Tags: Data Engineering, ETL/ELT, SQL, Data Modeling, Data Quality Testing, Orchestration, Cloud Data Warehousing
Domain: Open Source, Developer Analytics
Streamlit
GCS & BigQuery
Terraform
DuckDB
Python
Bruin
Build an end-to-end analytics pipeline for GH Archive data, with local and cloud execution paths, automated orchestration, data quality checks, and a dashboard for exploring repository activity.
GH Archive provides large-scale public GitHub event data, but the raw data is difficult to analyze directly because it is nested, high-volume, and not pre-aggregated.
This project solves that by building an end-to-end reproducible data pipeline that ingests raw GitHub events, transforms them into analytics-ready tables, and visualizes repository activity in a Streamlit dashboard. The workflow is orchestrated with Bruin.
The project includes two pipelines based on storage backend:
Local pipeline (DuckDB 🟡): fast iteration, debugging transformations, dashboard development.
Cloud pipeline (GCS + BigQuery 🔵): scheduled execution, production-like setup.
Python
SQL
Bruin
DuckDB
Google Cloud Storage
BigQuery
Terraform
Streamlit
Altair
GitHub
The final project is a reproducible data workflow that demonstrates local development, cloud execution, data modeling, orchestration, infrastructure provisioning, and dashboarding in one system.
It shows how raw event data can be transformed into structured analytics layers and delivered through an interactive dashboard.
To see the code, please visit my GitHub.