Data Transformation Pipeline with dbt

Back to All Projects

Tags: Data Engineering, ETL/ELT, SQL, Data Modeling, Data Quality Testing, Orchestration, Cloud Data Warehousing

Domain: Open Source, Developer Analytics

Streamlit

GCS & BigQuery

Terraform

DuckDB

Python

Bruin

● GOAL

Build an end-to-end analytics pipeline for GH Archive data, with local and cloud execution paths, automated orchestration, data quality checks, and a dashboard for exploring repository activity.

● OVERVIEW

GH Archive provides large-scale public GitHub event data, but the raw data is difficult to analyze directly because it is nested, high-volume, and not pre-aggregated.

This project solves that by building an end-to-end reproducible data pipeline that ingests raw GitHub events, transforms them into analytics-ready tables, and visualizes repository activity in a Streamlit dashboard. The workflow is orchestrated with Bruin.

The project includes two pipelines based on storage backend:

Local pipeline (DuckDB 🟡): fast iteration, debugging transformations, dashboard development.
Cloud pipeline (GCS + BigQuery 🔵): scheduled execution, production-like setup.

● KEY DESIGN DECISIONS

➜ Dual local/cloud architecture. The project uses separate local and cloud pipelines instead of forcing everything into one heavily parameterized pipeline. This keeps the logic easier to debug and makes the project reproducible across environments.

➜ Pre-aggregated marts. The dashboard queries mart tables instead of raw data. This improves performance and keeps transformation logic in the pipeline instead of the UI.

➜ Idempotent daily loads. The pipeline processes the previous UTC day by default. This avoids partial-day loads and makes reruns safer.

➜ Unified dashboard. A single Streamlit app supports both DuckDB and BigQuery as data sources. This avoids duplicating dashboard logic and keeps metrics consistent across environments.

● TOOLS USED

Python
SQL
Bruin
DuckDB
Google Cloud Storage
BigQuery
Terraform
Streamlit
Altair
GitHub

● OUTCOME

The final project is a reproducible data workflow that demonstrates local development, cloud execution, data modeling, orchestration, infrastructure provisioning, and dashboarding in one system.

It shows how raw event data can be transformed into structured analytics layers and delivered through an interactive dashboard.

To see the code, please visit my GitHub.

Go to GitHub

LET'S WORK TOGETHER

CONTACT ME ➚

Page updated

Google Sites

Report abuse