This project demonstrates how to build a Lakehouse architecture on Azure using Data Factory, Azure Data Lake Storage Gen2, and Azure Databricks. The pipeline ingests NYC taxi data and structures it in a Medallion Architecture: Bronze, Silver, and Gold layers.
NYC Taxi & Limousine Commission (TLC)
Green Taxi Trip Records – Link
Monthly data for 2023 ingested dynamically using a parameterized pipeline.
- Storage: Azure Data Lake Gen2 (Hierarchical namespace enabled)
- Containers:
bronze
,silver
, andgold
as per Medallion Architecture - Ingestion: Azure Data Factory (ADF)
- A dynamic pipeline with
ForEach
+If Condition
- Fetches raw data and writes in Parquet format to the
bronze
container
- A dynamic pipeline with
- Databricks Notebooks:
silver_notebook
- Transformations:
- Clean and enrich raw data using PySpark
- Rename columns, split zones, add date fields
- Store structured data in the
silver
container in Parquet format
- Databricks Notebooks:
gold_notebook
- Transformations:
- Load silver layer data
- Store cleaned and query-optimized tables in Delta format
- Save as managed Delta Tables in
gold
database
Service | Details |
---|---|
Resource Group | Created to hold all resources |
Storage Account | ADLS Gen2 with hierarchical namespace, 3 containers: bronze, silver, gold |
Azure Data Factory | Pipeline to ingest CSV data into bronze |
Azure Databricks | Single-node cluster (Standard_D4pds_v6, 16 GB RAM, 4 vCPUs) |
Azure Entra ID | Service principal created via App Registration for Databricks access |
- Service Principal created via Azure Entra ID:
- Registered in App Registrations
- Granted necessary permissions on the Storage Account
- Read raw CSVs from bronze
- Apply schema to
trips_2023
- Add new columns (e.g., trip_date, trip_year)
- Write processed data to silver layer
- Load silver layer Parquet files
- Convert and save as Delta Tables
- Perform SQL queries:
- Filter by zones or fare amount
- Run
UPDATE
,DELETE
, andRESTORE
operations
-- View expensive trips
SELECT * FROM gold.trips_2023 WHERE total_amount > 1500;
-- Check trip zones for Newark Airport
SELECT * FROM gold.zone_type WHERE Zone1 = 'Newark Airport';
-- Restore deleted records
RESTORE gold.zone_type TO VERSION AS OF 0;
dnyctake10/
├── bronze/
│ ├── trip_type/
│ ├── trip_zone/
│ └── trips_2023/
├── silver/
│ ├── trip_zone/
│ └── trips_2023/
└── gold/
└── Delta Tables:
├── trip_type
├── zone_type
└── trips_2023
For any questions or clarifications, please contact Raza Mehar at [[email protected]].