NYC Taxi Data Lakehouse on Azure

This project demonstrates how to build a Lakehouse architecture on Azure using Data Factory, Azure Data Lake Storage Gen2, and Azure Databricks. The pipeline ingests NYC taxi data and structures it in a Medallion Architecture: Bronze, Silver, and Gold layers.

Data Source

NYC Taxi & Limousine Commission (TLC)
Green Taxi Trip Records – Link
Monthly data for 2023 ingested dynamically using a parameterized pipeline.

Architecture Overview

Bronze Layer

Storage: Azure Data Lake Gen2 (Hierarchical namespace enabled)
Containers: bronze, silver, and gold as per Medallion Architecture
Ingestion: Azure Data Factory (ADF)
- A dynamic pipeline with ForEach + If Condition
- Fetches raw data and writes in Parquet format to the bronze container

Silver Layer

Databricks Notebooks: silver_notebook
Transformations:
- Clean and enrich raw data using PySpark
- Rename columns, split zones, add date fields
- Store structured data in the silver container in Parquet format

Gold Layer

Databricks Notebooks: gold_notebook
Transformations:
- Load silver layer data
- Store cleaned and query-optimized tables in Delta format
- Save as managed Delta Tables in gold database

Azure Resources Used

Service	Details
Resource Group	Created to hold all resources
Storage Account	ADLS Gen2 with hierarchical namespace, 3 containers: bronze, silver, gold
Azure Data Factory	Pipeline to ingest CSV data into bronze
Azure Databricks	Single-node cluster (Standard_D4pds_v6, 16 GB RAM, 4 vCPUs)
Azure Entra ID	Service principal created via App Registration for Databricks access

Identity & Access Management

Service Principal created via Azure Entra ID:
- Registered in App Registrations
- Granted necessary permissions on the Storage Account

Notebooks Summary

`silver_notebook`

Read raw CSVs from bronze
Apply schema to trips_2023
Add new columns (e.g., trip_date, trip_year)
Write processed data to silver layer

`gold_notebook`

Load silver layer Parquet files
Convert and save as Delta Tables
Perform SQL queries:
- Filter by zones or fare amount
- Run UPDATE, DELETE, and RESTORE operations

Example Queries

-- View expensive trips
SELECT * FROM gold.trips_2023 WHERE total_amount > 1500;

-- Check trip zones for Newark Airport
SELECT * FROM gold.zone_type WHERE Zone1 = 'Newark Airport';

-- Restore deleted records
RESTORE gold.zone_type TO VERSION AS OF 0;

dnyctake10/
├── bronze/
│   ├── trip_type/
│   ├── trip_zone/
│   └── trips_2023/
├── silver/
│   ├── trip_zone/
│   └── trips_2023/
└── gold/
    └── Delta Tables:
        ├── trip_type
        ├── zone_type
        └── trips_2023

Contact

For any questions or clarifications, please contact Raza Mehar at [[email protected]].

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
gold_notebook.ipynb		gold_notebook.ipynb
silver_notebook.ipynb		silver_notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NYC Taxi Data Lakehouse on Azure

Data Source

Architecture Overview

Bronze Layer

Silver Layer

Gold Layer

Azure Resources Used

Identity & Access Management

Notebooks Summary

`silver_notebook`

`gold_notebook`

Example Queries

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

razamehar/NYC-Taxi-Data-Lakehouse-on-Azure

Folders and files

Latest commit

History

Repository files navigation

NYC Taxi Data Lakehouse on Azure

Data Source

Architecture Overview

Bronze Layer

Silver Layer

Gold Layer

Azure Resources Used

Identity & Access Management

Notebooks Summary

silver_notebook

gold_notebook

Example Queries

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`silver_notebook`

`gold_notebook`

Packages