Skip to content

razamehar/NYC-Taxi-Data-Lakehouse-on-Azure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

NYC Taxi Data Lakehouse on Azure

This project demonstrates how to build a Lakehouse architecture on Azure using Data Factory, Azure Data Lake Storage Gen2, and Azure Databricks. The pipeline ingests NYC taxi data and structures it in a Medallion Architecture: Bronze, Silver, and Gold layers.


Data Source

NYC Taxi & Limousine Commission (TLC)
Green Taxi Trip Records – Link
Monthly data for 2023 ingested dynamically using a parameterized pipeline.


Architecture Overview

Bronze Layer

  • Storage: Azure Data Lake Gen2 (Hierarchical namespace enabled)
  • Containers: bronze, silver, and gold as per Medallion Architecture
  • Ingestion: Azure Data Factory (ADF)
    • A dynamic pipeline with ForEach + If Condition
    • Fetches raw data and writes in Parquet format to the bronze container

Silver Layer

  • Databricks Notebooks: silver_notebook
  • Transformations:
    • Clean and enrich raw data using PySpark
    • Rename columns, split zones, add date fields
    • Store structured data in the silver container in Parquet format

Gold Layer

  • Databricks Notebooks: gold_notebook
  • Transformations:
    • Load silver layer data
    • Store cleaned and query-optimized tables in Delta format
    • Save as managed Delta Tables in gold database

Azure Resources Used

Service Details
Resource Group Created to hold all resources
Storage Account ADLS Gen2 with hierarchical namespace, 3 containers: bronze, silver, gold
Azure Data Factory Pipeline to ingest CSV data into bronze
Azure Databricks Single-node cluster (Standard_D4pds_v6, 16 GB RAM, 4 vCPUs)
Azure Entra ID Service principal created via App Registration for Databricks access

Identity & Access Management

  • Service Principal created via Azure Entra ID:
    • Registered in App Registrations
    • Granted necessary permissions on the Storage Account

Notebooks Summary

silver_notebook

  • Read raw CSVs from bronze
  • Apply schema to trips_2023
  • Add new columns (e.g., trip_date, trip_year)
  • Write processed data to silver layer

gold_notebook

  • Load silver layer Parquet files
  • Convert and save as Delta Tables
  • Perform SQL queries:
    • Filter by zones or fare amount
    • Run UPDATE, DELETE, and RESTORE operations

Example Queries

-- View expensive trips
SELECT * FROM gold.trips_2023 WHERE total_amount > 1500;

-- Check trip zones for Newark Airport
SELECT * FROM gold.zone_type WHERE Zone1 = 'Newark Airport';

-- Restore deleted records
RESTORE gold.zone_type TO VERSION AS OF 0;
dnyctake10/
├── bronze/
│   ├── trip_type/
│   ├── trip_zone/
│   └── trips_2023/
├── silver/
│   ├── trip_zone/
│   └── trips_2023/
└── gold/
    └── Delta Tables:
        ├── trip_type
        ├── zone_type
        └── trips_2023

Contact

For any questions or clarifications, please contact Raza Mehar at [[email protected]].

Releases

No releases published

Packages

No packages published