Russia Historical Climate Data Pipeline

An end-to-end AWS data engineering pipeline for ingesting, processing, and visualizing historical climate data for Russia.

Problem

The project is about how the Russian climate has changed for the last century. Particularly, we are interested in 3 main variables: temperature, precipitation, and number of weather stations. I chose Global Historical Climatology Network Daily (GHCN-D) dataset as the main data source.

The main objective is to:

Develop a pipeline to collect the Russian climate archive data and process it in batch
Build a dashboard to visualize the trends

Dashboard

📖 Overview

This project automates the full lifecycle of climate data from NOAA’s GHCN, including:

Ingestion of raw data into S3
ETL with AWS Glue (fixed-width → CSV, spatial joins, yearly partitioning)
Orchestration via AWS Step Functions
Storage in Amazon Redshift Serverless
Visualization in Amazon QuickSight

All infrastructure is defined in Terraform, and configuration values live in a single config.yaml.

🏗 Architecture

Figure 1: High-level data flow and AWS services.

flowchart LR
  subgraph S3 [Amazon S3]
    A1["ghcn_data/external<br/>(raw files)"]
    A2["ghcn_data/processed/metadata<br/>(CSV metadata)"]
    A3["ghcn_data/processed/by_year<br/>(Parquet partitions)"]
    A1 --> A2
  end

  subgraph Glue [AWS Glue ETL]
    B1["metadata_to_redshift_job"]
    B2["map_stations_job"]
    B3["process_by_year_job"]
    A2 --> B1 --> B2 --> B3
  end

  subgraph SFN [Step Functions]
    C["BackfillByYear"]
    B3 --> C
  end

  subgraph RS [Redshift Serverless]
    D["russia-climate-workgroup<br/>ghcn-data database"]
    C --> D
  end

  subgraph QS [QuickSight]
    E["Dashboard"]
    D -->|Direct Query / SPICE| E
  end

🔧 Components

AWS S3
- ghcn_data/external (raw files)
- ghcn_data/processed/metadata (CSV metadata)
- ghcn_data/processed/by_year (partitioned Parquet)
- scripts/ (uploaded ETL scripts)
- logs/ (Glue Spark & Step Functions logs)
AWS Glue Jobs
1. metadata_to_redshift_job — fixed-width → CSV
2. map_stations_job — spatial join with GADM GPKG
3. process_by_year_job — yearly Parquet partitions + data quality checks
AWS Step Functions
- BackfillByYear Map state fans out process_by_year_job for 1901–2024
Amazon Redshift Serverless
- Namespace: russia-climate-namespace
- Workgroup: russia-climate-workgroup
- Schemas & tables loaded via SQL scripts in sql_scripts/
Amazon QuickSight
- Data source: Redshift Serverless (ghcn-data database)
- Templates & dashboards generated from analysis definitions in quicksight/
Configuration
- Single config.yaml holds all constants (region, bucket, usernames, ARNs)
- Consumed by Terraform (yamldecode), Python (PyYAML), SQL (Jinja2), and shell scripts (yq or Python one-liner)

⚙️ Setup

Clone repo

git clone https://github.com/your-org/pipeline-automation.git
cd pipeline-automation

Configure
- Edit config.yaml with your AWS account ID, QuickSight user, bucket name, etc.

Terraform

cd infrastructure
terraform init
terraform apply

Upload ETL scripts & external data
```
cd ../scripts
./upload_to_s3.sh
```

Backfill historical data

aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:eu-central-1:615299755921:stateMachine:BackfillByYear

Publish QuickSight dashboards

cd ../quicksight
aws quicksight create-template \
  --aws-account-id $AWS_ACCOUNT_ID \
  --template-id monthly-agg-template \
  --source-entity file://monthly-agg-data-template-definition.json

aws quicksight create-dashboard \
  --aws-account-id $AWS_ACCOUNT_ID \
  --dashboard-id monthly-agg-dashboard \
  --source-entity file://monthly-agg-data-template-definition.json \
  --permissions file://dashboard-permissions.json

📂 Directory Structure

pipeline-automation/
├── config.yaml
├── infrastructure/       # Terraform modules & main.tf
├── scripts/              # ETL & upload scripts
├── sql_scripts/          # Redshift DDL & COPY/INSERT scripts
├── quicksight/           # Template definitions & permissions
└── dashboards/           # PDF files with dashboard

🤝 Contributing

Feel free to open issues or pull requests—happy to collaborate!
Contact: Edyarich [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pipeline-automation		pipeline-automation
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Russia Historical Climate Data Pipeline

Problem

Dashboard

📖 Overview

🏗 Architecture

🔧 Components

⚙️ Setup

📂 Directory Structure

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Edyarich/russia-historical-climate

Folders and files

Latest commit

History

Repository files navigation

Russia Historical Climate Data Pipeline

Problem

Dashboard

📖 Overview

🏗 Architecture

🔧 Components

⚙️ Setup

📂 Directory Structure

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages