==========================
A real-time news processing pipeline newsapi.org REST API using Apache Kafka, Apache Spark, and Google BigQuery.
This project demonstrates a scalable and fault-tolerant architecture for processing news articles in real-time. The pipeline collects news articles from various sources, processes them using Apache Spark, and stores the results in Google BigQuery for analysis.
The pipeline consists of the following components:
- Kafka: Collects news articles from various sources and stores them in a Kafka topic.
- Spark: Processes the news articles collected from Kafka, performing tasks such as data cleaning, tokenization, and sentiment analysis.
- BigQuery: Stores the processed data for analysis and querying.
- Kafka: Apache Kafka 3.0 or later
- Spark: Apache Spark 3.2 or later
- BigQuery: Google BigQuery
- Python: Python 3.8 or later
- Java: Java 11 or later
This projects deploys Apache Kafka on Google Cloud Platform (GCP) using Terraform for instance creation & provision, Ansible for Spark, Kafka & Zookeeper, Python, pip and modules installation. Ansible files in the form of Terraform templates are rendered dynamically and provisioned to VMs. A new VPC is created, with a public subnet for control VM and a private subnet with a NAT Gateway for Kafka VM, Python producer VM and Spark Consumer VM.
Terraform_GCP/
: Contains .tf scripts for instance creation.ansible_files/
: Ansible files for Kafka & Zookeeper installation on VM.
To deploy Kafka on GCP:
-
Set Up GCP Environment: Ensure you have a GCP account and the necessary permissions to create and manage resources. Create a key and download .json file.
-
Provision Virtual Machines: Create VM instances that will serve as Kafka brokers and Zookeeper nodes (handled by Terraform).
-
Install Kafka and Zookeeper: On Kafka VM, install the required software packages using Ansible (started upon creation and provision of VMs by Terraform with remote-exec).
-
Configure Networking: Set up appropriate firewall rules and networking configurations to allow communication between nodes.
-
Start Services: Initiate Zookeeper and then start the Kafka brokers, defined as Ansible tasks.
Python service writes the stream obtained from kafka to BigQuery.
The results can be viewed from BigQuery Console. As the time progresses, new entries are appended to the table. The resulting table after one week.
This project is licensed under the MIT License. See the LICENSE
file for details.