Kafka Spark BigQuery NewsStream

==========================

A real-time news processing pipeline newsapi.org REST API using Apache Kafka, Apache Spark, and Google BigQuery.

Overview

This project demonstrates a scalable and fault-tolerant architecture for processing news articles in real-time. The pipeline collects news articles from various sources, processes them using Apache Spark, and stores the results in Google BigQuery for analysis.

Architecture

The pipeline consists of the following components:

Kafka: Collects news articles from various sources and stores them in a Kafka topic.
Spark: Processes the news articles collected from Kafka, performing tasks such as data cleaning, tokenization, and sentiment analysis.
BigQuery: Stores the processed data for analysis and querying.

Requirements

Kafka: Apache Kafka 3.0 or later
Spark: Apache Spark 3.2 or later
BigQuery: Google BigQuery
Python: Python 3.8 or later
Java: Java 11 or later

Terraform and Ansible

This projects deploys Apache Kafka on Google Cloud Platform (GCP) using Terraform for instance creation & provision, Ansible for Spark, Kafka & Zookeeper, Python, pip and modules installation. Ansible files in the form of Terraform templates are rendered dynamically and provisioned to VMs. A new VPC is created, with a public subnet for control VM and a private subnet with a NAT Gateway for Kafka VM, Python producer VM and Spark Consumer VM.

Directory Structure

Terraform_GCP/: Contains .tf scripts for instance creation.
ansible_files/: Ansible files for Kafka & Zookeeper installation on VM.

Getting Started

To deploy Kafka on GCP:

Set Up GCP Environment: Ensure you have a GCP account and the necessary permissions to create and manage resources. Create a key and download .json file.
Provision Virtual Machines: Create VM instances that will serve as Kafka brokers and Zookeeper nodes (handled by Terraform).
Install Kafka and Zookeeper: On Kafka VM, install the required software packages using Ansible (started upon creation and provision of VMs by Terraform with remote-exec).
Configure Networking: Set up appropriate firewall rules and networking configurations to allow communication between nodes.
Start Services: Initiate Zookeeper and then start the Kafka brokers, defined as Ansible tasks.

Resulting Table in BigQuery

Python service writes the stream obtained from kafka to BigQuery.

The results can be viewed from BigQuery Console. As the time progresses, new entries are appended to the table. The resulting table after one week.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
Terraform_GCP		Terraform_GCP
ansible_files		ansible_files
pngs		pngs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kafka Spark BigQuery NewsStream

Overview

Architecture

Requirements

Terraform and Ansible

Directory Structure

Getting Started

Resulting Table in BigQuery

License

About

Releases

Packages

Contributors 2

Languages

kaanevranportfolio/Kafka_Spark_BigQuery_ETL_NewsStream

Folders and files

Latest commit

History

Repository files navigation

Kafka Spark BigQuery NewsStream

Overview

Architecture

Requirements

Terraform and Ansible

Directory Structure

Getting Started

Resulting Table in BigQuery

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages