Skip to content

dakshpokar/RainStorm-Realtime-Stream-Processing

Repository files navigation

MP4-Distributed-Systems

RainStorm - A Stream Processing framework

RainStorm operates using a controller-worker architecture. In this setup, one machine acts as the controller, where the user submits jobs to the framework. The controller processes the job input and creates a job information map, which outlines the operations to be performed and the tasks required for each operation. Additionally, the map assigns machines to specific operations. Once the map is complete, the controller distributes it to all machines participating in the job.

Upon receiving the job information map, each worker identifies its responsibilities by scanning the map to determine the operations it is assigned. The operations can be categorized as source, sink, or intermediate:

  • Source machines fetch input files, read parts of these files, and forward each line to the next stage.
  • Intermediate machines receive input from the previous stage, process it according to the job information map, store the output in HyDFS, and forward the result to the next stage.
  • Sink machines receive processed input, perform the final operation as specified in the job map, and stream the output to the controller.

Each stage ensures seamless data flow:

  • Outputs are hashed to determine the relevant task in the next operation.
  • A unique key is generated for each output before passing it to the subsequent operation.
  • The machine locally stores the output until it receives an acknowledgment (ACK) from the machine handling the next stage.

When processing input, a machine stores the corresponding key in its local memory and sends an ACK to the previous stage upon successful processing. Retaining these keys ensures that every key is processed exactly once, even in the event of failures. All information is also logged in HyDFS for durability and recovery purposes.

Project Setup for all machines

  1. Ensure that you have Go Lang Installed on your machine.
  2. Clone this repo on each machine.
  3. cd into mp4-distributed-systems folder on all machines.
cd mp4-distributed-systems
  1. Start the introducer on machine 1 -
./start_introducer.sh
  1. Ensure that your log files are present in the same directory on all machines. Example - /root/

How to run Rain Storm Stream Processing System

Now that the introducer is started, you can execute go run main.go and join from any of the machines but before that you have to follow these steps -

  1. Create a .env file in mp1-distributed-system folder and add the following on all machines-
INTRODUCER_ADDR=127.0.0.1

Enter the introducer IP / Hostname. For example -

INTRODUCER_ADDR=fa24-cs425-9201

This has to be performed once on the machine where the distributed disseminator is to be run.

  1. Run ./worker_run.sh on the machine
./worker_run.sh
  1. Various commands will be displayed for performing various actions related to RainStorm and HyDFS.

  2. To run Rain Storm, you have to press command 17, and enter the Operation One's File Name & its Type, Operation Two's File Name & its Type, HyDFS Source File Name, HyDFS Destination File Name and Number of Tasks.

  3. After some amount of time, the leader machine will start streaming keys on the basis of operation 1 and operation 2.

Group G92

Rishi Mundada ([email protected]) Daksh Pokar ([email protected])

About

RainStorm - A Stream Processing framework

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published