Skip to content

Commit 24b823a

Browse files
committed
Merge branch 'main' of github.com:adisve/hadoop-spark-cluster
2 parents 83d92b2 + 6e85b3b commit 24b823a

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,17 @@
22

33
## Project Description
44

5-
This project, originally part of the Big Data course at the University of Kristianstad, aims to develop a versatile data pipeline capable of processing datasets ranging from 10-20 GB.
5+
This project aims to develop a versatile data pipeline capable of processing datasets large in size, utilizing Docker containers and Hadoop + Spark.
6+
7+
## About
68

79
For our practical implementation, we selected the [May 2015 Reddit Comments Dataset](https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015/) available on Kaggle. However, the pipeline's flexibility allows for the incorporation of various datasets. This adaptability is achieved by adjusting the NAMENODE_DATA_DIR variable in the ./hadoop-spark-cluster/Makefile and setting the namenode HDFS URL in scripts/spark/config.json.
810

911
Leveraging Apache Spark for data processing and HDFS on a Hadoop cluster for data storage, each node operates within its own container, ensuring efficient data handling.
1012

1113
The pipeline is designed to generate an output.csv file (prior to uploading it in parts as Parquet parts to the virtual HDFS container), located in the /data directory at the project's root. Should you opt to use the SQLite database from the provided link, a handy conversion script scripts/utils/csv_converter.py is available to convert the data from SQLite to CSV format before running the initialization script.
1214

13-
### Prerequisites
15+
## Prerequisites
1416

1517
- A comments.csv file under /data/output.csv (not included in the repository due to size), which can be downloaded from [May 2015 Reddit Comments](https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015/) and then manually parsed to a csv file with the helper script csv_converter.py under scripts/.
1618
- Pipenv (for installing dependencies)

0 commit comments

Comments
 (0)