Skip to content

Commit 9f640ad

Browse files
authored
Merge pull request #2 from adisve/adisve-patch-1
Update README.md
2 parents ab3e037 + 826a2a8 commit 9f640ad

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Project Description
44

5-
This project, originally part of the Big Data course at the University of Kristianstad, aims to develop a versatile data pipeline capable of processing datasets ranging from 10-20 GB.
5+
This project aims to develop a versatile data pipeline capable of processing datasets large in size, utilizing Docker containers and Hadoop + Spark.
66

77
For our practical implementation, we selected the [May 2015 Reddit Comments Dataset](https://www.kaggle.com/datasets/kaggle/reddit-comments-may-2015/) available on Kaggle. However, the pipeline's flexibility allows for the incorporation of various datasets. This adaptability is achieved by adjusting the NAMENODE_DATA_DIR variable in the ./hadoop-spark-cluster/Makefile and setting the namenode HDFS URL in scripts/spark/config.json.
88

0 commit comments

Comments
 (0)