Skip to content

Experiments: Running the MapReduce code (Linux)

Burra Abhishek edited this page Jun 5, 2021 · 10 revisions

The following experiments were conducted on Linux Mint 20 Cinnamon having 2 processors, 5.5 GB RAM and 70 GB storage. For each of the experiments, these commands are agnostic of the platform in which Hadoop was set up, unless mentioned otherwise. In addition to that, each of these experiments have the following assumptions:

  • The username is burraabhishek
  • The present working directory is ~/src. The entire src directory of this repository was cloned to the home directory in the Linux machine.
  • The Hadoop version is Hadoop 3.3.0
  • All the directories in the Hadoop Distributed File System differ across various development environments

These values differ across various development environments. Replace these values wherever necessary.

NOTE: To run these experiments, a Hadoop Development Environment is required. This guide can help you get started if you do not have a Hadoop Development Environment.

Starting Hadoop Distributed File System

Run each of these commands to start HDFS:

start-dfs.sh
start-yarn.sh

For these experiments, it is recommended to open the Terminal from the present working directory and then run the above commands. start-dfs.sh: Starts the Distributed File System. This starts the following:

  • namenode (on localhost, unless otherwise specified)
  • datanodes
  • secondary namenodes

start-yarn.sh: Starts Hadoop YARN (Yet Another Resource Manager). YARN manages computing resources in clusters. Running this command starts the following:

  • resourcemanager
  • nodemanagers

To check the status of the Hadoop daemons, type the command jps. jps is Java Virtual Machine Process Status Tool. For example:

$ jps
2560 NodeManager
2706 Jps
2453 ResourceManager
2021 DataNode
2168 SecondaryNameNode
1930 NameNode

Ensure that all five daemons and Jps are available. The numbers on the left are the process IDs and may differ across environments.

Preparing the code

Hadoop Streaming runs executable mappers and reducers. These codes are not currently executables.

For each of the 4 Python files in the directory, add an interpreter in the beginning and leave a line after that (differs across platforms).

For example,

#!/usr/bin/python

# Rest of the Python code

The first two bytes #! indicates that the Unix/Linux program loader should interpret the rest of the line as a command to launch the interpreter with which the program is executed. For example, #!/usr/bin/python runs python code with the python executable in /usr/bin.

Then for each of the files, run the following commands (This actually converts the files into executables):

chmod +x mapper.py
chmod +x reducer.py
chmod +x nextpass.py
chmod +x reset.py

Preparing a dataset

You can either use the dataset generator included here or download a dataset available online. (If you choose the latter, please abide by the licensing conditions if any).

Upload the dataset into HDFS.

For example, if the dataset is ~/src/csv_dataset.csv and the destination in HDFS is /dataset/ (The directory is not created), where ~/src is the present working directory where the commands are executed, then the following command copies the dataset into HDFS:

hdfs dfs -mkdir /dataset
hdfs dfs -put csv_dataset.csv /dataset/csv_dataset.csv

Note that the name of the files need not be the same while using hdfs dfs -put.

HDFS using GUI

HDFS can be accessed using a web browser. If default settings are used, then the URL

localhost:9870

should open the HDFS.

This URL may differ for different Hadoop configurations.

To browse the file system, go to Utilities, then select 'Browse the file system'.

Running the MapReduce code

The command to execute the MapReduce code is:

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-libjars custom.jar \
-file apriori_settings.json \
-file part-00000 \
-input csv_dataset.csv \
-mapper mapper.py \
-reducer reducer.py \
-output /output1 \
-outputformat CustomMultiOutputFormat

Replace if different:

  • The version in the jar file from 3.3.0 to your jar file version.
  • $HADOOP_HOME with its value (Hadoop location)
  • csv_dataset.csv with the dataset in HDFS
  • /output1 with the location in HDFS where you want to store the output.
Clone this wiki locally