Hadoop Configuration
Hadoop configuration is an important process in order to setup the Apache Hadoop software to handle big data using the map-reduce paradigm. This article provides a step-wise configuration guide based on Hadoop version 2.8.0 but the steps are similar to other versions of the Hadoop version 2.* distributions. We assume the Hadoop distribution was downloaded as described in our article Install Hadoop. But alternative installations will have a similar Hadoop configuration.
Step 1 – Check Hadoop Directory
We check the concrete Hadoop distribution by looking into the directory and understand its structure. The basic UNIX commands ‘cd’ and ‘ls -al’ help us as shown below.
$ cd hadoop-2.8.0
$ ls -al
Output:
total 156 drwxr-xr-x 9 ubuntu ubuntu 4096 Mar 17 05:31 . drwxr-xr-x 7 ubuntu ubuntu 4096 Apr 2 15:32 .. drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 17 05:31 bin drwxr-xr-x 3 ubuntu ubuntu 4096 Mar 17 05:31 etc drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 17 05:31 include drwxr-xr-x 3 ubuntu ubuntu 4096 Mar 17 05:31 lib drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 17 05:31 libexec -rw-r--r-- 1 ubuntu ubuntu 99253 Mar 17 05:31 LICENSE.txt -rw-r--r-- 1 ubuntu ubuntu 15915 Mar 17 05:31 NOTICE.txt -rw-r--r-- 1 ubuntu ubuntu 1366 Mar 17 05:31 README.txt drwxr-xr-x 2 ubuntu ubuntu 4096 Mar 17 05:31 sbin drwxr-xr-x 4 ubuntu ubuntu 4096 Mar 17 05:31 share
We also would like to know where exactly the directory is located using the ‘pwd’ command as shown below. We need this information later for the correct configuration as the term ‘Hadoop home directory’.
$ pwd
Output:
/home/ubuntu/hadoop-2.8.0
Step 2 – Create Environment Variables
Environment variables will help us with the configuration and later use of Hadoop. We create the Hadoop environment variables by appending the following lines to the ‘bashrc’ file using the ‘vi’ or any other text editor as shown below. Note that this file is located in your home directory and not in the Hadoop distribution. Also note that we re-use the information from above by providing the value given by the ‘pwd’ command in order to export ‘HADOOP_HOME’. This environment variable need to be set to your ‘Hadoop home directory’. One example is given below:
$ vi .bashrc
Lines to be add to this file:
# Hadoop environment export HADOOP_HOME=/home/ubuntu/hadoop-2.8.0 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL=$HADOOP_HOME
In order to apply all changes to this file in the currently running system we can use the command ‘source’ as shown below. In order to test if this command was successful we can use the ‘echo’ command to read the configured environment variables such as ‘HADOOP_HOME’.
$ source ~/.bashrc
$ echo $HADOOP_HOME
Output:
/home/ubuntu/hadoop-2.8.0
All Hadoop configuration files are located in “$HADOOP_HOME/etc/hadoop”. We need to make changes in several configuration files according to corresponding Hadoop infrastructure we would like to setup.
Step 3 – Link Java Environment
In order to link Hadoop with Java we need to edit another important environment variable “$JAVA_HOME” required in the configuration file ‘hadoop-env.sh’. It needs to be replaced with the value of the location of our Java installation in the system. We here assume a Java installation as done in our article Install Hadoop.
$ vi hadoop-env.sh
Replace line with $JAVA_HOME as follows:
# The java implementation to use. export JAVA_HOME=/usr/lib/jvm/java-8-oracle
Step 4 – Configure core-site.xml
The Hadoop configuration file core-site.xml contains pieces of information about the particular ‘Hadoop site’ itself. This includes the hostname and port number used for this particular Hadoop instance. Other optional information is the memory allocated for the file system. There can be also memory limits for storing data or more detailed configurations such as the size of read and wite buffers. We need to edit this file by using a text editor and use in this example ‘localhost’ as hostname and ‘9000’ as port number. This needs to be adapted depending on the corresponding Hadoop infrastructure.
$ vi core-site.xml
Insert the following lines for the <configuration> tag as follows:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Note that for production systems one would change the localhost to a real IP address. Note that some cloud providers do not allow to bind a process to a port and pulic IP. This is the case when you host Hadoop within some clouds. However, you can bind the service to the internal IP of the cloud instance that will be then publicly accessible via the public IP address. More configuration options for the file core-site.xml can be obtained from here.
Step 5 – Configure hdfs-site.xml
The Hadoop configuration file hdfs-site.xml file contains information about the Hadoop Distributed File System (HDFS) that is part of the Hadoop distribution. It includes the value of ‘replication’ and the path to the ‘namenode’ as well as the paths to ‘datanodes’ based on the local file systems. This is needed in order to tell HDFS a concrete place where data in the Hadoop infrastructure is stored. Below is an example but needs to be configured according to your file system structure depending on the Hadoop infrastructure.
$ vi hdfs-site.xml
Insert the following lines for the <configuration> tag as follows:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.name.dir</name> <value>file:///home/ubuntu/hadoopinfra/hdfs/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>file:///home/ubuntu/hadoopinfra/hdfs/datanode</value> </property> </configuration>
The values above mean that our file system is considering a replication of one only in this example. The ‘namenode’ is a directory created by the HDFS file system. The ‘datanode’ is also a directory created by the HDFS file system. Both are integral parts of HDFS. More configuration options for the file hdfs-site.xml can be obtained from here.
Step 6 – Configure yarn-site.xml
The Hadoop configuration file yarn-site.xml is used for the Hadoop scheduling system ‘Yet Another Resource Negoatiator (YARN)’. This component is also an integral part of Hadoop alongside HDFS.
$ vi yarn-site.xml
Insert the following lines for the <configuration> tag as follows:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
More configuration options for the file yarn-site.xml can be obtained from here.
Step 7 – Configure mapred-site.xml
The Hadoop configuration file mapred-site.xml specifies which map-reduce framework is used that is in our example here YARN. Any Hadoop distribution contains a template of the ‘mapred-site.xml’ file named ‘mapred-site.xml.template’. We first copy this template file to the correct name and then add lines as shown below.
$ cp mapred-site.xml.template mapred-site.xml
$ vi mapred-site.xml
Insert the following lines for the <configuration> tag as follows:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
More configuration options for the file mapred-site.xml can be obtained from here.
Step 8 – Format Name Node
In order to check the configuration and to prepare HDFS for being used we need to format the HDFS ‘namenode’ using ‘hdfs’ commands. In order to do this we are able to go back to the home directory and perform the following commands:
$ cd ~
$ hdfs namenode -format
Output:
17/04/02 17:38:32 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: user = ubuntu STARTUP_MSG: host = i-3992cc97.csdc3cloud.internal/10.1.1.240 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.8.0 STARTUP_MSG: classpath = /home/ubuntu/hadoop-2.8.0/etc/hadoop:/ ... 17/04/02 17:38:42 INFO common.Storage: Storage directory /home/ubuntu/hadoopinfra/hdfs/namenode has been successfully formatted. ... 17/04/02 17:38:43 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at i-3992cc97.csdc3cloud.internal/10.1.1.240 ************************************************************/
As shown above the formatting was successful but the HDFS ‘namenode’ is shut down again after the process. We therefore need to start the HDFS system as a whole in the next step.
Step 9 – Start HDFS
The HDFS is a key element of the Apache Hadoop installation and configuration and we need to start the HDFS in order to be used. The following command can be used to verify if our configuration is correct by starting HDFS as follows:
$ start-dfs.sh
Output:
17/04/02 17:46:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [localhost] The authenticity of host 'localhost (::1)' can't be established. ECDSA key fingerprint is SHA256:0paYZ2E0RvPsE1bDbnA0FCchCebuJUvQOyj/MsL6Ifk. Are you sure you want to continue connecting (yes/no)? yes localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts. localhost: starting namenode, logging to /home/ubuntu/hadoop-2.8.0/logs/hadoop-ubuntu-namenode-i-3992cc97.out localhost: starting datanode, logging to /home/ubuntu/hadoop-2.8.0/logs/hadoop-ubuntu-datanode-i-3992cc97.out Starting secondary namenodes [0.0.0.0] The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established. ECDSA key fingerprint is SHA256:0paYZ2E0RvPsE1bDbnA0FCchCebuJUvQOyj/MsL6Ifk. Are you sure you want to continue connecting (yes/no)? yes 0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /home/ubuntu/hadoop-2.8.0/logs/hadoop-ubuntu-secondarynamenode-i-3992cc97.out 17/04/02 17:49:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
In the process of starting there might be questions about the identity that we answer with ‘yes’. This is often the case during the first startup or when you have done changes to hostnames. If there are no critical error messages we confirm that HDFS has been properly configured. It is ready to be used. We can also verify if all HDFS services have been properly startup as processes using the ‘jps’ command as follows:
$ jps
Output:
4368 DataNode 4678 Jps 4540 SecondaryNameNode 4270 NameNode
As shown above all HDFS processes are started: DataNode, NameNode, and SecondaryNameNode. If there is one of these three essential services missing it is likely to have a problem in the configuration. We can further check the port configuration that is linked with the process information above using the command below. This is particular helpful when we try other configurations such as starting the services not on localhost but on a different IP address in the core-site.xml. Here we use an example of 10.2.2.240:9000.
$ ss -ltp
Output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 127.0.0.1:37865 *:* users:(("java",pid=4368,fd=204)) LISTEN 0 128 *:50090 *:* users:(("java",pid=4540,fd=212)) LISTEN 0 128 *:50070 *:* users:(("java",pid=4270,fd=200)) LISTEN 0 128 *:ssh *:* LISTEN 0 128 *:50010 *:* users:(("java",pid=4368,fd=200)) LISTEN 0 128 127.0.0.1:6010 *:* LISTEN 0 128 *:50075 *:* users:(("java",pid=4368,fd=228)) LISTEN 0 128 *:50020 *:* users:(("java",pid=4368,fd=232)) LISTEN 0 128 10.2.2.240:9000 *:* users:(("java",pid=4270,fd=218)) LISTEN 0 128 :::ssh :::* LISTEN 0 128 ::1:6010 :::*
Step 10 – Start YARN
The YARN scheduler is another key element of the Apache Hadoop installation and configuration and we need to start it in order to be used. The following command can be used to verify if our configuration is correct by starting YARN as follows:
$ start-yarn.sh
Output:
starting yarn daemons starting resourcemanager, logging to /home/ubuntu/hadoop-2.8.0/logs/yarn-ubuntu-resourcemanager-i-3992cc97.out localhost: starting nodemanager, logging to /home/ubuntu/hadoop-2.8.0/logs/yarn-ubuntu-nodemanager-i-3992cc97.out
If there are no critical error messages we can confirm that YARN is started and can be used.
Step 11 – Browse Hadoop Webpage
We are able to access the Hadoop configuration and installation via a Browser. The default port number to access Hadoop is 50070 and we can use the following URL to get Hadoop services displayed in the browser.
http://localhost:50070/
Another fast way to check if there is something listening on this port on localhost is the following ‘telnet’ command. If there is nothing listening one gets a connection refused (try it with a different port). The command is as follows:
$ telnet localhost 50070
Output:
Trying ::1... Trying 127.0.0.1... Connected to localhost. Escape character is '^]'.
Step 12 – Browse Hadoop Applications
We are able to access the Hadoop applications as well via a Browser. The default port number to access Hadoop applications is 8080 and we can use the following URL to get the information displayed in the browser.
http://localhost:8080/
We can use the same tool ‘telnet’ here on this port on localhost. This also concludes the steps for the Hadoop configuration.
Hadoop Configuration Details
Please have a look on the following video in order to understand more details: