Install Hadoop
Install Hadoop on a cluster enables the ability to process and analyse big data using the map-reduce paradigm. This article provides a step-wise installation guide.
Step 1 – Create Hadoop User
Before we generate the keys it makes sense to create a specific user for the framework. This can be only done as a ‘superuser’ or in other words with the identity ‘root’ and we can switch to this identity with the command ‘su’ when we have the root password available. Note that this is an optional step and is not really necessary to generate key pairs. It can be also done using your normal login. The steps to generate such keys is the same and is performed within the Ubuntu Linux operating system as an example below. The following optional commands can be used while a user ‘hadoop’ is just one example and can be modified depending on which big data framework is installed:
$ su
Output:
password:
# useradd hadoop
# passwd hadoop
Output:
New passwd: Retype new passwd
There are often questions why another user needs to be generated here. The above steps to create a separate user in Ubuntu for a big data framework like Apache Hadoop is recommended in order to isolate for example the Hadoop Distributed File System (HDFS) from the standard Unix file system. Note that this is only a recommendation and not an absolut requirement. Once the user is generated we are able to log into the system with this new hadoop identity and generate the SSH keys.
Step 2 – Generate Certificates
The article Ubuntu Generate SSH Key provides an example of how private public key pairs can be generated using the Ubuntu Linux system. This is required to perform various operations on a computing cluster including starting and stopping of Hadoop and other distributed daemon shell operations.
Step 3 – Install Java
Apache Hadoop is implemented in Java and as a consequence Java is the main prerequisite that needs to be installed. There are many ways to install Java and we provide an example in our article Install Oracle Java Ubuntu.
Step 4 – Install Hadoop
We need to first download Apache Hadoop 2.8 from one of the download mirrors manually using the command ‘wget’ as shown below. One can choose between the source and the binary distribution. We download the binary distribution of Hadoop 2.8 (~420 MB).
$ wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz
Output:
--2017-04-02 15:29:49-- http://www-eu.apache.org/dist/hadoop/common/hadoop-2.8.0/hadoop-2.8.0.tar.gz Resolving www-eu.apache.org (www-eu.apache.org)... 88.198.26.2, 2a01:4f8:130:2192::2 Connecting to www-eu.apache.org (www-eu.apache.org)|88.198.26.2|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 429928937 (410M) [application/x-gzip] Saving to: ‘hadoop-2.8.0.tar.gz’ hadoop-2.8.0.tar.gz 100%[====================================================================================>] 410.01M 4.18MB/s in 95s 2017-04-02 15:31:25 (4.33 MB/s) - ‘hadoop-2.8.0.tar.gz’ saved [429928937/429928937]
Afterwards we need to extract the file using the ‘tar’ command as shown below:
$ tar xzf hadoop-2.8.0.tar.gz
Then it is time for the configuration of the Hadoop system. Please refer to our article Hadoop Configuration and perform all the steps written in that detailed description.
Step 5 – Check Hadoop Processes
The following command ‘jps’ is an easy way to check which Hadoop processes are running and provides information that the Hadoop installation works.
$ jps
Output:
5152 Jps 4368 DataNode 4778 ResourceManager 4540 SecondaryNameNode 4270 NameNode 4879 NodeManager
These process IDs are important to consider when the Hadoop installation is accessible outside ‘localhost’. We can make a link to open ports using the ‘ss -ltp’ command as shown below.
$ ss -ltp
Output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 127.0.0.1:37865 *:* users:(("java",pid=4368,fd=204)) LISTEN 0 128 *:50090 *:* users:(("java",pid=4540,fd=212)) LISTEN 0 128 *:50070 *:* users:(("java",pid=4270,fd=200)) LISTEN 0 128 *:ssh *:* LISTEN 0 128 *:50010 *:* users:(("java",pid=4368,fd=200)) LISTEN 0 128 127.0.0.1:6010 *:* LISTEN 0 128 *:50075 *:* users:(("java",pid=4368,fd=228)) LISTEN 0 128 *:50020 *:* users:(("java",pid=4368,fd=232)) LISTEN 0 128 10.2.2.240:9000 *:* users:(("java",pid=4270,fd=218)) LISTEN 0 128 :::ssh :::* LISTEN 0 128 :::13562 :::* users:(("java",pid=4879,fd=240)) LISTEN 0 128 ::1:6010 :::* LISTEN 0 128 :::8030 :::* users:(("java",pid=4778,fd=221)) LISTEN 0 128 :::8031 :::* users:(("java",pid=4778,fd=210)) LISTEN 0 128 :::39744 :::* users:(("java",pid=4879,fd=219)) LISTEN 0 128 :::8032 :::* users:(("java",pid=4778,fd=231)) LISTEN 0 128 :::8040 :::*
Note the example above shows that we configured the core-site.xml not for localhost but for a real IP address like we would do in production and to be reachable from outside. In our example we use IP address 10.2.2.240 listening on port 9000.
Install Hadoop Details
We recommend to check the following video about this topic: