Tuesday, 23 June 2015

Hadoop Installation : Pseudo Distributed (Single Node) Cluster in Ubuntu

In this post, i will describe the installation of Hadoop in single node cluster mode step by step .
I will try to explain each step clearly through  the command as well as screen shots wherever possible.

Background :
Hadoop is a framework written in Java for running applications on large clusters of commodity hardware. It internally contains 3 important components:
HDFS : A highly fault-tolerant distributed file system designed to be deployed on low-cost hardware.
YARN : (Yet Another Resource Negotiator) A cluster manager provided by hadoop 2 onwards.
MR : Map reduce processing paradigm provided by Hadoop since the inception
 
Software versions used in this tutorial:
Operating System :   Ubuntu 14.04 LTS
Hadoop :  2.6.0
Java/JDK :                1.7.0_79

Steps for Installation:

edit1: ( updating ubuntu software package is recommended before starting)
          chandan@chandan:~$   sudo apt-get update
Jdk Installation:
For hadoop installation, java version 1.5+ is a must . However i will recommend  installing java 7 if you are doing a  fresh setup :
chandan@chandan:~$ sudo apt-get install default-jdk
chandan@chandan:~$ java -version
java version "1.7.0_79"

Create Group and User:
Add a new  group "hadoop" and dedicated hadoop user "hduser" . Although its not manadatory but its  recommended to keep hadoop installation separate :
chandan@chandan:~$ sudo addgroup hadoop
chandan@chandan:~$ sudo adduser --ingroup hadoop hduser
Add hduser in the list of sudoers so that you can run any command as sudo when you are logged in as hduser :
             chandan@chandan:~$  sudo adduser hduser sudo

Install SSH and create Certificates:
it is needed for remote login into one machine from another machine .
chandan@chandan:~$ sudo apt-get install ssh
chandan@chandan:~$ su hduser
hduser@chandan:/home/chandan$ ssh-keygen -t rsa -P ""
               
hduser@chandan:/home/chandan$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser@chandan:/home/chandan$ ssh localhost       //check if its working fine


Download and Install Hadoop 2.6.0 :               
hduser@chandan:/home/chandan$ sudo tar xvzf hadoop-2.6.0.tar.gz
hduser@chandan:/home/chandan$ cd hadoop-2.6.0/
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo mkdir /usr/local/hadoop
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
               
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop/


We will need to set atleast these configuration files:
bashrc , hadoop-env.sh , core-site.xml , hdfs-site.xml
Just copy paste the entries in the above 4 files as described in below steps :

// setting bashrc file : hduser@chandan:/home/chandan/hadoop-2.6.0$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
#HADOOP VARIABLES
END
hduser@chandan:/home/chandan/hadoop-2.6.0$ source ~/.bashrc
//setting hadoop-env.sh :  set JAVA_HOME
hduser@chandan:/home/chandan/hadoop-2.6.0$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64   //add this line

//core-site.xml : contains configuration properties that Hadoop uses when starting up
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo mkdir -p /app/hadoop/tmp
[sudo] password for hduser:
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo chown hduser:hadoop /app/hadoop/tmp
edit2: hduser@chandan:/home/chandan/hadoop-2.6.0$  sudo vi  /usr/local/hadoop/etc/hadoop/core-site.xml 
<configuration>
<property>
 <name>hadoop.tmp.dir</name>
 <value>/app/hadoop/tmp</value>
 <description>A base for other temporary
directories.</description>
</property>
<property>
 <name>fs.default.name</name>
 <value>hdfs://localhost:54310</value>
 <description>The name of the default
file system.  A URI whose
 scheme and authority determine the FileSystem
implementation.  The
 uri's scheme determines the config property
(fs.SCHEME.impl) naming
 the FileSystem implementation class.  The
uri's authority is used to
 determine the host, port, etc. for a
filesystem.</description>
</property>
</configuration>


//mapred-site.xml : mapreduce specific properties
hduser@chandan:/home/chandan/hadoop-2.6.0$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml
hduser@chandan:/home/chandan/hadoop-2.6.0$ vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
               
// hdfs-site.xml : hdfs related properties like namenode and datanode
//creating directories for namenode and datanode
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@chandan:/home/chandan/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop_store
//enter namenode/datanode information
hduser@chandan:/home/chandan/hadoop-2.6.0$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be
specified when the file is created.
The default is used if replication is not
specified in create time.
</description>
</property>
<property>
 <name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
 <name>dfs.datanode.data.dir</name>  
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
               
// format the hadoop file system : needed the first time after configuration before starting hadoop
hduser@chandan:/home/chandan/hadoop-2.6.0$ hadoop namenode -format

//starting hadoop : scripts can be found at /usr/local/hadoop/sbin
               
hduser@chandan:/home/chandan/hadoop-2.6.0$ ls /usr/local/hadoop/sbin
hduser@chandan:/home/chandan/hadoop-2.6.0$ start-all.sh ( start-dfs.sh and then start-yarn.sh)
               
//verify the processes running
hduser@chandan:/home/chandan/hadoop-2.6.0$ jps

//Web UI for namenode/datanode:
               
http://localhost:50070/dfshealth.html#tab-overview






// check the Yarn UI :              http://localhost:8088/cluster


//Stopping the nodes once you are done:
stop-all.sh  (stop-dfs.sh and stop-yarn.sh)

Hope the post was helpful for you!!  :)



6 comments:

  1. Awesome guide , it make life so easy for hadoop installation

    ReplyDelete
  2. This was really helpful. thank you

    ReplyDelete
  3. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Hadoop Training in Chennai

    ReplyDelete
  4. It is amazing and wonderful to visit your site.Thanks for sharing this information,this is useful to me...
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete
  5. Really useful information about hadoop, i have to know information about hadoop online training institutes.

    ReplyDelete