Tuesday, 23 June 2015

Spark Installation: Pseudo Distributed/Single Node Cluster in Ubuntu

In my last post , I showed how to install hadoop yarn in pseudo cluster mode.
Going furthur, i will describe here how to install Apache Spark in the same mode.
Background:
Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications.
Read more

I will be installing the following softwares (with versions) for Spark set up:
  1. Java SDK 7
  2. SSH Remote Access
  3. Hadoop 2.6.0
  4. Scala 2.10.4
  5. Spark  1.3.1
For steps 1-3, please refer my last post .

Installing Scala  :

hduser@chandan:/home/chandan/Desktop/downloads$ sudo tar -xzf scala-2.10.4.tgz

hduser@chandan:/home/chandan/Desktop/downloads$ sudo mv scala-2.10.4 scala

hduser@chandan:/home/chandan/Desktop/downloads$ sudo mv scala /usr/local/

hduser@chandan:/home/chandan/Desktop/downloads$ sudo vi ~/.bashrc
# adding scala
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin


hduser@chandan:/home/chandan/Desktop/downloads$ source ~/.bashrc

hduser@chandan:/home/chandan/Desktop/downloads$ scala   //should open scala context. try any thing like : val num =100


Installing Spark :
Download spark compatible to your hadoop. For example here, i am downloading spark version 1.3.1 which is compatible with my hadoop 2.6.0
https://spark.apache.org/downloads.html

 
hduser@chandan:/home/chandan/Desktop/downloads$ sudo tar -xzf spark-1.3.1-bin-hadoop2.6.tgz

hduser@chandan:/home/chandan/Desktop/downloads$ sudo mv spark-1.3.1-bin-hadoop2.6  spark
hduser@chandan:/home/chandan/Desktop/downloads$ sudo mv spark/ /usr/local/

hduser@chandan:/home/chandan/Desktop/downloads$ vi ~/.bashrc
# adding spark
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
hduser@chandan:/home/chandan/Desktop/downloads$ source ~/.bashrc


// setting Spark Configuration files
// spark-env.sh
hduser@chandan:/home/chandan/Desktop/downloads$ cd /usr/local/spark/conf/

hduser@chandan:/usr/local/spark/conf$ ls

fairscheduler.xml.template metrics.properties.template spark-defaults.conf.template

log4j.properties.template slaves.template spark-env.sh.template

hduser@chandan:/usr/local/spark/conf$ sudo cp spark-env.sh.template spark-env.sh

hduser@chandan:/usr/local/spark/conf$ sudo vi spark-env.sh
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=800m
export SPARK_WORKER_INSTANCES=1


//slaves file
hduser@chandan:/usr/local/spark/conf$ cp slaves.template slaves

hduser@chandan:/usr/local/spark/conf$ sudo vi slaves

hduser@chandan:/usr/local/spark/conf$ sudo cp log4j.properties.template log4j.properties

hduser@chandan:/usr/local/spark$ sudo mkdir logs


Test the Standalone Spark : spark-shell and standalone spark job/program
hduser@chandan:/usr/local/spark/conf$ spark-shell

scala> sc.parallelize( 2 to 200).count //should return res0: Long = 199

//find the corresponding UI for spark-shell at:


scala> exit
hduser@chandan:/usr/local/spark$ run-example SparkPi 5

//edit log4j properties file: get logs in a file and not on the console
# Initialize root logger
log4j.rootLogger=INFO, FILE
# Set everything to be logged to the console
log4j.rootCategory=INFO, FILE
# Ignore messages below warning level from Jetty, because it's a bit verbose
log4j.logger.org.eclipse.jetty=WARN
# Set the appender named FILE to be a File appender
log4j.appender.FILE=org.apache.log4j.FileAppender
# Change the path to where you want the log file to reside
log4j.appender.FILE.File=/usr/local/spark/logs/SparkOut.log
# Prettify output a bit
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

// Run the job again : instead of console,logs will be going to /usr/local/spark/logs/SparkOut.log
hduser@chandan:/usr/local/spark$ run-example SparkPi 5


Test Spark on Yarn Cluster :

hduser@chandan:/usr/local/spark$ start-dfs.sh
hduser@chandan:/usr/local/spark$ start-yarn.sh
Yarn Cluster manager UI should be up at
hduser@chandan:/usr/local/spark$ spark-shell --master yarn //this will start spark-shell under yarn cluster


Note: Sometimes, you might get permission issues, then you need to set appropriate permission settings on the folder. For example here, i want user "hduser" of group "hadoop" to be the owner of my /usr/local/spark/ module.

hduser@chandan:/usr/local$ sudo chown -R hduser:hadoop spark/

I hope the post was useful !  :)


11 comments:

  1. excellent , i was able to do complete setup hadoop + scala + spark within 2 hours .

    Its like cakewalk , only one permission issue :

    For //slaves file
    add sudo before cp command
    sudo cp slaves.template slaves

    ReplyDelete
  2. i have a doubt..
    i am new to this spark world and am in great trouble
    i have been trying to install various versions of spark coz am not able to meet my requisites..
    my doubt is:
    do i need to build spark with sbt or maven in order to get pyspark running???i know how to start a python shell in spark..(./bib/pyspark) but i don't know how to build spark..i always reach this point and quit...please help me

    ReplyDelete
  3. Very useful and to the point.

    ReplyDelete
  4. Hi Friend, very use full thanks lot for ur post.For me working fine first time but second time i start the spark-sql and gave "show tables;" query it contionusly processsing not displaying the result

    ReplyDelete
  5. Hi i am getting
    Endless INFO Client: Application report for application_xx (state: ACCEPTED) messages


    start time: 1464406185066
    final status: UNDEFINED
    tracking URL: http://ubuntu:8088/proxy/application_1464406138396_0001/
    user: hduser
    16/05/27 20:29:47 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)
    16/05/27 20:29:48 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)
    16/05/27 20:29:49 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)
    16/05/27 20:29:50 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)
    16/05/27 20:29:51 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)
    16/05/27 20:29:52 INFO Client: Application report for application_1464406138396_0001 (state: ACCEPTED)

    ReplyDelete
  6. Hi,
    i have a great news about spark installation. i have doubt why this used FIFO mode which means first in first out? or else why should can use LIFO methods? is there any old version it has? but this spark installation have any error in between time of installation? if any issues occur what preventive method i should handle for it?
    Hadoop Training in Chennai

    ReplyDelete
  7. nice blog too informative. looking and reading your points its so impressive. doing more blog like this. i really appreciated doing like this
    Hadoop training in chennai

    ReplyDelete
  8. The blog says $ sudo vi slaves and moves onto log4j . Does it mean nothing needs to be added to slaves file ?

    ReplyDelete
  9. Really it was an awesome article...very interesting to read..You have provided an nice article....Thanks for sharing..
    Android Training in Chennai
    Ios Training in Chennai

    ReplyDelete