Saturday, 4 June 2016

Apache Spark Streaming : Logging Driver Executor logs the right way


From my experience, i feel logging properly is one of the most important thing to do first when starting Spark Streaming development especially when you are running on cluster with multiple worker machines.
Reason is simple : Streaming is a continuous running process and the exception/error may arrive after many hours/days and it can be because of driver or can be because of executor. It will be hard to debug the root cause as driver logs are coming in console cannot be seen after application shuts down while executor logs come in std out/err files ( i am using Mesos as cluster manager) which is tedious to download and see. So when some issue comes, like in my case an out-of-memory issue came after 2 days of running and application went down. I had to be sure whether driver or executor was the actual culprit where issue came first. So i first did this logging configuration properly before debugging the issue.
Also with logging, we can control how much retention/days of logs we want to keep for driver and executor so that disk space is not eat up by logs generated by ever running application. And if we are running multiple spark streaming applications on the same cluster , we can enable logging to separate log files for different executors even if multiple executors happen to run on same worker machine.

Logging Configuration Steps :
I am using standard apache logger library for logging with appropriate logging levels in code. Default spark log4j properties template can be found in the spark conf directory. For example in my case, its at /usr/local/spark-1.5.1-bin-hadoop2.6/conf/ : log4j.properties.template
1.  Create separate log4j configuration files for driver and executor and place them at conf/ directory. Each of them should be configured with different file name and other rolling properties as per use case. In case of multiple applications, create the same separately and differently for each application . For example, i have 2 streaming application named Request and Event. So, have created 4 files as:




Contents of log4j file should be like :
log4j.rootCategory=INFO, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=/tmp/eventLogDriver.log
log4j.appender.FILE.ImmediateFlush=true
log4j.appender.FILE.Threshold=debug
log4j.appender.FILE.Append=true
log4j.appender.FILE.MaxFileSize=500MB
log4j.appender.FILE.MaxBackupIndex=10
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n


2. Copy the above files to similar conf/ directory on each worker machine and create the log directory for logs on each of them. I have specified log dir as temp/ as of now, which is already there. As shown above, i am keeping only last 10 recent log files of max size 500 mb each for my driver and executor log on each worker machine as it suffices for 4-5 days of logs retention for my specific use case.
3. Include the log dir for driver and executor in spark submit command as below :
      spark-submit   --driver-java-options "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/c
onf/log4j_RequestLogDriver.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.prope
rties"   --master mesos://sparkclusterip:5050  --jars     ……so on

Troubleshooting logging issue :
  Ideally, as per spark documentation, this configuration is enough for separate logging of driver and executor to log file.
However for some unknown reason, in my case, my driver logs were getting written properly to file but executor logs were not going into file and were still coming on Mesos cluster standard std out/err files .
Setting the same above driver and executor java options in spark conf object in code through application cont properties file did the trick for me which is another way of setting spark properties if you don't want to include ins park submit command.
application.conf :
request-driver-options = "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogDriver.properties"
request-executor-options = "-XX:+UseG1GC -Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.properties"
Code :
sparkConf.set("spark.driver.extraJavaOptions",conf.getString("application.request-driver-options") )
sparkConf.set("spark.executor.extraJavaOptions",conf.getString("application.request-executor-options") )Doing above, now i can see all logs getting written to log files only. On Mesos UI , std err/out files are now empty which used to contain executor logs earlier.




Although this post described logging configuration based on my experience on Spark Streaming with Mesos cluster, it will be similar for batch spark jobs and other clusters like standalone,Yarn. Also i am using spark 1.5.1 version right now, so behaviour might change in future releases as spark evolves further e.g. in cases like graceful shutdown,see my last post.

8 comments:

  1. Nice Article Chandan :). One little out of the context query, is there any reason
    for choosing mesos for cluster management over other alternatives ? What is the decisive factor for choosing mesos as cluster manager

    ReplyDelete
  2. Yes Dwijen, there are some reasons when compared to its competitor Yarn :
    1. Mesos is a general resource manager. allows sharing resources in same cluster with other applications which may or may not be related to Big data. While Yarn is not a general purpose resource manager and it allows resource sharing only with Big Data technologies like hadoop,spark,storm,etc as of now.
    2. Mesos is light weight compared to yarn.
    3. Mesos was the default cluster manager for Spark when it was developed in UC Berkley. Even Mesos is product of UC Berkley as wel.. Yarn support came later.
    4. Devleopment os Spark with Mesos took slightly more time compared to Yarn but benefits will be i can use learning of same Resource manager for all other future applications.
    5. Performance wise Mesos and Yarn are almost similar , from what i have googled.
    P.S. In short if your cluster is Big data specific only, you can opt for Yarn else Mesos.
    There are few others which i dont recall right now.

    ReplyDelete
  3. Great survey, I'm sure you're getting a great response.
    Z3X Box Driver

    ReplyDelete
  4. Thanks for providing this informative information you may also refer.
    http://www.s4techno.com/blog/2015/12/15/sort-unique-ip-address-from-apache-log/

    ReplyDelete
  5. Thanks for sharing your knowledge and your blog is informative.

    ReplyDelete
  6. how to set up the days of logs that I want to retain?

    Say retain the last 30 days of logs, no matter the size.

    ReplyDelete
  7. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Spark, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Spark. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  8. Excellent job Chandan !! could you please help me to store yarn and spark aaplication logs combined either on unix box or HDFS hadoop cluster.

    ReplyDelete