Monday, 29 February 2016

HeartBeat Issue in Apache Storm


Today, i want to discuss a Storm issue i faced few days back during development.
I did some code changes in a Bolt ,tested it locally and deployed in staging environment.
Everything was running fine until 24th hour when all of sudden Storm stopped processing.
I looked into the logs but there was no exception or error i could find.
However in Supervisor.log , i saw a log line telling Storm was shut down and then restarted :

2015-12-23 07:06:20 b.s.d.supervisor [INFO] Shutting down and clearing state for id edf00206-32c4-4782-8ace-dd01c1a7096a. Current supervisor time: 1450875980. State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1450875947, :storm-id "staging-data-pipeline-topology-1-1450789565", :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [-1 -1] [1 1]}, :port 6708}
2015-12-23 07:06:20 b.s.d.supervisor [INFO] Shutting down 5f5bcf02-b919-48c0-a73b-8ffbb445d315:edf00206-32c4-4782-8ace-dd01c1a7096a
2015-12-23 07:06:20 b.s.d.supervisor [INFO] Shut down 5f5bcf02-b919-48c0-a73b-8ffbb445d315:edf00206-32c4-4782-8ace-dd01c1a7096a
2015-12-23 07:06:20 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "staging-data-pipeline-topology-1-1450789565", :executors ([2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17] [1 1])} for this supervisor 5f5bcf02-b919-48c0-a73b-8ffbb445d315 on port 6708 with id 9ce11d5d-58e7-4b39-8cef-1ec99fbb72d5
2015-12-23 07:06:20 b.s.d.supervisor [INFO] Launching worker with command………
…………………………….

The automatic restart was a surprise as it had never happened before . Also, we are using ZMQ publisher as source of data for Spout which needs a handshake with the publisher .Only after handshake,publisher starts sending data to Spout which converts it into Stream before getting processed by Bolt (For details on Storm internals, please refer my last post : Storm Architecture )
But since restart happened automatically, no handshake was done, so ZMQ publisher did not start sending any data and hence no Storm processing at all.
I restarted the ZMQ publisher and storm topology again so it started handshaking and then processing again but after exactly 24 hours the same issue occurred again. The storm stopped processing as it was restarted again. Now it really became a concern.

So i started looking into the changes i did and tried to relate the issue with the change .

Changes Done in Bolt :
In Bolt, i was processing log lines coming in real time , was applying some logic ,converting into rows of records with a COUNT field (indicating number of occurrences) before persisting into the database. With the new change, i had to do the processing only if the log line was unique in last 24 hours and this uniqueness was dependent on value of a field (lets call it “UniqueField” of type Long) coming as part of the incoming log line. So i used Guava CacheBuilder library to make a time eviction based TTLCache in which i stored the UniqueField values with expireAfterWrite 24 hours (For more details of Guava cache, refer : http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/cache/CacheBuilder.html ) . 
 If a log line with the same UniqueField value comes in the window of 24 hours, its treated as duplicate and skipped for processing. 

Finding The Root Cause :
It seemed the issue was somehow related to this 24 hour window cache. In 24 hours the maximum number of entries in this TTlCache<Long> was reaching upto 20 million max so  I had given sufficient memory to Storm (24GB) so that JVM memory issue can not come.
Observed in the Supervisor.log after issue came again, i found this line :

State: :timed-out, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat

I googled about it .
State: timed-out” means HeartBeat did not reach from worker process to its Supervisor and so supervisor considered the worker dead and hence restarted the Topology. So i checked the default timeout value which was 30 secs :  (check https://github.com/apache/storm/blob/master/conf/defaults.yaml )

#how long between heartbeats until supervisor considers that worker dead and tries to restart it
supervisor.worker.timeout.secs: 30

But why was Worker not able to send heartbeat to Supervisor exactly after 24 hours.
Thought of JVM Garbage Collection came in mind. A lot of entries was evicted from cache exactly after 24 hours as their life was over. So it might have happened that suddenly a lot of Long objects (UniqueField values) was being garbage collected which is a Stop-The-World phenomena causing delay of more than 30 secs before Worker could send heartBeat to Supervisor. So supervisor considers the worker dead and might restart the topology. It made sense to me and so i increased the the time out value from 30 secs to 5 minutes :
supervisor.worker.timeout.secs: 600

I restarted the publishers and the topology. After 24 hours, the Storm stopped again.
Supervisor.log again showed below logs :

2015-12-25 00:21:15 b.s.d.supervisor [INFO] Shutting down and clearing state for id da57c681-418c-4b87-86ef-e5912aac445a. Current supervisor time:
1451024475. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1451024474, :storm-id "staging-data-pipeline-to
pology-1-1450937816", :executors #{[2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [16 16] [17 17]
[-1 -1] [1 1]}, :port 6708}
2015-12-25 00:21:15 b.s.d.supervisor [INFO] Shutting down 5f5bcf02-b919-48c0-a73b-8ffbb445d315:da57c681-418c-4b87-86ef-e5912aac445a
2015-12-25 00:21:15 b.s.d.supervisor [INFO] Shut down 5f5bcf02-b919-48c0-a73b-8ffbb445d315:da57c681-418c-4b87-86ef-e5912aac445a
2015-12-25 00:21:15 b.s.d.supervisor [INFO] Launching worker with assignment #backtype.storm.daemon.supervisor.LocalAssignment{:storm-id "staging-d
ata-pipeline-topology-1-1450937816", :executors ([2 2] [3 3] [4 4] [5 5] [6 6] [7 7] [8 8] [9 9] [10 10] [11 11] [12 12] [13 13] [14 14] [15 15] [1
6 16] [17 17] [1 1])} for this supervisor 5f5bcf02-b919-48c0-a73b-8ffbb445d315 on port 6709 with id 1d3b6180-05dd-4d1e-b79a-1fa8f82f1fc4
2015-12-25 00:21:15 b.s.d.supervisor [INFO] Launching worker with command……….
………………….

The logs seemed similar as previous one with information about storm shutdown and restart. However this time, the value of “State” in log was different. It was saying :

1451024475. State: :disallowed, Heartbeat: #backtype.storm.daemon.common.WorkerHeartbeat

So i googled agained.
State: :disallowed” means that Nimbus told the supervisor to kill the worker . It seemed as if Nimbus now might not have received heartbeat from Supervisor/Task (through zookeeper) in time (due to GC pause) and hence had decided to shutdown and restart the topology.
Found a similar heartBeat timeout between Nimbus and Supervisor which was also by default 30 secs :
nimbus.task.timeout.secs: 30
nimbus.supervisor.timeout.secs: 60

So i increased now both these values to 5 minutes as well :
nimbus.task.timeout.secs: 600
nimbus.supervisor.timeout.secs: 600

After doing this change, when i restarted topology it ran successfully for 24 hours and further without giving any issue.
This confirmed my assumption that GC pause was causing the hearBeat issue between components of Storm after first first 24 hours when lots of entries from TTLCache were getting garbage collected. This was the first time i witnessed a Live issue happening because of Garbage Collection.

I could not find proper information on google regarding this HeartBeat issue in Storm,so it took me some time to figure out the cause of issue and fix it. I hope this post might be useful for someone facing the similar issue in future.

8 comments:

  1. Hi, I got "Too many open files" error, while worker tried to send heartbeat message and worker died, even before supervisor realized that. But noticed following default value:
    worker.heartbeat.frequency.secs: 1

    Did you consider increasing this frequency as well?

    ReplyDelete
  2. Thank you for your great information on Apache Park & Scala online trainings.

    Apache Park & Scala online trainings!

    ReplyDelete
  3. What version of apache storm were you using back then?

    ReplyDelete
  4. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Storm , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus
    E-mail: sangita@maxmunus.com
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete
  5. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Storm, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Storm. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  6. Thank you for sharing this!
    I'm having similar problem here. The only difference is that my topology has been running fine for 2 month and suddenly this happened. Guess my issue is something to do with network issue which cause the hearbeat to stop for a while.

    ReplyDelete