Saturday, 2 January 2016

Apache Storm : Architecture Overview

This is continuation of my last post ,  Apache Storm : Introduction .
The easiest way to understand the architecture of Storm is to start with comparing its different components with Apache Hadoop,which are similar in superficial way. At least it makes easy to understand. Hadoop is the senior most citizen in Big Data space and most of people are familiar with it already.

Superficial Comparison with Hadoop :
Storm Architecture
On Hadoop cluster, we run MR jobs while in storm cluster we run Topologies. Biggest difference is MR job starts,processes and ends eventually while a topology once started,is intended to keep on processing live data forever which it keeps on getting from data sources like zmq,kafka,etc (until we wish to kill it). 
In terms of Node Services :
there are 2 kinds of nodes on a Storm cluster similar to hadoop: Master node and Worker nodes.
The master node runs a daemon called "Nimbus" that is similar to Hadoop's "JobTracker". Nimbus is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures .
Each worker node runs a daemon called the "Supervisor". Each supervisor can run one or more worker processes which are separate JVM processes on its node. Each worker process in itself, can run one or more tasks parallel (spout/bolt) .  Each supervisor listens for work assigned by Nimbus to its node and starts and stops worker processes as necessary . Each worker process executes a subset of a topology; a running topology consists of many worker processes spread across many machines.
In terms of Tasks :
Hadoop MR job runs 2 tasks in a job 1 mapper task and 1 reducer task . This is restricting, you can only have 2 tasks and they have to be only mapper and reducer only. Not all problems can be solved using this MR paradigm.
Storm in contrast runs 2 tasks : Spouts and Bolts . In a topology, spout will act as data receiver from external sources and creator of Stream for bolts for actual processing. Bolts can be chained serially or in parallel depending on what kind of processing we want to do.  
A simple Word Count problem can be solved in Storm in following way :

Word Count Example
Important point to note is , Apache Storm does not have its own state managing capabilities. Instead, It uses Apache ZooKeeper to manage the cluster state. All coordination between Nimbus and the Supervisors such as message acknowledgements, processing status etc is done through a Zookeeper cluster. Nimbus daemon and Supervisor daemons are stateless; all state is kept in Zookeeper or on local disk (property storm.local.dir ) . This enables Storm to start right from where it left even after the restart. Even if we kill -9 Nimbus or the Supervisors , they'll start back like nothing happened. 
Zookeeper for State Keeping

Another important point to be aware of is that Storm makes use of zeromq library for inter process communication(between different worker processes) but after it was adopted as an apache project, storm developers  replaced zeromq with Netty .


Storm Components in depth :
  
Topology :
Topology Graph


In simple words, Topology is a network of spouts and bolts as in above figure. It is analogous to a MR Job in Hadoop. It is a graph of computation consisting of spouts and bolts. Spouts as data stream source tasks and Bolts as actual processing tasks.
Each node in the graph contains some processing logic and links in the graph shows how the data will be passed and processing will happen among nodes.
When a topology is submitted to a Storm cluster, Nimbus service on master node consults the supervisor services on different worker nodes and submits the topology. Each supervisor, creates one or more worker processes, each having its own separate jvm . Each process runs within itself threads which we call Executors. The thread/executor processes the actual computational tasks : Spout or Bolt .
Topology is submitted to a storm cluster through a command :
storm jar storm-topology-code.jar chandan.storm.MyTopology arg1 arg2


Stream :  
1. Spout emitting tuples        2. Bolt processing tuples
 Stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples (collection of key value pairs). 
Tuple is the most basic data structure in storm . Its a named list of values. Each field in the values can be an object of any serializable type.

Spout :
           Spout is the entry point in a storm topology. It is the source of streams in the topology. A spout connects to the actual data source such as a message queue as Kafka , gets continuous data , converts the actual data into stream of tuples , emits them to bolts for actual processing. Spouts run as tasks in worker processes by Executor threads .

Bolt :
         Bolt contains the actual processing logic. It works only on streams and can emit streams too for further processing downstream by other bolts or can export/save data for persistent storage. It receives stream from either one or more spouts or some other bolts. For example in simple word count example(see diagram above), map and reduce tasks will be executed as 2 different bolts executed in serial fashion.  Bolts can do anything from run functions, filter tuples, do streaming aggregations, do streaming joins, talk to databases, and more. 

This pretty much sums up the architecture of Apache Storm. Hope it was helpful.  :)  


7 comments:

  1. Thanks for sharing this informative information..
    Installing a Storm Cluster Following are the prerequisites for setting the cluster http://www.s4techno.com/blog/2016/08/13/installing-a-storm-cluster/

    ReplyDelete
  2. I am beginner in Apache Storm. Finally, I got my answer after many hours of surfing the web by reading your article. hope to see more posts on this technology.

    ReplyDelete
  3. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Apache Storm , kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Sangita Mohanty
    MaxMunus
    E-mail: sangita@maxmunus.com
    Skype id: training_maxmunus
    Ph:(0) 9738075708 / 080 - 41103383
    http://www.maxmunus.com/

    ReplyDelete
  4. Informative blog.. i am not strong in apache storm concepts but after reading this article i collected more useful information about apache storm from this article..

    hadoop Articles | big data articles

    ReplyDelete
  5. After reading this blog i learnt new information about hadoop and i got new idea about hadoop which really helpful to develop my knowledge and cracking the interview easily.. thanks a lot for sharing this blog to us

    hadoop training institute in velachery | big data training institute in velachery

    ReplyDelete
  6. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor led live training in Apache Storm, kindly contact us http://www.maxmunus.com/contact
    MaxMunus Offer World Class Virtual Instructor led training on Apache Storm. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
    For Demo Contact us.
    Nitesh Kumar
    MaxMunus
    E-mail: nitesh@maxmunus.com
    Skype id: nitesh_maxmunus
    Ph:(+91) 8553912023
    http://www.maxmunus.com/


    ReplyDelete
  7. A nice article here, i think that people who have grown up with the idea of using computers are showing more responsibility towards writing posts that are thoughtful, do not have grammar mistakes and pertinent to the post..

    Web Designing Training in Chennai

    Java Training in Chennai

    Salesforce Training in Chennai

    ReplyDelete