Thursday, 31 December 2015

Apache Storm : Introduction

Apache Storm, in simple terms, is a distributed framework for real time processing of Big Data like Apache Hadoop is a distributed framework for batch processing.

Why Storm was needed :
First of all , we should understand why we needed storm when we already had hadoop.
Hadoop is intelligently designed to process huge amount of data in distributed fashion on commodity hardwares but the way it processes does not suit the use case of Real time processing. MR jobs are disk intensive, every time it runs, it fetches data from disk,processes in memory in mapper/reducer and then write back to disk ; the job is finished then and there. For next batch of data, a new job will be created again and whole cycle will repeat.
However in real time processing, data will be coming continuously and you have to keep getting data,process it and keep writing the output as well. And of course , all these have to be done parallel and fast in distributed fashion with no data loss. This is where Storm pitches in.

Background History of Storm :
Apache Storm was originally developed by Nathan Marz and the BackType team. BackType was later acquired by Twitter, who open-sourced the project. Later it got adopted by Apache and last year 2014, it has now become a top level Apache project.
Before Storm was written, the usual way of processing data in real time was using queues and worker thread approaches. For example, some threads will be continuously writing data to some queues like rabbitMq and some worker threads will be continuously reading data from these queues and processing them. The output might be written again to some other queues and chained as input to some other worker  threads to process further.
Such design is possible but obviously very fragile . Much of the time would be spent in maintaining the entire framework,serializing/deserializing messages,dealing with data loss,resolving many other issues rather than doing the actual processing work. 
Nathan März came up with nice idea of creating abstraction to all these in efficient way in a program where we have to just create SPOUT and BOLT to do necessary processing and submit the job as TOPOLOGY and the framework will take care of everything else.

Some of the really beautiful abstractions he came up with , i feel is :
  • Streams : Every data which will be processed in the topology is basically an abstraction called tuple and sequence of these tuples is called a stream .
  •  Message processing Algo : Nathan developed a very efficient algo which guaranteed that every message will be processed. It ensured , no matter how much a message is going to process downstream, only 20 bytes would be needed to keep track of state of every message tuple. This approach was obviously very efficient ,fast,fail safe compared to maintaining complex intermediate state in some queues.

I highly recommend reading this excellent post by Nathan Marz to any Developer in which he beautifully explains what was his experience, how he came up with idea of storm ,what issues he faced and how he took things forward.

This was all about a brief background of Apache Storm. In next post, will write things about the internal details and Architecture.

No comments:

Post a Comment