Sunday, 10 January 2016

Apache Drill : Introduction and Use Cases

Apache Drill is a highly useful query engine , which is very easy and quick to set up and give immediate insights from stored data without much developer efforts. It entered the Apache Incubator in August 2012 and is one of the top level Apache projects currently.
Official Definition from : “Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data.

Why Apache Drill came in existence :
10-15 years back, data store meant majorly RDBMS. Amount of Data was not that vast and Datasets were predefined and schemas were fixed. And we had SQL queries to interact with them.
With time, amount of data to be stored and analysed grew exponentially along with the need of rapid application development. This gave birth to NoSql data stores like mongoDb which were faster with operations. With Hadoop, we got a new world of distributed processing(MR) and storage in distributed HDFS. Then to extract information from these massive distributed files in HDFS/S3, we came up with something like Hive which looked like Sql engine which internally ran MR job to extract the data. Also to reduce the size of data, we came up with optimised storage formats like Parquet. And from ages we were also using flat file formats like text,csv,tsv,etc.
In this whole process, we got surrounded by different kind of data stores,different structured/semi-structured/nested data formats each with its own usefulness. We needed something which can serve as a layer between them and user. Something with which user interacts as a single interface and it quickly returns the data no matter what internal storage system/format is. Google has already implemented Dremel on concept called “BigQuery” for this but it is not open sourced
This is where Apache Drill comes in picture. In a light-hearted way we can call it an open source alternative for Google Dremel although it has additional features as well.
Its important to keep in mind that Apache Drill is not a data storage at all . It does not store any data itself. Instead, its a distributed query engine intelligently built to extract data from different data storages/formats and the query syntax is just a standard ANSI Sql irrespective of the data store you are going to query . User doesn't need to know much about the actual data stores, he just needs to do some configuration like setting up path,workspace,etc of data stores with Drill and thereafter, he can fire normal SQL queries to Drill to get data from any data store/format. Drill will internally fetch data from the actual source intelligently.  As such, Drill is world’s first sql engine which doesn't need schemas. It automatically understands data structure on the fly.  Unlike Hive, it does not use  MR job internally and hence unlike most distributed query engines, it does not depend on Hadoop.  Rather it uses its own distributed processing service called DrillBit .  (Architecture Details in future post)

As in the above Diagram, 
A user might want to connect through ways like jdbc,cli,Rest,etc and would like to fetch data from a variety of data sources like Cassandra,mongoDb or even files like csv,json,parquet from local disk,hdfs,S3,etc.
The user will just define the storage plugin with Drill ,do some configurations and is good to start with firing query to any data store through drill. In fact, even in a single query he can combine results from different tables from different data stores. Drill does all the computation in memory.

My Experience with Drill :
We are using Drill as a service for one of the use cases. We get a lot of data(say some million lines) every hour in pre-defined csv file format . And there is requirement to fetch data from those files on adhoc basis at some point of time in future quite frequently for any date/time range. Files are being kept in folder structure say: year/month/day/hour/logType. Earlier we were using Apache Hive for extracting out the data but it was taking hours if time range was in months. So we tried our hands on with drill which reduced the turnaround time in few minutes. 
We used Drill in 2 stages : 
1st stage: As soon as we got the csv files for a hour we converted into Parquet format using drill using CTAS(create Table As) query . This saved data space as well as we only kept the fields we knew we can need in future. Another reason for parquet was speeded extraction(explained below).
2nd Stage: When a request comes for some column values for some time range, we fire Drill query to fetch the data from Parquet. Parquet is columnar storage and drill is intelligently built to understand the underlying format of any data store . It accordingly builds and optimises its query plan to retrieve data fast and efficiently. For example, if we pass say 5 column names to fetch from parquet files having 50 columns, Drill will understand its columnar structure and hence will access and read values from disk for only those 5 columns (unlike Rdbms query which scans and reads from disk row by row even for getting a single column data) thus saving a lot of IO seek,read and memory consumption.
Using CTAS in the query we can store the fetched data output at any location in any format.

Sample Drill Query example :
A sample select query in Drill looks as simple as :
SELECT COL1_ID, COL2_ID, COL3_ID, COL4_ID, COL5_ID, SERVER_DAY, SERVER_HOUR dfs.parquetlogs.`parquetlogs/20151024/20151024*/*` WHERE COL1_ID=100;
And a simple CTAS query is like : (for getting and saving the extracted data as per wish)
alter session set `store.format`='csv';

CREATE TABLE dfs.queryoutput.`job-01` AS (  SELECT COL1_ID, COL2_ID, COL3_ID, COL4_ID, COL5_ID, SERVER_DAY, SERVER_HOUR dfs.parquetlogs.`parquetlogs/20151024/20151024*/*` WHERE COL1_ID=100;  ); 

Benefits and Use Cases of Drill:
Although have mentioned some already above but to put precisely in bullet points :
  • First of all, Drill reduces the dependency of BI guys on Developers. Setting up Drill is quick and easy and only thing to be technically aware of is ANSI SQL Syntax. No more waiting for developers to code and build application like Hive before being able to extract data.
  • Schema Free : only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. Drill automatically understands the structure of the data.
  • Universal Query Engine: can query data from data stores like mongoDb,cassandra,rdbms to distributed file systems hdfs,S3 to old age flat file formats csv,tsv to nested data format json to advanced columnar format like parquet. Also it has support for Hive tables ,capable of extracting data without triggering a hive job .
  • Drill is a scale-out and columnar execution engine. Drill can scale from a single node to thousands of nodes.
  • Drill has the ability to support for nested data like json and can access nested data like element of an array in a json on fly without having to define schema definitions upfront ( big advantage over its competitor Impala).
  • Support for UDFs (User Defined Functions) .
  • High Performance : Drill does all the computation in memory and is highly performant because of following features: (details in next Architectural post)
    • Distributed query optimization and execution
    • Columnar execution
    • Runtime compilation and code generation
    • Vectorization
  • Allows join of tables from different sources like hbase,json,etc in a single query

Conclusion :

Overall, Apache Drill is a software solution that users can implement to leverage their traditional relational data assets alongside newer nosql sources in a quick and convenient way while continuing to use familiar tools and language.
In future posts , will share details on set up,examples and internal  architecture of Drill.