Thursday, 14 January 2016

Apache Drill : Quick Setup and Examples

This post is continuation of my last post Apache Drill: Introduction and Use Cases
In this post, i will share setup steps and some examples of using Apache Drill which will make understanding clear.
We can run Drill in 2 modes:
  1. Embedded/Single-Server mode
  2. Distributed/Multi-Server mode 

Distributed mode is needed when we want to install Drill on Hadoop Cluster. Useful to know, if Drill nodes are installed on the same nodes as Hadoop datanodes,it uses data locality to its advantage and optimises the query plan keeping data locality in mind, saving network bandwidth thus yielding very efficient and fast query results. Zookeeper is needed to be installed on Drill nodes for inter-node coordination while using in cluster mode.
Embedded mode is sufficient when you want to install Drill on a single server or local machine . It doesn't need any configuration or zookeeper to be installed. When we run Drill in embedded mode, drillbit service starts automatically and a Drill Shell is launched (similar to sql shell to write and fire queries).
In this post i will go with embedded mode as it hardly takes 5mins to setup and start using. 
For distributed mode, would re-direct to this link : https://drill.apache.org/docs/installing-drill-in-distributed-mode/

Installation and Setup steps :
Although i followed these for Mac, it will be similar for linux,ubuntu as well.
Pre-requisite :  Oracle Java SE Development (JDK) Kit 7 
Download the latest Drill version  e.g. i used http://getdrill.org/drill/download/apache-drill-1.2.0.tar.gz
chandans-MacBook-Pro:Downloads chandan$ wget http://getdrill.org/drill/download/apache-drill-<latest-version>.tar.gz
chandans-MacBook-Pro:Downloads chandan$ cd ~/Downloads/
chandans-MacBook-Pro:Downloads chandan$ tar -xzvf apache-drill-1.2.0.tar.gz
chandans-MacBook-Pro:Downloads chandan$ sudo mv apache-drill-1.2.0 /usr/local/
chandans-MacBook-Pro:Downloads chandan$ cd /usr/local/
chandans-MacBook-Pro:local chandan$ mv apache-drill-1.2.0/ drill
chandans-MacBook-Pro:local chandan$ sudo vi ~/.bash_profile
                              #adding apache drill 
                              export DRILL_HOME=/usr/local/drill
                              export PATH=$PATH:$DRILL_HOME/bin
chandans-MacBook-Pro:local chandan$ source ~/.bash_profile
handans-MacBook-Pro:drill chandan$ drill-embedded    //it will start drillbit process in background and an interactive drill-shell as :
jdbc:drill:zk=local> 
//employee.json file is by default added at class path in drill jar as sample
jdbc:drill:zk=local>SELECT employee_id,first_name,last_name FROM cp.`employee.json` LIMIT 2;
 To exit drill-shell, use (drillbit service will also shutdown if using embedded mode)
jdbc:drill:zk=local> !quit  

Configuring Storage Plugin for DataStore :
check the web console at : http://localhost:8047/storage
By default it will show cp(classpath),dfs(file system, can configure it to point any file system like hdfs),hbase,hive,mongo .
Important to know, Drill uses /tmp/drill/sys.storage_plugins folder to store storage plugin configurations in embedded mode.
We can use the Drill Web Console to update or add a new storage plugin configuration.

 Click on Update button of dfs, it should show something like below by default : 















The above image is self explanatory. Lets understand it practically by configuring a new storage.
Our example will use the existing “type” of storage plugin:dfs with a new “workspace”: logs which will have data stored in “format”: csv with “extension” : csv

Demo of Drill Queries :
In the last introduction post, i had discussed my experience with drill. Now will show the demo through drill queries here: how i can use the original csv file to get queried from drill, then get important columns saved in efficient parquet format using CTAS(Create Table As) query . Then will query that parquet file later to get the data displayed.
i have created a sample data file : friendList.csv
shekhar,kumar,28,male,mumbai,english,trading
apurv,awasthi,29,male,bangalore,hindi,marketing
shailesh,singh,30,male,bangalore,marathi,engineer

storage plugin changes at :  http://localhost:8047/storage/dfs
// define new workspaces for csv and parquet
"logs": {
      "location": "/Users/chandan/Documents/pers/logs",
      "writable": true,
      "defaultInputFormat": null
    },
    "parquetlogs": {
      "location": "/Users/chandan/Documents/pers/parquetlogs",
      "writable": true,
      "defaultInputFormat": null
    }

//define formats for csv and parquet
"formats": {
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "parquet": {
      "type": "parquet"
    } 
























Add workspace info and format info and press the update button.
Please note we will have to create workspace folders logs and parquet logs on disk using mkdir command ourselves.
mkdir /Users/chandan/Documents/pers/logs
mkdir /Users/chandan/Documents/pers/parquetlogs
Now fire queries one by one as below in drill-shell :
//select * to show each line in csv as one row
0: jdbc:drill:zk=local> select * from dfs.logs.`friendList.csv` ; 







//Time to extract only useful information, skipping Gender as i know all my friends are male  :)
0: jdbc:drill:zk=local> select columns[0] as `firstname`,columns[1] as `lastname`,columns[2] as `age`,columns[4] as `city`,columns[5] as `language`,columns[6] as `profession`from dfs.logs.`friendList.csv` ; 






//Now using CTAS, have to extract and save only the useful columns in efficient columnar Parquet storage in parquet workspace. But before that, change the store format of the session to parquet.
0: jdbc:drill:zk=local> alter session set `store.format`=‘parquet'; 






0: jdbc:drill:zk=local> CREATE TABLE dfs.parquetlogs.`friendInfo` AS   select columns[0] as `firstname`,columns[1] as `lastname`,columns[2] as `age`,columns[4] as `city`,columns[5] as `language`,columns[6] as `profession`from dfs.logs.`friendList.csv` ; 





//check the disk content whether parquet file has been created
chandans-MacBook-chandan$ ls -ltr ~/Documents/pers/parquetlogs/
drwxr-xr-x  4 chandan  staff  136 Jan 15 10:24 friendInfo
//Now query the parquet log file
0: jdbc:drill:zk=local> select * from dfs.parquetlogs.`friendInfo`; 







We have now only those fields which we wanted to be stored in parquet columnar storage with our own custom understandable column names.
Similarly , we can use CTAS query on parquet files also to fetch data later and export into some other format and data storage .


Join in Drill between different Data Storage :
Lastly, I want to show one of the most cool features of Drill : join on different data storages to get data in same query from different data storages. My post will be incomplete if i don't share join example.  :)
We have here 2 different storages: csv and parquet. We will fetch data from both storages in the same query using join.
i have created a csv file friendChildrenList.csv in the same logs folder which contains friend’s name and number of kids he has :
shekhar,2
apurv,4
shailesh,3

Now, call the join query on parquet and csv file to get the friends info with number of kids :
0: jdbc:drill:zk=local> select tb1.firstname,tb1.lastname,tb1.city,tb2.columns[1]  as `kids` from dfs.parquetlogs.`friendInfo` as tb1, dfs.logs.`friendChildrenList.csv` as tb2 where tb1.firstname=tb2.columns[0] ; 







We can see how intelligently Drill fetches data from both parquet and csv files , joins them in memory and show the output. Although i have just taken example of plain CSV and columnar parquet but the same kind of logic is applicable to other storages like hdfs,S3,Hive,Mongo,etc.
This was all about the setup and examples of Apache Drill in this post.
Feel free to share your comment/feedback.

1 comment:

  1. Hi, how to keep the drillbit service(in embedded mode) running in the background all the time.. I just want to utilize the rest api service in my web application.

    ReplyDelete