The use case i had in mind was something like this:
- A user is storing files containing structured data in S3 – Amazon’s store for bulk data. A very realistic use case could be a web-admin archiving Apache logs into S3 – or even transaction logs from some db
- Now this user wants to run sql queries on these files – perhaps to do some historical analysis
- Amazon provides compute resources on demand via EC2 – ideally these sql queries should use some allocated number of machines in EC2 to perform the required computations
- The results of sql queries that are interesting for later use should be easily stored back in S3
Of course – most of this is already in place. Hadoop already works on EC2 and can use an arbitrary number of machines to crunch data stored in S3. Hive offers a metadata management and SQL interpretation layer on top of Hadoop that organizes flat files into a Data Warehouse. The perplexing part was getting to the state of ideal separation between metadata, compute and tmp table store – and finally – long term data store:
- I should be able to store and access metadata about tables and partitions from anywhere. My favorite place would be my personal laptop from where i should be able to create tables over files in S3. This should not require a compute cluster
- I should then be able to allocate a compute cluster as large as I see fit
- I should then be able to compose a sql query and make it run on the allocated compute cluster from my laptop
- I can store the tables back in hdfs on the compute cluster (for tmp tables that are useful in subsequent queries) or in S3 (for long term storage)
- I should be able to terminate or increase or decrease the size of my compute cluster as i see fit (depending on the requirements of my queries)
As i went through this project – i also ended up articulating a nice to have – i didn’t want to create any more Amazon AMIs. There were already enough Hadoop AMIs floating around – and I didn’t want to create AMIs for cross product of Hive and Hadoop versions. I am happy to report that i could finally make all of this work today. The final set of steps looks something like this:
# create a hive table over a S3 directory
hive> create external table kv (key int, val string) location 's3n://data.s3ndemo.hive/kv';
# start a hadoop cluster with 10 nodes and setup ssh proxy to it
linux> launch-hadoop-cluster hive-cluster 10
linux> ssh -D xxxx ec2-mumbo-jumbo.amazonaws.com
# next two steps direct hive to use the just allocated cluster
hive> set fs.default.name=hdfs://ec2-mumbo-jumbo.amazonzws.com:50001;
hive> set mapred.job.tracker=ec2-mumbo-jumbo.amazonaws.com:50002;
# that's it - fire a query
hive> select distinct key from kv;
And it finally works! Thankfully – no new AMIs are involved. Getting to this state was not easy. I ended up filing and fixing at least four JIRAS (hadoop-5839, hive-467, hive-453, hadoop-5749) and there’s still more work ahead (ideally the sql engine should figure out the cluster size i need, decreasing the cluster size is still hard, hadoop has trouble doing ‘ls’ operations on files in s3).
A few takeways from this exercise:
- Open Source Rocks 🙂 – Prasad Chakka helped in reviewing my patches, Tom White and Philip Zeyliger from Cloudera helped me set up remote job submission. I was able to look at the Hadoop and Hive code and help myself out in other places.
- Except getting in changes is slow 🙁 – changes required to EC2 scripts to make this work will probably not show up in a stable release of Hadoop for quite a while.
- AWS has a non trivial learning curve – in terms of understanding S3 buckets and objects, various EC2 and S3 tools, installing and configuring all the required tools (yum yum), getting comfortable with using AWS and SSH keys, port authorizations and so on. At the same time – there’s awesome documentation and great community postings that easily solved any and all questions i had.
- India is far far away from Amazon data centers 🙁 – every api call, every file copy – took forever.
Of course – now i have to do the hard work of getting all the patches in, putting up detailed instructions on a wiki and ideally doing some performance and cost measurements for this setup. But for now – this seems like a happy milestone and something worth blogging a bit about.