Archive for category Hive

HBase and Map-Reduce

HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase allow). This is a particularly fun and feasible prospect now that frameworks like Hive allow users to express such analysis in plain old SQL.

The BigTable paper mentioned this aspect only in passing. But looking back – the successor to the LSM Trees paper (from which BigTable’s on-disk structures are inspired) – called LHAM – had a great investigation into the possibilities in this area. One of the best ideas from here is that analytic queries can have a different I/O path than real-time read/write traffic – even when accessing the same logical data set. Extrapolating this to HBase:

  • Applications requiring up-to-date versions can go through the RegionServer (Tablets in BigTable parlance) API
  • However Applications that do not care about the very latest updates can directly access compacted files from HDFS

Applications in the second category are surprisingly common. Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic – not based on large bursty batch jobs (alternately put – partially isolates the online traffic from resource contention by batch jobs).

Many more possibilities arise. Data can be duplicated (for batch workloads) by compaction processes in HBase:

  • To a separate HDFS cluster – this would provide complete I/O path isolation between online and offline workloads
  • Historical versions for use by analytic workloads can be physically organized in (columnar) formats that are even more friendly for analytics (and more compressible as well).

This also eliminates cumbersome Extraction processes (from the (in)famous trio of ETL) that usually convey data from online databases to back-end data warehouses.

A slightly more ambitious possibility is to exploit HBase data structures for the purposes of data analytics – without requiring the use of active HBase components. Hive, for example, was designed for immutable data sets. Row level updates are difficult without overwriting entire data partitions. While this is a perfect model for processing web logs, this doesn’t fit all real-world scenarios. In some applications – there are very large dimension data sets (for example – those capturing the status of micro-transactions) that are mutable – but that can have a scale approaching those of web logs. They are also naturally time partitioned – where mutations in old partitions happen rarely/infrequently. Organizing Hive partitions storing such data sets as a LSM tree is an intriguing possibility. It would allow the rare row level updates to be a cheap append operation to the partition, while largely preserving the sequential data layout that provides high read bandwidth for analytic queries. As a bonus, writers and readers would’t contend with each other.

Many of (harder) pieces of this puzzle are already in place. As an example – Hadoop already has multiple columnar file formats – including Hive’s RCFile. HBase’s HFile format was recently re-written and organized in a very modular fashion (should be easy to plug into Hive directly for example). A lot of interesting possibilities, it seems, are just round the corner.

Update on Hive+Hadoop+S3+EC2

A formal recipe on running SQL queries using EC2 against S3 files is now posted at:

But not before hitting a few more bugs ( HADOOP-5861 ). Running a TPCH query using Hive was a pretty high point. (I did have to omit the order by clauses though :-()

I am amazed at how far Hive has come (and yet how glaring some of the missing features are). I am also impressed by the promise of the cloud (this being my first project using S3/EC2) and at how different the experience was as compared to programming/administering a large in-house cluster. Amazon’s infrastructure seems to scream for developers to jump in and add value. Hopefully i will get a chance to explore some ideas on this in future posts.

Tags: ,

Hive + Hadoop + S3 + EC2 = It works!

I have been enjoying my vacation time in India for the last few weeks and one of the fun projects i had taken up was getting a good story around running Hive on Amazon Infrastructure (AWS) .

The use case i had in mind was something like this:

  1. A user is storing files containing structured data in S3 – Amazon’s store for bulk data. A very realistic use case could be a web-admin archiving Apache logs into S3 – or even transaction logs from some db
  2. Now this user wants to run sql queries on these files – perhaps to do some historical analysis
  3. Amazon provides compute resources on demand via EC2 – ideally these sql queries should use some allocated number of machines in EC2 to perform the required computations
  4. The results of sql queries that are interesting for later use should be easily stored back in S3

Of course – most of this is already in place. Hadoop already works on EC2 and can use an arbitrary number of machines to crunch data stored in S3. Hive offers a metadata management and SQL interpretation layer on top of Hadoop that organizes flat files into a Data Warehouse. The perplexing part was getting to the state of ideal separation between metadata, compute and tmp table store – and finally – long term data store:

  • I should be able to store and access metadata about tables and partitions from anywhere. My favorite place would be my personal laptop from where i should be able to create tables over files in S3. This should not require a compute cluster
  • I should then be able to allocate a compute cluster as large as I see fit
  • I should then be able to compose a sql query and make it run on the allocated compute cluster from my laptop
  • I can store the tables back in hdfs on the compute cluster (for tmp tables that are useful in subsequent queries) or in S3 (for long term storage)
  • I should be able to terminate or increase or decrease the size of my compute cluster as i see fit (depending on the requirements of my queries)

As i went through this project – i also ended up articulating a nice to have – i didn’t want to create any more Amazon AMIs. There were already enough Hadoop AMIs floating around – and I didn’t want to create AMIs for cross product of Hive and Hadoop versions. I am happy to report that i could finally make all of this work today. The final set of steps looks something like this:

# create a hive table over a S3 directory
hive> create external table kv (key int, val string) location 's3n://data.s3ndemo.hive/kv';
# start a hadoop cluster with 10 nodes and setup ssh proxy to it
linux> launch-hadoop-cluster hive-cluster 10
linux> ssh -D xxxx
# next two steps direct hive to use the just allocated cluster
hive> set;
hive> set;
# that's it - fire a query
hive> select distinct key from kv;

And it finally works! Thankfully – no new AMIs are involved. Getting to this state was not easy. I ended up filing and fixing at least four JIRAS (hadoop-5839, hive-467, hive-453, hadoop-5749) and there’s still more work ahead (ideally the sql engine should figure out the cluster size i need, decreasing the cluster size is still hard, hadoop has trouble doing ‘ls’ operations on files in s3).

A few takeways from this exercise:

  • Open Source Rocks 🙂 – Prasad Chakka helped in reviewing my patches, Tom White and Philip Zeyliger from Cloudera helped me set up remote job submission. I was able to look at the Hadoop and Hive code and help myself out in other places.
  • Except getting in changes is slow 🙁 – changes required to EC2 scripts to make this work will probably not show up in a stable release of Hadoop for quite a while.
  • AWS has a non trivial learning curve – in terms of understanding S3 buckets and objects, various EC2 and S3 tools, installing and configuring all the required tools (yum yum), getting comfortable with using AWS and SSH keys, port authorizations and so on. At the same time – there’s awesome documentation and great community postings that easily solved any and all questions i had.
  • India is far far away from Amazon data centers 🙁 – every api call, every file copy – took forever.

Of course – now i have to do the hard work of getting all the patches in, putting up detailed instructions on a wiki and ideally doing some performance and cost measurements for this setup. But for now – this seems like a happy milestone and something worth blogging a bit about.

Tags: , , ,

Curt Monash reports on Hadoop/Hive @ Facebook

Curt Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their deployment and usage at Facebook.  It is heartening to see the mainstream database and analytics community starting to cover Hadoop and Hive.  Even though these projects are rapidly becoming better and developing strong communities around them – they are already of production quality and can bring substantial benefits to environments with large data sets – both in terms of reducing cost as well as in the ability to run more flexible computations against them.

Of course – this just makes all the questions being raised about the efficacy and architecture  of map-reduce/Hadoop all the more important to address (hopefully in subsequent posts).