HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase allow). This is a particularly fun and feasible prospect now that frameworks like Hive allow users to express such analysis in plain old SQL.

The BigTable paper mentioned this aspect only in passing. But looking back – the successor to the LSM Trees paper (from which BigTable’s on-disk structures are inspired) – called LHAM – had a great investigation into the possibilities in this area. One of the best ideas from here is that analytic queries can have a different I/O path than real-time read/write traffic – even when accessing the same logical data set. Extrapolating this to HBase:

  • Applications requiring up-to-date versions can go through the RegionServer (Tablets in BigTable parlance) API
  • However Applications that do not care about the very latest updates can directly access compacted files from HDFS

Applications in the second category are surprisingly common. Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic – not based on large bursty batch jobs (alternately put – partially isolates the online traffic from resource contention by batch jobs).

Many more possibilities arise. Data can be duplicated (for batch workloads) by compaction processes in HBase:

  • To a separate HDFS cluster – this would provide complete I/O path isolation between online and offline workloads
  • Historical versions for use by analytic workloads can be physically organized in (columnar) formats that are even more friendly for analytics (and more compressible as well).

This also eliminates cumbersome Extraction processes (from the (in)famous trio of ETL) that usually convey data from online databases to back-end data warehouses.

A slightly more ambitious possibility is to exploit HBase data structures for the purposes of data analytics – without requiring the use of active HBase components. Hive, for example, was designed for immutable data sets. Row level updates are difficult without overwriting entire data partitions. While this is a perfect model for processing web logs, this doesn’t fit all real-world scenarios. In some applications – there are very large dimension data sets (for example – those capturing the status of micro-transactions) that are mutable – but that can have a scale approaching those of web logs. They are also naturally time partitioned – where mutations in old partitions happen rarely/infrequently. Organizing Hive partitions storing such data sets as a LSM tree is an intriguing possibility. It would allow the rare row level updates to be a cheap append operation to the partition, while largely preserving the sequential data layout that provides high read bandwidth for analytic queries. As a bonus, writers and readers would’t contend with each other.

Many of (harder) pieces of this puzzle are already in place. As an example – Hadoop already has multiple columnar file formats – including Hive’s RCFile. HBase’s HFile format was recently re-written and organized in a very modular fashion (should be easy to plug into Hive directly for example). A lot of interesting possibilities, it seems, are just round the corner.