Archive for category Hadoop

HBase and Map-Reduce

HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase allow). This is a particularly fun and feasible prospect now that frameworks like Hive allow users to express such analysis in plain old SQL.

The BigTable paper mentioned this aspect only in passing. But looking back – the successor to the LSM Trees paper (from which BigTable’s on-disk structures are inspired) – called LHAM – had a great investigation into the possibilities in this area. One of the best ideas from here is that analytic queries can have a different I/O path than real-time read/write traffic – even when accessing the same logical data set. Extrapolating this to HBase:

  • Applications requiring up-to-date versions can go through the RegionServer (Tablets in BigTable parlance) API
  • However Applications that do not care about the very latest updates can directly access compacted files from HDFS

Applications in the second category are surprisingly common. Many analytical and reporting applications are perfectly fine with data that’s about a day old or so. Issuing large IO against HDFS will always be more performant than streaming data out of region servers. And it allows one to size the HBase clusters based on the (more steady) real-time traffic – not based on large bursty batch jobs (alternately put – partially isolates the online traffic from resource contention by batch jobs).

Many more possibilities arise. Data can be duplicated (for batch workloads) by compaction processes in HBase:

  • To a separate HDFS cluster – this would provide complete I/O path isolation between online and offline workloads
  • Historical versions for use by analytic workloads can be physically organized in (columnar) formats that are even more friendly for analytics (and more compressible as well).

This also eliminates cumbersome Extraction processes (from the (in)famous trio of ETL) that usually convey data from online databases to back-end data warehouses.

A slightly more ambitious possibility is to exploit HBase data structures for the purposes of data analytics – without requiring the use of active HBase components. Hive, for example, was designed for immutable data sets. Row level updates are difficult without overwriting entire data partitions. While this is a perfect model for processing web logs, this doesn’t fit all real-world scenarios. In some applications – there are very large dimension data sets (for example – those capturing the status of micro-transactions) that are mutable – but that can have a scale approaching those of web logs. They are also naturally time partitioned – where mutations in old partitions happen rarely/infrequently. Organizing Hive partitions storing such data sets as a LSM tree is an intriguing possibility. It would allow the rare row level updates to be a cheap append operation to the partition, while largely preserving the sequential data layout that provides high read bandwidth for analytic queries. As a bonus, writers and readers would’t contend with each other.

Many of (harder) pieces of this puzzle are already in place. As an example – Hadoop already has multiple columnar file formats – including Hive’s RCFile. HBase’s HFile format was recently re-written and organized in a very modular fashion (should be easy to plug into Hive directly for example). A lot of interesting possibilities, it seems, are just round the corner.

Dynamo: A flawed architecture – Part I

(The opinions in this article are entirely mine – although i owe my education in part to other people)

(Update: Please do read the followup. Conversations around the web on this topic are well tracked here)


Recently i have had to look at Dynamo in close detail as part of a project. Unlike my previous casual perusals of the paper/architecture – this time i (along with some other colleagues) spent a good amount of time looking at the details of the architecture. Given the amount of buzz around this paper, the number of clones it has spawned, and the numerous users using those clones now – our takeaways were shockingly negative.

Before posting this note – I debated whether Dynamo was simply inappropriate for our application (and whether calling it ‘flawed’ would therefore be a mis-statement). But it is clear in my mind, that the paper is flawed – it tries to lead readers to believe things that are largely untrue and it makes architectural choices that are questionable. It’s design contradicts some of it’s own stated philosophies. Finally, i believe that the problems it proposes to solve – are solvable by other means that do not suffer from Dynamo’s design flaws. I will try to cover these points in a series of posts.

Eventual Consistency

First – i will start with the notion of ‘eventual’ consistency which Dynamo adhers to. What does this mean to the application programmer? Here are some practical implications:

  1. committed writes don’t show up in subsequent reads
  2. committed writes may show up in some subsequent reads and go missing thereafter
  3. that there is no SLA for when the writes are globally committed (and hence show up in all subsequent reads)

At a high level – not having SLA is not a practical proposition. ‘Eventually’ we are all dead. A system that returns what I wrote only after I am dead is no good for me (or anyone else). At a more ground level – returning stale data (the logical outcome of eventual consistency) leads to severe data loss in many practical applications. Let’s say that one is storing key-value pairs in Dynamo – where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ‘stale’ the read was – the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss – not even in the case of a Disaster.

(Update: Several people have called out that this data loss scenario is not possible in Dynamo due to Vector Clocks. That sounds correct. Some followups:

  1. The scenario is possible in Cassandra that does not use vector clocks but only client timestamps
  2. Dynamo depends on conflict resolution to solve this problem. Such conflicts are very difficult to resolve – in particular if deletion from the list is a valid operation – then how would one reconcile after mistaken truncation?
  3. In another simple scenario – a stale read may end up affecting writes to other keys – and this would be even harder to fix

The general points here are that returning stale reads is best avoided and where cannot be avoided – at least having some bounds on the staleness allows one to write applications with reasonable behavior. Dynamo puts no bounds on how stale a read can be and returns stale reads in single data-center environments where it can be entirely avoided)

Quorum Consensus

Dynamo starts by saying it’s eventually consistent – but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) – one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster – but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data. This is partly the reason why the system is only ‘eventually’ consistent. Of course – in a large cluster – nodes will leave and re-join the cluster all the time – so stale reads will be a matter of course.

This leads to the obvious question – why can one simply not put into place a resynchronization barrier when nodes re-join the cluster? The answer to this is troublesome: no one knows whether a node is in sync or not when it rejoins the cluster. No one can tell how much data it doesn’t have. There is no central commit log in the system. The only way to figure out the answer to these questions is to do a full diff of the data (Dynamo uses something called Merkel Merkle trees to do this) against all other members of the quorum group. This will of course be remarkably expensive and practically infeasible to do synchronously. Hence it is (at best) performed in the background.

The other way to provide strong consistency is to read from all the replicas all the time. There are two problems with this:

  1. This clearly fails under the case where the only surviving member of the quorum group is out of date (because it had failed previously and hasn’t been bought up to date)
  2. This imposes a super-high penalty for reads that is otherwise not implied by a lower setting for the parameter R. In fact it renders the parameter R moot – wondering why Dynamo would talk about quorum consensus in the first place?

I am not the first one to observe these problems. See for example Cassandra-225 which was an attempt to solve this problem with a centralized commit log in Cassandra (a Dynamo clone).

WAN considerations

It’s worth pointing out that when Dynamo clusters span a WAN – the odds of nodes re-joining the cluster and remaining out of date are significantly increased. If a node goes down, ‘hinted handoff’ sends updates to the next node in the ring. Since nodes of the two data centers alternate – the updates are sent to the remote data center. When the node re-joins the cluster, if the network is partitioned (which happen all the time), the node will not catch up on pending updates for a long time (until the network partitioning is healed).

Disaster Recovery

The effect of eventual consistency and replication design is felt most acutely when one considers the case of disaster recovery. If one data center fails, there is absolutely nothing one can say about the state of the surviving data center. One cannot quantify exactly how much data has been lost. With standard log-shipping based replication and disaster recovery, one can at least keep track of replication lag and have some idea of how far behind the surviving cluster is.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).


At the beginning of the paper, the paper stresses the principle of Symmetry. To quote:

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers

By the time we get to Section 4.8.2, this principle is gone, to quote:

To prevent logical partitions, some Dynamo nodes play the role of seeds.

Again, in section 2, one reads:

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window

and by the time one gets to Section 2.3, one reads:

Dynamo is designed to be an eventually consistent data store

where of course – the term ‘eventual’ is not quantified with any ‘window’!

In addition i found this quote somewhat misleading (Section 2.3):

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination in order to provide a strongly consistent data access interface.

As an engineer who has worked extensively on production commercial replication and disaster recovery systems – I can vouch this claim is incorrect. Most database and storage systems are deployed in asynchronous (but serial) replication mode. Replicas are always point-in-time consistent. Commercial recovery technologies (backups, snapshots, replication) all typically rely on point-in-time consistency. The combination of high availability and point in time consistency at remote data centers is relatively easy to achieve.

It is possible that the paper is trying to refer to distributed databases that are distributed globally and update-able concurrently. However, these are extremely tiny minority of commercial database deployments (if they exist at all) and it’s worth noting that this statement is largely untrue in practice.

Consistency versus Availability

Several outside observers have noted that Dynamo is chooses the AP of the CAP theorem – while other systems (notably BigTable) choose CA. Unfortunately, Dynamo does not distinguish between ‘Availability’ and ‘Partition Tolerance’ in the SOSP paper.

The reader is left with the impression that there is always a tradeoff between Consistency and Availability of all kinds. This is, of course, untrue. One can achieve strong Consistency and High-Availability within a single data center – and this is on par for most commercial databases – as well as for systems like HBase/HDFS.

The unfortunate outcome of this is that people who are looking for scalable storage systems even within a single data center may conclude (incorrectly) that Dynamo is a better architecture (than BigTable+GFS).


Dynamo rails against centralized architectures, deeming them inherently of low availability – to quote (from Section 2.3):

In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Again – as an engineer who worked on in data-center high availability for years – i find this general line of thought questionable. Centralized control typically causes scalability bottlenecks. But by no means are they necessarily of low availability. The entire storage industry churns out highly available storage servers – typically by arranging for no single points of failure (SPOF). This means dual everything (dual motherboards, dual nics, dual switches, RAID, multipathing etc. etc.) and synchronously mirrored write-back caches (across motherboards). A storage server is most often a central entity that is highly available – and as such i am willing to bet that the bank accounts of all the Dynamo authors are stored and retrieved via some such server sitting in a vault somewhere. Such servers typically have 5 9’s availability (way more than anything Amazon offers for any of it’s services). The principles employed in building such highly available servers are well understood and easily applied to other centralized entities (for example the HDFS namenode).

Of course – having highly available centralized services needs discipline. One needs to have redundancy at every layer in the service (including, for example, power supplies for a rack and network uplinks). Many IT operations do not apply such discipline – but the resultant lack of availability is hardly the fault of the centralized architecture itself.

The irony in this claim is that a Dynamo cluster is likely itself to be centralized in some shape or form. One would likely have some racks of nodes in a Dynamo cluster all hanging off one set of core-switches. As such it’s clients (application/web-servers) would be in different network segment connected to different core-switches. Network partitioning between these two sets of core-switches would make Dynamo unavailable to the clients. (While this example shows how futile Dynamo’s goal of avoiding centralization is – it also shows how data centers need to be architected (with dual uplinks and switches at every network hop) to prevent network partitioning from happening and isolating centralized services).

Finally (and with extreme irony) we have already seen that the lack of centralization (of commit logs for instance) is the reason behind many of the consistency issues affecting Dynamo.

What next?

There are many other issues regarding fault isolation from data corruptions that are also worth discussing. And as promised, I will try to cover simpler schemes to cover some of the design goals of Dynamo as well. If all is well – in a subsequent post.

Tags: , , , , ,

Compression and Layering in Hadoop

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission.

One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something that the application layer (map-reduce and higher layers) are responsible for. While native compression is supported by many file systems (for example ZFS) – the arguments in favor of the former are less well made (or at least less commonly made). On the other hand, columnar, dictionary and other forms of compression at the application level (that exploit data schema and distribution) are common parlance and there’s a whole industry of sorts around these.

After some thought though – i have become more and more convinced that functionality such as compression should be moved as much down the stack as possible. The arguments for this (at least in the context of HDFS and Hadoop) are fairly obvious:

  1. Applications need to apply compression synchronously – often increasing latency of workloads and load during peak hours. File Systems can perform compression asynchronously
  2. FileSystems can manage data through it’s life cycle applying different forms of compression. One could, for example, keep new data uncompressed, recent data compressed via LZO and then finally migrate to Bzip.
  3. Where multiple replicas of data are maintained – some of the replicas can be maintained as compressed transparently – providing redundancy while saving space. Applications can run against uncompressed data whenever possible
  4. The former may be especially appropriate when data is maintained for disaster recovery. Data can be compressed before transmission to remote site and can be kept in compressed format there
  5. The filesystem may also be able to push compression to external hardware (co-processors)

Fairly similar arguments can be made to move wire compression to the networking sub-system. It can dynamically recognize whether a compute cluster is currently CPU or network bound and come up with appropriate compression level for data transmission. It is impossible for application level compression to achieve these things.

This leaves the tricky question of how to get the same levels of compression that an application may be able to achieve (given knowledge of data schema etc.). The challenge here points to a missing abstraction in the traditional split of file and database systems (and other applications). The traditional ‘byte-stream’ abstraction that filesystems (and networking systems) present means that they don’t have any knowledge of data inside the byte stream. However – if a data stream can be tagged with it’s schema – and optionally even a pointer to the right codec for this stream – then the file-system can easily perform optimal compression (while preserving the benefits above).

Traditionally – these kinds of proposals would also have encountered the standard opposition to running user-land (application) code in kernel space. But with user level file systems like HDFS gaining more and more tractions – and with user level processing becoming more common in operating systems as well (witness FUSE) – that argument is fairly moot. One of the opportunities with systems like Hadoop, i think, is that we can fix these historic anomalies. The open nature of this software and the fact that files are used as a record-stream (and not as a database image) – lends itself to the kind of schemes suggested above.

Tags: ,

Update on Hive+Hadoop+S3+EC2

A formal recipe on running SQL queries using EC2 against S3 files is now posted at:

But not before hitting a few more bugs ( HADOOP-5861 ). Running a TPCH query using Hive was a pretty high point. (I did have to omit the order by clauses though :-()

I am amazed at how far Hive has come (and yet how glaring some of the missing features are). I am also impressed by the promise of the cloud (this being my first project using S3/EC2) and at how different the experience was as compared to programming/administering a large in-house cluster. Amazon’s infrastructure seems to scream for developers to jump in and add value. Hopefully i will get a chance to explore some ideas on this in future posts.

Tags: ,

Hive + Hadoop + S3 + EC2 = It works!

I have been enjoying my vacation time in India for the last few weeks and one of the fun projects i had taken up was getting a good story around running Hive on Amazon Infrastructure (AWS) .

The use case i had in mind was something like this:

  1. A user is storing files containing structured data in S3 – Amazon’s store for bulk data. A very realistic use case could be a web-admin archiving Apache logs into S3 – or even transaction logs from some db
  2. Now this user wants to run sql queries on these files – perhaps to do some historical analysis
  3. Amazon provides compute resources on demand via EC2 – ideally these sql queries should use some allocated number of machines in EC2 to perform the required computations
  4. The results of sql queries that are interesting for later use should be easily stored back in S3

Of course – most of this is already in place. Hadoop already works on EC2 and can use an arbitrary number of machines to crunch data stored in S3. Hive offers a metadata management and SQL interpretation layer on top of Hadoop that organizes flat files into a Data Warehouse. The perplexing part was getting to the state of ideal separation between metadata, compute and tmp table store – and finally – long term data store:

  • I should be able to store and access metadata about tables and partitions from anywhere. My favorite place would be my personal laptop from where i should be able to create tables over files in S3. This should not require a compute cluster
  • I should then be able to allocate a compute cluster as large as I see fit
  • I should then be able to compose a sql query and make it run on the allocated compute cluster from my laptop
  • I can store the tables back in hdfs on the compute cluster (for tmp tables that are useful in subsequent queries) or in S3 (for long term storage)
  • I should be able to terminate or increase or decrease the size of my compute cluster as i see fit (depending on the requirements of my queries)

As i went through this project – i also ended up articulating a nice to have – i didn’t want to create any more Amazon AMIs. There were already enough Hadoop AMIs floating around – and I didn’t want to create AMIs for cross product of Hive and Hadoop versions. I am happy to report that i could finally make all of this work today. The final set of steps looks something like this:

# create a hive table over a S3 directory
hive> create external table kv (key int, val string) location 's3n://data.s3ndemo.hive/kv';
# start a hadoop cluster with 10 nodes and setup ssh proxy to it
linux> launch-hadoop-cluster hive-cluster 10
linux> ssh -D xxxx
# next two steps direct hive to use the just allocated cluster
hive> set;
hive> set;
# that's it - fire a query
hive> select distinct key from kv;

And it finally works! Thankfully – no new AMIs are involved. Getting to this state was not easy. I ended up filing and fixing at least four JIRAS (hadoop-5839, hive-467, hive-453, hadoop-5749) and there’s still more work ahead (ideally the sql engine should figure out the cluster size i need, decreasing the cluster size is still hard, hadoop has trouble doing ‘ls’ operations on files in s3).

A few takeways from this exercise:

  • Open Source Rocks 🙂 – Prasad Chakka helped in reviewing my patches, Tom White and Philip Zeyliger from Cloudera helped me set up remote job submission. I was able to look at the Hadoop and Hive code and help myself out in other places.
  • Except getting in changes is slow 🙁 – changes required to EC2 scripts to make this work will probably not show up in a stable release of Hadoop for quite a while.
  • AWS has a non trivial learning curve – in terms of understanding S3 buckets and objects, various EC2 and S3 tools, installing and configuring all the required tools (yum yum), getting comfortable with using AWS and SSH keys, port authorizations and so on. At the same time – there’s awesome documentation and great community postings that easily solved any and all questions i had.
  • India is far far away from Amazon data centers 🙁 – every api call, every file copy – took forever.

Of course – now i have to do the hard work of getting all the patches in, putting up detailed instructions on a wiki and ideally doing some performance and cost measurements for this setup. But for now – this seems like a happy milestone and something worth blogging a bit about.

Tags: , , ,

Curt Monash reports on Hadoop/Hive @ Facebook

Curt Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their deployment and usage at Facebook.  It is heartening to see the mainstream database and analytics community starting to cover Hadoop and Hive.  Even though these projects are rapidly becoming better and developing strong communities around them – they are already of production quality and can bring substantial benefits to environments with large data sets – both in terms of reducing cost as well as in the ability to run more flexible computations against them.

Of course – this just makes all the questions being raised about the efficacy and architecture  of map-reduce/Hadoop all the more important to address (hopefully in subsequent posts).