(The opinions in this article are entirely mine – although i owe my education in part to other people)

(Update: Please do read the followup. Conversations around the web on this topic are well tracked here)

Background

Recently i have had to look at Dynamo in close detail as part of a project. Unlike my previous casual perusals of the paper/architecture – this time i (along with some other colleagues) spent a good amount of time looking at the details of the architecture. Given the amount of buzz around this paper, the number of clones it has spawned, and the numerous users using those clones now – our takeaways were shockingly negative.

Before posting this note – I debated whether Dynamo was simply inappropriate for our application (and whether calling it ‘flawed’ would therefore be a mis-statement). But it is clear in my mind, that the paper is flawed – it tries to lead readers to believe things that are largely untrue and it makes architectural choices that are questionable. It’s design contradicts some of it’s own stated philosophies. Finally, i believe that the problems it proposes to solve – are solvable by other means that do not suffer from Dynamo’s design flaws. I will try to cover these points in a series of posts.

Eventual Consistency

First – i will start with the notion of ‘eventual’ consistency which Dynamo adhers to. What does this mean to the application programmer? Here are some practical implications:

  1. committed writes don’t show up in subsequent reads
  2. committed writes may show up in some subsequent reads and go missing thereafter
  3. that there is no SLA for when the writes are globally committed (and hence show up in all subsequent reads)

At a high level – not having SLA is not a practical proposition. ‘Eventually’ we are all dead. A system that returns what I wrote only after I am dead is no good for me (or anyone else). At a more ground level – returning stale data (the logical outcome of eventual consistency) leads to severe data loss in many practical applications. Let’s say that one is storing key-value pairs in Dynamo – where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ‘stale’ the read was – the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss – not even in the case of a Disaster.

(Update: Several people have called out that this data loss scenario is not possible in Dynamo due to Vector Clocks. That sounds correct. Some followups:

  1. The scenario is possible in Cassandra that does not use vector clocks but only client timestamps
  2. Dynamo depends on conflict resolution to solve this problem. Such conflicts are very difficult to resolve – in particular if deletion from the list is a valid operation – then how would one reconcile after mistaken truncation?
  3. In another simple scenario – a stale read may end up affecting writes to other keys – and this would be even harder to fix

The general points here are that returning stale reads is best avoided and where cannot be avoided – at least having some bounds on the staleness allows one to write applications with reasonable behavior. Dynamo puts no bounds on how stale a read can be and returns stale reads in single data-center environments where it can be entirely avoided)

Quorum Consensus

Dynamo starts by saying it’s eventually consistent – but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) – one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster – but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data. This is partly the reason why the system is only ‘eventually’ consistent. Of course – in a large cluster – nodes will leave and re-join the cluster all the time – so stale reads will be a matter of course.

This leads to the obvious question – why can one simply not put into place a resynchronization barrier when nodes re-join the cluster? The answer to this is troublesome: no one knows whether a node is in sync or not when it rejoins the cluster. No one can tell how much data it doesn’t have. There is no central commit log in the system. The only way to figure out the answer to these questions is to do a full diff of the data (Dynamo uses something called Merkel Merkle trees to do this) against all other members of the quorum group. This will of course be remarkably expensive and practically infeasible to do synchronously. Hence it is (at best) performed in the background.

The other way to provide strong consistency is to read from all the replicas all the time. There are two problems with this:

  1. This clearly fails under the case where the only surviving member of the quorum group is out of date (because it had failed previously and hasn’t been bought up to date)
  2. This imposes a super-high penalty for reads that is otherwise not implied by a lower setting for the parameter R. In fact it renders the parameter R moot – wondering why Dynamo would talk about quorum consensus in the first place?

I am not the first one to observe these problems. See for example Cassandra-225 which was an attempt to solve this problem with a centralized commit log in Cassandra (a Dynamo clone).

WAN considerations

It’s worth pointing out that when Dynamo clusters span a WAN – the odds of nodes re-joining the cluster and remaining out of date are significantly increased. If a node goes down, ‘hinted handoff’ sends updates to the next node in the ring. Since nodes of the two data centers alternate – the updates are sent to the remote data center. When the node re-joins the cluster, if the network is partitioned (which happen all the time), the node will not catch up on pending updates for a long time (until the network partitioning is healed).

Disaster Recovery

The effect of eventual consistency and replication design is felt most acutely when one considers the case of disaster recovery. If one data center fails, there is absolutely nothing one can say about the state of the surviving data center. One cannot quantify exactly how much data has been lost. With standard log-shipping based replication and disaster recovery, one can at least keep track of replication lag and have some idea of how far behind the surviving cluster is.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).

Contradictions

At the beginning of the paper, the paper stresses the principle of Symmetry. To quote:

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers

By the time we get to Section 4.8.2, this principle is gone, to quote:

To prevent logical partitions, some Dynamo nodes play the role of seeds.

Again, in section 2, one reads:

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window

and by the time one gets to Section 2.3, one reads:

Dynamo is designed to be an eventually consistent data store

where of course – the term ‘eventual’ is not quantified with any ‘window’!

In addition i found this quote somewhat misleading (Section 2.3):

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination in order to provide a strongly consistent data access interface.

As an engineer who has worked extensively on production commercial replication and disaster recovery systems – I can vouch this claim is incorrect. Most database and storage systems are deployed in asynchronous (but serial) replication mode. Replicas are always point-in-time consistent. Commercial recovery technologies (backups, snapshots, replication) all typically rely on point-in-time consistency. The combination of high availability and point in time consistency at remote data centers is relatively easy to achieve.

It is possible that the paper is trying to refer to distributed databases that are distributed globally and update-able concurrently. However, these are extremely tiny minority of commercial database deployments (if they exist at all) and it’s worth noting that this statement is largely untrue in practice.

Consistency versus Availability

Several outside observers have noted that Dynamo is chooses the AP of the CAP theorem – while other systems (notably BigTable) choose CA. Unfortunately, Dynamo does not distinguish between ‘Availability’ and ‘Partition Tolerance’ in the SOSP paper.

The reader is left with the impression that there is always a tradeoff between Consistency and Availability of all kinds. This is, of course, untrue. One can achieve strong Consistency and High-Availability within a single data center – and this is on par for most commercial databases – as well as for systems like HBase/HDFS.

The unfortunate outcome of this is that people who are looking for scalable storage systems even within a single data center may conclude (incorrectly) that Dynamo is a better architecture (than BigTable+GFS).

Centralization

Dynamo rails against centralized architectures, deeming them inherently of low availability – to quote (from Section 2.3):

In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Again – as an engineer who worked on in data-center high availability for years – i find this general line of thought questionable. Centralized control typically causes scalability bottlenecks. But by no means are they necessarily of low availability. The entire storage industry churns out highly available storage servers – typically by arranging for no single points of failure (SPOF). This means dual everything (dual motherboards, dual nics, dual switches, RAID, multipathing etc. etc.) and synchronously mirrored write-back caches (across motherboards). A storage server is most often a central entity that is highly available – and as such i am willing to bet that the bank accounts of all the Dynamo authors are stored and retrieved via some such server sitting in a vault somewhere. Such servers typically have 5 9′s availability (way more than anything Amazon offers for any of it’s services). The principles employed in building such highly available servers are well understood and easily applied to other centralized entities (for example the HDFS namenode).

Of course – having highly available centralized services needs discipline. One needs to have redundancy at every layer in the service (including, for example, power supplies for a rack and network uplinks). Many IT operations do not apply such discipline – but the resultant lack of availability is hardly the fault of the centralized architecture itself.

The irony in this claim is that a Dynamo cluster is likely itself to be centralized in some shape or form. One would likely have some racks of nodes in a Dynamo cluster all hanging off one set of core-switches. As such it’s clients (application/web-servers) would be in different network segment connected to different core-switches. Network partitioning between these two sets of core-switches would make Dynamo unavailable to the clients. (While this example shows how futile Dynamo’s goal of avoiding centralization is – it also shows how data centers need to be architected (with dual uplinks and switches at every network hop) to prevent network partitioning from happening and isolating centralized services).

Finally (and with extreme irony) we have already seen that the lack of centralization (of commit logs for instance) is the reason behind many of the consistency issues affecting Dynamo.

What next?

There are many other issues regarding fault isolation from data corruptions that are also worth discussing. And as promised, I will try to cover simpler schemes to cover some of the design goals of Dynamo as well. If all is well – in a subsequent post.