Dynamo: A flawed architecture – Part I

(The opinions in this article are entirely mine – although i owe my education in part to other people)

(Update: Please do read the followup. Conversations around the web on this topic are well tracked here)

Background

Recently i have had to look at Dynamo in close detail as part of a project. Unlike my previous casual perusals of the paper/architecture – this time i (along with some other colleagues) spent a good amount of time looking at the details of the architecture. Given the amount of buzz around this paper, the number of clones it has spawned, and the numerous users using those clones now – our takeaways were shockingly negative.

Before posting this note – I debated whether Dynamo was simply inappropriate for our application (and whether calling it ‘flawed’ would therefore be a mis-statement). But it is clear in my mind, that the paper is flawed – it tries to lead readers to believe things that are largely untrue and it makes architectural choices that are questionable. It’s design contradicts some of it’s own stated philosophies. Finally, i believe that the problems it proposes to solve – are solvable by other means that do not suffer from Dynamo’s design flaws. I will try to cover these points in a series of posts.

Eventual Consistency

First – i will start with the notion of ‘eventual’ consistency which Dynamo adhers to. What does this mean to the application programmer? Here are some practical implications:

  1. committed writes don’t show up in subsequent reads
  2. committed writes may show up in some subsequent reads and go missing thereafter
  3. that there is no SLA for when the writes are globally committed (and hence show up in all subsequent reads)

At a high level – not having SLA is not a practical proposition. ‘Eventually’ we are all dead. A system that returns what I wrote only after I am dead is no good for me (or anyone else). At a more ground level – returning stale data (the logical outcome of eventual consistency) leads to severe data loss in many practical applications. Let’s say that one is storing key-value pairs in Dynamo – where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ‘stale’ the read was – the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss – not even in the case of a Disaster.

(Update: Several people have called out that this data loss scenario is not possible in Dynamo due to Vector Clocks. That sounds correct. Some followups:

  1. The scenario is possible in Cassandra that does not use vector clocks but only client timestamps
  2. Dynamo depends on conflict resolution to solve this problem. Such conflicts are very difficult to resolve – in particular if deletion from the list is a valid operation – then how would one reconcile after mistaken truncation?
  3. In another simple scenario – a stale read may end up affecting writes to other keys – and this would be even harder to fix

The general points here are that returning stale reads is best avoided and where cannot be avoided – at least having some bounds on the staleness allows one to write applications with reasonable behavior. Dynamo puts no bounds on how stale a read can be and returns stale reads in single data-center environments where it can be entirely avoided)

Quorum Consensus

Dynamo starts by saying it’s eventually consistent – but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) – one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster – but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data. This is partly the reason why the system is only ‘eventually’ consistent. Of course – in a large cluster – nodes will leave and re-join the cluster all the time – so stale reads will be a matter of course.

This leads to the obvious question – why can one simply not put into place a resynchronization barrier when nodes re-join the cluster? The answer to this is troublesome: no one knows whether a node is in sync or not when it rejoins the cluster. No one can tell how much data it doesn’t have. There is no central commit log in the system. The only way to figure out the answer to these questions is to do a full diff of the data (Dynamo uses something called Merkel Merkle trees to do this) against all other members of the quorum group. This will of course be remarkably expensive and practically infeasible to do synchronously. Hence it is (at best) performed in the background.

The other way to provide strong consistency is to read from all the replicas all the time. There are two problems with this:

  1. This clearly fails under the case where the only surviving member of the quorum group is out of date (because it had failed previously and hasn’t been bought up to date)
  2. This imposes a super-high penalty for reads that is otherwise not implied by a lower setting for the parameter R. In fact it renders the parameter R moot – wondering why Dynamo would talk about quorum consensus in the first place?

I am not the first one to observe these problems. See for example Cassandra-225 which was an attempt to solve this problem with a centralized commit log in Cassandra (a Dynamo clone).

WAN considerations

It’s worth pointing out that when Dynamo clusters span a WAN – the odds of nodes re-joining the cluster and remaining out of date are significantly increased. If a node goes down, ‘hinted handoff’ sends updates to the next node in the ring. Since nodes of the two data centers alternate – the updates are sent to the remote data center. When the node re-joins the cluster, if the network is partitioned (which happen all the time), the node will not catch up on pending updates for a long time (until the network partitioning is healed).

Disaster Recovery

The effect of eventual consistency and replication design is felt most acutely when one considers the case of disaster recovery. If one data center fails, there is absolutely nothing one can say about the state of the surviving data center. One cannot quantify exactly how much data has been lost. With standard log-shipping based replication and disaster recovery, one can at least keep track of replication lag and have some idea of how far behind the surviving cluster is.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).

Contradictions

At the beginning of the paper, the paper stresses the principle of Symmetry. To quote:

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers

By the time we get to Section 4.8.2, this principle is gone, to quote:

To prevent logical partitions, some Dynamo nodes play the role of seeds.

Again, in section 2, one reads:

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window

and by the time one gets to Section 2.3, one reads:

Dynamo is designed to be an eventually consistent data store

where of course – the term ‘eventual’ is not quantified with any ‘window’!

In addition i found this quote somewhat misleading (Section 2.3):

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination in order to provide a strongly consistent data access interface.

As an engineer who has worked extensively on production commercial replication and disaster recovery systems – I can vouch this claim is incorrect. Most database and storage systems are deployed in asynchronous (but serial) replication mode. Replicas are always point-in-time consistent. Commercial recovery technologies (backups, snapshots, replication) all typically rely on point-in-time consistency. The combination of high availability and point in time consistency at remote data centers is relatively easy to achieve.

It is possible that the paper is trying to refer to distributed databases that are distributed globally and update-able concurrently. However, these are extremely tiny minority of commercial database deployments (if they exist at all) and it’s worth noting that this statement is largely untrue in practice.

Consistency versus Availability

Several outside observers have noted that Dynamo is chooses the AP of the CAP theorem – while other systems (notably BigTable) choose CA. Unfortunately, Dynamo does not distinguish between ‘Availability’ and ‘Partition Tolerance’ in the SOSP paper.

The reader is left with the impression that there is always a tradeoff between Consistency and Availability of all kinds. This is, of course, untrue. One can achieve strong Consistency and High-Availability within a single data center – and this is on par for most commercial databases – as well as for systems like HBase/HDFS.

The unfortunate outcome of this is that people who are looking for scalable storage systems even within a single data center may conclude (incorrectly) that Dynamo is a better architecture (than BigTable+GFS).

Centralization

Dynamo rails against centralized architectures, deeming them inherently of low availability – to quote (from Section 2.3):

In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Again – as an engineer who worked on in data-center high availability for years – i find this general line of thought questionable. Centralized control typically causes scalability bottlenecks. But by no means are they necessarily of low availability. The entire storage industry churns out highly available storage servers – typically by arranging for no single points of failure (SPOF). This means dual everything (dual motherboards, dual nics, dual switches, RAID, multipathing etc. etc.) and synchronously mirrored write-back caches (across motherboards). A storage server is most often a central entity that is highly available – and as such i am willing to bet that the bank accounts of all the Dynamo authors are stored and retrieved via some such server sitting in a vault somewhere. Such servers typically have 5 9’s availability (way more than anything Amazon offers for any of it’s services). The principles employed in building such highly available servers are well understood and easily applied to other centralized entities (for example the HDFS namenode).

Of course – having highly available centralized services needs discipline. One needs to have redundancy at every layer in the service (including, for example, power supplies for a rack and network uplinks). Many IT operations do not apply such discipline – but the resultant lack of availability is hardly the fault of the centralized architecture itself.

The irony in this claim is that a Dynamo cluster is likely itself to be centralized in some shape or form. One would likely have some racks of nodes in a Dynamo cluster all hanging off one set of core-switches. As such it’s clients (application/web-servers) would be in different network segment connected to different core-switches. Network partitioning between these two sets of core-switches would make Dynamo unavailable to the clients. (While this example shows how futile Dynamo’s goal of avoiding centralization is – it also shows how data centers need to be architected (with dual uplinks and switches at every network hop) to prevent network partitioning from happening and isolating centralized services).

Finally (and with extreme irony) we have already seen that the lack of centralization (of commit logs for instance) is the reason behind many of the consistency issues affecting Dynamo.

What next?

There are many other issues regarding fault isolation from data corruptions that are also worth discussing. And as promised, I will try to cover simpler schemes to cover some of the design goals of Dynamo as well. If all is well – in a subsequent post.

26 thoughts on “Dynamo: A flawed architecture – Part I

  1. I think before you start making these big claims you need to read up on how Vector Clocks work. Any distributed systems text book would explain that.

    Dynamo uses Vector Clocks for versioning. When a stale read occurs the new list you talk about is written it is written with a new version stamp and stored. If the version stamp subsumes the previous one then the older version is garbage collected i.e overwritten. When versions conflict both versions will be stored and returned to the client on a subsequent read. Client then resolves the conflict and on the subsequent write of the resolved value a new version will be generated and the older versions will be garbage collected because they will be subsumed by the newly generate version stamp. So the case you are suggesting will never happen.

  2. [Disclosure: I used to work at Amazon, but not on Dynamo or Cart, and have no non-public information about those systems.]

    I think this is based on a flawed application of metrics from other organizations that do not match Amazon’s actual business needs.

    Sure, you can build a centralized system with a critical core that has near-zero downtime – usually at great operational cost – but there’s not a bank in the world that’s actually 99.999% available to customers over the internet. Most retail banks I’ve used take several hours of outages *per month* for maintenance on their web services.

    Amazon wants its systems not just to keep running but to keep taking orders from millions of people all over the world. This means that they are concerned with the reliability not just of a storage server, but everything needed to connect it to its application servers and to customers. They have found that the cost-effective way to do this is to distribute every component across geographically separated datacenters. Amazon has remained available through real and simulated datacenter-wide outages ranging from power/cooling failures to floods, fire, and hurricanes. No Amazon system lives within a single building, much less a single network switch or rack. The availability numbers you see externally for Amazon services are the combined downtime of every component in a large, networked system, and they also reflect tradeoffs made under a different cost-benefit model than a financial institution.

    Finally, although the formally provable aspects of Dynamo’s “eventual consistency” guarantee may be vague, any team operating a distributed system will study and understand the actual operational characteristics in practice, under both normal conditions and various failure modes. Some systems may have realistic failures that lead to days or hours of inconsistency (in which case the team has deliberately chosen this and will write client software to be aware of it); others might be tuned to achieve consistency within milliseconds under normal operation and to set off alarms within seconds after a failure. I’ve never known any that would be used in circumstances where “human lifetimes” are a relevant time period.

  3. Based on his reports on stale reads it is pretty clear he needs to attend Distributed Systems 101. It is apparent he just doesn’t realize how vector clock based versioning works within Dynamo.

  4. +1 if it is so flawed why is Amazon, Facebook and Digg using in production. Twitter is also looking at this currently.

  5. the interesting and ironic thing about the claims that dynamo helps amazon is that according to their published data, dynamo is used as a cart system only. When the order goes thru, there is still dependencies on oracle databases. So the notion that Amazon has fabulous uptime is due to dynamo is flawed itself. They get uptime via Oracle, replication, judicious datacenter choice, a strategy of ‘never be down’, and extensive software teams and young engineers to sacrifice at the altar to keep things up.

    Vector clocks get into a state where they ‘punt’ and ask the application to merge 2 divergent states. In the amazon cart example, this is where the code would merge 2 diverged carts, and then the result would be the union of the items, thus potentially resurrecting deleted items (a stated issue/problem in the paper iirc). Not all applications have this kind of behaviour, and not all applications find this “OK”.

    Furthermore, the leading implementation of dynamo, cassandra, does not use vector clocks. They do support a more columnar data store, like bigtable, but not vector clocks, nor multiple versions.

    For my own needs, I have determined the greater flexibility of a Bigtable storage model is superior. You get good consistency properties, flexible schema design (left-partial indexes are great), and no hidden problems such as key rebalancing as indicated above. With the big community and lots of momentum, HBase is where I am.

  6. Type your comment here

    Mark :

    +1 if it is so flawed why is Amazon, Facebook and Digg using in production. Twitter is also looking at this currently.

    I think you should know that the author holds the job title of “Data Infrastructure Lead” at Facebook. Additionally, Casandra, which is the brainchild of Facebook, and in use by Digg/Twitter, is not the same as Dynamo.

  7. I think it’s worth noting that Cassandra != Dynamo. It’s designed by one of the original Dynamo engineers (or so he says), but the model is quite different and has more in common with BigTable, IMO.

    Regardless, at the end of the day, Dynamo is a useful design that solves a specific problem space. The existence of Dynamo does not invalidate the existence of other data stores. There are places where HBase or Oracle or MySQL, etc will be the Right Choice. Dynamo (and the clones) are just another handy tool in the toolbelt.

  8. > Vector clocks get into a state where they ‘punt’ and ask the application to merge 2 divergent states. […] Not all applications have this kind of behaviour, and not all applications find this “OK”.

    And not all applications are suitable for a distributed environment. CAP lets you pick two at any point in time, but you still have to make a choice. If you can’t handle clients sorting out consistency issues on their own then you will need to sacrifice on availability or partition tolerance.

    > Furthermore, the leading implementation of dynamo, cassandra, does not use vector clocks. They do support a more columnar data store, like bigtable, but not vector clocks, nor multiple versions.

    Not sure where you are getting your info, but the leading implementation of Dynamo is whatever Amazon is currently using for maintaining cart state. Cassandra does not use vector clocks (which allow you to establish ordering even if clocks are out of sync among the nodes) but instead uses a simpler timestamp with majority rule among these timestamps for consistency conflicts.

    BigTable/HBase is fine if you can sacrifice partition-tolerance, but some applications need to be distributed across more than a single data center and are not too difficult to modify to handle looser consistency properties. Different strokes and all that…

  9. We’re also using it via Project Voldemort very successfully at Gilt Groupe for cart and order processing.

    I think that you missed something important somewhere – other commenters mentioned vector clocks, and I think that might be the core of it – did you take into account how those help the overall system (clients + servers) ensure consistency?

  10. @Nathan – thanks for you comments.

    To tell u the truth – some of the points in this post are derived from looking at Cassandra. Cassandra does not use vector clocks, only client timestamps. As such it is vulnerable to the scenario i described.

    That said, imho – the problem does not disappear. Let’s take the scenario i mentioned – a key mapped to a singleton list (via mistaken truncation) on one node and a large list on another. Let’s say that at some point a client does receive both and needs to reconcile – how will it do so? what if deletion from the list was also a legitimate operation in my application? i would have no way of figuring out the right outcome.

    another simple example is where i populate/modify a second key based on the (stale) value of the first key.

    my general point here is that exposing clients to stale reads create a very dangerous and tricky scenario. it is best to avoid this as much as possible. to the extent that consistency and availability are well achievable within the context of a single data center – one should not suffer from inconsistent data in such a setting. However Dynamo’s architecture creates potential for inconsistency even within a single data center due to the way it’s read/write protocols are structured.

    the second point i am trying to make here is that Dynamo exposes applications to ‘unbounded’ staleness. I would be less worried about many of the scenarios if there were limits on how inaccurate the read values were.

  11. @Matt – thanks for the comments.

    the general theme i was trying to convey in my post is that highly available centralized services should be considered in the context of any distributed architectures – especially if they don’t become scalability bottlenecks. The choice of Symmetry and Decentralization as being guiding architectural principles is questionable – i think a lot of complexity and problems in Dynamo (and it’s clones) come from this. And it’s not well justified. Dynamo clusters would themselves be a centralized service in a data center.

    Note that Cassandra has started using Zookeeper (a centralized coordination service) for some operations.

  12. I’m glad you posted your thoughts – it’s good to have discussion like this out there. Generally I agree with you regarding the argument of centralization, however your assertion or implication that there are “dynamo clusters” at Amazon is unlikely to be true in practice.

    As you yourself state, the data center architecture is a critical component in a distributed architecture. Given the properties of the Amazon Dynamo system, I would suspect it is running just about everywhere co-resident with the web applications, and as such there would be no centralized “dynamo cluster”. This would be similar to how memcached is provisioned at your place of employment – Facebook.

  13. I disagree with your statement about Cassandra too. We are evaluating and we bypass this problem through something that Cassandra has called callouts. So read-repair and hinting we use callouts i.e conflict resolution. So I don’t see it as a problem at all.

    BTW callouts are peices of application code that can be registered with system and invoked via a scripting engine API.

  14. BTW Cassandra came from Facebook. Also isn’t one of the guys from Dynamo too. I mean the answers are within your reach.

  15. @Nathan – you are right we are close to the horse’s mouth. My conclusion (shared by others) is that Cassandra cannot give any kind of consistency without reading from all the replicas because of the way node failures and rejoins are handled as i described in this note. the list truncation is a real scenario that i am looking at in my application – so it’s a show stopper. you may want to reconsider ur evaluation (or perhaps this case does not apply to you). the performance impact of reading from all replicas all the time is likely to be unacceptable. i don’t understand how you are doing conflict resolution with only client timestamp. how can u know that the latest timestamp does not contain a valid update? if u can explain – i will happily incorporate into my post.

    i also didn’t talk about performance – because i wanted to focus on architecture. however when people talk glibly about reading from all replicas or about the whole vector timestamping thing (which involves reading all versions) – i would point out how expensive this type of stuff is likely to be. Cassandra has stuff stored in multiple files – and reading all versions in means much more expensive in terms of disk seeks.

    @Taylor – i would be very interested in learning about Dynamo deployment at Amazon. I am pretty sure that Facebook’s Cassandra cluster is centralized. i am not sure about our exact memcache layout – but i do know that our requests to memcache cross core switches – and that core switch issues have caused some degree of isolation between web and memcache tiers in the past. so in that sense our memcache tier is ‘centralized’ and we had better ensure very good connectivity between these different tiers (which we try our best to do).

  16. @Dave – thanks for writing a good rebuttal. when i get some more time – i will try to address the points you have raised (although i might have addressed/incorporated one of them already).

    to address your comments here: even though Cassandra is not Dynamo – it has much more in common with Dynamo than Big Table. In terms of CAP – it’s almost entirely Dynamo (except for the part about vector clocks and multiple versions). In it’s agreement with the architectural principles of Symmetry and Decentralization – it’s all Dynamo (except, as i noted, it’s starting to use Zookeeper).

  17. Why do you state that Cassandra is centralized? Deployments at Digg and at our end do not use Zookeeper at all. Could you elaborate on what you mean by Cassandra cluster is centralized?

  18. what i mean is that Cassandra nodes are probably co-located in a small set of racks different from web tier. as such any network partitioning that separates the web tier from Cassandra will cause downtimes. So regardless of how tolerant Cassandra itself is of partitions – partitions in a data center are intolerable. that is what i mean by saying that Cassandra as a service is ‘centralized’.

    Once one comes around to this point of view – it becomes clear that tolerating partitions within a data center is a non-goal. one should design for consistency and availability in a data center and only worry about partitioning at higher levels in the system (and only for applications/data-sets that are not ok with single data center service SLA). Imposing the overheads implied by partition tolerance on environments that do not have partitions is simply wrong.

  19. BTW, the hash trees that Dynamo uses are credited to Ralph, not Angela. It is spelled “Merkle”.

Comments are closed.