Archive for November, 2009

Dynamo – Part I: a followup and re-rebuttals

Thanks to everyone who responded (in different forums). i will try to summarize the responses made on the comments to my initial post as well as some of the substantive discussions at Dave’s blog and news.ycombinator.

The things i have heard are roughly in this order:

  1. Data loss scenario under the Eventual Consistency section is not possible with Vector Clocks: agreed. I screwed up. I was thinking about Cassandra that does not have Vector clocks – only client timestamps – where this is a definite possibility. I have updated the section with this information.

    However, i remain convinced that one should not force clients to deal with stale reads in environments where they can be avoided. As i have mentioned in the updated initial post – there are simple examples where stale reads cause havoc. One may not be able to do conflict resolution or the reads can affect other keys in ways that are hard to fix later.

    The other point that i would re-emphasize is that there is no bound on how ‘stale’ the reads are. Nodes can be down for significant amounts of time or they may rejoin the cluster after having lost some disks. It’s hard to imagine writing applications where the data returned is that much out of date.

    About Vector Clocks and multiple versions – it’s not a surprise that they were not implemented in Cassandra. In Cassandra – the cost of having to retrieve many versions of a key increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may be required for each file that has a version of the key. Even though the versions may not require reconciliation, one still has to read them.

  2. Analysis of quorum protocols is wrong: I don’t think so. Consider W=3, R=1, N=3. In this case, one node disappearing for a while and coming back and re-joining cluster is clearly problematic. That node may serve the single copy of data required by a read operation. Depending on how much data that node does not have – the read may be very stale. Dynamo paper says this setting is used for persistent caching – but i am surprised that Amazon can afford to read cached data that is potentially grossly out of date (we can’t).

    Consider W=2, R=2, N=3. This has somewhat better properties. I need to take out two nodes in same shard and re-insert them after a while into the cluster to get stale reads. Writes in the interim succeed due to quorums provided by hinted handoffs. So stale reads are still possible – if considerably more improbable. Let’s do the math: we want a system that’s tolerant of 2 faults (which is why we have 3 replicas). Let’s say in a cluster of 100 nodes – that the mean number of times two nodes are simultaneously down is roughly once per day (taking a totally wild guess). Under the assumption that a single node serves a single key range – the odds of these two nodes belonging to the same quorum group is about 0.06 (100*(3C2)/100C2). That means that roughly every 16 days my cluster may get into a condition where there can be stale reads happen even with W=2 and R=2. Bad enough. Now if we throw virtual nodes into the equation – the odds will go up. The exact math sounds tough – but at a low ratio of virtual nodes per physical node – the odds will likely go up in direct proportion to the ratio. So with a virtual/physical ratio of 2 – i could see stale reads every week or so.

    Consider also the downsides of this scheme: twice the read is a very high penalty in a system. In Cassandra – which is write optimized – read overheads are worse that traditional btrees. Note also that although writes can be made highly available by hinted handoffs – there’s no such savior for reads. If the 3 replicas span a WAN – then one of the data centers only has one copy of the data. R=2 means one must read from both the sides of the WAN when reading from this data center. That sounds pretty scary and highly partition intolerant and unavailable to me :-)!

  3. Replication schemes with point in time consistency also don’t prevent stale reads: Let me clarify – i simply wanted to correct the assertion in the paper that commercial databases update replicas across a WAN synchronously. They mostly don’t. They also aren’t typically deployed to perform transactions concurrently from more than one site. So there’s no comparison to Dynamo – except to point out that async replication with point in time consistency is pretty darned good for disaster recovery. That’s it.
  4. Pesky resync-barriers and quorums again
    One more problem with Dynamo: i completely overlooked that resynchronization barriers, in the way i was thinking are impossible to implement (to fix the consistency issues).

    The problem is this – how does a node know it’s rejoining the cluster and it’s out of date? Of course – if a node is rebooting – then this is simple to guess. However consider a more tricky failure condition – the node’s network card (or the switch port) keeps going up and down. The software on the node thinks everything is healthy – but in effect it’s leaving and re-joining the cluster (every once in a while).

    In this case – even if Merkel trees are totally in-expensive (as Dave claims in his post) – i still wouldn’t know when exactly to invoke them in such a way as to not serve stale reads. (surely i can’t invoke them before every read – i might as well read from all the replicas then!)

    So, unfortunately, i am repeating this yet again – Dynamo’s quorum consensus protocol seems fundamentally broken. How can one write outside the quorum group and claim a write quorum? And when one does so – how can one get consistent reads without reading every freaking replica all the time? (well – the answer is – one doesn’t – which is why Dynamo is eventually consistent. I just hope that users/developers of Dynamo clones realize this now).

    Symmetry
    While i pointed out the inherent contradiction in Dynamo’s goal of symmetry and the notion of seeds – i did not sufficiently point out the downside of Symmetry as a design principle.

    One aspect of this is that server hardware configurations are inherently asymmetric. The way one configures a highly available stateful centralized server is very different from the way one configures a cheap/stateless web server. By choosing symmetry as a design principle – one effectively rules out using different hardware for different components in a complex software system. IMHO – this is a mistake.

    Another aspect of this is that network connectivity is not symmetric. Connectivity between nodes inside a data center has properties (no partitions) that connectivity across across data centers does not have. Dynamo’s single ring topology does not reflect this inherent asymmetry in network connectivity. It’s design treats all the inter-node links the same – where they clearly aren’t.

    Lastly, symmetry prevents us from decomposing complex software into separate services that can be built and deployed independently (and on different machines even). This is how most software gets built these days (so Dynamo is in some sense an anachronism). Zookeeper is a good example of a primitive that is useful to build a system like Dynamo that can be run on dedicated nodes.

    Fault Isolation and Recovery

    As a matter of detail – Dynamo paper also does not talk about how it addresses data corruptions (say a particular disk block had corrupt data) or disk failures (on a multi-disk machine – what is the recovery protocol for a single disk outage). In general the issue is fault isolation and recovery from the same. This is not a flaw per se – rather I am just pointing out that one cannot build any kind of storage system without addressing these issues. Cassandra also doesn’t have any solutions for these issues (but some solutions may be in the works).

    Revisiting and Summarizing

    A lot of things have been mentioned by now – let me try to summarize my main points of contention so far:

    1. Stale Reads are bad. We should do our utmost to not have them if they can be avoided.
    2. Unbounded Stale Reads are pure evil and unacceptable. Even under disaster scenarios – applications expect finite/bounded data loss. In most cases – admins will prefer to bring down a service (or parts of it) rather than take unbounded data loss/corruption.
    3. Network Partitions within a data center can (and are) avoided by redundant network connectivity (they are usually intolerable). Designing for partition tolerance within a data center does not make sense.
    4. Centralization does not mean low availability. Very high availability central services can be built – although scalability can be a concern.
    5. The notion of Symmetry as a deployment and design principle does not model well the asymmetry that is inherent in hardware configurations and networking
    6. Consistency, high availability and scalability are simultaneously achievable in a data center environment (that does not have partitions). BigTable+GFS, HBase+HDFS (perhaps even an Oracle RAC database) are good examples of such systems. Strong Consistency means that these systems do not suffer from stale reads
    7. Dynamo’s read/write protocols can cause stale reads even when deployed inside a single data center
    8. No bound can be put on the degree of staleness of such reads (which is, of course, why the system is described as eventually consistent).
    9. When deployed across data centers, there is no way in Dynamo to track how many pending updates have not been reflected globally. When trying to recover from a disaster (by potentially changing quorum votes) – the admin will have no knowledge of just how much data has been lost (and will be possibly corrupted forever).

    Enough said. Hopefully, in part II (when i get a chance), i can try to list some design alternatives in this space (I have already hinted at the broad principles that i would follow: don’t shirk centralization or asymmetry, model the asymmetry in the real world, don’t build for partition tolerance where partitions don’t/shouldn’t exist, think about what the admin would need in a disaster scenario).

Dynamo: A flawed architecture – Part I

(The opinions in this article are entirely mine – although i owe my education in part to other people)

(Update: Please do read the followup. Conversations around the web on this topic are well tracked here)

Background

Recently i have had to look at Dynamo in close detail as part of a project. Unlike my previous casual perusals of the paper/architecture – this time i (along with some other colleagues) spent a good amount of time looking at the details of the architecture. Given the amount of buzz around this paper, the number of clones it has spawned, and the numerous users using those clones now – our takeaways were shockingly negative.

Before posting this note – I debated whether Dynamo was simply inappropriate for our application (and whether calling it ‘flawed’ would therefore be a mis-statement). But it is clear in my mind, that the paper is flawed – it tries to lead readers to believe things that are largely untrue and it makes architectural choices that are questionable. It’s design contradicts some of it’s own stated philosophies. Finally, i believe that the problems it proposes to solve – are solvable by other means that do not suffer from Dynamo’s design flaws. I will try to cover these points in a series of posts.

Eventual Consistency

First – i will start with the notion of ‘eventual’ consistency which Dynamo adhers to. What does this mean to the application programmer? Here are some practical implications:

  1. committed writes don’t show up in subsequent reads
  2. committed writes may show up in some subsequent reads and go missing thereafter
  3. that there is no SLA for when the writes are globally committed (and hence show up in all subsequent reads)

At a high level – not having SLA is not a practical proposition. ‘Eventually’ we are all dead. A system that returns what I wrote only after I am dead is no good for me (or anyone else). At a more ground level – returning stale data (the logical outcome of eventual consistency) leads to severe data loss in many practical applications. Let’s say that one is storing key-value pairs in Dynamo – where the value encodes a ‘list’. If Dynamo returns a stale read for a key and claims the key is missing, the application will create a new empty list and store it back in Dynamo. This will cause the existing key to be wiped out. Depending on how ‘stale’ the read was – the data loss (due to truncation of the list) can be catastrophic. This is clearly unacceptable. No application can accept unbounded data loss – not even in the case of a Disaster.

(Update: Several people have called out that this data loss scenario is not possible in Dynamo due to Vector Clocks. That sounds correct. Some followups:

  1. The scenario is possible in Cassandra that does not use vector clocks but only client timestamps
  2. Dynamo depends on conflict resolution to solve this problem. Such conflicts are very difficult to resolve – in particular if deletion from the list is a valid operation – then how would one reconcile after mistaken truncation?
  3. In another simple scenario – a stale read may end up affecting writes to other keys – and this would be even harder to fix

The general points here are that returning stale reads is best avoided and where cannot be avoided – at least having some bounds on the staleness allows one to write applications with reasonable behavior. Dynamo puts no bounds on how stale a read can be and returns stale reads in single data-center environments where it can be entirely avoided)

Quorum Consensus

Dynamo starts by saying it’s eventually consistent – but then in Section 4.5. it claims a quorum consensus scheme for ensuring some degree of consistency. It is hinted that by setting the number of reads (R) and number of writes (W) to be more than the total number of replicas (N) (ie. R+W>N) – one gets consistent data back on reads. This is flat out misleading. On close analysis one observes that there are no barriers to joining a quorum group (for a set of keys). Nodes may fail, miss out on many many updates and then rejoin the cluster – but are admitted back to the quorum group without any resynchronization barrier. As a result, reading from R copies is not sufficient to give up-to-date data. This is partly the reason why the system is only ‘eventually’ consistent. Of course – in a large cluster – nodes will leave and re-join the cluster all the time – so stale reads will be a matter of course.

This leads to the obvious question – why can one simply not put into place a resynchronization barrier when nodes re-join the cluster? The answer to this is troublesome: no one knows whether a node is in sync or not when it rejoins the cluster. No one can tell how much data it doesn’t have. There is no central commit log in the system. The only way to figure out the answer to these questions is to do a full diff of the data (Dynamo uses something called Merkel Merkle trees to do this) against all other members of the quorum group. This will of course be remarkably expensive and practically infeasible to do synchronously. Hence it is (at best) performed in the background.

The other way to provide strong consistency is to read from all the replicas all the time. There are two problems with this:

  1. This clearly fails under the case where the only surviving member of the quorum group is out of date (because it had failed previously and hasn’t been bought up to date)
  2. This imposes a super-high penalty for reads that is otherwise not implied by a lower setting for the parameter R. In fact it renders the parameter R moot – wondering why Dynamo would talk about quorum consensus in the first place?

I am not the first one to observe these problems. See for example Cassandra-225 which was an attempt to solve this problem with a centralized commit log in Cassandra (a Dynamo clone).

WAN considerations

It’s worth pointing out that when Dynamo clusters span a WAN – the odds of nodes re-joining the cluster and remaining out of date are significantly increased. If a node goes down, ‘hinted handoff’ sends updates to the next node in the ring. Since nodes of the two data centers alternate – the updates are sent to the remote data center. When the node re-joins the cluster, if the network is partitioned (which happen all the time), the node will not catch up on pending updates for a long time (until the network partitioning is healed).

Disaster Recovery

The effect of eventual consistency and replication design is felt most acutely when one considers the case of disaster recovery. If one data center fails, there is absolutely nothing one can say about the state of the surviving data center. One cannot quantify exactly how much data has been lost. With standard log-shipping based replication and disaster recovery, one can at least keep track of replication lag and have some idea of how far behind the surviving cluster is.

Lack of point in time consistency at the surviving replica (that is evident in this scenario) is very problematic for most applications. In cases where one transaction (B) populates entites that refer to entities populated in previous transactions (A), the effect of B being applied to the remote replica without A being applied leads to inconsistencies that applications are typically ill equipped to handle (and doing so would make most applications complicated).

Contradictions

At the beginning of the paper, the paper stresses the principle of Symmetry. To quote:

Symmetry: Every node in Dynamo should have the same set of responsibilities as its peers

By the time we get to Section 4.8.2, this principle is gone, to quote:

To prevent logical partitions, some Dynamo nodes play the role of seeds.

Again, in section 2, one reads:

Dynamo has a simple key/value interface, is highly available with a clearly defined consistency window

and by the time one gets to Section 2.3, one reads:

Dynamo is designed to be an eventually consistent data store

where of course – the term ‘eventual’ is not quantified with any ‘window’!

In addition i found this quote somewhat misleading (Section 2.3):

Data replication algorithms used in commercial systems traditionally perform synchronous replica coordination in order to provide a strongly consistent data access interface.

As an engineer who has worked extensively on production commercial replication and disaster recovery systems – I can vouch this claim is incorrect. Most database and storage systems are deployed in asynchronous (but serial) replication mode. Replicas are always point-in-time consistent. Commercial recovery technologies (backups, snapshots, replication) all typically rely on point-in-time consistency. The combination of high availability and point in time consistency at remote data centers is relatively easy to achieve.

It is possible that the paper is trying to refer to distributed databases that are distributed globally and update-able concurrently. However, these are extremely tiny minority of commercial database deployments (if they exist at all) and it’s worth noting that this statement is largely untrue in practice.

Consistency versus Availability

Several outside observers have noted that Dynamo is chooses the AP of the CAP theorem – while other systems (notably BigTable) choose CA. Unfortunately, Dynamo does not distinguish between ‘Availability’ and ‘Partition Tolerance’ in the SOSP paper.

The reader is left with the impression that there is always a tradeoff between Consistency and Availability of all kinds. This is, of course, untrue. One can achieve strong Consistency and High-Availability within a single data center – and this is on par for most commercial databases – as well as for systems like HBase/HDFS.

The unfortunate outcome of this is that people who are looking for scalable storage systems even within a single data center may conclude (incorrectly) that Dynamo is a better architecture (than BigTable+GFS).

Centralization

Dynamo rails against centralized architectures, deeming them inherently of low availability – to quote (from Section 2.3):

In the past, centralized control has resulted in outages and the goal is to avoid it as much as possible. This leads to a simpler, more scalable, and more available system.

Again – as an engineer who worked on in data-center high availability for years – i find this general line of thought questionable. Centralized control typically causes scalability bottlenecks. But by no means are they necessarily of low availability. The entire storage industry churns out highly available storage servers – typically by arranging for no single points of failure (SPOF). This means dual everything (dual motherboards, dual nics, dual switches, RAID, multipathing etc. etc.) and synchronously mirrored write-back caches (across motherboards). A storage server is most often a central entity that is highly available – and as such i am willing to bet that the bank accounts of all the Dynamo authors are stored and retrieved via some such server sitting in a vault somewhere. Such servers typically have 5 9’s availability (way more than anything Amazon offers for any of it’s services). The principles employed in building such highly available servers are well understood and easily applied to other centralized entities (for example the HDFS namenode).

Of course – having highly available centralized services needs discipline. One needs to have redundancy at every layer in the service (including, for example, power supplies for a rack and network uplinks). Many IT operations do not apply such discipline – but the resultant lack of availability is hardly the fault of the centralized architecture itself.

The irony in this claim is that a Dynamo cluster is likely itself to be centralized in some shape or form. One would likely have some racks of nodes in a Dynamo cluster all hanging off one set of core-switches. As such it’s clients (application/web-servers) would be in different network segment connected to different core-switches. Network partitioning between these two sets of core-switches would make Dynamo unavailable to the clients. (While this example shows how futile Dynamo’s goal of avoiding centralization is – it also shows how data centers need to be architected (with dual uplinks and switches at every network hop) to prevent network partitioning from happening and isolating centralized services).

Finally (and with extreme irony) we have already seen that the lack of centralization (of commit logs for instance) is the reason behind many of the consistency issues affecting Dynamo.

What next?

There are many other issues regarding fault isolation from data corruptions that are also worth discussing. And as promised, I will try to cover simpler schemes to cover some of the design goals of Dynamo as well. If all is well – in a subsequent post.

Tags: , , , , ,