Thanks to everyone who responded (in different forums). i will try to summarize the responses made on the comments to my initial post as well as some of the substantive discussions at Dave’s blog and news.ycombinator.
The things i have heard are roughly in this order:
- Data loss scenario under the Eventual Consistency section is not possible with Vector Clocks: agreed. I screwed up. I was thinking about Cassandra that does not have Vector clocks – only client timestamps – where this is a definite possibility. I have updated the section with this information.
However, i remain convinced that one should not force clients to deal with stale reads in environments where they can be avoided. As i have mentioned in the updated initial post – there are simple examples where stale reads cause havoc. One may not be able to do conflict resolution or the reads can affect other keys in ways that are hard to fix later.
The other point that i would re-emphasize is that there is no bound on how ‘stale’ the reads are. Nodes can be down for significant amounts of time or they may rejoin the cluster after having lost some disks. It’s hard to imagine writing applications where the data returned is that much out of date.
About Vector Clocks and multiple versions – it’s not a surprise that they were not implemented in Cassandra. In Cassandra – the cost of having to retrieve many versions of a key increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may be required for each file that has a version of the key. Even though the versions may not require reconciliation, one still has to read them.
- Analysis of quorum protocols is wrong: I don’t think so. Consider W=3, R=1, N=3. In this case, one node disappearing for a while and coming back and re-joining cluster is clearly problematic. That node may serve the single copy of data required by a read operation. Depending on how much data that node does not have – the read may be very stale. Dynamo paper says this setting is used for persistent caching – but i am surprised that Amazon can afford to read cached data that is potentially grossly out of date (we can’t).
Consider W=2, R=2, N=3. This has somewhat better properties. I need to take out two nodes in same shard and re-insert them after a while into the cluster to get stale reads. Writes in the interim succeed due to quorums provided by hinted handoffs. So stale reads are still possible – if considerably more improbable. Let’s do the math: we want a system that’s tolerant of 2 faults (which is why we have 3 replicas). Let’s say in a cluster of 100 nodes – that the mean number of times two nodes are simultaneously down is roughly once per day (taking a totally wild guess). Under the assumption that a single node serves a single key range – the odds of these two nodes belonging to the same quorum group is about 0.06 (100*(3C2)/100C2). That means that roughly every 16 days my cluster may get into a condition where there can be stale reads happen even with W=2 and R=2. Bad enough. Now if we throw virtual nodes into the equation – the odds will go up. The exact math sounds tough – but at a low ratio of virtual nodes per physical node – the odds will likely go up in direct proportion to the ratio. So with a virtual/physical ratio of 2 – i could see stale reads every week or so.
Consider also the downsides of this scheme: twice the read is a very high penalty in a system. In Cassandra – which is write optimized – read overheads are worse that traditional btrees. Note also that although writes can be made highly available by hinted handoffs – there’s no such savior for reads. If the 3 replicas span a WAN – then one of the data centers only has one copy of the data. R=2 means one must read from both the sides of the WAN when reading from this data center. That sounds pretty scary and highly partition intolerant and unavailable to me !
- Replication schemes with point in time consistency also don’t prevent stale reads: Let me clarify – i simply wanted to correct the assertion in the paper that commercial databases update replicas across a WAN synchronously. They mostly don’t. They also aren’t typically deployed to perform transactions concurrently from more than one site. So there’s no comparison to Dynamo – except to point out that async replication with point in time consistency is pretty darned good for disaster recovery. That’s it.
- Stale Reads are bad. We should do our utmost to not have them if they can be avoided.
- Unbounded Stale Reads are pure evil and unacceptable. Even under disaster scenarios – applications expect finite/bounded data loss. In most cases – admins will prefer to bring down a service (or parts of it) rather than take unbounded data loss/corruption.
- Network Partitions within a data center can (and are) avoided by redundant network connectivity (they are usually intolerable). Designing for partition tolerance within a data center does not make sense.
- Centralization does not mean low availability. Very high availability central services can be built – although scalability can be a concern.
- The notion of Symmetry as a deployment and design principle does not model well the asymmetry that is inherent in hardware configurations and networking
- Consistency, high availability and scalability are simultaneously achievable in a data center environment (that does not have partitions). BigTable+GFS, HBase+HDFS (perhaps even an Oracle RAC database) are good examples of such systems. Strong Consistency means that these systems do not suffer from stale reads
- Dynamo’s read/write protocols can cause stale reads even when deployed inside a single data center
- No bound can be put on the degree of staleness of such reads (which is, of course, why the system is described as eventually consistent).
- When deployed across data centers, there is no way in Dynamo to track how many pending updates have not been reflected globally. When trying to recover from a disaster (by potentially changing quorum votes) – the admin will have no knowledge of just how much data has been lost (and will be possibly corrupted forever).
Pesky resync-barriers and quorums again
One more problem with Dynamo: i completely overlooked that resynchronization barriers, in the way i was thinking are impossible to implement (to fix the consistency issues).
The problem is this – how does a node know it’s rejoining the cluster and it’s out of date? Of course – if a node is rebooting – then this is simple to guess. However consider a more tricky failure condition – the node’s network card (or the switch port) keeps going up and down. The software on the node thinks everything is healthy – but in effect it’s leaving and re-joining the cluster (every once in a while).
In this case – even if Merkel trees are totally in-expensive (as Dave claims in his post) – i still wouldn’t know when exactly to invoke them in such a way as to not serve stale reads. (surely i can’t invoke them before every read – i might as well read from all the replicas then!)
So, unfortunately, i am repeating this yet again – Dynamo’s quorum consensus protocol seems fundamentally broken. How can one write outside the quorum group and claim a write quorum? And when one does so – how can one get consistent reads without reading every freaking replica all the time? (well – the answer is – one doesn’t – which is why Dynamo is eventually consistent. I just hope that users/developers of Dynamo clones realize this now).
While i pointed out the inherent contradiction in Dynamo’s goal of symmetry and the notion of seeds – i did not sufficiently point out the downside of Symmetry as a design principle.
One aspect of this is that server hardware configurations are inherently asymmetric. The way one configures a highly available stateful centralized server is very different from the way one configures a cheap/stateless web server. By choosing symmetry as a design principle – one effectively rules out using different hardware for different components in a complex software system. IMHO – this is a mistake.
Another aspect of this is that network connectivity is not symmetric. Connectivity between nodes inside a data center has properties (no partitions) that connectivity across across data centers does not have. Dynamo’s single ring topology does not reflect this inherent asymmetry in network connectivity. It’s design treats all the inter-node links the same – where they clearly aren’t.
Lastly, symmetry prevents us from decomposing complex software into separate services that can be built and deployed independently (and on different machines even). This is how most software gets built these days (so Dynamo is in some sense an anachronism). Zookeeper is a good example of a primitive that is useful to build a system like Dynamo that can be run on dedicated nodes.
Fault Isolation and Recovery
As a matter of detail – Dynamo paper also does not talk about how it addresses data corruptions (say a particular disk block had corrupt data) or disk failures (on a multi-disk machine – what is the recovery protocol for a single disk outage). In general the issue is fault isolation and recovery from the same. This is not a flaw per se – rather I am just pointing out that one cannot build any kind of storage system without addressing these issues. Cassandra also doesn’t have any solutions for these issues (but some solutions may be in the works).
Revisiting and Summarizing
A lot of things have been mentioned by now – let me try to summarize my main points of contention so far:
Enough said. Hopefully, in part II (when i get a chance), i can try to list some design alternatives in this space (I have already hinted at the broad principles that i would follow: don’t shirk centralization or asymmetry, model the asymmetry in the real world, don’t build for partition tolerance where partitions don’t/shouldn’t exist, think about what the admin would need in a disaster scenario).