Thanks to everyone who responded (in different forums). i will try to summarize the responses made on the comments to my initial post as well as some of the substantive discussions at Dave’s blog and news.ycombinator.
The things i have heard are roughly in this order:
- Data loss scenario under the Eventual Consistency section is not possible with Vector Clocks: agreed. I screwed up. I was thinking about Cassandra that does not have Vector clocks – only client timestamps – where this is a definite possibility. I have updated the section with this information.
However, i remain convinced that one should not force clients to deal with stale reads in environments where they can be avoided. As i have mentioned in the updated initial post – there are simple examples where stale reads cause havoc. One may not be able to do conflict resolution or the reads can affect other keys in ways that are hard to fix later.
The other point that i would re-emphasize is that there is no bound on how ‘stale’ the reads are. Nodes can be down for significant amounts of time or they may rejoin the cluster after having lost some disks. It’s hard to imagine writing applications where the data returned is that much out of date.
About Vector Clocks and multiple versions – it’s not a surprise that they were not implemented in Cassandra. In Cassandra – the cost of having to retrieve many versions of a key increases the disk seek costs reads multi-fold. Due to the usage of LSM trees, a disk seek may be required for each file that has a version of the key. Even though the versions may not require reconciliation, one still has to read them.
- Analysis of quorum protocols is wrong: I don’t think so. Consider W=3, R=1, N=3. In this case, one node disappearing for a while and coming back and re-joining cluster is clearly problematic. That node may serve the single copy of data required by a read operation. Depending on how much data that node does not have – the read may be very stale. Dynamo paper says this setting is used for persistent caching – but i am surprised that Amazon can afford to read cached data that is potentially grossly out of date (we can’t).
Consider W=2, R=2, N=3. This has somewhat better properties. I need to take out two nodes in same shard and re-insert them after a while into the cluster to get stale reads. Writes in the interim succeed due to quorums provided by hinted handoffs. So stale reads are still possible – if considerably more improbable. Let’s do the math: we want a system that’s tolerant of 2 faults (which is why we have 3 replicas). Let’s say in a cluster of 100 nodes – that the mean number of times two nodes are simultaneously down is roughly once per day (taking a totally wild guess). Under the assumption that a single node serves a single key range – the odds of these two nodes belonging to the same quorum group is about 0.06 (100*(3C2)/100C2). That means that roughly every 16 days my cluster may get into a condition where there can be stale reads happen even with W=2 and R=2. Bad enough. Now if we throw virtual nodes into the equation – the odds will go up. The exact math sounds tough – but at a low ratio of virtual nodes per physical node – the odds will likely go up in direct proportion to the ratio. So with a virtual/physical ratio of 2 – i could see stale reads every week or so.
Consider also the downsides of this scheme: twice the read is a very high penalty in a system. In Cassandra – which is write optimized – read overheads are worse that traditional btrees. Note also that although writes can be made highly available by hinted handoffs – there’s no such savior for reads. If the 3 replicas span a WAN – then one of the data centers only has one copy of the data. R=2 means one must read from both the sides of the WAN when reading from this data center. That sounds pretty scary and highly partition intolerant and unavailable to me :-)!
- Replication schemes with point in time consistency also don’t prevent stale reads: Let me clarify – i simply wanted to correct the assertion in the paper that commercial databases update replicas across a WAN synchronously. They mostly don’t. They also aren’t typically deployed to perform transactions concurrently from more than one site. So there’s no comparison to Dynamo – except to point out that async replication with point in time consistency is pretty darned good for disaster recovery. That’s it.
- Stale Reads are bad. We should do our utmost to not have them if they can be avoided.
- Unbounded Stale Reads are pure evil and unacceptable. Even under disaster scenarios – applications expect finite/bounded data loss. In most cases – admins will prefer to bring down a service (or parts of it) rather than take unbounded data loss/corruption.
- Network Partitions within a data center can (and are) avoided by redundant network connectivity (they are usually intolerable). Designing for partition tolerance within a data center does not make sense.
- Centralization does not mean low availability. Very high availability central services can be built – although scalability can be a concern.
- The notion of Symmetry as a deployment and design principle does not model well the asymmetry that is inherent in hardware configurations and networking
- Consistency, high availability and scalability are simultaneously achievable in a data center environment (that does not have partitions). BigTable+GFS, HBase+HDFS (perhaps even an Oracle RAC database) are good examples of such systems. Strong Consistency means that these systems do not suffer from stale reads
- Dynamo’s read/write protocols can cause stale reads even when deployed inside a single data center
- No bound can be put on the degree of staleness of such reads (which is, of course, why the system is described as eventually consistent).
- When deployed across data centers, there is no way in Dynamo to track how many pending updates have not been reflected globally. When trying to recover from a disaster (by potentially changing quorum votes) – the admin will have no knowledge of just how much data has been lost (and will be possibly corrupted forever).
Pesky resync-barriers and quorums again
One more problem with Dynamo: i completely overlooked that resynchronization barriers, in the way i was thinking are impossible to implement (to fix the consistency issues).
The problem is this – how does a node know it’s rejoining the cluster and it’s out of date? Of course – if a node is rebooting – then this is simple to guess. However consider a more tricky failure condition – the node’s network card (or the switch port) keeps going up and down. The software on the node thinks everything is healthy – but in effect it’s leaving and re-joining the cluster (every once in a while).
In this case – even if Merkel trees are totally in-expensive (as Dave claims in his post) – i still wouldn’t know when exactly to invoke them in such a way as to not serve stale reads. (surely i can’t invoke them before every read – i might as well read from all the replicas then!)
So, unfortunately, i am repeating this yet again – Dynamo’s quorum consensus protocol seems fundamentally broken. How can one write outside the quorum group and claim a write quorum? And when one does so – how can one get consistent reads without reading every freaking replica all the time? (well – the answer is – one doesn’t – which is why Dynamo is eventually consistent. I just hope that users/developers of Dynamo clones realize this now).
Symmetry
While i pointed out the inherent contradiction in Dynamo’s goal of symmetry and the notion of seeds – i did not sufficiently point out the downside of Symmetry as a design principle.
One aspect of this is that server hardware configurations are inherently asymmetric. The way one configures a highly available stateful centralized server is very different from the way one configures a cheap/stateless web server. By choosing symmetry as a design principle – one effectively rules out using different hardware for different components in a complex software system. IMHO – this is a mistake.
Another aspect of this is that network connectivity is not symmetric. Connectivity between nodes inside a data center has properties (no partitions) that connectivity across across data centers does not have. Dynamo’s single ring topology does not reflect this inherent asymmetry in network connectivity. It’s design treats all the inter-node links the same – where they clearly aren’t.
Lastly, symmetry prevents us from decomposing complex software into separate services that can be built and deployed independently (and on different machines even). This is how most software gets built these days (so Dynamo is in some sense an anachronism). Zookeeper is a good example of a primitive that is useful to build a system like Dynamo that can be run on dedicated nodes.
Fault Isolation and Recovery
As a matter of detail – Dynamo paper also does not talk about how it addresses data corruptions (say a particular disk block had corrupt data) or disk failures (on a multi-disk machine – what is the recovery protocol for a single disk outage). In general the issue is fault isolation and recovery from the same. This is not a flaw per se – rather I am just pointing out that one cannot build any kind of storage system without addressing these issues. Cassandra also doesn’t have any solutions for these issues (but some solutions may be in the works).
Revisiting and Summarizing
A lot of things have been mentioned by now – let me try to summarize my main points of contention so far:
Enough said. Hopefully, in part II (when i get a chance), i can try to list some design alternatives in this space (I have already hinted at the broad principles that i would follow: don’t shirk centralization or asymmetry, model the asymmetry in the real world, don’t build for partition tolerance where partitions don’t/shouldn’t exist, think about what the admin would need in a disaster scenario).
The core misunderstanding here is that Hinted Handoff does NOT mean “ignore W.” If it did, clearly R + W > N would mean nothing; in fact, W would be entirely superfluous. So to the degree that the Dynamo paper suggests otherwise, it should be treated as a “bug” in the paper.
So, just to treat your first example, with W=3,R=1 you can’t get inconsistent reads since writes will block if any replica is unavailable.
All quorum-based systems that I know of treat R + W like this. For Cassandra, see the code from StorageProxy.insertBlocking, specifically the calls to getNaturalEndpoints, getHintedEndpointMap, and getUnhintedNodes (and QuorumResponseHandler of course). Cassandra is careful to only hint N – W replicas as I’ve been discussing.
Everything else you object to basically falls out of this. For example, “there is no way in Dynamo to track how many pending updates have not been reflected globally” — which doesn’t matter, since either you are waiting for R + W > N, in which case you get consistency, or you are not, in which case you can’t complain if you don’t get it. 🙂
Jonathan – first of all – thanks for posting the comments here. I was getting tired of visiting different web sites 🙂
As you suggest – the Dynamo paper does clearly say otherwise. It clearly says that it observes ‘sloppy’ quorums. My understanding from a vigorous internal debate has been that Cassandra does the same thing as well (from people with more experience with code and the project). In fact – with a proper write quorum – there is no need for ‘hinted handoff’ – is there?
Secondly, let’s say for a minute, that Cassandra/Dynamo are doing what you are saying. Now we are in a serious availability black hole. Nodes fail all the time – so achieving W=3 is impossible for some subset of the key ranges pretty much all the time. This is simply not acceptable.
Simply put – you can’t have it both ways. But the solutions are very simple really. In many classic quorum consensus protocols – there is a notion of a strong quorum group. Quorum groups are resized (and the votes adjusted) when nodes enter or leave the group. Re-entering a quorum group requires an explicit resynchronization point to observe consistency. It is very easy to see that such a solution fix the problems above for a local data center case. But things will get complicated in a WAN setting (which brings us to the whole point of the fallacy of trying to treat multiple data centers the same way as a single data center).
i am very surprised at your claim actually. This has been discussed on the mailing lists as well (for example: http://markmail.org/message/vfhkctddw6lttpf3) and no one’s contradicted it – Avinash in fact acknowledged it in his reply (saying that things don’t go wrong in practice).
If by “sloppy quorum” you mean “cassandra ignores W” then no, absolutely not.
> with a proper write quorum – there is no need for ‘hinted handoff’ – is there?
There is no need _for the clients doing quorum read/writes_, but the reason you do hinted handoff is because the client is the one who determines the desired consistency tradeoffs and different clients can choose different values. So if one client just wants minimum latency and is doing R=1 (and other clients are doing W Nodes fail all the time – so achieving W=3 is impossible
That is why most people who want consistency use quorum writes rather than one of the extremes (i.e., for N=3, W=R=2). Even with these small numbers you are already in very good shape compared to traditional master/slave db replication. And of course you can run with N=5 or more, trading latency for reliability.
> This has been discussed on the mailing lists as well
The context there is that Jun wants to wave a magic wand and achieve consistency for R=1 — you mentioned his ticket 224 in your original post — but TANSTAAFL and the tradeoffs you have to make to get that are not worth it. If you want consistency, use R + W > N.
The other context is that Avinash’s group entirely (?) runs with what you can think of as W=0,R=1: as long as you can record the writes anywhere for HH later, call it good. (This is what the dynamo paper refers to as “always writable.”) And he is saying that this is good enough in practice, which is a different discussion than whether it can theoretically offer strong consistency (obviously it can’t).
i give up. i will never be able to convince you (I am reminded of the quote that there are none so blind as those who will not see).
Read Section 4.6 of the Dynamo paper. Read your own comments. They are completely at odds.
However – let’s say that we are just disagreeing on terminology at this point (what W=3 means). But your claim (and some other Dynamo author’s) claim is that consistency and availability are irreconcilable. I have tried in explain that it is possible to provide both when one does not suffer from partitions (as within a data center). I have given examples of systems (HDFS) that provide both (with a HA configuration for namenode). Zookeeper is another example.
It has been frequently observed by outside observers that Dynamo chooses AP out of CAP while BigTable/HBase type of systems choose CA. I am simply reiterating this in detail – and my point is that choosing AP when one operates within a single data center is simply wrong. To the extent that partition tolerance is required – it can be built as a layer above a CA system. Such a layered system will model the inherent differences in network connectivity much better (I was hoping to cover this in subsequent post) and can provide partition tolerance if and when required (and only cause inconsistency in those cases). But such an overall design will violate the principles of symmetry and centralization that Dynamo holds so dear. Which is why it is a deeply flawed architecture.
As I’ve already said on the discussion of your last post on Hacker News, I agree with your interpretation of sloppy quorums. In my opinion Jonathan has got it wrong or is using a different terminology.
That said, I’m very happy that someone speaks out against Dynamo. I agree with most of your points in this and the last post. I just wish people would read what you are saying instead of blindly bashing you…
Joydeep,
Part of the cause of the trouble in your debate with Jonathan, and others, is that your arguments are predicated on something which is demonstrably false: that you can’t build a reliable, production, COMMERCIAL system based on the Dynamo or Cassandra architectures. You might be better served by taking a few different courses of action:
1) Don’t take such a lawyerly approach to the Dynamo paper. It is not a design document, and treating it as one is doomed to fail (as anyone who has implemented a Dynamo clone can tell you).
2) Recognize that not all situations are equally likely, so though there exist edge cases that can surface conflicts, they are far from the common case. This is called out clearly in the Dynamo paper (with the R/W/N parameters they used, it almost never happened). As explained in the paper, by Jonathan above, and by several others, you tune R, W, and N to suit your needs.
3) When someone who has implemented one of these systems tells you what is easy, what is hard, what works, and what doesn’t, you should listen. They are sharing hard-won wisdom, not being ‘blind’. It’s rather shocking to me that you have at least 4 folks who have direct implementation experience telling you you are misunderstanding how these systems work and your response is to tell them they are in the dark.
4) Read the papers! Not blogs on them, not presentations by unrelated folks about them, the actual papers. Further, read the _decades_ of supporting literature on vector clocks, gossip protocols, quorums, etc.
5) CAP is not 3 binary parameters, it is 3 knobs to be adjusted to suit different needs. Dynamo and Cassandra don’t eliminate C in favor of AP, they relax C slightly to get a very large increase in AP.
6) It is simply false to assume partitions inside a data center don’t happen, that sharing a pair of core routers several layers into the network implies a Dynamo cluster is ‘centralized’, or that being in multiple data centers implies massive latency between them (they can be, and often are, within a few miles of each other), etc. Yes, machines fail ‘constantly’, but on a timescale many orders of magnitude higher than writes to the cluster: writes happen on millisecond timescales, while server failures for a given cluster happen over timescales of days, weeks, or even months.
The impression I get from your comments is that you have extensive experience implementing traditional, tightly clustered storage systems and almost no experience with radically distributed systems like Dynamo and, more importantly, no experience with large-scale web operations: actually carrying a pager supporting your code, not just watching others do it. You are applying lessons learned in a very different context, coming to confusing answers, and asserting the experts in the new domain are wrong, despite their clear success. Revisit your assumptions.
b
Jonathan: You say: [with W=3,R=1 you can’t get inconsistent reads since writes will block if any replica is unavailable.] and later [ If you want consistency, use R + W > N. ]
In Cassandra, say you use N=3, W=3 & R=1. Let’s say you managed to only write to replicas A & B, but not C. In this case Cassandra will return an error to the application saying the write failed- which is acceptable given than W=3. But Cassandra does not cleanup/rollback the writes that happened to A & B. Then how does R=1 give you consistency because you could get data from A that supposedly shouldn’t be there. [I checked this with Avinash and he confirms this behavior.]
Kannan,
The situation you describe is a partition: A & B are writable, while C is only readable [and you read from C; and you can avoid this by adjusting R/W/N]. Ignoring for the moment that such a scenario is extremely unlikely, recall that being always writable is an explicit design goal of Dynamo, _even at a cost to consistency_. This makes it appropriate for certain applications and inappropriate for others. Arguing that Dynamo and Cassandra are fatally flawed because they don’t offer the same consistency guarantees as Oracle RAQ is similar to arguing that Oracle RAQ is fatally flawed because it isn’t always writable under partition. Use the right tool for the right job, don’t insist a tool be universally applicable.
b
@Benjamin – please don’t shoot the messenger.
i don’t have a paper trail in PODC and SOSP – but here’s my background. i bought up the hadoop cluster at Facebook. I can comfortably claim to be one of the critical factors in making our Hadoop cluster scale from 80 to thousands of nodes today. as the lead of this team and effort for almost a couple of years – i was indeed carrying the pager (or more accurately the cell phone) and i have spent numerous nights attending to our cluster, solving all sorts of problems and keeping our users happy. i have dealt with this shit – and i know it’s not pleasant. while i was doing all this – i was also responsible for conceiving Hive and one of it’s primary developers.
secondly, what i am saying is sound academically. for example – read the VLDB 08 paper on PNUTS from Yahoo from respected academicians. It’s close to an implementation that i would design myself. while it doesn’t attack Dynamo directly – it does cover the problems with eventual consistency for developing applications. i have a reasonably strong education in computer science – and i still haven’t forgotten (entirely) reading the quorum consensus papers from Herlihy back in school.
i have used vector clocks myself when i worked at Netapp. We had a very simple 2 node HA cluster – but state synchronization was always a issue. I independently thought of and implemented vector clocks to reconcile cluster state with eventual consistency semantics. That level of consistency was just fine for the state we were dealing with. this was many many years back. We also implemented a commercial Disaster recovery system – it had to balance the tradeoffs between CAP (I didn’t know this term then). A very smart colleague of mine solved the problem very elegantly using hierarchical quorums. Flat quorum groups just didn’t work.
Consider that these HA and DR systems are used in many business critical applications (Netapp sold quite a load of them) and one of the reasons the DR solution worked well was because we were able to provide admins a good balance of C, A and P.
As regards my conclusions – Avinash has acknowledged flat out in internal mailing lists that Cassandra should not be used if data is desired to be consistent. I think you should reconsider the sources you trust – he has after all written the bulk of Cassandra code and was one of the Dynamo authors as well. (the fact that the leading committer of the open source Cassandra trunk doesn’t understand these issues points out how bad the situation is)
Consider also that we are hearing that Dynamo is slowly being deprecated inside Amazon. It would not be fair for me to comment further on this (let’s hear from Amazon guys).
Consider also that i am constantly pointing out BigTable/GFS as a better abstraction with partition tolerance being built as a layer on top of this. So i am indeed again referring to prior work with tremendous credentials.
Note that while Dynamo is withering inside Amazon (a cloud computing stalwart) – BigTable powers a strong commercial grade development platform (AppEngine).
i would let the facts speak for themselves.
Joydeep,
It is surprising to me that you’ve read the relevant papers as your analysis of Dynamo ignored vector clocks and the various conflict detection and resolution mechanisms while asserting conflict resolution was a glaring problem. A significant chunk of the paper deals with nothing but conflict detection and resolution, even though it is rare in practice.
I don’t believe anyone is arguing that you should use Dynamo or Cassandra is guaranteed consistency is required. What people _are_ arguing is 1) there are interesting applications that don’t require such guarantees, 2) those applications often have high writability/low latency requirements, 3) partitions happen (even in a single data center). As I said above, it would be foolish to use a systems with relaxed consistency guarantees when consistency is paramount for your application.
On Amazon’s use of Dynamo, you are making a leap unsupported, even contradicted, by evidence: Werner did not say ‘Dynamo failed, so we stopped using it’, he said ‘We did the best we could at the time, it worked for years, but based on 5 years of experience we have now built something better’. I would be extremely surprised if they abandoned many of the principles embodied in Dynamo.
Finally, and again, your understanding of the underlying infrastructure is rather odd. Partitions can and do happen within a single datacenter. Your argument that 1) things like Dynamo are only ever deployed in a single datacenter and 2) partitions don’t happen in a single datacenter, hence 3) staying writable while partitioned is irrelevant is patently false on every point.
So, I agree: let the facts speak for themselves. You cherry pick and grossly misstate information to support your position, then cast aspersions at those who disagree with you. Dynamo and Cassandra were or are in production making enormous amounts of money for several companies. Any claims you make that they don’t work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.
b
yeah,Dynamo and Cassandra were or are in production making enormous amounts of money for several companies. Any claims you make that they don’t work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.
On the contrary, Benjamin. People _are_ arguing that #nosql is the replacement to to the traditional rdbms.
Mobs and “movements” don’t like nuance.
@benjamin: regarding amazon usage – perhaps i shouldn’t have commented on it (i was passing on a second-hand story – but that’s all it is). but i believe the relative success of bigtable is very much pertinent to this discussion. i don’t think one could have provided an eventually consistent data store and achieved the same success as appengine with application developers.
i have posted a correction on my post about the vector clock stuff and explained why it happened. we were deep in discussions about Cassandra – and it doesn’t use vector timestamps.
thank goodness u agree about Consistency. So does Avinash. What i have tried to point out is that Dynamo paper’s section on quorums and consistency is confusing like hell. It leads readers to believe that they can get consistency – when they can’t (with 100% odds). If u look at Jonathan’s arguments – he’s continuing to insist that there are proper read/write quorums in Dynamo/Cassandra – whereas there aren’t. The term ‘sloppy quorum’ is used for a reason and the system is only ‘eventually consistent’ for the same reason.
i haven’t said that relaxed consistency is not attractive for some applications. i am also not saying that dynamo is only deployed within a single data center. what i am saying though is that consistency needs to be relaxed only when partitioning is possible – and that this can be built as a separate layer above a data store with CA. the other thing that i keep stressing is that having bounds on inconsistency is a matter of practical importance. while recovering from an event like a disaster, one is faced with a choice of bringing online significantly old data and availability in the face of disaster. In such admin initiated actions – it’s very important to have some idea of how much data could be potentially lost. The reason simply being that if data is significantly out of date – one might rather choose to be unavailable for the couple of days that it takes for the disaster to repair.
i continue to disagree about partitions within a single data center. u had mentioned whether i had on the ground experience in a web company. it might help to know then that my comments about core switches and partitioning is not some figment of imagination – but derived from actual events from our site – one of the largest in the world. any kind of network partition in our data centers is usually catastrophic. we are simply unable to lose network access to one of our core services (from say our web tier) and continue functioning normally (from that data center). this would be fairly typical of any web site. so we must build arrangements that prevent network partitions in a data center. rack failures (which are usually switch failures) are another case (that are almost like partitions) – but this problem is easily solved by replicating across racks (a la hdfs). important central servers have to be attached to multiple switches.
i think this is a critical point (without which the argument for starting with CA only falls apart). FWIW – i have had almost total success internally with this argument (people immediately agree from experience that partitions are simply intolerable within a data center).
(btw – on a related point – S3 is eventually consistent as well – and it’s a total pain to deal with that aspect of it (first hand experience working out Hive integration with Amazon guys). i almost felt sorry for Amazon engineers as they kept explaining how screwed up the semantics were. some day amazon will have to fix it (competition is coming)).
@Benjamin: I am not sure if you parsed my comment right. The case I am describing does not necessarily involve partition. C could be down due to a variety of reasons. Also, in my example, the read that happens after the failed write is not on C (as you mentioned) but at A.