<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments for Joydeep Sen Sarma's blog</title>
	<atom:link href="http://jsensarma.com/blog/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://jsensarma.com/blog</link>
	<description>musings on computing and storage</description>
	<pubDate>Wed, 10 Mar 2010 13:12:19 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Dynamo: A flawed architecture - Part I by Mike Spreitzer</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/comment-page-1/#comment-154</link>
		<dc:creator>Mike Spreitzer</dc:creator>
		<pubDate>Fri, 13 Nov 2009 04:10:56 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=55#comment-154</guid>
		<description>BTW, the hash trees that Dynamo uses are credited to Ralph, not Angela.  It is spelled "Merkle".</description>
		<content:encoded><![CDATA[<p>BTW, the hash trees that Dynamo uses are credited to Ralph, not Angela.  It is spelled &#8220;Merkle&#8221;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo: A flawed architecture - Part I by tecosystems &#187; links for 2009-11-11</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/comment-page-1/#comment-153</link>
		<dc:creator>tecosystems &#187; links for 2009-11-11</dc:creator>
		<pubDate>Thu, 12 Nov 2009 01:05:01 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=55#comment-153</guid>
		<description>[...] Dynamo: A flawed architecture &#8211; Part I « Joydeep Sen Sarma’s blog pushback to Dynamo (tags: dynamo amazon scalability architecture distributed computing scaling nosql key-value eventual consistency) [...]</description>
		<content:encoded><![CDATA[<p>[...] Dynamo: A flawed architecture &#8211; Part I « Joydeep Sen Sarma’s blog pushback to Dynamo (tags: dynamo amazon scalability architecture distributed computing scaling nosql key-value eventual consistency) [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by Kannan</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-150</link>
		<dc:creator>Kannan</dc:creator>
		<pubDate>Thu, 05 Nov 2009 06:41:22 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-150</guid>
		<description>@Benjamin: I am not sure if you parsed my comment right. The case I am describing does not necessarily involve partition. C could be down due to a variety of reasons. Also, in my example, the read that happens after the failed write is not on  C (as you mentioned) but at A.</description>
		<content:encoded><![CDATA[<p>@Benjamin: I am not sure if you parsed my comment right. The case I am describing does not necessarily involve partition. C could be down due to a variety of reasons. Also, in my example, the read that happens after the failed write is not on  C (as you mentioned) but at A.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo: A flawed architecture - Part I by Lee</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/comment-page-1/#comment-148</link>
		<dc:creator>Lee</dc:creator>
		<pubDate>Wed, 04 Nov 2009 16:17:38 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=55#comment-148</guid>
		<description>Why are you assuming dynamo only runs in one data center?</description>
		<content:encoded><![CDATA[<p>Why are you assuming dynamo only runs in one data center?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by Joydeep</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-146</link>
		<dc:creator>Joydeep</dc:creator>
		<pubDate>Wed, 04 Nov 2009 07:40:06 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-146</guid>
		<description>@benjamin: regarding amazon usage - perhaps i shouldn't have commented on it (i was passing on a second-hand story - but that's all it is). but i believe the relative success of bigtable is very much pertinent to this discussion. i don't think one could have provided an eventually consistent data store and achieved the same success as appengine with application developers.

i have posted a correction on my post about the vector clock stuff and explained why it happened. we were deep in discussions about Cassandra - and it doesn't use vector timestamps.

thank goodness u agree about Consistency. So does Avinash. What i have tried to point out is that Dynamo paper's section on quorums and consistency is confusing like hell. It leads readers to believe that they can get consistency - when they can't (with 100% odds). If u look at Jonathan's arguments - he's continuing to insist that there are proper read/write quorums in Dynamo/Cassandra - whereas there aren't. The term 'sloppy quorum' is used for a reason and the system is only 'eventually consistent' for the same reason.

i haven't said that relaxed consistency is not attractive for some applications. i am also not saying that dynamo is only deployed within a single data center. what i am saying though is that consistency needs to be relaxed only when partitioning is possible - and that this can be built as a separate layer above a data store with CA. the other thing that i keep stressing is that having bounds on inconsistency is a matter of practical importance. while recovering from an event like a disaster, one is faced with a choice of bringing online significantly old data and availability in the face of disaster. In such admin initiated actions - it's very important to have some idea of how much data could be potentially lost. The reason simply being that if data is significantly out of date - one might rather choose to be unavailable for the couple of days that it takes for the disaster to repair.

i continue to disagree about partitions within a single data center. u had mentioned whether i had on the ground experience in a web company. it might help to know then that my comments about core switches and partitioning is not some figment of imagination - but derived from actual events from our site - one of the largest in the world. any kind of network partition in our data centers is usually catastrophic. we are simply unable to lose network access to one of our core services (from say our web tier) and continue functioning normally (from that data center). this would be fairly typical of any web site. so we must build arrangements that prevent network partitions in a data center. rack failures (which are usually switch failures) are another case (that are almost like partitions) - but this problem is easily solved by replicating across racks (a la hdfs). important central servers have to be attached to multiple switches.

i think this is a critical point (without which the argument for starting with CA only falls apart). FWIW - i have had almost total success internally with this argument (people immediately agree from experience that partitions are simply intolerable within a data center). 

(btw - on a related point - S3 is eventually consistent as well - and it's a total pain to deal with that aspect of it (first hand experience working out Hive integration with Amazon guys). i almost felt sorry for Amazon engineers as they kept explaining how screwed up the semantics were. some day amazon will have to fix it (competition is coming)).</description>
		<content:encoded><![CDATA[<p>@benjamin: regarding amazon usage - perhaps i shouldn&#8217;t have commented on it (i was passing on a second-hand story - but that&#8217;s all it is). but i believe the relative success of bigtable is very much pertinent to this discussion. i don&#8217;t think one could have provided an eventually consistent data store and achieved the same success as appengine with application developers.</p>
<p>i have posted a correction on my post about the vector clock stuff and explained why it happened. we were deep in discussions about Cassandra - and it doesn&#8217;t use vector timestamps.</p>
<p>thank goodness u agree about Consistency. So does Avinash. What i have tried to point out is that Dynamo paper&#8217;s section on quorums and consistency is confusing like hell. It leads readers to believe that they can get consistency - when they can&#8217;t (with 100% odds). If u look at Jonathan&#8217;s arguments - he&#8217;s continuing to insist that there are proper read/write quorums in Dynamo/Cassandra - whereas there aren&#8217;t. The term &#8217;sloppy quorum&#8217; is used for a reason and the system is only &#8216;eventually consistent&#8217; for the same reason.</p>
<p>i haven&#8217;t said that relaxed consistency is not attractive for some applications. i am also not saying that dynamo is only deployed within a single data center. what i am saying though is that consistency needs to be relaxed only when partitioning is possible - and that this can be built as a separate layer above a data store with CA. the other thing that i keep stressing is that having bounds on inconsistency is a matter of practical importance. while recovering from an event like a disaster, one is faced with a choice of bringing online significantly old data and availability in the face of disaster. In such admin initiated actions - it&#8217;s very important to have some idea of how much data could be potentially lost. The reason simply being that if data is significantly out of date - one might rather choose to be unavailable for the couple of days that it takes for the disaster to repair.</p>
<p>i continue to disagree about partitions within a single data center. u had mentioned whether i had on the ground experience in a web company. it might help to know then that my comments about core switches and partitioning is not some figment of imagination - but derived from actual events from our site - one of the largest in the world. any kind of network partition in our data centers is usually catastrophic. we are simply unable to lose network access to one of our core services (from say our web tier) and continue functioning normally (from that data center). this would be fairly typical of any web site. so we must build arrangements that prevent network partitions in a data center. rack failures (which are usually switch failures) are another case (that are almost like partitions) - but this problem is easily solved by replicating across racks (a la hdfs). important central servers have to be attached to multiple switches.</p>
<p>i think this is a critical point (without which the argument for starting with CA only falls apart). FWIW - i have had almost total success internally with this argument (people immediately agree from experience that partitions are simply intolerable within a data center). </p>
<p>(btw - on a related point - S3 is eventually consistent as well - and it&#8217;s a total pain to deal with that aspect of it (first hand experience working out Hive integration with Amazon guys). i almost felt sorry for Amazon engineers as they kept explaining how screwed up the semantics were. some day amazon will have to fix it (competition is coming)).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by psvt</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-145</link>
		<dc:creator>psvt</dc:creator>
		<pubDate>Wed, 04 Nov 2009 06:42:25 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-145</guid>
		<description>On the contrary, Benjamin.   People _are_ arguing that #nosql is the replacement to to the traditional rdbms.    

Mobs and "movements" don't like nuance.</description>
		<content:encoded><![CDATA[<p>On the contrary, Benjamin.   People _are_ arguing that #nosql is the replacement to to the traditional rdbms.    </p>
<p>Mobs and &#8220;movements&#8221; don&#8217;t like nuance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by cheap Web Design</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-144</link>
		<dc:creator>cheap Web Design</dc:creator>
		<pubDate>Wed, 04 Nov 2009 05:00:10 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-144</guid>
		<description>yeah,Dynamo and Cassandra were or are in production making enormous amounts of money for several companies. Any claims you make that they don’t work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.</description>
		<content:encoded><![CDATA[<p>yeah,Dynamo and Cassandra were or are in production making enormous amounts of money for several companies. Any claims you make that they don’t work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by Benjamin Black</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-143</link>
		<dc:creator>Benjamin Black</dc:creator>
		<pubDate>Wed, 04 Nov 2009 02:27:38 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-143</guid>
		<description>Joydeep,

It is surprising to me that you've read the relevant papers as your analysis of Dynamo ignored vector clocks and the various conflict detection and resolution mechanisms while asserting conflict resolution was a glaring problem.  A significant chunk of the paper deals with nothing but conflict detection and resolution, even though it is rare in practice.

I don't believe anyone is arguing that you should use Dynamo or Cassandra is guaranteed consistency is required.  What people _are_ arguing is 1) there are interesting applications that don't require such guarantees, 2) those applications often have high writability/low latency requirements, 3) partitions happen (even in a single data center).  As I said above, it would be foolish to use a systems with relaxed consistency guarantees when consistency is paramount for your application.

On Amazon's use of Dynamo, you are making a leap unsupported, even contradicted, by evidence: Werner did not say 'Dynamo failed, so we stopped using it', he said 'We did the best we could at the time, it worked for years, but based on 5 years of experience we have now built something better'.  I would be extremely surprised if they abandoned many of the principles embodied in Dynamo.

Finally, and again, your understanding of the underlying infrastructure is rather odd.  Partitions can and do happen within a single datacenter.  Your argument that 1) things like Dynamo are only ever deployed in a single datacenter and 2) partitions don't happen in a single datacenter, hence 3) staying writable while partitioned is irrelevant is patently false on every point.

So, I agree: let the facts speak for themselves.  You cherry pick and grossly misstate information to support your position, then cast aspersions at those who disagree with you.  Dynamo and Cassandra were or are in production making enormous amounts of money for several companies.  Any claims you make that they don't work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.


b</description>
		<content:encoded><![CDATA[<p>Joydeep,</p>
<p>It is surprising to me that you&#8217;ve read the relevant papers as your analysis of Dynamo ignored vector clocks and the various conflict detection and resolution mechanisms while asserting conflict resolution was a glaring problem.  A significant chunk of the paper deals with nothing but conflict detection and resolution, even though it is rare in practice.</p>
<p>I don&#8217;t believe anyone is arguing that you should use Dynamo or Cassandra is guaranteed consistency is required.  What people _are_ arguing is 1) there are interesting applications that don&#8217;t require such guarantees, 2) those applications often have high writability/low latency requirements, 3) partitions happen (even in a single data center).  As I said above, it would be foolish to use a systems with relaxed consistency guarantees when consistency is paramount for your application.</p>
<p>On Amazon&#8217;s use of Dynamo, you are making a leap unsupported, even contradicted, by evidence: Werner did not say &#8216;Dynamo failed, so we stopped using it&#8217;, he said &#8216;We did the best we could at the time, it worked for years, but based on 5 years of experience we have now built something better&#8217;.  I would be extremely surprised if they abandoned many of the principles embodied in Dynamo.</p>
<p>Finally, and again, your understanding of the underlying infrastructure is rather odd.  Partitions can and do happen within a single datacenter.  Your argument that 1) things like Dynamo are only ever deployed in a single datacenter and 2) partitions don&#8217;t happen in a single datacenter, hence 3) staying writable while partitioned is irrelevant is patently false on every point.</p>
<p>So, I agree: let the facts speak for themselves.  You cherry pick and grossly misstate information to support your position, then cast aspersions at those who disagree with you.  Dynamo and Cassandra were or are in production making enormous amounts of money for several companies.  Any claims you make that they don&#8217;t work for the jobs for which they were built are bogus and say more about your attachment to being right than interest in an engineering discussion.</p>
<p>b</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by Joydeep</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-142</link>
		<dc:creator>Joydeep</dc:creator>
		<pubDate>Wed, 04 Nov 2009 02:09:23 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-142</guid>
		<description>@Benjamin - please don't shoot the messenger.

i don't have a paper trail in PODC and SOSP - but here's my background. i bought up the hadoop cluster at Facebook. I can comfortably claim to be one of the critical factors in making our Hadoop cluster scale from 80 to thousands of nodes today. as the lead of this team and effort for almost a couple of years - i was indeed carrying the pager (or more accurately the cell phone) and i have spent numerous nights attending to our cluster, solving all sorts of problems and keeping our users happy. i have dealt with this shit - and i know it's not pleasant. while i was doing all this - i was also responsible for conceiving Hive and one of it's primary developers.

secondly, what i am saying is sound academically. for example - read the VLDB 08 paper on PNUTS from Yahoo from respected academicians. It's close to an implementation that i would design myself. while it doesn't attack Dynamo directly - it does cover the problems with eventual consistency for developing applications. i have a reasonably strong education in computer science - and i still haven't forgotten (entirely) reading the quorum consensus papers from Herlihy back in school.

i have used vector clocks myself when i worked at Netapp. We had a very simple 2 node HA cluster - but state synchronization was always a issue. I independently thought of and implemented vector clocks to reconcile cluster state with eventual consistency semantics. That level of consistency was just fine for the state we were dealing with. this was many many years back. We also implemented a commercial Disaster recovery  system - it had to balance the tradeoffs between CAP (I didn't know this term then). A very smart colleague of mine solved the problem very elegantly using hierarchical quorums. Flat quorum groups just didn't work.

Consider that these HA and DR systems are used in many business critical applications (Netapp sold quite a load of them) and one of the reasons the DR solution worked well was because we were able to provide admins a good balance of C, A and P.

As regards my conclusions - Avinash has acknowledged flat out in internal mailing lists that Cassandra should not be used if data is desired to be consistent. I think you should reconsider the sources you trust - he has after all written the bulk of Cassandra code and was one of the Dynamo authors as well. (the fact that the leading committer of the open source Cassandra trunk doesn't understand these issues points out how bad the situation is)

Consider also that we are hearing that Dynamo is slowly being deprecated inside Amazon. It would not be fair for me to comment further on this (let's hear from Amazon guys).

Consider also that i am constantly pointing out BigTable/GFS as a better abstraction with partition tolerance being built as a layer on top of this. So i am indeed again referring to prior work with tremendous credentials.

Note that while Dynamo is withering inside Amazon (a cloud computing stalwart) - BigTable powers a strong commercial grade development platform (AppEngine).

i would let the facts speak for themselves.</description>
		<content:encoded><![CDATA[<p>@Benjamin - please don&#8217;t shoot the messenger.</p>
<p>i don&#8217;t have a paper trail in PODC and SOSP - but here&#8217;s my background. i bought up the hadoop cluster at Facebook. I can comfortably claim to be one of the critical factors in making our Hadoop cluster scale from 80 to thousands of nodes today. as the lead of this team and effort for almost a couple of years - i was indeed carrying the pager (or more accurately the cell phone) and i have spent numerous nights attending to our cluster, solving all sorts of problems and keeping our users happy. i have dealt with this shit - and i know it&#8217;s not pleasant. while i was doing all this - i was also responsible for conceiving Hive and one of it&#8217;s primary developers.</p>
<p>secondly, what i am saying is sound academically. for example - read the VLDB 08 paper on PNUTS from Yahoo from respected academicians. It&#8217;s close to an implementation that i would design myself. while it doesn&#8217;t attack Dynamo directly - it does cover the problems with eventual consistency for developing applications. i have a reasonably strong education in computer science - and i still haven&#8217;t forgotten (entirely) reading the quorum consensus papers from Herlihy back in school.</p>
<p>i have used vector clocks myself when i worked at Netapp. We had a very simple 2 node HA cluster - but state synchronization was always a issue. I independently thought of and implemented vector clocks to reconcile cluster state with eventual consistency semantics. That level of consistency was just fine for the state we were dealing with. this was many many years back. We also implemented a commercial Disaster recovery  system - it had to balance the tradeoffs between CAP (I didn&#8217;t know this term then). A very smart colleague of mine solved the problem very elegantly using hierarchical quorums. Flat quorum groups just didn&#8217;t work.</p>
<p>Consider that these HA and DR systems are used in many business critical applications (Netapp sold quite a load of them) and one of the reasons the DR solution worked well was because we were able to provide admins a good balance of C, A and P.</p>
<p>As regards my conclusions - Avinash has acknowledged flat out in internal mailing lists that Cassandra should not be used if data is desired to be consistent. I think you should reconsider the sources you trust - he has after all written the bulk of Cassandra code and was one of the Dynamo authors as well. (the fact that the leading committer of the open source Cassandra trunk doesn&#8217;t understand these issues points out how bad the situation is)</p>
<p>Consider also that we are hearing that Dynamo is slowly being deprecated inside Amazon. It would not be fair for me to comment further on this (let&#8217;s hear from Amazon guys).</p>
<p>Consider also that i am constantly pointing out BigTable/GFS as a better abstraction with partition tolerance being built as a layer on top of this. So i am indeed again referring to prior work with tremendous credentials.</p>
<p>Note that while Dynamo is withering inside Amazon (a cloud computing stalwart) - BigTable powers a strong commercial grade development platform (AppEngine).</p>
<p>i would let the facts speak for themselves.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Dynamo - Part I: a followup and re-rebuttals by Benjamin Black</title>
		<link>http://jsensarma.com/blog/2009/11/dynamo-part-i-a-followup-and-re-rebuttals/comment-page-1/#comment-141</link>
		<dc:creator>Benjamin Black</dc:creator>
		<pubDate>Wed, 04 Nov 2009 01:53:14 +0000</pubDate>
		<guid isPermaLink="false">http://jsensarma.com/blog/?p=64#comment-141</guid>
		<description>Kannan,

The situation you describe is a partition: A &amp; B are writable, while C is only readable [and you read from C; and you can avoid this by adjusting R/W/N].  Ignoring for the moment that such a scenario is extremely unlikely, recall that being always writable is an explicit design goal of Dynamo, _even at a cost to consistency_.  This makes it appropriate for certain applications and inappropriate for others.  Arguing that Dynamo and Cassandra are fatally flawed because they don't offer the same consistency guarantees as Oracle RAQ is similar to arguing that Oracle RAQ is fatally flawed because it isn't always writable under partition.  Use the right tool for the right job, don't insist a tool be universally applicable.


b</description>
		<content:encoded><![CDATA[<p>Kannan,</p>
<p>The situation you describe is a partition: A &amp; B are writable, while C is only readable [and you read from C; and you can avoid this by adjusting R/W/N].  Ignoring for the moment that such a scenario is extremely unlikely, recall that being always writable is an explicit design goal of Dynamo, _even at a cost to consistency_.  This makes it appropriate for certain applications and inappropriate for others.  Arguing that Dynamo and Cassandra are fatally flawed because they don&#8217;t offer the same consistency guarantees as Oracle RAQ is similar to arguing that Oracle RAQ is fatally flawed because it isn&#8217;t always writable under partition.  Use the right tool for the right job, don&#8217;t insist a tool be universally applicable.</p>
<p>b</p>
]]></content:encoded>
	</item>
</channel>
</rss>
