HBase and Map-Reduce

HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase […]

Flash Memory

I have been finding and reading some great references on flash memory lately and thought would collate up some of the better ones here (and leave some takeaways as well). For starters, ACM queue magazine had a great issue entitled Enterprise Flash Storage last year. Jim Gray’s and Goetz Graefe’s […]

Compression and Layering in Hadoop

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission. One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something […]

Update on Hive+Hadoop+S3+EC2

A formal recipe on running SQL queries using EC2 against S3 files is now posted at: http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely But not before hitting a few more bugs ( HADOOP-5861 ). Running a TPCH query using Hive was a pretty high point. (I did have to omit the order by clauses though :-() […]

Curt Monash reports on Hadoop/Hive @ Facebook

Curt Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their deployment and usage at Facebook.  It is heartening to see the mainstream database and analytics community starting to cover Hadoop and Hive.  Even though these projects are rapidly becoming […]