HBase and Map-Reduce

HBase + Map-Reduce is a really awesome combination. In all the back and forths about NoSQL – one of the things that’s often missed out is how convenient it is to be able to do scalable data analysis directly against large online data sets (that new distributed databases like HBase […]

Compression and Layering in Hadoop

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission. One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something […]

Update on Hive+Hadoop+S3+EC2

A formal recipe on running SQL queries using EC2 against S3 files is now posted at: http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely But not before hitting a few more bugs ( HADOOP-5861 ). Running a TPCH query using Hive was a pretty high point. (I did have to omit the order by clauses though :-() […]

Curt Monash reports on Hadoop/Hive @ Facebook

Curt Monash posted a blog post on our (myself and Ashish Thusoo’s) conversation with him regarding Hadoop and Hive and their deployment and usage at Facebook.  It is heartening to see the mainstream database and analytics community starting to cover Hadoop and Hive.  Even though these projects are rapidly becoming […]