A formal recipe on running SQL queries using EC2 against S3 files is now posted at: http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely
But not before hitting a few more bugs ( HADOOP-5861 ). Running a TPCH query using Hive was a pretty high point. (I did have to omit the order by clauses though :-()
I am amazed at how far Hive has come (and yet how glaring some of the missing features are). I am also impressed by the promise of the cloud (this being my first project using S3/EC2) and at how different the experience was as compared to programming/administering a large in-house cluster. Amazon’s infrastructure seems to scream for developers to jump in and add value. Hopefully i will get a chance to explore some ideas on this in future posts.