Compression and Layering in Hadoop

One of the relatively late lessons I have received in operating a Hadoop cluster has been the (almost overwhelming) importance of compression in storage, computation and network transmission.

One of the architectural questions is whether compression belongs to the file-system (and similarly the networking sub-system) or whether it is something that the application layer (map-reduce and higher layers) are responsible for. While native compression is supported by many file systems (for example ZFS) – the arguments in favor of the former are less well made (or at least less commonly made). On the other hand, columnar, dictionary and other forms of compression at the application level (that exploit data schema and distribution) are common parlance and there’s a whole industry of sorts around these.

After some thought though – i have become more and more convinced that functionality such as compression should be moved as much down the stack as possible. The arguments for this (at least in the context of HDFS and Hadoop) are fairly obvious:

  1. Applications need to apply compression synchronously – often increasing latency of workloads and load during peak hours. File Systems can perform compression asynchronously
  2. FileSystems can manage data through it’s life cycle applying different forms of compression. One could, for example, keep new data uncompressed, recent data compressed via LZO and then finally migrate to Bzip.
  3. Where multiple replicas of data are maintained – some of the replicas can be maintained as compressed transparently – providing redundancy while saving space. Applications can run against uncompressed data whenever possible
  4. The former may be especially appropriate when data is maintained for disaster recovery. Data can be compressed before transmission to remote site and can be kept in compressed format there
  5. The filesystem may also be able to push compression to external hardware (co-processors)

Fairly similar arguments can be made to move wire compression to the networking sub-system. It can dynamically recognize whether a compute cluster is currently CPU or network bound and come up with appropriate compression level for data transmission. It is impossible for application level compression to achieve these things.

This leaves the tricky question of how to get the same levels of compression that an application may be able to achieve (given knowledge of data schema etc.). The challenge here points to a missing abstraction in the traditional split of file and database systems (and other applications). The traditional ‘byte-stream’ abstraction that filesystems (and networking systems) present means that they don’t have any knowledge of data inside the byte stream. However – if a data stream can be tagged with it’s schema – and optionally even a pointer to the right codec for this stream – then the file-system can easily perform optimal compression (while preserving the benefits above).

Traditionally – these kinds of proposals would also have encountered the standard opposition to running user-land (application) code in kernel space. But with user level file systems like HDFS gaining more and more tractions – and with user level processing becoming more common in operating systems as well (witness FUSE) – that argument is fairly moot. One of the opportunities with systems like Hadoop, i think, is that we can fix these historic anomalies. The open nature of this software and the fact that files are used as a record-stream (and not as a database image) – lends itself to the kind of schemes suggested above.