Log Replay in MySQL and NetApp Filers

Mark Callaghan started a discussion on mysql replication lag today on the MySql@Facebook page. This happens to be one of my favorite topics – thanks to some related work i did at Netapp. I was happy to know that there’s already a bunch of stuff happening in this area for MySQL – and thought it would be good to document some of my experiences and how they correlate to the MySQL efforts.

Somewhere back in 2003/2004 – i started looking at file system replay at Netapp (primarily because it was the coolest/most important problem i could find that didn’t have half a dozen people looking at it already). Netapp maintains an operation log in NVRAM (that is mirrored over to a cluster partner in HA configurations). A filer restart or failover requires a log replay before the file system can be bought online. The log may have transactions dependent on each other (a file delete depends on a prior file create for example) – so the simplest way to replay the log is serial. Because replay happens when the file system cache is cold – most log replay operations typically block waiting for data/metadata to be paged in from disk. No wonder that basic serial log replay is dog slow and this was one of the aspects of failover that we struggled to put a bound on. To make things worse – NVRAM sizes (and hence accumulated operation log on failure) were becoming bigger and bigger (whereas disk random iops were not getting any better) – so this was a problem that was becoming worse with time.

When I started working on this problem – there were some possible ideas floating around – but nothing that we knew would work for sure. Some of the obvious approaches were not good enough in the worst case (parallelize log replay per file system) or were too complicated to fathom (analyze the log and break into independent streams that could be replayed in parallel). The latter particularly because there was no precise list of transactions and their dependent resources – the log format was one hairy ball (a common characteristic of closed source systems). At the beginning – I started with an approach where the failed filer’s memory could be addressed by filer taking over (ie. instead of going to disk – it could use the partner filer’s memory for filling in required blocks). But this was problematic since buffers could be dirty – and those weren’t usable for replay. It was, of course, also problematic since it would have never worked in the simple restart case. At some point playing around with this approach – I started having to maintain a list of buffers required for each log replay (don’t precisely remember why) – and soon enough I had this Eureka moment where the realization dawned that this was all was ever needed to speed up log replay.

The technique is documented in one of the few worthy patents i have ever filed (USPTO link) – but long story short – it works like this:

The filer keeps track of file system blocks required to perform each modifying transaction (in a separate log called the WAFL Catalog)
This log is also written to NVRAM (so it’s available at the time of log replay) and mirrored to partner filers
At replay time – the first thing that happens is that the filer issues a huge boatload of read requests using the Catalog. It doesn’t wait for these requests to complete and doesn’t really care whether the requests succeed or fail – the hope is that most of them succeed and warm up the cache.
Rest of replay happens as before (serially) – but most of the operations being replayed find the required blocks in memory already (thanks to the IO requests issued in the previous step)

This implementation has been shipping on filers since at least 2005. It resulted in pretty ginormous speedups in log replay (4-5x was common). I was especially fond of this work because of the way it materialized (basically hacking around) and the way it worked (a pure optimization that was non-intrusive to the basic architecture of the system) and because of the generality of the principles involved (it was apparent that such form of cataloging and prefetching could be used in any log replay environment). It was a simple and elegant method – and I have been dying ever since to find other places to apply this to!

As a side note – disks are more efficient at random io operations if they have more of them to do at once. Disk head movement can be minimized to serve a chain of random IO requests. (Disk drivers also sort pending IO requests). During measurements with this implementation – i was able to drive disks to around 200 iops (where the common wisdom is no more than 100 iops per disk) during the prefetch phase of replay (thanks to the long list of IOs generated from the catalog)

Roll forward to 2006 at Yahoo and 2008 at Facebook. Both these organizations perennially complained about replication lag in MySQL. Nothing surprising – operation log replay in a database has the same problem as operation log replay in a filer. So i have been dying to find the time (and frankly motivation) to dive into another enormous codebase and try and implement the same techniques. In bantering with colleagues earlier this year at Facebook – it became apparent that the technique described above was much easier to implement for a database. All transactions in a database are declarative and have a nice grammer (unlike the hair ball nvram log format in Netapp). So one could generate a read query (select * where x=’foo’;) for each DML (update y where x=’foo’;) that would essentially pre-fill blocks required by the DML statement. And that should be it.

So I was pleasantly surprised to learn today that this technique is already implemented in MySQL (and that it has a name too!). It’s called the oracle algorithm and is available as part of the maatkit toolset. Wonderful. I can’t wait to see how it performs on Facebook’s workloads.

MySql also has more comprehensive work on parallelizing replay going on. Harrison Fisk posted a couple of links on design docs and code for the same. It’s not surprising that these efforts take the route i had avoided (detecting conflicts in logs and replaying independent operations in parallel) – the declarative nature of the mysql log makes such efforts much more tractable i would imagine.

Kudos to the MySQL community (and Mark for his work at Facebook) – seems like there’s a lot of good stuff happening.