MySQL on ZFS – we should all have started doing this a decade ago

One thing I have been asking myself for quite a few years now is: “How did we ever manage before ZFS?” Was everything really as painful and expensive before ZFS as it currently feels when I have to get anything done without ZFS? And why does anyone still use anything other than ZFS, a decade after it became available on the most popular server platforms (Solaris, FreeBSD, Linux)? I was fortunate enough to be exposed to it early on. If you are only discovering it and switching to it today, the question you will ask yourself will probably be more along the lines of: “Why have I not done this before?” This is at least as applicable to running MySQL on ZFS as it is to running anything else on ZFS.

Importance of Storage Alignment

Before we get into the ZFS specific technical part, let’s first look at the storage layer and how we can optimise the interaction between the storage subsystem and MySQL / MariaDB. Specifically, let’s look at the InnoDB storage engine and replication logs as these are by far the most ubiquitously used components. The most important aspect here is the page size of various files. InnoDB defaults to 16KB page size for tablespaces, 8KB for the transaction log, and binary logs don’t have a fixed page size. In order to simplify this, we can adjust the transaction log write size to 16KB (maximum is equal to the used InnoDB page size) using the innodb_log_write_ahead_size option. That reduces the requirement to 16KB blocks for the InnoDB tablespaces and logs, and no fixed block size for the replication logs. Even traditional file systems can be optimized for 16KB application page size (e.g. ext4’s stride and stripe-width mkfs parameters), but unfortunately systems engineers who are sufficiently familiar with both databases and storage are relatively rare.

Sector Size

With that out of the way, at the time of pool creation, the first thing you should consider is your underlying disk’s physical sector size. Most modern disks have 4KB hardware sectors, even of they emulate 512 byte sectors. Underlying page size on SSDs can be bigger, but this is heavily abstracted away and most SSDs are optimised for 4KB writes since this is the most commonly used block size on most file systems. ZFS will try to auto-detect this, but it is usually a good idea to explicitly instruct it to create a pool with 4KB sector size using the ashift parameter:

zpool create -o ashift=12 tank

“Wait, I thought you said 4KB, why 12?” I hear you ask? The ashift parameter is the exponent, rather than size. 2^12 = 4096. For 8KB sectors it would be ashift=13 because 2^13 = 8192.


Since InnoDB does it’s own prefetching, we can disable ZFS’ own prefetching (since it is redundant in this specific usage) by setting the kernel module paramter zfs_prefetch_disable=1. This is especially important in environments where disk I/O is heavily constrained and provisioning more IOPS is expensive (e.g. in cloud environments).

ZFS Tuning and Optimisation

At this stage, we need to consider the actual ZFS configuration parameters for the file system that will store our InnoDB tablespaces and logs. The parameters we normally recommend are the following:

atime=off: It means the same thing it means on all other file systems.

compression=lz4: LZ4 compression is very fast. So fast, in fact, that if you have a modern processor, you are likely to be able to decompress the data much faster than your SSD can read the compressed data. This means simultaneously having approximately double the disk throughput while consuming only half of the storage space.

logbias=throughput: From Solaris documentation: “If logbias is set to throughput, ZFS does not use the pool’s separate log devices. Instead, ZFS optimizes synchronous operations for global pool throughput and efficient use of resources. The default value is latency. For most configurations, the default value is recommended. Using the logbias=throughput value might improve performance for writing database files.”

primarycache=metadata: Since InnoDB already does it’s own caching via the Buffer Pool, there is no benefit to also caching the data in the page cache, or in the case of ZFS, ARC. Since O_DIRECT doesn’t disable caching on ZFS, we can disable caching of data in the ARC by setting primarycache to only cache metadata (default is “all”).

recordsize=16k: As discussed above, InnoDB uses 16KB pages. Using this parameter, we tell ZFS to limit the block size to 16KB. This prevents multiple pages from being written in one larger block, as that would necessitate reading the entire larger block for future updates to any one page in that block. For replication logs (binary/relay), we should create a separate file system on the same pool without specifying the recordsize option (or specifying it to be the default 128KB). This will typically improve compression ratios and provide best results for log-structured (append-only) data where the size of a write isn’t consistently predictable.

xattr=sa: This instructs ZFS to store extended file attributes in the file’s inode instead of a hidden directory, since storing it in the inode is more efficient. This is only really relevant on systems that use SELinux, since this uses extended attributes for storing security contexts of files.

MySQL Tuning and Optimization

In addition to optimisation on the ZFS side, there are also optimisations we can make on the MySQL side, to take maximum possible advantage of what ZFS offers.

innodb_log_write_ahead_size=16384: In order to prevent torn pages and avoid read-on-write overhead, we should set this to the underlying file system block size. If we are keeping the InnoDB logs in the data directory (as is the default), we should set this to what we set the ZFS recordsize to.

innodb_doublewrite=0: InnoDB double-write has one purpose – to prevent torn pages from corrupting the data in case of application crash. Because commits on ZFS are atomic, and we have aligned the InnoDB page size and innodb_log_write_ahead_size with ZFS recordsize, a torn page cannot occur – either the entire block is written, or the entire block is lost. This eliminates the risk of a partial page being written, so we can disable innodb_doublewrite.

innodb_checksum_algorithm=none: InnoDB records a checksum in each page. This is important for detecting data corruption on traditional storage stacks, but ZFS already does the same thing – it computes and stores the checksum of eack block. Doing it at InnoDB level as well is therefore rendered redundant and can be disabled for a small CPU saving.

innodb_flush_neighbors=0: When it is time to flush out a page to disk, InnoDB also flushes all of the nearby pages as well. This is beneficial on spinning rust and with traditional file systems, but with a copy-on-write file system like ZFS where logically nearby blocks are not necessarily physically nearby blocks, this typically just wastes disk I/O. It is better to disable it on ZFS, even when using mechanical disks.

innodb_use_native_aio=0: On Linux, the ZFS driver’s AIO implementation is a compatibility shim. InnoDB performance takes a hit when using the default code path. To fully disable the use of ZFS native AIO, innodb_use_atomic_writes=0 should also be set.

As discussed above, ZFS LZ4 compression is incredibly fast, so we should leave compression to ZFS and not use InnoDB’s built in page compression. As an added benefit, leaving compression to ZFS doesn’t disrupt the block alignment. Optimising the storage stack for 16KB writes and then using compression at InnoDB level would de-tune that optimisation significantly. Since ZFS recordsize is referring to the raw, uncompressed size (it could compress to as little as one sector/ashift size), it doesn’t disrupt the storage stack alignment.

Operating System Tuning and Optimization

ZFS does its own I/O scheduling so the scheduler for the disks used in ZFS pools should be set to “none” or “noop”, depending on what your kernel includes as available for your disks.

You can check the current values using:

cat /sys/block/*/queue/scheduler

This can have a dramatic effect on performance, especially on fast storage where the overhead of kernel scheduler getting in the way becomes dominant to the bottleneck traditionally caused by spinning disks being slow.

Things You Can Do Better With MySQL on ZFS

If you need to temporarily boost write performance without restarting mysqld (e.g. when you are loading a large amount of data), ZFS can help. In MySQL and MariaDB you can disable disk flushing by setting innodb_flush_log_at_trx_commit=0 and sync_binlog=0. With ZFS you can disable all explicit disk flushing to any one file system by setting sync=disabled at file system level for the duration of the write-heavy workload. The background commit thread will still flush the written data to disk every 5 seconds, but flushing the data once per 5 seconds instead of for each transaction gives a huge performance boost as it provides orders of magnitude more opportunity for merging writes.

Other Benefits

ZFS also brings other benefits. As mentioned above, it brings built in data corruption detection and healing (obviously healing requires storage redundancy, e.g. disk mirroring or RAIDZ). It brings transparent data compression that can both reduce storage costs and improve throughput. It also provides snapshot capability, but unlike other solutions for providing snapshots (e.g. LVM), it doesn’t have limited snapshot space – the only limitation is total space available in the pool. Additionally, unlike with LVM, there is no performance penalty from having snapshots, so snapshots can be kept indefinitely (or in line with backup data retention policy). These snapshots can be transferred incrementally to another ZFS system, making backup management much more workable, especially when very large databases are in use, since regularly taking a full backup of a multi-TB database is prohibitively slow and expensive. Snapshots can also be used for facilitating efficient Galera full state transfers, with some custom scripting.

Performance Measurements

How much performance difference does it make? Well, I last thoroughly measured this a few years back, when optimizing a client’s very large production system. Here are the numbers, measured on an AWS instance with EBS storage:

MD 4-disk RAID10 + ext4:
SELECT: 169m
INSERT: 140m
DELETE: 332m
TOTAL: 712m

ZFS 4-disk stripe-of-mirrors
SELECT: 164m (97%)
INSERT: 119m (85%)
UPDATE: 73m (104%)
DELETE: 202m (60%)
TOTAL: 558m (78%)

The 3-4% difference on SELECTs and UPDATEs is within the margin of measurement, and it was not very surprising – the memory on the servers was already sized appropriately for the workload, so the vast majority of reads were satisfied from the buffer pool. A significant part of the UPDATEs in this case were null operations that didn’t actually match any rows, so there was not much write workload involved, and as for the SELECTs, most of the reads came from the buffer pool and didn’t touch the storage once the caches were warmed up. INSERTs and DELETEs, however, show a significant improvement. The system in question is rather write-heavy, and the benefits of ZFS really show, with the overall performance being improved by 22%. Considering that these improvements came purely from ZFS and the optimisations it safely enabled at MySQL level, on a system where the database was already fully optimised by our database specialists, this 22% improvement on top is certainly worth having. The approximately 50% saving in EBS storage costs was also a nice bonus.

Would you like your databases migrated to run MySQL on ZFS? Give Shattered Silicon a call, we can help you improve your database performance.