MySQL on a Traditional File System

After a recent article about running MySQL on ZFS, which touched upon the subject of storage stack and file system alignment, I had a few questions from clients about whether any of that is applicable to other file systems as well. Indeed, there are analogues in other storage stacks that are just as important when it comes to tuning storage subsystem performance for InnoDB workloads (and other workloads with known or predictable block sizes). In a typical storage stack there are 4 layers:

Modern disks typically have 4KB sectors, though in some cases (certain SSDs), there minimum write block size can be even bigger. The important thing is to make sure that as you go up the stack, the block size only goes up and never down. If your RAID uses blocks of 64KB but your application uses 16KB pages (e.g. InnoDB by default uses 16KB pages), this will result unnecessary read and write amplification and read-modify-write overheads.

In this article I will illustrate alignment optimisations based on Linux software MD RAID and ext4 file system options. Since parity RAID (5, 6 and equivalents) performance profile isn’t well suited for OLTP database workloads, we will be working based on RAID0 disk arrays, since RAID10 is effectively RAID0 on top of RAID1 mirror devices.

The key point is that we must make sure that the RAID block size aligns with the application block size, and that the file system inbetween these two aligns as well. At MD RAID level, the important option is chunk, and for InnoDB we should set it to 16KB: mdadm […] –chunk=16K […]

Then we must tell ext4 to arrange its metadata in a way optimal for the RAID block size. This is done using the extended mkfs.ext4 parameters stride and stripe-width. It may seem intuitive to set stripe-width appropriate to the width of the array by looking at a RAID0 array as a RAID5 array without the parity disk, but this is in fact a mistake since there is no read-modify-write required to update the parity block. Both of these should be set to 16KB. The default (and maximum) ext4 block size is 4KB, and stride and stripe-width are expressed in numbers of blocks rather than absolute size, so the command we should use is: mkfs.ext4 -b 4096 -E stride=4,stripe-width=4 […]

There is one other option that is typically overlooked but can make a substantial difference, especially with very large file systems (and thus, databases). This is the -g blocks-per-group option. Since every access to a block group requires access to the block group header, if all of the block groups begin on the same disk in the RAID array, this will create a significant hot spot and bottleneck the performance on that disk. By default, a block group is 32768 blocks (128MB) in size, and it can only be changed in 8 block increments (32KB). This is, unfortunately, where it becomes impossible to provide an optimal one-size-fits-all option because the optimum will depend on the width of the array. With a 2^n number of disks in the array, unfortunately, every block group will in most cases start on the first disk in the array. There are three possible solutions:

In some cases, a combination of 1 and 2 may be necessary. Let’s have a look at an example.

Let’s say we have a 2-disk RAID0 array. With 16KB desired block size, and a minimum increment of blocks-per-group increment of 32KB, there is no way to make the block group align to 16KB. Other than enlarge the InnoDB page size to 32KB, the only thing we can do is increase the number of disks in the stripe to 3. With 3 disk RAID0 array optimised for 16KB alignment, the 32KB aligned block groups will always fall short of a full stripe by half of a minimum increment. This will result in the beginnings of block groups rotating around the underlying disks in a continuous 1-3-2-1-3-2… pattern, and thus spread the load evenly across all the disks in the array.

Caveat: Compression

One overlooked effect of using InnoDB compression is that it will completely break the above optimisation. If you are intending to use compression, you should instead optimise for the compressed block size rather than the regular block size. In InnoDB, we typically set KEY_BLOCK_SIZE=8. This means that pages that compress down to 8KB will be stored in an 8KB block rather than a 16KB block, and pages that don’t compress down to 8KB will be stored as two separate 8KB blocks. Since we optimised the storage stack for 16KB storage, this upset the alignment. Worse, some of those 8+8KB pages will end up straddling the disk boundaries, consuming an IOP from two disks rather than just one – this will, in the theoretical worst case, half our throughput. This is why on ZFS we tell ZFS that our desired recordsize is 16KB and let ZFS do the compression itself, and it will internally make sure that each 16KB logical block (even if it compresses), won’t straddle disk boundaries.

So, what can we do about this? The only thing we can do is, instead of optimising for 16KB block size, to optimise for 8KB block size. Assuming most of our InnoDB blocks compress down to 8KB, performance will be close to optimal, but it is worth bearing in mind that this will have an impact on all of the tables in our database – all pages in uncompressed tables will end up straddling two disks, thus consuming double the IOPS.

Would you like your storage stack optimised for MySQL? Are you having any kinds of database performance problems? Give Shattered Silicon a call, we can almost certainly help.