L2ARC Tuning - ZFS on FreeBSD

ZFS has the ability to extend the ARC with one or more L2ARC devices, which provides the best benefit for random read workloads. These L2ARC devices should be faster and/or lower latency than the storage pool. Generally speaking this limits the useful choices to flash based devices. In very large pools the ability to have devices faster than the pool may be problematic. In smaller pools it may be tempting to use a spinning disk as a dedicated L2ARC device. Generally this will result in lower pool performance (and definitely capacity) than if it was just placed in the pool. There may be scenarios in lower memory systems where a single 15K SAS disk can improve the performance of a small pool of 5.4k or 7.2 drives, but this is not a typical case.

By default the L2ARC does not attempt to cache streaming/sequential workloads, on the assumption that the combined throughput of your pool disks exceeds the throughput of the L2ARC devices, and therefore, this workload is best left for the pool disks to serve. This is usually the case. If you believe otherwise (number of L2ARC devices X their max throughput > number of pool disks X their max throughput), then this can be toggled with the following sysctl:

vfs.zfs.l2arc_noprefetch

The default value of 1 does not allow caching of streaming and/or sequential workloads. Switching it to 0 will allow streaming/sequential reads to be cached. Note that this value is a run time sysctl that is read at pool import time, which means the default mechanism of setting sysctls in FreeBSD will not work. (Values set in /etc/sysctl.conf are set after ZFS pools are imported)

The default throttling of loading the L2ARC device is 8 Mbytes/sec, on the assumption that the L2ARC is warming up from a random read workload from spinning disks, for which 8 Mbytes/sec is usually more than the spinning disks can provide. For example, at a 4 Kbyte I/O size, this is 2048 random disk IOPS, which may take at least 20 pool disks to drive. Should the L2ARC throttling be increased from 8 Mbytes, it would make no difference in many configurations, which cannot provide more random IOPS. The downside of increasing the throttling is CPU consumption: the L2ARC periodically scans the ARC to find buffers to cache, based on the throttling size. If you increase the throttling but the pool disks cannot keep up, you burn CPU needlessly. In extreme cases of tuning, this can consume an entire CPU for the ARC scan.

If you are using the L2ARC in its typical use case: say, fewer than 30 pool disks, and caching a random read workload for ~4 Kbyte I/O which is mostly being pulled from the pool disks, then 8 Mbytes is usually sufficient. If you are not this typical use case: say, you are caching streaming workloads, or have several dozens of disks, then you may want to consider tuning the rate. Modern L2ARC devices (SSDs) can handle an order of magnitude higher than the default. It can be tuned by setting the following sysctls:

vfs.zfs.l2arc_write_max
vfs.zfs.l2arc_write_boost

The former value sets the runtime max that data will be loaded into L2ARC. The latter can be used to accelerate the loading of a freshly booted system. Note that the same caveats apply about these sysctls and pool imports as the previous one. While you can improve the L2ARC warmup rate, keep an eye on increased CPU consumption due to scanning by the l2arc_feed_thread(). Eg, use DTrace to profile on-CPU thread names.

The known caveats:

There's no free lunch. A properly tuned L2ARC will increase read performance, but it comes at the price of decreased write performance. The pool essentially magnifies writes by writing them to the pool as well as the L2ARC device. Another interesting effect that's been observed is a falloff in L2ARC performance when doing a streaming read from L2ARC while simultaneously doing a heavy write workload. My conjecture is that the write can cause cache thrashing but this hasn't been confirmed at this time.

Given a working set close to ARC size an L2ARC can actually hurt performance. If a system has a 14GB ARC and a 13GB working set, adding an L2ARC device will rob ARC space to map the L2ARC. If the reduced ARC size is smaller than the working set reads will be evicted from the ARC into the (ostensibly slower) L2ARC.

Multiple L2ARC devices are concatenated, there's no provision for mirroring them. If a heavily used L2ARC device fails the pool will continue to operate with reduced performance. There's also no provision for striping reads across multiple devices. If the blocks for a file end up in multiple devices you'll see striping but there's no way to force this behavior.

Be very careful when adding devices to a production pool. By default zpool add stripes vdevs to the pool. If you do this you'll end up striping the device you intended to add as an L2ARC to the pool, and the only way to remove it will be backing up the pool, destroying it, and recreating it.

Many SSDs benefit from 4K alignment. Using gpart and gnop on L2ARC devices can help with accomplishing this. Because the pool ID isn't stored on hot spare or L2ARC devices they can get lost if the system changes device names.

The caveat about only giving ZFS full devices is a solarism that doesn't apply to FreeBSD. On Solaris write caches are disabled on drives if partitions are handed to ZFS. On FreeBSD this isn't the case.