Update – Some of the suggestions below have been questioned for a typical ZFS setup. To clarify, these setting should not be implemented on most ZFS installations. The L2ARC is designed to ensure that by default it will never hurt performance. Some of the changes below can have a negative impact on workloads that are not using the L2ARC and accepts the possibility of worse performance in some workloads for better performance with cache friendly workloads. These suggestions are intended for ZFS implementations where the flash cache is one of the main motivating factors for deploying ZFS – think high SSD to disk ratios. In particular, these changes were tested with an Oracle database on ZFS
In 2008 Sun Microsystems announced the availability of a feature in ZFS that could use SSDs as a read or write cache to accelerate ZFS. A good write-up on the implementation of the level 2 adaptive read cache (L2ARC) by a member of the fishworks team is available here. In 2008, flash SSDs were just starting to penetrate the enterprise storage market and this cache was written with many of early flash SSD issues in mind. First, it warms quite slowly, defaulting to a maximum setting of 8 MB/s cache load rate. Second, to avoid being in a write heavy path, it is explicitly set outside of the data eviction path from the ZFS memory cache (ARC). This prevents it from behaving like a traditional level 2 cache and causes it fill more slowly with mainly static data. Finally, the default record size of the file system is rather big (128KB) and the default assumption is that for sequential scans it is better to just read from the disk and skip the SSD cache.
Many of the assumptions about SSD don’t line up with current generation SSD products. An enterprise class SSD can write quickly, has high capacity, higher sequential bandwidth than the disk storage systems, and has a long life span even under a heavy write load. There are a few parameters that can be changed as a best practice when enterprises SSDs are being used for the L2ARC in ZFS:
Change the Record Size to a much lower value than 128 KB. The L2ARC fetches the full record on a read and 128 KB IO size to an SSD uses up device bandwidth increases the response time.
This can be set dynamically (for new files) with:
# zfs set recordsize <record size> <filesystem>
Change the l2arc_write_max to a higher value. Most SSDs specify a device life in terms of full device write cycles per day. For instance, say you have 700 GBs of SSDs that support 10 device cycles per day for 5 years. This equates to a max write rate of 7000 GBs/day or 83 MB/s. As the setting is the maximum write rate, I would suggest at least doubling the speced drive max rate. As the L2ARC is a read only cache that can abruptly fail without impacting the file system availability, the risk of too high of a write rate is only that of wearing out the drive ealier. This throttle was put in place as when early SSDs with unsophisticated controllers were the norm. Early SSDs could experience significant performance problems during writes that would limit the performance of the reads the cache was meant to accelerate. Modern enterprise SSDs are orders of magnitude better at handling writes so this is not a major concern.
On Solaris, this parameter is set by adding the following line to the /etc/system file:
set zfs:l2arc_write_max= <maximum bytes per second>
Set the l2arc_noprefetch=0. By default this is set to one, skipping the L2ARC for prefetch reads that are used for sequential reading. The idea here is that the disks are good at sequential so just read from them. With PCIe SSDs readily available with multiple GB/s of bandwidth even sequential workloads can get a significant performance boost. Changing this parameter will put the L2ARC in the prefetch read path and can make a big difference for workloads that have a sequential component.
On Solaris, this parameter is set by adding the following line to the /etc/system file:
Comparing storage performance is a bit more difficult than meets the eye. This comes up quite a bit as I frequently address the differences between RAM and Flash-based SSDs. This particularly comes up when comparisons are made between TMS’s Flagship Flash system, the RamSan-630 and the Flagship RAM system, the RamSan-440. The Flash system has higher IOPS but the RAM system has lower latency.
There are three independent metrics of storage performance: response time, IOPS, and bandwidth. Understanding the relationships between these metrics is the key to understanding storage performance.
Bandwidth is really just a limitation of the design or standards that are used to connect storage. It is the maximum number of bytes that can be moved in a specific time period; response time overhead or concurrency do not play a factor. IOPS are nothing more than the number of I/O transactions that can be performed in a single second. Determining the maximum theoretical IOPS for a given transfer size is as simple as dividing the maximum bandwidth by the transfer size. For a storage system with a single one Gbps iSCSI connection (~100 MB/s bandwidth) and a workload of 64 KB transfers, then the maximum IOPS will be ~1,500. If the transfer size is a single sector (512 bytes), then the maximum IOPS will be ~200,000 – a notable difference. At this upper limit, bandwidth will more than likely not be the performance limiter.
There is another relationship that ties together response time and concurrency – Little’s law. Little’s law governs the concurrency of a system needed to achieve a desired amount of throughput. For storage, Little’s Law is: (Outstanding I/Os) ÷ (response time) = IOPS. I consider this the most important formula in storage performance. If you boil this down, ultimately the limitation of IOPS performance is the ability of a system to handle Outstanding I/Os concurrently. Once that limit is reached, the I/Os get clogged up and the response time increases rapidly. This is the reason a common tactic to increase storage performance has been to simply add disks – each additional disk increases the concurrent I/O capabilities.
Interestingly, the IOPS performance isn’t limited by the response time. Lower response times merely allow a given level of IOPS to be achieved at lower levels of concurrency. There are practical limits on the level of concurrency that can be achieved by the interfaces to the storage (e.g. the execution throttle setting in an HBA), and many applications have fairly low levels of concurrent I/O, but the response time by itself does not limit the IOPS. This is why even though Flash media has a higher response time than RAM, Flash systems that handle a high level of concurrent I/O can achieve as good as or better IOPS performance.
Recently, I was helping with a project where the performance difference of Oracle on disks vs on SSDs was being showcased. A a database was created on 11g and a simple script was querying random subsets of rows. The database was setup so we could flip between the disks and SSDs quickly by just changing the preferred read failure group in ASM. Something very odd was happening. The SSD was faster than the 48 disks we were comparing to, but only 3 times as fast. This would be fine for a production database that was doing actual work, but in a benchmark we controlled, we needed to show at least a 10x improvement.
In cases where a system isn’t behaving as expected it is helpful to look at the basic statistics where different subsystems meet. In this case, we looked at the IOstat to see what the storage looked like to the host. The response time of the SSD was about 2 milliseconds even though the IOPS were only around 10,000 – well below the point where the SSD starts to get stressed. We stopped the database and pulled out a synthetic benchmark, ran a random load against the SSD and saw at the 10,000 IOPS level the response time was 0.3 ms – about what we would expect. We switched back to the query and again, 10,000 IOPS, 2 ms response time – why? What was Oracle doing differently than a simple benchmark?
We thought for a while and the engineer that I was working with had a “eureka moment.” We referred to the Integration guide that we had written for our customers to use:
To further improve the performance on Linux, append the kernel parameter “elevator=noop” to disable the I/O scheduler. This will help reduce latency on small-block requests to the RamSan. This will also greatly improve performance of mixed reads and writes to the filesystem. Example entry in /boot/grub/menu.lst (/etc/grub.conf): title Red Hat Enterprise Linux Server (2.6.18-164.el5) root (hd0,0) kernel /vmlinuz-2.6.18-164.el5 ro root=LABEL=/ elevator=noop rhgb quiet initrd /initrd-2.6.18-164.el5.img
We implemented the settings changes, ran the query again, and saw exactly what we expect, 0.3 ms response time, >50,000 IOPS, and the CPU was now the bottleneck in the database benchmark. We added a second RAC node and hit 100,000 IOPS through the database.
The Linux I/O scheduler is designed to reorder I/Os to disks to reduce the thrash and thus lower the average response time. In the case of our simple I/O test, the requests were truly random and arrived at a regular interval, so it didn’t bother trying to reorder. With the Oracle tests, it was less random, burstier, and there were some writes to files that control the database mixed in. The scheduler saw that it could “help” and spent time reordering I/Os under the assumption that delaying the I/Os a little to make the disks more efficient was a good trade off. Needless to say this is a bad assumption for SSDs and is a good example of optimizations put in place for disks becoming obsolete and just slowing down overall performance.