Archive for the ‘Oracle’ Category

ZFS Tuning for SSDs

December 1, 2011 Leave a comment

Update – Some of the suggestions below have been questioned for a typical ZFS setup.  To clarify, these setting should not be implemented on most ZFS installations.  The L2ARC is designed to ensure that by default it will never hurt performance.  Some of the changes below can have a negative impact on workloads that are not using the L2ARC and accepts the possibility of worse performance in some workloads for better performance with cache friendly workloads.  These suggestions are intended for ZFS implementations where the flash cache is one of the main motivating factors for deploying ZFS – think high SSD to disk ratios.  In particular, these changes were tested with an Oracle database on ZFS

In 2008 Sun Microsystems announced the availability of a feature in ZFS that could use SSDs as a read or write cache to accelerate ZFS.  A good write-up on the implementation of the level 2 adaptive read cache (L2ARC) by a member of the fishworks team is available here.  In 2008, flash SSDs were just starting to penetrate the enterprise storage market and this cache was written with many of early flash SSD issues in mind.  First, it warms quite slowly, defaulting to a maximum setting of 8 MB/s cache load rate.  Second, to avoid being in a write heavy path, it is explicitly set outside of the data eviction path from the ZFS memory cache (ARC).  This prevents it from behaving like a traditional level 2 cache and causes it fill more slowly with mainly static data.  Finally, the default record size of the file system is rather big (128KB) and the default assumption is that for sequential scans it is better to just read from the disk and skip the SSD cache.

Many of the assumptions about SSD don’t line up with current generation SSD products.  An enterprise class SSD can write quickly, has high capacity, higher sequential bandwidth than the disk storage systems, and has a long life span even under a heavy write load.  There are a few parameters that can be changed as a best practice when enterprises SSDs are being used for the L2ARC in ZFS:

Record Size

Change the Record Size to a much lower value than 128 KB.  The L2ARC fetches the full record on a read and 128 KB IO size to an SSD uses up device bandwidth increases the response time.

This can be set dynamically (for new files) with:

# zfs set recordsize <record size> <filesystem>


Change the l2arc_write_max to a higher value.  Most SSDs specify a device life in terms of full device write cycles per day.  For instance, say you have 700 GBs of SSDs that support 10 device cycles per day for 5 years.  This equates to a max write rate of 7000 GBs/day or 83 MB/s.  As the setting is the maximum write rate, I would suggest at least doubling the speced drive max rate.  As the L2ARC is a read only cache that can abruptly fail without impacting the file system availability, the risk of too high of a write rate is only that of wearing out the drive ealier.  This throttle was put in place as when early SSDs with unsophisticated controllers were the norm.  Early SSDs could experience significant performance problems during writes that would limit the performance of the reads the cache was meant to accelerate.  Modern enterprise SSDs are orders of magnitude better at handling writes so this is not a major concern.

On Solaris, this parameter is set by adding the following line to the /etc/system file:

set zfs:l2arc_write_max= <maximum bytes per second>


Set the l2arc_noprefetch=0.  By default this is set to one, skipping the L2ARC for prefetch reads that are used for sequential reading.  The idea here is that the disks are good at sequential so just read from them.  With PCIe SSDs readily available with multiple GB/s of bandwidth even sequential workloads can get a significant performance boost.  Changing this parameter will put the L2ARC in the prefetch read path and can make a big difference for workloads that have a sequential component.

On Solaris, this parameter is set by adding the following line to the /etc/system file:

set zfs:l2arc_noprefetch=0
Categories: Architectures, Oracle, PCIe, SSD, tuning

Linux Elevator Settings Impact on SSDs

Recently, I was helping with a project where the performance difference of Oracle on disks vs on SSDs was being showcased.  A a database was created on 11g  and a simple script was querying random subsets of rows.  The database was setup so we could flip between the disks and SSDs quickly by just changing the preferred read failure group in ASM.  Something very odd was happening.  The SSD was faster than the 48 disks we were comparing to, but only 3 times as fast.  This would be fine for a production database that was doing actual work, but in a benchmark we controlled, we needed to show at least a 10x improvement.

In cases where a system isn’t behaving as expected it is helpful to look at the basic statistics where different subsystems meet.  In this case, we looked at the IOstat to see what the storage looked like to the host. The response time of the SSD was about 2 milliseconds even though the IOPS were only around 10,000 – well below the point where the SSD starts to get stressed.  We stopped the database and pulled out a synthetic benchmark, ran a random load against the SSD and saw at the 10,000 IOPS level the response time was 0.3 ms – about what we would expect.  We switched back to the query and again, 10,000 IOPS, 2 ms response time – why?  What was Oracle doing differently than a simple benchmark?

We thought for a while and the engineer that I was working with had a “eureka moment.”  We referred to the Integration guide that we had written for our customers to use:

To further improve the performance on Linux, append the kernel parameter
“elevator=noop” to disable the I/O scheduler. This will help reduce latency on
small-block requests to the RamSan. This will also greatly improve performance
of mixed reads and writes to the filesystem.

Example entry in /boot/grub/menu.lst (/etc/grub.conf):

title Red Hat Enterprise Linux Server (2.6.18-164.el5)
root (hd0,0)
kernel /vmlinuz-2.6.18-164.el5 ro root=LABEL=/ elevator=noop rhgb quiet
initrd /initrd-2.6.18-164.el5.img

We implemented the settings changes, ran the query again, and saw exactly what we expect, 0.3 ms response time, >50,000 IOPS, and the CPU was now the bottleneck in the database benchmark.  We added a second RAC node and hit 100,000 IOPS through the database.

The Linux I/O scheduler is designed to reorder I/Os to disks to reduce the thrash and thus lower the average response time.  In the case of our simple I/O test, the requests were truly random and arrived at a regular interval, so it didn’t bother trying to reorder.  With the Oracle tests, it was less random, burstier, and there were some writes to files that control the database mixed in.  The scheduler saw that it could “help” and spent time reordering I/Os under the assumption that delaying the I/Os a little to make the disks more efficient was a good trade off.  Needless to say this is a bad assumption for SSDs and is a good example of optimizations put in place for disks becoming obsolete and just slowing down overall performance.

Categories: Disk, Oracle, SSD, tuning