Linux Elevator Settings Impact on SSDs
Recently, I was helping with a project where the performance difference of Oracle on disks vs on SSDs was being showcased. A a database was created on 11g and a simple script was querying random subsets of rows. The database was setup so we could flip between the disks and SSDs quickly by just changing the preferred read failure group in ASM. Something very odd was happening. The SSD was faster than the 48 disks we were comparing to, but only 3 times as fast. This would be fine for a production database that was doing actual work, but in a benchmark we controlled, we needed to show at least a 10x improvement.
In cases where a system isn’t behaving as expected it is helpful to look at the basic statistics where different subsystems meet. In this case, we looked at the IOstat to see what the storage looked like to the host. The response time of the SSD was about 2 milliseconds even though the IOPS were only around 10,000 – well below the point where the SSD starts to get stressed. We stopped the database and pulled out a synthetic benchmark, ran a random load against the SSD and saw at the 10,000 IOPS level the response time was 0.3 ms – about what we would expect. We switched back to the query and again, 10,000 IOPS, 2 ms response time – why? What was Oracle doing differently than a simple benchmark?
We thought for a while and the engineer that I was working with had a “eureka moment.” We referred to the Integration guide that we had written for our customers to use:
To further improve the performance on Linux, append the kernel parameter “elevator=noop” to disable the I/O scheduler. This will help reduce latency on small-block requests to the RamSan. This will also greatly improve performance of mixed reads and writes to the filesystem. Example entry in /boot/grub/menu.lst (/etc/grub.conf): title Red Hat Enterprise Linux Server (2.6.18-164.el5) root (hd0,0) kernel /vmlinuz-2.6.18-164.el5 ro root=LABEL=/ elevator=noop rhgb quiet initrd /initrd-2.6.18-164.el5.img
We implemented the settings changes, ran the query again, and saw exactly what we expect, 0.3 ms response time, >50,000 IOPS, and the CPU was now the bottleneck in the database benchmark. We added a second RAC node and hit 100,000 IOPS through the database.
The Linux I/O scheduler is designed to reorder I/Os to disks to reduce the thrash and thus lower the average response time. In the case of our simple I/O test, the requests were truly random and arrived at a regular interval, so it didn’t bother trying to reorder. With the Oracle tests, it was less random, burstier, and there were some writes to files that control the database mixed in. The scheduler saw that it could “help” and spent time reordering I/Os under the assumption that delaying the I/Os a little to make the disks more efficient was a good trade off. Needless to say this is a bad assumption for SSDs and is a good example of optimizations put in place for disks becoming obsolete and just slowing down overall performance.