## SSD Sustained Write Performance

There is lots of confusion out there on how SSDs handle writes and I thought it would be worthwhile to shed some light on this. Flash chips need to have entire blocks (256 KB) written and erased at the same time. Since applications write in smaller chunks than this, there is something called a Flash Translation Layer that handles write I/Os. I’ll try to summarize the details here.

The Flash Translation Layer keeps an index of where a block is physically written and the logical address that is presented to the host. After a continuous small block random write workload, eventually there are not any pre erased blocks available to write to. The flash controller has to perform “moves” of data from a few blocks that have some valid data and some stale data. These moves are performed to get a few completely full blocks and a new empty block to write to. A move operation ties up the chip and is just a little less work than a write. The amount of over provisioned capacity determines how many background moves you will have to perform for each new write that comes in under the worst possible scenario.

For example, on the RamSan-20 Datasheet we list the sustained number for our random writes – 50,000 IOPS. Outside of this worst case you can achieve 160,000 write IOPS. The amount of over provisioned space (~30%) is not arrived at by accident, and roughly 1/3 of peak performance is not a coincidence either. The amount of extra space that is available determines the maximum number of moves that have to be done to get a full empty page – as once all of the spare area is used up, effectively every block will have about as much stall data as the over provisioned percentage. So (ignoring the complexities of the actual math) you have to perform a number of moves that is just a little less than 1/(the over provisioning percent) in the absolute worst case to get a full block to write to. With a very small over provisioning percent, the amount of moves that have to take place can be very high. These moves keep the chips busy almost as long as a formal write, so the amount of write IOPS you can perform declines.

So why use ~30% over-provisioned space? Because there are diminishing returns in terms of performance improvement as the over provisioning increases. The figure below graphs the function – 1/(over provisioning percent) (a rough approximation of the number of moves required in the worst case scenario):

It is worth noting that, in the real world of applications, you are unlikely to be performing 100% random writes, so a good flash controller will perform background move operations to defragment the valid and stale data. To get to the worst case, you have to randomly write across all of the capacity without ever stopping to read. While some logging applications have this type of constant write workload they are never random. Nevertheless, it is still important to understand this concept to accurately compare different devices’ performance. It is tempting to reduce the amount of over provisioned space to lower the cost per GB, but then the number of worst case moves can become a real problem. Keep in mind that the over provisioned space does shrinks over time as it is also intended to accommodate for flash blocks that have worn out.

Some controllers may not actively defragment the space to save on controller costs, so the worst case performance becomes the “real” performance after the drive has been written a few times. This is becoming the exception, but still something to look out for. This performance decline in the sustained random write case can also easily be missed if there is no way for the host driving the workload to keep all of the chips busy all of the time. This is because the decline in performance requires hitting the flash as hard as possible. But with SSDs at a premium price to disks based on their performance, limiting performance all of the time just to avoid a performance decline in the worst case isn’t a good design tradeoff. For more information on write handling, SNIA’s Solid State Storage Initiative’s performance test specification goes into quite a bit of depth.

The SNIA link at the end gives “Page not found” error. An updated link would be appreciated.

I have known for some time that overprovisioning to write amplification follows roughly this pattern, but it was nice to see it explained this way with worst case numbers. This also confirms my thoughts that write amplification is (roughly) reverse proportional to sustained random write.

Thanks for the notice on the SNIA link, I have updated it.