Flash Memory Errors – It’s Not Just About Write Endurance
Increasing the capacity in a single Flash chip is a driving focus in the semiconductor industry. This has led to some interesting design decisions to increase density. The first is that high capacity Flash chips include multiple internal Flash dies (typically four today). Putting four dies into a single chip lowers the packaging costs and reduces the footprint on the printed circuit board (PCB). Using smaller dies also increases the yield from the semiconductor foundries since defects on the wafer result in a smaller area being scrapped. In addition, each die is divided into two areas called planes. These planes are effectively independent regions of the die that can work on independent operations, but share a common interface to the die. The SNIA whitepaper on NAND Reliability illustrates this construction on page 3.
This internal construction has long been leveraged by Flash controllers to make garbage collection and write handling more efficient because internal operations can be accomplished within each plane in parallel (such as page “move” operations; see my post on sustained write performance). With eight planes in a typical chip the external pins that connect to the Flash chip have been the ultimate performance limiter.
What has not been widely exploited, however, is that many of the failure modes within a Flash chip come from localized issues that only take a single plane or die offline, leaving the remainder of the chip fully functional. Failure can take several forms; for example, the program and erase controls for a single plane can fail, or a defect on the chip can render the blocks within a plane unreadable. These failures are actually fairly common occurrences and are a reason that virtually all enterprise SSDs leverage a fine-grained RAID of the Flash media (this presentation discusses Flash construction and shows several types in failure with highly magnified images).
A Flash chip is the size of a postage stamp and designed to be surface mounted onto a PCB. But a chip doesn’t fit into the typical field replaceable unit design that storage engineers are used to without adding significant expense and failure points. This creates an issue with the typical RAID designs and Flash. The ability to easily field replace the disk form factor has been one of the main reasons SSD in a disk form factor have been so prominent. However, this approach moves the RAID from a granular chip level to encompassing many chips, requires many Flash chips to be replaced at once, and creates a performance choke point. Flash is expensive and you don’t want to waste a lot of good Flash when there is just a single partial chip error.
TMS was awarded a patent for an innovative solution to this quandary. The core building block of TMS’ latest products is the Series-7 Flash Controller™, which implements a RAID array across ten chips and leverages the overprovisioned capacity to handle block wear outs and partial chip failures. TMS’ Variable Stripe RAID (VSR)™ technology leverages the understanding that plane and die level failures are much more common than complete chip failures, and uses the overprovisioned space to effectively handle failures and avoid requiring maintenance.
First, VSR only maps out portions of the chip rather than the complete chip. So typically, only 1/8 to 1/4 of a chip is taken offline rather than the entire thing. Second, it shrinks the stripe around the partial chip failure from a 9+1 RAID layout to 8+1. This process is complex, but it allows RAID protection to stay in place through normal block wear out and chip failures.
This protection is particularly important for PCIe products like the RamSan-70 where maintenance and downtime on an internal direct attached storage product is very undesirable. To be effectively leveraged in the architectures that need PCIe SSDs, the SSD needs to be significantly more reliable than the server it is placed in. This is more than just the ability to survive a fault, as most storage arrays are designed to do, but to avoid maintenance. VSR technology from TMS allows for Flash failures to be mapped out at a very granular level and for automatic return to a protected status without requiring a maintenance event. Of course failures can still happen, but putting the storage inside the server on the PCIe bus means that you need to treat the storage failure modes more like memory than disks.