Increasing the capacity in a single Flash chip is a driving focus in the semiconductor industry. This has led to some interesting design decisions to increase density. The first is that high capacity Flash chips include multiple internal Flash dies (typically four today). Putting four dies into a single chip lowers the packaging costs and reduces the footprint on the printed circuit board (PCB). Using smaller dies also increases the yield from the semiconductor foundries since defects on the wafer result in a smaller area being scrapped. In addition, each die is divided into two areas called planes. These planes are effectively independent regions of the die that can work on independent operations, but share a common interface to the die. The SNIA whitepaper on NAND Reliability illustrates this construction on page 3.
This internal construction has long been leveraged by Flash controllers to make garbage collection and write handling more efficient because internal operations can be accomplished within each plane in parallel (such as page “move” operations; see my post on sustained write performance). With eight planes in a typical chip the external pins that connect to the Flash chip have been the ultimate performance limiter.
What has not been widely exploited, however, is that many of the failure modes within a Flash chip come from localized issues that only take a single plane or die offline, leaving the remainder of the chip fully functional. Failure can take several forms; for example, the program and erase controls for a single plane can fail, or a defect on the chip can render the blocks within a plane unreadable. These failures are actually fairly common occurrences and are a reason that virtually all enterprise SSDs leverage a fine-grained RAID of the Flash media (this presentation discusses Flash construction and shows several types in failure with highly magnified images).
A Flash chip is the size of a postage stamp and designed to be surface mounted onto a PCB. But a chip doesn’t fit into the typical field replaceable unit design that storage engineers are used to without adding significant expense and failure points. This creates an issue with the typical RAID designs and Flash. The ability to easily field replace the disk form factor has been one of the main reasons SSD in a disk form factor have been so prominent. However, this approach moves the RAID from a granular chip level to encompassing many chips, requires many Flash chips to be replaced at once, and creates a performance choke point. Flash is expensive and you don’t want to waste a lot of good Flash when there is just a single partial chip error.
TMS was awarded a patent for an innovative solution to this quandary. The core building block of TMS’ latest products is the Series-7 Flash Controller™, which implements a RAID array across ten chips and leverages the overprovisioned capacity to handle block wear outs and partial chip failures. TMS’ Variable Stripe RAID (VSR)™ technology leverages the understanding that plane and die level failures are much more common than complete chip failures, and uses the overprovisioned space to effectively handle failures and avoid requiring maintenance.
First, VSR only maps out portions of the chip rather than the complete chip. So typically, only 1/8 to 1/4 of a chip is taken offline rather than the entire thing. Second, it shrinks the stripe around the partial chip failure from a 9+1 RAID layout to 8+1. This process is complex, but it allows RAID protection to stay in place through normal block wear out and chip failures.
This protection is particularly important for PCIe products like the RamSan-70 where maintenance and downtime on an internal direct attached storage product is very undesirable. To be effectively leveraged in the architectures that need PCIe SSDs, the SSD needs to be significantly more reliable than the server it is placed in. This is more than just the ability to survive a fault, as most storage arrays are designed to do, but to avoid maintenance. VSR technology from TMS allows for Flash failures to be mapped out at a very granular level and for automatic return to a protected status without requiring a maintenance event. Of course failures can still happen, but putting the storage inside the server on the PCIe bus means that you need to treat the storage failure modes more like memory than disks.
I work for an organization that is a hardware design company at its core and I often end up describing what being a hardware company actually means. I’ll attempt here to capture this distinction and describe why it is important. I am going to start at a very high level, so bear with me.
In order to deliver a storage product, you need storage space, a way for hosts to access it, and features that manipulate the storage space that the host sees. In developing the product there are two basic design methods: hardware centric or software centric. In hardware centric design, the focus is on the physical embodiment of the product, and with software centric the focus is on the features.
If the idea that a product is implementing is new and does not have a dependence on the physical components, then software design methodology makes sense. A good recent example is data de-duplication, where the idea is to find multiple copies of the same data and only store one copy. It doesn’t matter if you have one disk or many, if the data is local or on a NAS or a SAN. A product has to focus on certain areas of course, but the core idea of data de-duplication is not dependant on the physical embodiment.
Designing a software solution is where most startups begin; in the early stages of a company it is much easier to develop on an x86 platform and outsource the manufacturing and hardware design to a server vendor. The product that is sold may look like a hardware solution by the addition of a custom faceplate on a standard server, but it is really just software.
If on the other hand your product idea is not a new method of doing something, but a method of making a product that is faster, cheaper, or more efficient (in terms of power and space), then hardware is the way to go.
This is because, while software solutions provide extreme flexibility in what ideas can be implemented quickly, the trade-off is less efficiency and performance. By contrast, hardware design involves deciding what signal goes where, why, and when. It allows repeatable unrelated tasks to be handled by parallel circuits to hit any arbitrary performance level. The ability to design exactly what a chip will do allows much higher levels of throughput. Hardware is particularly good at data movement operations, encoding/decoding, and processing that can leverage deep pipelines for parallel execution.
A good recent example of designing exactly what a signal will do and when is illustrated in a patent that my employer just announced: “Patent No. 7,928,791 – Method and apparatus for clock calibration in a clocked digital device.” This is a fairly interesting patent if you are an electrical engineer. It basically covers a method that allows TMS to run chips with a higher degree of signal integrity than would otherwise be possible by continually adjusting the sampling time for data pins to compensate for the changes in timing that occur when the environmental conditions (temperature, voltage, etc) slightly change. Normally chips just have a defined setup and hold time that accounts for this variability, and the length of these times directly determine how fast you can move data in and out of a chip; by sampling at the center of this signal window, we can move data more robustly than we would otherwise be capable of.
The tagline of Texas Memory Systems is The World’s Fastest Storage®. From this perspective, it is easy to understand why we choose to be a hardware company. The focus is on creating storage products that are faster than any others available.
Today TMS announced the awarding of a patent that covers Efficient reduction of read disturb errors in NAND Flash memory (Patent No. 7,818,525). This is a type of flash error that is often lost in the noise of flash write endurance concerns. The basic issue is that isolated reads done repeatedly can corrupt data that is not being read. This can be a serious problem, because the data can become corrupted to the point where ECC can no longer fix it.
This is a documented issue that has a fairly innocuous sounding name – “read disturbs”. The abbreviated text from the patent sheds further info: “A read operation [of a flash cell] tends to impose elevated voltage stresses on the other unread cells within the same Block…. Over time, repeated application of this higher magnitude voltage … can result in charge migration onto the floating gate of the [adjacent] cell.” So at the end of the day reads are also a problem for flash (albeit on a smaller scale than for writes), and as is the case with most flash issues, this is much worse for MLC than SLC.
We implemented a unique method of dealing with this issue without sacrificing the performance that RamSans are purchased for. There is a counter that keeps track of the reads that take place to a block and once a threshold is hit, the pages will be marked to be moved after the next time they are read. It is important to note we don’t move all of the pages in a block at once, as that takes considerable time within the flash chip. Additionally, by waiting until the pages are read, the heavily read pages can move to several new blocks to reduce the read impacts in the future.
This implementation also allows leaving pages that are rarely read in place for quite some time without risking corruption. This is important as we RAID the flash chips in our design and it is not uncommon to have some number of pages that are only used for RAID reconstruction. First there is the XOR data that is only used upon failure, but there is also data that is no longer valid but still needed for RAID reconstruction. This happens because new data goes to a new physical location and new RAID set, but the other elements in the RAID group with this “dirty” data may not have been rewritten. This means that to maintain the RAID protection the data that is no longer referenced needs to stick around for some time until it is convenient and efficient (after several members of the RAID set are invalid) to move the valid elements in the RAID set somewhere else.
I hope to post more about other types of flash errors as write issues are really just a small part of Flash handling. I’ll also go into more depth on why RAIDing flash chips for an Enterprise system is so critical. The Flash manufacturers keep quiet about flash media issues so I hope to shed some light on how the years of dealing with flash complexities have influenced TMS’s design. Unfortunately, I’ll only be discussing the parts of the secret sauce in our flash controller that have IP protection.