Flash Read Leveling?
Today TMS announced the awarding of a patent that covers Efficient reduction of read disturb errors in NAND Flash memory (Patent No. 7,818,525). This is a type of flash error that is often lost in the noise of flash write endurance concerns. The basic issue is that isolated reads done repeatedly can corrupt data that is not being read. This can be a serious problem, because the data can become corrupted to the point where ECC can no longer fix it.
This is a documented issue that has a fairly innocuous sounding name – “read disturbs”. The abbreviated text from the patent sheds further info: “A read operation [of a flash cell] tends to impose elevated voltage stresses on the other unread cells within the same Block…. Over time, repeated application of this higher magnitude voltage … can result in charge migration onto the floating gate of the [adjacent] cell.” So at the end of the day reads are also a problem for flash (albeit on a smaller scale than for writes), and as is the case with most flash issues, this is much worse for MLC than SLC.
We implemented a unique method of dealing with this issue without sacrificing the performance that RamSans are purchased for. There is a counter that keeps track of the reads that take place to a block and once a threshold is hit, the pages will be marked to be moved after the next time they are read. It is important to note we don’t move all of the pages in a block at once, as that takes considerable time within the flash chip. Additionally, by waiting until the pages are read, the heavily read pages can move to several new blocks to reduce the read impacts in the future.
This implementation also allows leaving pages that are rarely read in place for quite some time without risking corruption. This is important as we RAID the flash chips in our design and it is not uncommon to have some number of pages that are only used for RAID reconstruction. First there is the XOR data that is only used upon failure, but there is also data that is no longer valid but still needed for RAID reconstruction. This happens because new data goes to a new physical location and new RAID set, but the other elements in the RAID group with this “dirty” data may not have been rewritten. This means that to maintain the RAID protection the data that is no longer referenced needs to stick around for some time until it is convenient and efficient (after several members of the RAID set are invalid) to move the valid elements in the RAID set somewhere else.
I hope to post more about other types of flash errors as write issues are really just a small part of Flash handling. I’ll also go into more depth on why RAIDing flash chips for an Enterprise system is so critical. The Flash manufacturers keep quiet about flash media issues so I hope to shed some light on how the years of dealing with flash complexities have influenced TMS’s design. Unfortunately, I’ll only be discussing the parts of the secret sauce in our flash controller that have IP protection.