In 2008 I was on a panel at WinHEC alongside other SSD and disk industry participants. The question came up – When would SSDs become the default option rather than a premium? Most answers came in a form of – there are places where both disks and SSDs make sense. When my opportunity to comment came, I replied simply “five to ten years,” which added a bit of levity to the panel where the question had been skirted.
Now that several years have passed, I can look back at how the industry has moved forward and I decided that I was ready to update my prediction. In a part of the market there is an unrelenting demand for additional capacity. There are enough applications where capacity beats out the gains in performance, form factor, and power consumption that come with SSDs to give products that optimize price per capacity a bright future. With disk manufacture’s single focus on optimizing cost per capacity, performance optimization has been left to SSDs. In applications that support business processes where performance and capacity are both important, this is a profound shift that has really only begun. SSDs are not going to replace disks wholesale, but using some SSD capacity will become the norm.
One of the aspects of a market that moves from niche to mainstream is that the focus has to shift towards designing for mass markets with a focus on price points and ease-of-use. After looking at the market directions I reached the following conclusion: the biggest winners will be the companies that make it easiest to effective use SSD across the widest range of applications. I conducted a wide survey of the SSD industry, weighed some personal factors, and moved from Texas to Colorado to join LSI in the Accelerated Solutions Division.
Today LSI announced an application acceleration product family that fully embraces the vision of enabling the broadest number of applications to benefit from solid state technology, from DAS to SAN, to 100% flash solutions. The problems that SSDs solve and the way that they solve them (eliminating the time that is spent waiting on disks) has not really changed, but the capacity and entry level prices points have, allowing for much broader adoption.
Using SSDs still requires sophisticated flash management (which is still highly variable from SSD to SSD), data protection from component failures, and integration to use the SSD capacity effectively with existing storage – so there is still plenty of room to add value to the raw flash. In the next wave of flash adoptions, expect to see a much higher attach rate of SSDs to servers. A shift is taking place from trying to explain why SSDs would be justified, to explaining why SSDs are not justified. In this next phase reducing the friction to enable deployment of flash far and wide is the key.
Update – Some of the suggestions below have been questioned for a typical ZFS setup. To clarify, these setting should not be implemented on most ZFS installations. The L2ARC is designed to ensure that by default it will never hurt performance. Some of the changes below can have a negative impact on workloads that are not using the L2ARC and accepts the possibility of worse performance in some workloads for better performance with cache friendly workloads. These suggestions are intended for ZFS implementations where the flash cache is one of the main motivating factors for deploying ZFS – think high SSD to disk ratios. In particular, these changes were tested with an Oracle database on ZFS
In 2008 Sun Microsystems announced the availability of a feature in ZFS that could use SSDs as a read or write cache to accelerate ZFS. A good write-up on the implementation of the level 2 adaptive read cache (L2ARC) by a member of the fishworks team is available here. In 2008, flash SSDs were just starting to penetrate the enterprise storage market and this cache was written with many of early flash SSD issues in mind. First, it warms quite slowly, defaulting to a maximum setting of 8 MB/s cache load rate. Second, to avoid being in a write heavy path, it is explicitly set outside of the data eviction path from the ZFS memory cache (ARC). This prevents it from behaving like a traditional level 2 cache and causes it fill more slowly with mainly static data. Finally, the default record size of the file system is rather big (128KB) and the default assumption is that for sequential scans it is better to just read from the disk and skip the SSD cache.
Many of the assumptions about SSD don’t line up with current generation SSD products. An enterprise class SSD can write quickly, has high capacity, higher sequential bandwidth than the disk storage systems, and has a long life span even under a heavy write load. There are a few parameters that can be changed as a best practice when enterprises SSDs are being used for the L2ARC in ZFS:
Change the Record Size to a much lower value than 128 KB. The L2ARC fetches the full record on a read and 128 KB IO size to an SSD uses up device bandwidth increases the response time.
This can be set dynamically (for new files) with:
# zfs set recordsize <record size> <filesystem>
Change the l2arc_write_max to a higher value. Most SSDs specify a device life in terms of full device write cycles per day. For instance, say you have 700 GBs of SSDs that support 10 device cycles per day for 5 years. This equates to a max write rate of 7000 GBs/day or 83 MB/s. As the setting is the maximum write rate, I would suggest at least doubling the speced drive max rate. As the L2ARC is a read only cache that can abruptly fail without impacting the file system availability, the risk of too high of a write rate is only that of wearing out the drive ealier. This throttle was put in place as when early SSDs with unsophisticated controllers were the norm. Early SSDs could experience significant performance problems during writes that would limit the performance of the reads the cache was meant to accelerate. Modern enterprise SSDs are orders of magnitude better at handling writes so this is not a major concern.
On Solaris, this parameter is set by adding the following line to the /etc/system file:
set zfs:l2arc_write_max= <maximum bytes per second>
Set the l2arc_noprefetch=0. By default this is set to one, skipping the L2ARC for prefetch reads that are used for sequential reading. The idea here is that the disks are good at sequential so just read from them. With PCIe SSDs readily available with multiple GB/s of bandwidth even sequential workloads can get a significant performance boost. Changing this parameter will put the L2ARC in the prefetch read path and can make a big difference for workloads that have a sequential component.
On Solaris, this parameter is set by adding the following line to the /etc/system file:
It is interesting to see how some of the developments in the IT space are governed by intradepartmental realities. I see this most pronounced in the storage team’s perspective versus the rest of the IT team. Storage teams are exceptionally conservative by nature. This makes perfect sense – servers can be rebooted, applications can be reinstalled, hardware can be replaced – but if data is lost there are no easy solutions.
Application teams are aware of the risk of data loss, but are much more concerned with the day-to-day realities of managing an application – providing a valuable service, adding new features, and scaling performance. This difference in focus can lead to very real differences in viewpoints and a bit of mutual distrust between the application and storage teams.
The best example of this difference is where the storage team classifies storage array controllers as hardware solutions, when in reality most are just a predefined server configuration running data management software with disk shelves attached. Although the features provided by the controllers are important (like replication, deduplication, snapshot, file services, backup, etc.) they are inherently just software packages.
More and more of the major enterprise applications are now building in the same feature set traditionally found in the storage domain. With Oracle there is RMAN for backups, ASM for storage management, Data Guard for replication, and the Flash Recovery Area for snapshots. In Microsoft SQL Server there is database mirroring for synchronous or asynchronous replication and database snapshots. If you follow VMware’s updates, it is easy to see they are rapidly folding in more storage features with every release (as an aside, one of the amazing successes of VMware is in making managing software feel like managing hardware). Since solutions at the application level can be aware of the layout of the data, some of these features can be implemented much more efficiently. A prime example is replication, where database level replication tools tie into the transaction logging mechanism and send only the logs rather than blindly replicating all of the data.
The biggest hurdle that I have seen at customer sites looking to leverage these application level storage features is the resistance from the storage team in ceding control, either due to lack of confidence in the application team’s ability to manage data, internal requirements, or turf protection. One of the most surprising reasons I have seen PCIe SSD solutions selected by application architects is to avoid even having these internal discussions!
As the applications that support important business processes continue to grow in their data management sophistication there will be more discussions on where the data management and protection belong. Should they be bundled with the storage media? Bundled within the application? Or should they be purchased as separate software – perhaps as virtual appliances?
Where do you think these services will be located going forward? Let me know below in the comments.
Increasing the capacity in a single Flash chip is a driving focus in the semiconductor industry. This has led to some interesting design decisions to increase density. The first is that high capacity Flash chips include multiple internal Flash dies (typically four today). Putting four dies into a single chip lowers the packaging costs and reduces the footprint on the printed circuit board (PCB). Using smaller dies also increases the yield from the semiconductor foundries since defects on the wafer result in a smaller area being scrapped. In addition, each die is divided into two areas called planes. These planes are effectively independent regions of the die that can work on independent operations, but share a common interface to the die. The SNIA whitepaper on NAND Reliability illustrates this construction on page 3.
This internal construction has long been leveraged by Flash controllers to make garbage collection and write handling more efficient because internal operations can be accomplished within each plane in parallel (such as page “move” operations; see my post on sustained write performance). With eight planes in a typical chip the external pins that connect to the Flash chip have been the ultimate performance limiter.
What has not been widely exploited, however, is that many of the failure modes within a Flash chip come from localized issues that only take a single plane or die offline, leaving the remainder of the chip fully functional. Failure can take several forms; for example, the program and erase controls for a single plane can fail, or a defect on the chip can render the blocks within a plane unreadable. These failures are actually fairly common occurrences and are a reason that virtually all enterprise SSDs leverage a fine-grained RAID of the Flash media (this presentation discusses Flash construction and shows several types in failure with highly magnified images).
A Flash chip is the size of a postage stamp and designed to be surface mounted onto a PCB. But a chip doesn’t fit into the typical field replaceable unit design that storage engineers are used to without adding significant expense and failure points. This creates an issue with the typical RAID designs and Flash. The ability to easily field replace the disk form factor has been one of the main reasons SSD in a disk form factor have been so prominent. However, this approach moves the RAID from a granular chip level to encompassing many chips, requires many Flash chips to be replaced at once, and creates a performance choke point. Flash is expensive and you don’t want to waste a lot of good Flash when there is just a single partial chip error.
TMS was awarded a patent for an innovative solution to this quandary. The core building block of TMS’ latest products is the Series-7 Flash Controller™, which implements a RAID array across ten chips and leverages the overprovisioned capacity to handle block wear outs and partial chip failures. TMS’ Variable Stripe RAID (VSR)™ technology leverages the understanding that plane and die level failures are much more common than complete chip failures, and uses the overprovisioned space to effectively handle failures and avoid requiring maintenance.
First, VSR only maps out portions of the chip rather than the complete chip. So typically, only 1/8 to 1/4 of a chip is taken offline rather than the entire thing. Second, it shrinks the stripe around the partial chip failure from a 9+1 RAID layout to 8+1. This process is complex, but it allows RAID protection to stay in place through normal block wear out and chip failures.
This protection is particularly important for PCIe products like the RamSan-70 where maintenance and downtime on an internal direct attached storage product is very undesirable. To be effectively leveraged in the architectures that need PCIe SSDs, the SSD needs to be significantly more reliable than the server it is placed in. This is more than just the ability to survive a fault, as most storage arrays are designed to do, but to avoid maintenance. VSR technology from TMS allows for Flash failures to be mapped out at a very granular level and for automatic return to a protected status without requiring a maintenance event. Of course failures can still happen, but putting the storage inside the server on the PCIe bus means that you need to treat the storage failure modes more like memory than disks.
Yesterday TMS announced our latest PCIe Flash offering, the RamSan-70. Prior to the launch, I had the chance to brief a number of analysts and to explain the key deployments we are tackling with this offering. One thing I discovered was that there is a lot of confusion about what PCIe SSDs bring to the table and what they don’t. So with that in mind, I present the following primer on fit of PCIe SSDs.
How would you describe a PCIe SSD?
A PCIe SSD is like Direct-attached storage (DAS) on steroids.
Doesn’t being on the PCIe bus increase performance by being as close to the CPU as possible?
Yes, but nowhere near the degree it is promoted. Going through a HBA to FC attached RamSan adds about 10 µs of latency –that’s it. The reason that accessing SSDs through most SAN systems take 1-2 ms is because of the software stack in the SAN head – not because of the PCIe to FC conversion. For our customers the decision to go with a PCIe RamSan-70 for a FC/IB attached RamSan-630 comes down to whether the architecture needs to share storage.
Are you working on a way to make the PCIe card sharable
No, we have shared systems. If the architecture needs the shared storage, use our shared storage systems.
So why is PCIe making such a splash? Isn’t the DAS vs SAN argument over with SAN rising triumphant?
Well, the argument was over until two things happened: servers started to get really cheap, and really, really big clusters started getting deployed. In a shared storage model, a big core network is needed so each server can access the storage at a reasonable rate. This is one of the main reasons a dedicated high performance Storage Area Network is used for the server to storage network. However, after there are more than a few dozen servers, the network starts to become rather large. Now imagine if you want to have tens of thousands of servers, the network becomes the dominant cost (see the aside in my post on SSDs and the Cloud for more details). In these very large clusters the use of a network attached shared storage model becomes impractical.
A new computing model developed for these environments – a shared nothing scale-out cluster. The basic idea is that each computer processes a part of the data that is stored locally, many nodes do this in parallel, and then an aggregation step compiles the results. This way all of the heavy data to CPU movement takes place within a single server and only the results are compiled across the network. This is the foundation of Hadoop as well as several data warehouse appliances. In effect, rather than virtualized servers, a big network, and virtualized storage via a SAN or NAS array; the servers and storage are virtualized in a single step using hardware that has CPU resources and Direct-Attached Storage.
PCIe SSDs are important for this compute framework because reasonably priced servers are really quite powerful and can leverage quite a bit of storage performance. With the RamSan-70 each PCIe slot can provide 2 GB/s of throughput while fitting directly inside the server. This much local performance allows building high performance nodes for a scale-out shared-nothing cluster that balances the CPU and storage resources. Otherwise, a large number of disks would be needed for each node or the nodes would have to scale to a lower CPU power than is readily available from mainstream servers. Both of these other options have negative power and space qualities that make them less desirable.
The rise of SSDs has provided a quantum leap in storage price-performance at a reasonable cost for capacity as new compute frameworks are moving into mainstream applications. Both of these developments are still being digested by the IT community. You can see the big vendors jostling for dominance to control the new platforms that are used to build the datacenters of the future. At TMS we see the need for “DAS on steroids” in the new frameworks and are leveraging our hardware engineering expertise to make the best DAS solution.