I have had the chance to meet with several analysts over the past couple of weeks and have raised the position that with eMLC the long awaited price parity of Tier-1 disks and SSDs is virtually upon us. I had a mixed set of reactions, from “nope, not yet” to “sorry if I don’t act surprised, but I agree.” For the skeptics I promised that I would compile some data to back up my claim.
For years the mantra of the SSD vendor was to look at the price per IOPS rather than the price per GB. The Storage Performance Council provides an excellent source of data that facilitates that comparison in an audited forum with their flagship SPC-1 benchmark. The SPC requires quite a bit of additional information to be reported for the result to be accepted, which provides an excellent data source when you want to examine the enterprise storage market. If you bear with me I will walk through a few ways that I look through the data, and I promise that this is not a rehash of the cost per IOPS argument.
First, if you dig through the reports you can see how many disks are included in each solution as well as the total cost. The chart below is an aggregation of the HDD based SPC-1 submissions showing the reported Total Tested Storage Configuration Price (including three-year maintenance) divided by the number of HDDs reported in the “priced storage configuration components” description. It covers data from 12/1/2002 to 8/25/2011:
Now, let’s take it as a given that SSD can deliver much higher IOPS than an HDD of equivalent capacity, and price per GB is the only advantage disks bring to the table. The historical way to get higher IOPS from HDDs was to use lots of drives and short stroke them. The modern day equivalent is using low capacity, high performance HDDs rather than cheaper high capacity HDDs. With the total cost of enterprise disk at close to $2,000 per HDD, the $/ GB of enterprise SSDs determines the minimum logical capacity of an HDD. Here is an example of various SSD $/GB levels and the associated minimum disk capacity points:
Enterprise SSD $/ GB
Minimum HDD capacity
To get to the point that 300 GB HDD no longer make sense, the enterprise price per GB just needs to be around $7/GB and 146 GB HDDs are gone at around $14/GB. Keep in mind that this is the price of the SSD capacity before redundancy and overhead to make it comparable to the HDD case.
It’s not fair (or permitted use) to compare audited SPC-1 data with data that has not gone through the same rigorous process, so I won’t make any comparisons here. However, I think that when looking at the trends, it is clear that the low capacity HDDs that are used for Tier-1 one storage are going away sooner rather than later.
About the Storage Performance Council (SPC)
The SPC is a non-profit corporation founded to define, standardize and promote storage benchmarks and to disseminate objective, verifiable storage performance data to the computer industry and its customers. The organization’s strategic objectives are to empower storage vendors to build better products as well as to stimulate the IT community to more rapidly trust and deploy multi-vendor storage technology.
The SPC membership consists of a broad cross-section of the storage industry. A complete SPC membership roster is available at http://www.storageperformance.org/about/roster/.
A complete list of SPC Results is available at http://www.storageperformance.org/results.
SPC, SPC-1, SPC-1 IOPS, SPC-1 Price-Performance, SPC-1 Results, are trademarks or registered trademarks of the Storage Performance Council (SPC)
This week in TMS’s booth (booth #258) at VMworld we have a joint demo with our partner Datacore that shows an interesting combination of VMware, SSDs, and storage virtualization. We are using Datacore’s SANsymphony -V software to create an environment with the RamSan-70 as tier 1 storage and SATA disks as tier 2. The SANsymphony-V software handles the tiering, high-availability mirroring, snapshots, replication, and other storage virtualization features
Iometer is running on four virtual machines within a server and handling just north of 140,000 4 KB read IOPS. A screen shot from Iometer on the master manager is show below:
Running 140,000 IOPS is a healthy workload, but the real benefit of this configuration is its simplicity. It uses just two 1U servers and hits all of the requirements for a targeted VMware deployment. Much of the time, RamSans are deployed in support of a key database application where exceptionally high performance shared SSD capacity is the driving requirement. RamSan systems implement a highly parallel hardware design to achieve an extremely high performance level at exceptionally low latency. This is an ideal solution for a critical database environment where the database has all of the tools integrated that are normally “outsourced” to a SAN array (such as clustering, replication, snapshots, backup, etc.). However, in a VMware environment many physical and virtual servers are leveraging the SAN, so pushing the data management to each application is impractical.
Caching vs. Tiering
One of the key use cases of SSDs in VMware environments is automatically accelerating the most accessed data as new VMs are brought online, grow over time, and retire. The benefit of a flexible virtual infrastructure makes seamless automatic access to SSD capacity more important. There are two accepted approaches to properly integrating an SSD in a virtual environment; I’ll call them caching and tiering. Although, similar on the surface, there are some important distinctions.
In a caching approach, the data remains in place (in its primary location) and a cached copy is propagated to SSD. This setup is best suited to heavily accessed read data because write-back caches break all of the data management storage features running on the storage behind it (Woody Hutsell discusses this in more depth in this article). This approach is effective for frequently accessed static data, but it is not ideal for frequently changing data.
In tiering, the actual location of the data moves from one type of persistent storage to another. In the read-only caching case it is possible to create a transparent storage cache layer that is managed outside of the virtualized layer, but when tiering with the SSD, tiering and storage virtualization need to be managed together.
SSDs have solved the boot-storm startup issues that plague many virtual environments, but VMware’s recent license model updates sparked increased interest in other SSD use cases. With VMware moving to a memory-based licensing model there is interest in using SSDs to accelerate VMs with a smaller memory footprint. In a tiering model, if VMDK are created on LUNs that leverages SSDs, the virtual guest will automatically move the internal paging mechanisms within the VM to low latency SSDs. Paging is write-heavy, so the tiering model is important to ensure that the page files are leveraging SSD as they are modified(and that the less active storage doesn’t use the SSD).
We are showing this full setup at our booth (#258) at VMworld. If you are attending I would be happy to show you the setup.
Until recently, the model for Flash implementation has been to use SLC for the enterprise and MLC for the consumer. MLC solutions traded endurance, performance, and reliability for a lower cost while SLC solutions didn’t. The tradeoff of 10x the endurance for 2x the price led most enterprise applications to adopt SLC.
But there is a shift taking place in the industry as SSD prices start to align with the prices enterprise customers have been paying for tier 1 HDD storage (which is much higher than the cost of consumer drives). If the per-GB pricing is similar, you can add so much more capacity that endurance becomes less of a concern. Rather than only having highly transactional OLTP systems on SSDs, you can move virtually every application using tier 1 HDD storage to SSD. The biggest concern for tier 1 storage has been that the most critical datasets that reside on them cannot risk a full 10x drop in endurance.
Closing the gap: eMLC
At first glance, enterprise MLC (eMLC) sounds like Marketing is trying to pull a fast one. If there was a simple way to make MLC have higher endurance, why bother restricting it from consumer applications? eMLC sports endurance levels of 30,000 write cycles, whereas some of the newest MLC only has 3,000 write cycles (SLC endurance is generally 100,000 write cycles). There is a big reason for this restriction: eMLC makes a tradeoff to enable this endurance – retention.
It’s not commonly understood that although Flash is considered persistent, the data is slowly lost over time. For most Flash chips the retention is around 10 years – longer than most use cases. With eMLC, longer program and erase times are used with different voltage levels than MLC to increase the endurance. These changes reduce the retention to as low as three months for eMLC. This is plenty of retention for an enterprise storage system that can manage the Flash behind the scenes, but it makes eMLC impractical for consumer applications. Imagine if you didn’t get around to copying photos from your camera within a three month window and lost all the pictures!
Today, Texas Memory Systems announced its first eMLC RamSan product: the RamSan-810. This is a major announcement as we have investigated eMLC for some time (I have briefed analysts on eMLC for almost a year; here is a discussion of eMLC from Silverton Consulting, and a recent one from Storage Switzerland). TMS was not the first company to introduce an eMLC product as the RamSan’s extremely high performance backplane and interfaces make endurance concerns more palpable. However, with the latest eMLC chips, we aggregate enough capacity to be comfortable introducing an eMLC product that bears the RamSan name.
10 TB of capacity multiplied by 30,000 write cycles equates to 300 PB of lifetime writes. This amount of total writes is difficult to achieve during the lifetime of a storage device, even at the high bandwidth the RamSan-810 can support. Applications that demand the highest performance and those with more limited capacity requirements will still be served best by an SLC solution, but very high capacity SSD deployments will shift to eMLC. With a density of 10 TB per rack unit, petabyte scale SSD deployments are now a realistic deployment option.
We’re just getting warmed up discussing eMLC. Stay tuned for another post soon on tier 1 disk pricing vs. eMLC.
I started my career at a different Texas company – Texas Instruments – and I remember the 1” drive division aimed at the mobile device market. It didn’t take very long for this to get end-of-lifed. It was a neat product and a serious feat of engineering, but it just couldn’t compete with Flash. At first it was because Flash was smaller, more rugged, and used less power. However, it was ultimately just because Flash was cheaper! (Compare the disk-based iPod Mini and the Flash-based iPod Nano.)
Disks have a high fixed cost per unit and a small marginal cost per GB . Physically bigger disks have a lower cost per GB than smaller ones. This is very different from the other storage media like Flash and tape. So it was bothering me recently – since each generation of disks takes on a smaller form factor than before, why are mainstream disks still shrinking, from 3.5” to 2.5”? If the disk market was just concerned with cost per GB and “tape is dead”, this is crazy – disk should be getting bigger! Why do disks continue their march towards smaller form factors when that just makes SSDs more competitive?
I originally thought that this was just a holdover from the attempts to make disks faster. Bigger disks are harder to spin at a high speed, so as the RPM rate marched forward disks had to get smaller. The advent of cost effective SSDs, however, has stopped the increase in RPMs. (Remember the news in 2008 of the 20k RPM disk?) The market for performance storage at a premium has been ceded to SSDs.
After spending some time thinking on it I think there are a few basic reasons disks continue their march:
- The attempt to have a converged enterprise, desktop, and laptop standard.
- The need for smaller units to compose RAID sets, so that during a rebuild the chance of a second failure is not too high. I understand this, but RAID-6 is an alternate solution.
- Disks are not just for storage, they are for both performance and long term storage.
Simply because disks store data on a circular platter, every time the bit density increases, the capacity grows by a power of 2 function, but the ability to access randomly doesn’t change, and the bandwidth only grows by a power of 1 function. At some point the need for capacity is more than adequately met so the performance need takes over and disks shrink to get the performance and capacity more in sync.
Neither tape cartridges nor Flash suffer the fixed cost problem or geometry induced accessibility issues of disks. With the new high density cartridges coming online tape continually avoids being supplanted by disk for pure capacity requirements. TMS even recently had a customer that was able to leverage a Tape + SSD deployment and skip disks altogether.
Is the future of storage SSD + tape?
No. While this works for a streamlined processing application, tape just isn’t ever going to be fast enough for data that needs to feel like it is instantly available. There is just too much data that probably won’t be needed much, but when it is, it must be instantly available. However, disks are much faster than the ~1 second response time needed for a user facing application.
With SSD handling more and more of the performance storage requirements it will be interesting to see if disks stop their march toward smaller form factors and head in the other direction by becoming bigger and slower and fully cede the “tier 1 storage” title to SSDs.
On a recent post I discussed how PCIe SSDs fit in with the rise of shared-nothing scale-out clusters and why exceptionally large clusters favor direct attached storage. There are two elements that are required for this type of architecture to be successful: a logical way to partition the data and a need to scan large chunks of data from those partitions. These elements are prevalent in many applications, however, not to all of them. There are applications where “Total Access” to the data is required and the scale-out model is ineffective as most of the data is fetched across the network rather than accessing it locally.
Applications requiring total access are common in the scientific computing, financial modeling, and government areas. They are characterized by being able to be effectively parallelize the processing, but the data that each process needs could be located anywhere in a large dataset. This results in lots of small random access that traditional cluster designs are not well suited to. There have been three primary methods to tackle these problems: add a task specific preprocessing step to allow subsequent accesses to be less random, create a cluster where each compute node’s memory can be accessed over a network, or deploy the applications on a large memory SMP server.
There is a new option that I have seen getting deployed more and more often: using high capacity SSDs and a SAN shared file system. A SAN shared filesystem provides the locking to allow multiple servers to directly access the block storage concurrently. This provides the ease of use of a file system with the performance benefits of block storage access. If you add SSDs into this setup you can build a very powerful solution. The basic setup looks like the following:
Using SSDs as the primary storage allows the shared file system to handle small block random workloads in addition to the standard high bandwidth workloads that are the mainstay of most SAN shared file system deployments. It is relatively easy to construct a system that has a couple hundred cores of processing power and tens to hundreds of TBs of flash. This system can tackle the types of workloads that were previously reserved for the large memory SMP systems of the past.
Workloads that require “Total Access” to a large data set are limited in performance by the network and process steps that connect the compute resources and the storage. The big benefit of a SAN shared file system is first, that the shared components – the SAN resources – can use simple block level communication and multiple high bandwidth interfaces to have a very high performance and low latency. Secondly, solutions can scale-up the performance almost linearly since the coordination processes run on the servers rather than the storage. When customers have extreme application performance needs this is the architecture I look to first.
Comparing storage performance is a bit more difficult than meets the eye. This comes up quite a bit as I frequently address the differences between RAM and Flash-based SSDs. This particularly comes up when comparisons are made between TMS’s Flagship Flash system, the RamSan-630 and the Flagship RAM system, the RamSan-440. The Flash system has higher IOPS but the RAM system has lower latency.
There are three independent metrics of storage performance: response time, IOPS, and bandwidth. Understanding the relationships between these metrics is the key to understanding storage performance.
Bandwidth is really just a limitation of the design or standards that are used to connect storage. It is the maximum number of bytes that can be moved in a specific time period; response time overhead or concurrency do not play a factor. IOPS are nothing more than the number of I/O transactions that can be performed in a single second. Determining the maximum theoretical IOPS for a given transfer size is as simple as dividing the maximum bandwidth by the transfer size. For a storage system with a single one Gbps iSCSI connection (~100 MB/s bandwidth) and a workload of 64 KB transfers, then the maximum IOPS will be ~1,500. If the transfer size is a single sector (512 bytes), then the maximum IOPS will be ~200,000 – a notable difference. At this upper limit, bandwidth will more than likely not be the performance limiter.
There is another relationship that ties together response time and concurrency – Little’s law. Little’s law governs the concurrency of a system needed to achieve a desired amount of throughput. For storage, Little’s Law is: (Outstanding I/Os) ÷ (response time) = IOPS. I consider this the most important formula in storage performance. If you boil this down, ultimately the limitation of IOPS performance is the ability of a system to handle Outstanding I/Os concurrently. Once that limit is reached, the I/Os get clogged up and the response time increases rapidly. This is the reason a common tactic to increase storage performance has been to simply add disks – each additional disk increases the concurrent I/O capabilities.
Interestingly, the IOPS performance isn’t limited by the response time. Lower response times merely allow a given level of IOPS to be achieved at lower levels of concurrency. There are practical limits on the level of concurrency that can be achieved by the interfaces to the storage (e.g. the execution throttle setting in an HBA), and many applications have fairly low levels of concurrent I/O, but the response time by itself does not limit the IOPS. This is why even though Flash media has a higher response time than RAM, Flash systems that handle a high level of concurrent I/O can achieve as good as or better IOPS performance.
On a flight this weekend, I had a chance to catch up on some reading and work through most of an interesting paper on memory technology, database architectures, and the future of enterprise applications from Credit Suisse. After overcoming the eerie feeling that I had written a small part of it (looking in the references found some papers I had written.) I was glad to have made it through the 100 plus pages. The full paper is available here (it is quite repetitive so skimming chapters of interest works well).
The central premise of the paper is that the rise of high capacity solid state options (RAM and Flash) in combination with columnar databases will lead to a consolidation of OLTP and OLAP systems. The driving force is based on large memory capacities and SSDs enable columnar databases to handle OLTP workloads. Once that is handled, the inherent advantages of columnar databases for analytics will lead to their widespread displacement of row-based database architectures.
Here is a 10,000 foot view of the distinction between row and column-based database: In a row-based database all of the information for a record in a table is stored together. This makes record centric activities that OLTP systems are supporting (recording a transaction, viewing the details of an account, etc) exceptionally fast. In a columnar database the data in a table is partitioned so that all of the fields in table are stored together. This makes the analytical activities where queries aggregate data from particular fields of all of the records much faster. This is because the queries can avoid reading the fields that are not needed for the query. The trade-off from the columnar approach is that when a new record is added all the separate fields need to be updated with separate I/O operations. The driving force for database adoption has been the automation of business processes that tend to be transactional, so, due to the expense of record creation in columnar databases, row-based database systems dominate.
SSDs and in-memory databases shift this calculus by dramatically lowering the “cost” of a small block operation to storage. The added expense of in-memory or SSD storage can be offset by some of the space saving advantages of columnar databases can provide: the ease of compressing data when it is grouped by fields rather than records, and the ability to organize data in a way that avoids the need for additional indexes. Having firsthand experience with the parallel IO capability of SSD systems, I am not worried about the ability of hardware handling a heavier IO workload. The biggest hurdle that I see is altering the locking mechanisms to handle updating multiple locations efficiently while maintaining transactional consistency. Predicting shifting trends in the IT industry is always a difficult business.
This paper lays out interesting arguments and looks to predict the large vendor’s moves. Rewriting software always takes more time and resources than expected but solid state storage options represent such a change in hardware systems behavior that software paradigm shifts are bound to happen.
As an aside, there was one area where I found disagreement with one of the arguments of the paper: that SSDs would be used as a stepping stone towards adoption of entirely in-memory database systems. The reasons that I don’t see this happening for larger systems are: cost, performance, power, and restart persistence. The advantages of flash for cost, power, and restart persistence are quite obvious, so that leaves performance as the reason for in-memory to beat out SSDs. A simplistic argument is that RAM is faster than flash because the read and write latency is so much lower. This is definitely true, but if the software has to change the locking mechanisms to allow true parallelization of the inserts and updates, then latency differences can be handled. High performance SSDs can rival memory in terms of parallel IOPS performance.