Archive for the ‘Writes’ Category

SSSI Performance Test Specification

November 7, 2011 1 comment

I recently joined the governing board of SNIA’s Solid State Storage Initiative (SSSI), an organization designed to promote solid state storage and standards around the technology.  One of the biggest points of contention in the SSD space is comparing various devices’ performance claims.  The reason this is fairly contentious is that test parameters can have a dramatic impact on the achievable performance.  Different flash controllers have been implemented in ways where the amount of over provisioning (see my post on this), the compressibility of the data, and how full the device is can have a dramatic performance impact.  To help standardize the testing process and methodology for SSDs the SSSI developed the Performance Test Specification (PTS).

This specification outlines the requirements for a test tool and defines the methodologies to use to test an SSD and the data the needs to be reported.  The SSSI is actively working to promote this standard.  One of the hurdles to adoption is that many vendors are using their own testing methodology and tools internally and don’t want to modify their processes.  There is also hesitation to present numbers that don’t have the best case assumptions that may be compared to the competition under different circumstances.  This can make life more difficult for end users, as there is a gap between having a specification and having an easy way to run the test.

I spent some time over the past week making a bash script that uses the Flexible I/O utility (you will need to download this utility to use the script) to implement my reading of the IOPS section of the test.  I also made an Excel template to paste the results from the script into.  You can use this to select the right measurement window and create an IOPS chart for various block sizes.  There are still a few parameters that you need to edit in the script and you will have to do some manual editing of the charts to meet the reporting requirements, but I hope that this will help as a starting point for users that want to get an idea of what the spec does and run tests according to its methodology.

SSSI PTS report Template

Script (Copy and Paste):

#This Script Runs through part of the SSSI PTS using fio as the test tool for section 7 IOPS
#By Jamon Bowen
#check to see if right number of parameters.
if [ $# -ne 1 ]
  echo "Usage: $0 /dev/<device to test>"
#The output from a test run is placed in the ./results folder.
#This folder is recreated after every run.
rm -f results/* > /dev/null
rmdir results > /dev/null
mkdir results
# Section 7 IOPS test
echo "Running on SSSI PTS 1.0 Section 7 - IOPS on device: $1"
echo "Device information:"
fdisk -l $1
#7.1 Purge device
echo "****Prior to running the test, Purge the SSS to be in compliance with PTS 1.0"
#These variables need to be set the test operator choice of SIZE, outstanding IO per thread and number of threads
#To test the full device use fdisk -l to get the device size and update the values below.
echo "Test range 0 to $SIZE"
echo "OIO/thread = $OIO, Threads = $THREADS"
echo "Test Start time: `date`"
#7.2(b) Workload independent preconditioning
#Write 2x device user capacity with 128 KiB sequential writes.
echo "****Preconditioning"
./fio --name=precondition --filename=$1 --size=$SIZE --iodepth=$OIO --numjobs=1 --bs=128k --ioengine=libaio --invalidate=1 --rw=write --group_reporting -eta never --direct=1 --thread --refill_buffers
echo "****50% complete: `date`"
./fio --name=precondition --filename=$1 --size=$SIZE --iodepth=$OIO --numjobs=1 --bs=128k --ioengine=libaio --invalidate=1 --rw=write --group_reporting -eta never --direct=1 --thread --refill_buffers
echo "****Precondition complete:`date`"
#IOPS test 7.5
echo "****Random IOPS TEST"
echo "================"
echo "Pass, BS, %R, IOPS" >> "results/datapoints.csv"
for PASS in `seq 1 25`;
  echo "Pass, BS, %R, IOPS, Time"
  for i in 512 4096 8192 16384 32768 65536 131072 1048576 ;
    for j in 0 5 35 50 65 95 100;
      IOPS=`./fio --name=job --filename=$1  --size=$SIZE --iodepth=$OIO --numjobs=$THREADS --bs=$i --ioengine=libaio --invalidate=1 --rw=randrw --rwmixread=$j --group_reporting --eta never --runtime=60 --direct=1 --norandommap --thread --refill_buffers | grep iops | gawk 'BEGIN{FS = "="}; {print $4}' | gawk '{total = total +$1}; END {print total}'`
      echo "$PASS, $i, $j, $IOPS, `date`"
      echo "$PASS, $i, $j, $IOPS"  >> "results/datapoints.csv"
Categories: SSD, Writes

Update from the Field: VMworld 2011

August 30, 2011 Leave a comment

This week in TMS’s booth (booth #258) at VMworld we have a joint demo with our partner Datacore that shows an interesting combination of VMware, SSDs, and storage virtualization.  We are using Datacore’s SANsymphony -V software to create an environment with the RamSan-70 as tier 1 storage and SATA disks as tier 2.  The SANsymphony-V software handles the tiering, high-availability mirroring, snapshots, replication, and other storage virtualization features


Iometer is running on four virtual machines within a server and handling just north of 140,000 4 KB read IOPS.  A screen shot from Iometer on the master manager is show below:

Running 140,000 IOPS is a healthy workload, but the real benefit of this configuration is its simplicity.  It uses just two 1U servers and hits all of the requirements for a targeted VMware deployment.  Much of the time, RamSans are deployed in support of a key database application where exceptionally high performance shared SSD capacity is the driving requirement.  RamSan systems implement a highly parallel hardware design to achieve an extremely high performance level at exceptionally low latency.  This is an ideal solution for a critical database environment where the database has all of the tools integrated that are normally “outsourced” to a SAN array (such as clustering, replication, snapshots, backup, etc.). However, in a VMware environment many physical and virtual servers are leveraging the SAN, so pushing the data management to each application is impractical.

Caching vs. Tiering

One of the key use cases of SSDs in VMware environments is automatically accelerating the most accessed data as new VMs are brought online, grow over time, and retire.  The benefit of a flexible virtual infrastructure makes seamless automatic access to SSD capacity more important.  There are two accepted approaches to properly integrating an SSD in a virtual environment; I’ll call them caching and tiering.  Although, similar on the surface, there are some important distinctions.

In a caching approach, the data remains in place (in its primary location) and a cached copy is propagated to SSD.  This setup is best suited to heavily accessed read data because write-back caches break all of the data management storage features running on the storage behind it (Woody Hutsell discusses this in more depth in this article).  This approach is effective for frequently accessed static data, but it is not ideal for frequently changing data.

In tiering, the actual location of the data moves from one type of persistent storage to another.  In the read-only caching case it is possible to create a transparent storage cache layer that is managed outside of the virtualized layer, but when tiering with the SSD, tiering and storage virtualization need to be managed together.

SSDs have solved the boot-storm startup issues that plague many virtual environments, but VMware’s recent license model updates sparked increased interest in other SSD use cases.  With VMware moving to a memory-based licensing model there is interest in using SSDs to accelerate VMs with a smaller memory footprint.  In a tiering model, if VMDK are created on LUNs that leverages SSDs, the virtual guest will automatically move the internal paging mechanisms within the VM to low latency SSDs.  Paging is write-heavy, so the tiering model is important to ensure that the page files are leveraging SSD as they are modified(and that the less active storage doesn’t use the SSD).

We are showing this full setup at our booth (#258) at VMworld. If you are attending I would be happy to show you the setup.

Categories: Architectures, Cloud, Writes

Part 1: eMLC – It’s Not Just Marketing

August 23, 2011 Leave a comment

Until recently, the model for Flash implementation has been to use SLC for the enterprise and MLC for the consumer.   MLC solutions traded endurance, performance, and reliability for a lower cost while SLC solutions didn’t.  The tradeoff of 10x the endurance for 2x the price led most enterprise applications to adopt SLC.

But there is a shift taking place in the industry as SSD prices start to align with the prices enterprise customers have been paying for tier 1 HDD storage (which is much higher than the cost of consumer drives).  If the per-GB pricing is similar, you can add so much more capacity that endurance becomes less of a concern.  Rather than only having highly transactional OLTP systems on SSDs, you can move virtually every application using tier 1 HDD storage to SSD.  The biggest concern for tier 1 storage has been that the most critical datasets that reside on them cannot risk a full 10x drop in endurance.

Closing the gap: eMLC

At first glance, enterprise MLC (eMLC) sounds like Marketing is trying to pull a fast one.  If there was a simple way to make MLC have higher endurance, why bother restricting it from consumer applications?  eMLC sports endurance levels of 30,000 write cycles, whereas some of the newest MLC only has 3,000 write cycles (SLC endurance is generally 100,000 write cycles).   There is a big reason for this restriction: eMLC makes a tradeoff to enable this endurance – retention.

It’s not commonly understood that although Flash is considered persistent, the data is slowly lost over time.  For most Flash chips the retention is around 10 years – longer than most use cases.  With eMLC, longer program and erase times are used with different voltage levels than MLC to increase the endurance. These changes reduce the retention to as low as three months for eMLC.  This is plenty of retention for an enterprise storage system that can manage the Flash behind the scenes, but it makes eMLC impractical for consumer applications.  Imagine if you didn’t get around to copying photos from your camera within a three month window and lost all the pictures!

Today, Texas Memory Systems announced  its first eMLC RamSan product: the RamSan-810.  This is a major announcement as we have investigated eMLC for some time (I have briefed analysts on eMLC for almost a year; here is a discussion of eMLC from Silverton Consulting, and a recent one from Storage Switzerland).   TMS was not the first company to introduce an eMLC product as the RamSan’s extremely high performance backplane and interfaces make endurance concerns more palpable.  However, with the latest eMLC chips, we aggregate enough capacity to be comfortable introducing an eMLC product that bears the RamSan name.

10 TB of capacity multiplied by 30,000 write cycles equates to 300 PB of lifetime writes. This amount of total writes is difficult to achieve during the lifetime of a storage device, even at the high bandwidth the RamSan-810 can support.  Applications that demand the highest performance and those with more limited capacity requirements will still be served best by an SLC solution, but very high capacity SSD deployments will shift to eMLC.  With a density of 10 TB per rack unit, petabyte scale SSD deployments are now a realistic deployment option.

We’re just getting warmed up discussing eMLC. Stay tuned for another post soon on tier 1 disk pricing vs. eMLC.

Categories: SSD, Writes

SSD Sustained Write Performance

April 21, 2011 2 comments

There is lots of confusion out there on how SSDs handle writes and I thought it would be worthwhile to shed some light on this.   Flash chips need to have entire blocks (256 KB) written and erased at the same time.  Since applications write in smaller chunks than this, there is something called a Flash Translation Layer that handles write I/Os.  I’ll try to summarize the details here.

The Flash Translation Layer keeps an index of where a block is physically written and the logical address that is presented to the host.  After a continuous small block random write workload, eventually there are not any pre erased blocks available to write to.  The flash controller has to perform “moves” of data from a few blocks that have some valid data and some stale data.  These moves are performed to get a few completely full blocks and a new empty block to write to.  A move operation ties up the chip and is just a little less work than a write.  The amount of over provisioned capacity determines how many background moves you will have to perform for each new write that comes in under the worst possible scenario.

For example, on the RamSan-20 Datasheet we list the sustained number for our random writes – 50,000 IOPS.  Outside of this worst case you can achieve 160,000 write IOPS.  The amount of over provisioned space (~30%) is not arrived at by accident, and roughly 1/3 of peak performance is not a coincidence either.  The amount of extra space that is available determines the maximum number of moves that have to be done to get a full empty page – as once all of the spare area is used up, effectively every block will have about as much stall data as the over provisioned percentage.  So (ignoring the complexities of the actual math) you have to perform a number of moves that is just a little less than 1/(the over provisioning percent) in the absolute worst case to get a full block to write to.  With a very small over provisioning percent, the amount of moves that have to take place can be very high.  These moves keep the chips busy almost as long as a formal write, so the amount of write IOPS you can perform declines.

So why use ~30% over-provisioned space?  Because there are diminishing returns in terms of performance improvement as the over provisioning increases.  The figure below graphs the function – 1/(over provisioning percent) (a rough approximation of the number of moves required in the worst case scenario):

It is worth noting that, in the real world of applications, you are unlikely to be performing 100% random writes, so a good flash controller will perform background move operations to defragment the valid and stale data.  To get to the worst case, you have to randomly write across all of the capacity without ever stopping to read.  While some logging applications have this type of constant write workload they are never random.  Nevertheless, it is still important to understand this concept to accurately compare different devices’ performance.  It is tempting to reduce the amount of over provisioned space to lower the cost per GB, but then the number of worst case moves can become a real problem.  Keep in mind that the over provisioned space does shrinks over time as it is also intended to accommodate for flash blocks that have worn out.

Some controllers may not actively defragment the space to save on controller costs, so the worst case performance becomes the “real” performance after the drive has been written a few times.  This is becoming the exception, but still something to look out for.  This performance decline in the sustained random write case can also easily be missed if there is no way for the host driving the workload to keep all of the chips busy all of the time. This is because the decline in performance requires hitting the flash as hard as possible.  But with SSDs at a premium price to disks based on their performance, limiting performance all of the time just to avoid a performance decline in the worst case isn’t a good design tradeoff. For more information on write handling, SNIA’s Solid State Storage Initiative’s performance test specification goes into quite a bit of depth.

Categories: SSD, Writes