Is There Room for Solid State Disks in the Hadoop Framework?
The MapReduce framework developed by Google has led to a revolution in computing. This revolution has accelerated as the open source Apache Hadoop framework has become widely deployed. The Hadoop framework includes important elements that leverage hardware layout, new recovery models, and help developers easily write parallel programs. A big part of the framework is a distributed parallel file system (HDFS) that uses local commodity disks to keep the data close to the processing, offers lots of managed capacity, and triple mirrors for redundancy. One of the big efficiencies in Hadoop for “Big Data” style workloads is that it moves the process to the data rather than the data to the process. Knowledge and use of the storage layout is part of the framework that makes it effective.
On the surface, this is not an environment that screams for SSDs due to the focus on effectively leveraging commodity disks. Some of the concepts of the framework can be applied to other areas that impact SSDs. However, one of the major impacts of Hadoop is that it enables software developers without a background in parallel programming to write highly parallel applications that can process large datasets with ease. This aspect of Hadoop is very important and I wanted to see if there was a place for SSDs in production Hadoop installations.
I worked with some partners and ran through a few workloads in the lab to get a better understanding of the storage workload and see where SSDs can fit in. I still have a lot of work to do on this but I am ready to draw a few conclusions. First, I’ll attempt to describe how Hadoop uses storage using a minimum of technical Hadoop terminology.
There is one part of the framework that is critical to understand: All input data is interpreted in a <key, value> form. The framework is very flexible in what these can be – for instance, one can be an array – but this format is required. Some common examples of <key, value> pairs are <word, frequency>, <word, location>, and <sources web address, destination web address>. These example <key, value> pairs are very useful in searching and indexing. More complex implementations can use the output of one MapReduce run as the input to additional runs.
To understand the full framework see the Hadoop tutorial: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Here is my simplified overview, highlighting the storage use:
Map: Input data is broken into shards and placed on the local disks of many machines. In the Map phase this data is processed into an intermediate state (the data is “Mapped”) on a cluster of machines, each of which is working only on its shard of the data. Doing this allows the input data in this phase to be fetched from local disks. Limiting what you can do to only local operations on a piece of the data within a local machine allows Hadoop to turn simple functional code into a massively parallel operation. The intermediate data can be bigger or smaller than the input, and it is sorted locally and grouped by key. The intermediate data sorts that are too large to fit in memory take place on local disk (outside of the parallel file system).
Reduce: In this phase the data from all the different map processes are passed over the network to the reduce process that has been assigned to work on a particular key. Many reduce process run in parallel, each on separate keys. Each reduce process runs through all of the values associated with the key and outputs a final value. This output is placed locally in the parallel distributed file system. Since the output of one map-reduce run is often the input to a second one this pre-shards the input data to subsequent map-reduce passes.
Over a run there are a few storage/ data intensive components:
- Streaming the data in from local disk to the local processor.
- Sorting on a temporary space when the intermediate data is too large to sort in memory.
- Moving the smaller chunks of data across the network to be grouped by key for processing.
- Streaming the output data to local disk.
There are two places that SSDs can plug into this framework easily and a third place that SSDs and non-commodity servers can be used to tackle certain problems more effectively.
First, the easy ones:
- Using SSDs for storage within the parallel distributed file system. SSDs can be used to deliver a lower cost per MB/s than disks (this is highly manufacture design dependent), although the SSD has a higher cost per MB. For workloads where the dataset is small and processing time requirements are high, SSDs can be used in place of disks. Right now with most Hadoop problems focusing on large data sets, I don’t expect this to be the normal deployment, but this will grow as the price per MB difference between SSDs and HDDs compresses.
- Using SSDs for the temporary space. The temporary space is used for sorting that is I/O intensive, and it is also accessed randomly as the reduce processes fetch the data for the keys that they need to work with. This in an obvious place for SSDs as it can be much smaller than the disks used for the distributed file system, and bigger than the memory in each node. If the jobs that are running result in lots of disk sorts, using SSDs here makes sense.
The final use case is of the most interest to enterprise SSD suppliers and is based on changing the cluster architecture to impact one of the fundamental limitations of the MapReduce framework (from the Google Labs paper):
“Network bandwidth is a relatively scarce resource in our computing environment. We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS ) is stored on the local disks of the machines that make up our cluster.”
The key here is that “Network Bandwidth” is scarce because the number of nodes in the cluster is large. This inherently limits the network performance due to cost concerns. The CPU resources are actually quite cheap in commodity servers. So this setup works well with many workloads that are CPU intensive but where the intermediate data that needs to be passed around is small. However, this is really an attribute of the cluster setup more than of the Hadoop framework itself. Rather than using commodity machines you can use a more limited number of larger nodes in the cluster and a high performance network. Instead of using lots of 1 CPU machines with GigE networking, you can use quad socket servers with many cores and QDR Infiniband. With 4x the CPU resources and 40x the node to node bandwidth, you now have a cluster that is geared towards jobs that require lots of intermediate data sharing. You also now need the local storage to be much faster to keep the processors busy.
High bandwidth/high capacity enterprise SSDs, particularly PCIe SSDs, fit well into this framework as you want to have higher local storage bandwidth than node-to-node. This is in effect a smaller cluster of “fat” nodes. The processing power in this setup is more expensive, so this isn’t the best setup for CPU intensive jobs.
Hadoop is a powerful framework. One of its key advantages is that it enables highly parallel programs to be written by software engineers without a background in computer science. This is one reason it has grown so popular and why it is important for hardware vendors to see where they fit in. There is a lot of flexibility in the hardware layout, and different cluster configurations make sense for tackling different types of problems.