Columnar Databases and Solid State Storage
On a flight this weekend, I had a chance to catch up on some reading and work through most of an interesting paper on memory technology, database architectures, and the future of enterprise applications from Credit Suisse. After overcoming the eerie feeling that I had written a small part of it (looking in the references found some papers I had written.) I was glad to have made it through the 100 plus pages. The full paper is available here (it is quite repetitive so skimming chapters of interest works well).
The central premise of the paper is that the rise of high capacity solid state options (RAM and Flash) in combination with columnar databases will lead to a consolidation of OLTP and OLAP systems. The driving force is based on large memory capacities and SSDs enable columnar databases to handle OLTP workloads. Once that is handled, the inherent advantages of columnar databases for analytics will lead to their widespread displacement of row-based database architectures.
Here is a 10,000 foot view of the distinction between row and column-based database: In a row-based database all of the information for a record in a table is stored together. This makes record centric activities that OLTP systems are supporting (recording a transaction, viewing the details of an account, etc) exceptionally fast. In a columnar database the data in a table is partitioned so that all of the fields in table are stored together. This makes the analytical activities where queries aggregate data from particular fields of all of the records much faster. This is because the queries can avoid reading the fields that are not needed for the query. The trade-off from the columnar approach is that when a new record is added all the separate fields need to be updated with separate I/O operations. The driving force for database adoption has been the automation of business processes that tend to be transactional, so, due to the expense of record creation in columnar databases, row-based database systems dominate.
SSDs and in-memory databases shift this calculus by dramatically lowering the “cost” of a small block operation to storage. The added expense of in-memory or SSD storage can be offset by some of the space saving advantages of columnar databases can provide: the ease of compressing data when it is grouped by fields rather than records, and the ability to organize data in a way that avoids the need for additional indexes. Having firsthand experience with the parallel IO capability of SSD systems, I am not worried about the ability of hardware handling a heavier IO workload. The biggest hurdle that I see is altering the locking mechanisms to handle updating multiple locations efficiently while maintaining transactional consistency. Predicting shifting trends in the IT industry is always a difficult business.
This paper lays out interesting arguments and looks to predict the large vendor’s moves. Rewriting software always takes more time and resources than expected but solid state storage options represent such a change in hardware systems behavior that software paradigm shifts are bound to happen.
As an aside, there was one area where I found disagreement with one of the arguments of the paper: that SSDs would be used as a stepping stone towards adoption of entirely in-memory database systems. The reasons that I don’t see this happening for larger systems are: cost, performance, power, and restart persistence. The advantages of flash for cost, power, and restart persistence are quite obvious, so that leaves performance as the reason for in-memory to beat out SSDs. A simplistic argument is that RAM is faster than flash because the read and write latency is so much lower. This is definitely true, but if the software has to change the locking mechanisms to allow true parallelization of the inserts and updates, then latency differences can be handled. High performance SSDs can rival memory in terms of parallel IOPS performance.