Hadoop Compression – The Elephant That’s Not In The Room

This article was originally posted at Cloudtimes.org http://cloudtimes.org/hadoop-compression-the-elephant-thats-not-in-the-room/

We are living in the age of “Big Data” where billions of transactions, events or activities are generated through use of smartphones, web browsing, smartmeter sensors and more. Hadoop, MapReduce and a new generation of NoSQL technologies are helping us manage, transform, analyze and deal with the data overload. Together they process extreme volumes using techniques and technologies that askew the traditional “big iron” infrastructure normally contributing to the high cost of running IT departments and data centers. By running on low-cost commodity servers and direct attached storage, Hadoop and HDFS clusters can be used to process petabytes of raw data, producing meaningful, actionable business insights in a fraction of the time previously thought possible.

With the cost of acquiring new racks of physical storage continuing to trend lower per TB, it would seem that this should be a winning combination for many years to come. But as the story goes, whether it’s CPU intensive computing power, RAM or physical storage we will always find ways to consume and exceed current capacity. Invented by Doug Cutting and named after his son’s toy elephant, Hadoop can efficiently store and retrieve large data sets for processing. However, the stored data footprint actually becomes 2 to 3 times larger than its original raw size due to replication across nodes. Since Hadoop has no in built compression, this has lead to the use of basic binary compression technologies such as Gzip (see Amazon AWS’ guide to Hadoop compression here) and LZO (see the blog post by Matt Massie back in 2009 about using LZO with Hadoop at Twitter) to reduce the amount of disk required. As we know binary compression has its limits, and comes with a re-inflation penalty upon access. Meanwhile higher compression rates and savings have been realized through other techniques such as de-duplication of files and objects through products such as Data Domain. But that presupposes that you have many copies of the same object or blocks of data on disk exhibit the same characteristics.

In reality certain types of structured and semi-structured data does have similar characteristics, but at a much more granular level than a file or object. Transactions, call data records, log entries or events have repeated data values and patterns that are common across individual and groups of records. These can be de-duplicated so that only unique entries are retained. This level of de-duplication (at the value de-dupe level similar to columnar databases) generally yield compression rates far greater than binary compression. The challenge of course is maintaining the integrity and original immutability of the individual records loaded into the system, ensuring that data can be accessed on demand without a high performance penalty. Compression obsession has shown that when you achieve significant compression, many things become easier and in some cases even faster! Smaller amounts of data, written as large blocks results in less I/O, as well as less bandwidth consumed when moved between nodes or networks. This means that data can be stored in a shared nothing architecture like HDFS while also benefiting from a “logically shared everything” model where each node can have access to data located on other nodes without major performance impact. Additionally, in a heterogeneous server environment, higher compute capacity nodes can actually compete for more tasks.

When you are dealing with petabyte-scale data, like major communication service providers (CSPs) who have to capture tens of billions of WAP logs and CDRs a day, basic binary compression isn’t significant enough. Compliance for on demand accessibility and retention periods of 3 months to years make higher compression rates a critical factor to keeping operational costs at levels that can scale with the growth in subscribers and activity.

Compliance isn’t the only reason to retain large data sets. Better historical business reference and analysis trending across years of gathered data, or test data spanning millions of critical components, all require data be accessible to yield the next great set of business insights. Take the announcement by Yahoo (early user and significant contributor to Hadoop) who said that they are taking the extra-ordinary measure of retaining more data than the compliance requirement dictated by the European Union (EU) – (See Yahoo Jacks Data Retention Period from 90 days to 18 months). This raises another issue around when and how to determine what data should be removed as expiry periods are reached. But that’s another topic for another time.

As a community we continue to make great strides in leveraging Hadoop, such as Cloudera’s Distribution Including Apache Hadoop that brings together a wealth of complementary technologies and components for enterprise class Big Data management. However in order to process Big Data, you will need to combine Hadoop with compression that can keep up. At petabytes today trending to exabyte scale in the future, the topic of Hadoop compression is one elephant that is conspicuously missing from the room.

Leave a Reply