A Small Price to Pay for Big (Machine-generated) Data Retention

Big Data used to be mostly generated as a result of human-driven interaction (texting, online retail purchases, stock trades) but in this new age, more and more data is machine-generated (call data records, automated stock trades, smart meter sensors, security monitoring appliances, test and measurement devices). Machine Generated Data is widely expected to form the bulk of data growth into the future.

Cisco recently published an eye opening white paper that contains a number of Big Data relevant facts. One fact stated that “Global mobile data traffic will increase 26-fold between 2010 and 2015”. In case you didn’t know 1 exabyte is 1000 petabytes (PB) or 1,000,000 terabytes (TB). It used to be popular to say that “all words ever spoken by human beings” could be stored in approximately 5 exabytes of data. Cisco’s forecast is that mobile data traffic alone will hit 6.3 exabytes a month in 2015, now that’s some Big Data!

Who cares you say, won’t all of that data just pass through networks and disappear into the digital abyss? Not so, the proliferation of mobile devices and regulatory compliance requirements for retaining and maintaining ongoing on demand access to data such as call data records (CDRs) and wireless access protocol (WAP) logs puts the telecommunications industry at the forefront of the Big Data wave. But telco is not alone; Big Data retention needs are popping up in other industries and sectors such as utilities with smart meters, and in cyber security with events and logs.

Whether driven by regulatory requirement or business competitive need, Big Data is big business. As evidenced by the significant rash of M&A activity in the Big Data analytics space where EMC acquired Greenplum, IBM acquired Netezza and most recently HP acquired Vertica. While those products have succeeded in driving down the cost of complex deep analytics from “enterprise-premium” prices vs. traditional OLTP RDBMS and data warehousing systems, they are still not in the same “economic ballpark” when compared to a repository optimized for long-term extreme MGD retention. I would argue that anyone using those products for long-term retention is still paying a premium, albeit a smaller one than before. As an aside, as I previously blogged, big data retention requires unique features which these products do not offer.

But let’s focus on the economics of retaining MGD. Clearly the capital expenditure (CAPEX) of the hardware and software investment needed to begin to ingest and allow for on-demand querying of data is what captures everyone’s initial attention. This leads big data software or data warehousing appliances to be measured and priced typically at a per TB price. Curt Monash leads the thinking in this often confusing pricing scheme where apples are often being compared with watermelons due to the disk and dedicated hardware required for system operation and high availability. However, ongoing operating expenditure (OPEX) is what drives overall total cost of owning and maintaining any system you put in place. The dimensions that make up this total CAPEX and OPEX equation include the servers (high-end Solaris boxes or commodity blades) needed to support the throughput of load and query of the data, the amount of physical storage needed and the class of storage (expensive SAN or lower cost NAS). Much of this will be dictated how efficient the software can compress or de-duplicate the data to reduce the space needed, how efficiently it can parallelize load and query and what class of hardware and storage is needed as its optimal reference architecture. Not to mention what is often left out are the large amount of resources (skilled expert admins) required to tune and maintain these systems.

In order to meet the data volumes the world’s machines are generating, we must drive down the cost of information retention. This means boosting compression rates, well beyond that of basic binary compression and file level dedupe. Soon prices for retaining MGD will be quoted in the per petabyte and eventually per exabyte range. Big Data retention should be economically feasible for any organization, because it literally should be “a small price to pay”.

Leave a Reply