The Big Data Odd Couple: Retention and Analytics

I recently caught the 1968 movie The Odd Couple on TV. It starred Jack Lemmon and Walter Matthau, and was based on Neil Simon’s 1965 Broadway play. The movie and subsequent TV series features a neat freak and neurotic, Felix Ungar rooming with his friend Oscar Madison, a messy sportswriter. As you may recall Felix and Oscar get into many humorous situations that highlight their personality differences. But somehow, someway, they get along despite their different approaches. Not unlike Big Data Retention and Analytics solutions, both have differentiated unique capabilities, and both can and should get along.

There has been a lot of back and forth about Big Data Retention and Big Data Analytics. When I last blogged about retention as “The Other Side of The Big Data Problem” 3 months ago, much of the fanfare in both mainstream media and IT publications was squarely focused on analytics. The acquisition last month of Greenplum by EMC validated the value and need for technologies that satisfy the pursuit of fast insights within terabytes and increasingly petabytes of data. And just a few weeks ago I was invited by Mike Vizard at CTOEdge to author a guest post on “Surfing The Big Data Retention Wave,” which prompted many more enquiries around the topic of this Big Data Odd Couple.

Most recently, Doug Henschen, editor-in-chief of Intelligent Enterprise wrote a great blog post titled Big Data: The Early Days Are Over. The blog post mentions several interesting use cases that straddle both the need for retention as well as analytics. One such example is that of Reliance Communications. The article stated:

Indeed storage was more of a priority than speed for this particular application. Reliance needed to retain CDRs for compliance reasons. In a police investigation, for instance, law enforcement officials might ask Reliance for a complete record for all calls a particular subscriber made or received during a certain time period. The Indian government requires CDRs to be retained for 13 months, and with nearly one billion new calls made each day, the demands were massive.

“Access to CDRs is not very frequent, but we needed fast loading and fast retrieval for large amounts of data,” said Raj Joshi, vice president of decision support systems.

The description above seems to indicate that Big Data Retention was the primary use case for Reliance Communications back in 2008, when they selected and deployed EMC/Greenplum to solve their problem. Fast forwarding to 2010, RainStor’s solution for Big Data Retention is seeing significant traction in the CDR retention space as evidenced by various ISV partnerships solving exactly the same problem.

So why are Big Data Analytics products such as EMC/Greenplum, Vertica, InfoBright and others being used in situations where retention may be the primary driver? Perhaps the reason is that data warehouses, marts and other forms of data aggregation repositories have long been anointed as the place to put massive amounts of data. The new kids on the block for Big Data Analytics simply satisfy the need for faster ingestion and high performance query of the data and by de facto become the infrastructure of choice. But is messy Oscar really the half of the Odd Couple you want mashing up and taking care of your compliant data?

Surely a dose of neurotic Felix is what is called for when compliance is concerned. In this day of strict regulatory requirements and stiff fines, it is not enough to just have a fast data warehouse to be considered compliant. Strict rules around ensuring the data is immutable, tamper proof and configurable retention, and expiry policies are required and are what you get built-in as part of any Big Data Retention solution. Additionally, a heavy helping of neat organized de-duplication of data results in compression so extreme it outstrips the TCO for any Big Data Analytic solution.

More formally comparing Big Data Retention vs. Big Data Analytic offerings:

Feature & Use Case Big Data Retention (Felix) Big Data Analytics (Oscar)
Ingestion of Massive Volumes Yes Yes
On Demand Query Yes Yes
Standard Analytics Yes (Standard SQL Access) Yes
High Performance Analytics No (Not a Specialty) Yes
Configurable Retention Rules Yes (Granular) No
Configurable Expiry/Delete Yes (Granular) No
Guaranteed Immutable Yes Depends
Low Admin Cost Yes Not usually
Lowest Cost Per TB for Retention Yes No

The differences are subtle and you can see why companies who have retention as a primary goal, with large scale ingestion requirements, might overpay for a Big Data Analytics solution, which in the end may only solve part of their compliance problem.

A better way to think about when to use Felix vs. Oscar, is to consider how Big Data Retention not only solves scenarios like the CDR use case for Reliance Communications, but also is used in other mainstream ILM (Information Lifecycle Management) retention scenarios. For example, as the target repository for application retirement (when legacy applications are shut down and data moved into Big Data Retention repositories for continued ongoing access), and application archiving (when static/historical data is moved from production OLTP systems into Big Data Retention repositories to alleviate the burden on the production systems while still affording on demand accessibility), Big Data Analytics tools and repositories are NEVER used in these scenarios because the cost economics and lack of retention-focused features make them a mismatch. This is why a Big Data Retention solution, with high-ingestion performance characteristics, is a much better fit for CDR scenarios such as Reliance Communications. Meanwhile the higher cost of Big Data Analytics should be directed and aligned with more complex aggregation and trend analysis.

To close, as an Odd Couple, Felix and Oscar do eventually get along, and Big Data Retention works just as well as a supporting repository to Big Data Analytics. In his latest Information Week article “The Big Data Era: How Data Strategy Will Change” , Doug identifies two additional trends which further highlight the need for an Odd Couple pairing:

1. Fast Access to Historical Source Records – Rather than extracting data out of data warehouses, which is a slow arduous process, leading tools such as SAS are now enabled directly for leading vendors such as Netezza and Teradata to eliminate data movement. Big Data Retention repositories also provide direct access to originating source records, freeing the burden from OLTP systems or data warehouses. While maintaining detailed drill down capability from aggregated results derived from Big Data Analytics.

2. In-Memory Is Coming, But Compliance Still Persists – Solid-state drives are coming and in memory processing will become compelling as an additional high performance option for Big Data Analytics. (Side note: Interestingly RainStor was originally developed as an in-memory database, which ultimately resulted in the patented value and pattern de-dupe capability – due to the high cost of memory). Like many long-term Big Data Retention offerings, RainStor leverages an in-memory model in combination with long-term persistent storage required to support compliance and to ensure the lowest TCO. Just like ILM data archiving for OLTP systems, Big Data Retention provides a tiered, compliant lower cost retention option for completed analysis, freeing up much more expensive Teradata, Netezza and EMC/Greenplum appliances, for example, to focus on what they do best.

Now if I can only get this da dum dee dum dee dum theme tune out of my head!