Feeding the Elephant Peanuts and Making Pig Fly

Today at RainStor we announced a new product edition that runs natively on Hadoop and HDFS. We are particularly excited as we sincerely hope it will help support the growth and enterprise adoption of Hadoop in the marketplace. Although we are not an open source vendor, we have tremendous admiration and respect for the open source community and the incredible momentum that Hadoop has garnered. A special thanks to the efforts of Cloudera, who blazed and continues to blaze the trail evangelizing the virtues of Hadoop, and to others such as Hortonworks and MapR (all RainStor partners) who are legitimizing the technology for solving Big Data problems.

By applying our unique pattern and value de-duplication to raw data that would normally be compressed via LZO or Gzip, RainStor can deliver significant savings in the number of nodes required to retain Big Data. For example 40-1 compression could cut the number of nodes from 75 down to 2! Which is not just a lower upfront purchase cost but also a significant ongoing total operating cost reduction. Why bother if your savings in deploying Hadoop are already so significant compared to “traditional” enterprise database or data warehouse hardware and software deployments? Besides the obvious fact that saving money never goes out of style, the sheer rate of data growth is outstripping advances in physical storage media, which means it is a never ending job to feed the elephant.

Cost aside, another way to look at the challenge is to think logically about uncoupling the storage and processing requirements used within each Hadoop node for solving your problem. If you are adding nodes purely to hold the data, you might be significantly under-utilizing the CPUs in each node. Also those CPUs might also be spending effort re-inflating data, if compressed via LZO or Gzip, rather than being fully applied to supporting the query or business analytic calculations. RainStor on the other hand, requires no re-inflation and actually the RainStor compressed files contain more records per block and have a magnification effect on disk performance and bandwidth upon access. So you end up in an almost surreal situation where not only is the data more compressed, Pig and MapReduce jobs actually run faster. Even though the number of nodes are reduced, they would be more efficiently used thereby allowing you to set the correct balance of adding nodes for processing power and storage needs.

Finally RainStor’s ability to run natively on Hadoop is due to the fact that our architecture fits Hadoop and HDFS like a glove. As a large block, MPP database already using MapReduce capabilities internally, it was a natural fit for RainStor to run on HDFS. This enables RainStor to be part of the Hadoop deployment, rather than a database or data warehouse connecting to or transferring data out of HDFS. Because you get all of the security, auditing, unique compliance and data lifecycle management features and more you would expect from an enterprise database that speaks perfect SQL so that your traditional BI tools can access the data without having to transform or transfer it into a separate environment. Furthermore our data virtualization partner Composite Software allows data stored within RainStor on Hadoop to be seamlessly combined with other data sources around the enterprise without the need for large scale copy or transfer.

In closing I have to give credit to our CFO Jamie Andrews (who is a budding marketing intern on the side) for the title of this blog. He knows a thing or two about saving money and articulated that RainStor’s compression and node reduction will allow enterprises to feed their Hadoop cluster peanuts, all while making Pig and MapReduce jobs fly!

Leave a Reply