Other Places You’ll Find Me | As the Facebook IPO frenzy builds up to the pricing and Facebook starts trading this Friday (UPDATE:Facebook has priced at $38 giving it a market cap of about $104B),it got me to thinking about how much data I have uploaded/contributed to Facebook over the last 5 years. Turns out,you can get your own personal slice of the Big Data in Facebook back as a tidy zip file snapshot of everything you have done/uploaded to or had commented on. If you want to try it yourself take a look at the instructions here. Since I joined Faceook in 2007 it appears that I have generated or uploaded about 1.5GB of data. The Zip file returned (after it took 5 hours – combined file preparation time and bandwidth needed to download),contains a nice HTML index page,which provides a strip down version of the photos and comments in chronological order,just like your wall. The basic capability was made available in 2010 and extended with an enhanced archive option,after complaints made by Irish users who reported their concerns to the Irish data retention commissioner. So how much Big Data is in Facebook and where is it kept? The popular details making the rounds of Hadoop and Big Data conferences focuses mainly on the huge clusters running Facebook data warehouses running on Hadoop and Hive. There was an interesting article on Facebook’s corporate blog about their massive Hadoop migration (30PB worth) last year to a larger data center. However on a daily basis,the repository and platform you interact with is still powered by MySQL databases. Given the publicity around how traditional “relational” databases can’t handle internet scale,and that NoSQL databases are the way to go,the fact that Facebook still operates MySQL as the backend is eye opening. This has prompted critique from database experts such as Michael Stonebreaker (Vertica and VoltDB fame) to state that Facebook is “trapped in a MySQL fate worse than death”. This was followed up by another GigaOM article detailing how Facebook is able to make MySQL scale. The article details how Facebook is not just relying purely on MySQL,and that they have a massive layer of memcached servers that are being used as an in-memory database highlighting that MySQL servers on their own couldn’t possibly handle the read load of live Facebook traffic. For functionality such as the Facebook Inbox,Hadoop and HBase are used instead. Additionally Hadoop is used as the backup for the MySQL data. Back to my personal download of my Facebook information,I was quite impressed by the time (again 5 hours) it took to download a bulk copy of my personal data. However,I doubt that many users leverage the download option today,rather with more and more users joining,and increasing upload of data per account,it will be interesting to see if the MySQL architecture can continue to hold up,and how the Facebook’s use of Hadoop,Hive and other Apache projects will evolve for Big Data warehousing and analytics. First of all,congratulations to all Splunk employees,VCs and shareholders! Today is a great day for your company and those of us in Big Data (see who else is who in Big data here). Almost 12 months ago I wrote a post titled:“LinkedIn’s IPO –A Perfect Storm of Big Data,Open Source and Cloud Computing” in which I marveled at the then $9B market cap a few days after the IPO. I noted that LinkedIn used or involved 3 core technology areas:Big Data,Open Source and Cloud Computing. Today,I was excited to see that Splunk IPOed and immediately doubled in price making it worth a cool $3B,mostly on the basis of the hype and reality of Big Data. For an interesting financial and company analysis,see Dave Kellog’s post in January about Splunk’s S1 and impending IPO. It’s a great analysis,describing Splunk’s marketing as “the Virgin America of log file analysis,”as evidenced by one of many funny tag lines that often appear on t-shirts they hand out to their users and at trade shows: 
The only thing he may have missed the mark on,was that he felt the predicted $1B valuation to be rather high. Wonder what he thinks of the $3B market cap today? Irrational exuberance,high tech bubble,Instagram effect or Big Data trending? What does Splunk provide for its $40M VC money raised,$3B market cap on $121M in revenues and a $11M loss? According to “About the company”on Splunk.com Splunk was founded to pursue a disruptive new vision:make machine data accessible,usable and valuable to everyone. Machine data is one of the fastest growing and most pervasive segments of “big data”--generated by websites,applications,servers,networks,mobile devices and the like that organizations rely on every day. By monitoring and analyzing everything from customer clickstreams and transactions to network activity and call records–and more,Splunk turns machine data into valuable insights no matter what business you’re in. It’s what we call operational intelligence. Splunk was dealing with what is now known as Big “machine-generated”Data since its founding in 2006 long before Hadoop was popular and data was merely large. Splunk cracked the code for helping a very influential constituent,the internal IT group of organizations who were struggling to analyze and manage the millions of logs generated by expanding infrastructure and data growth. In face many companies were dealing with such data at scale BH (Before Hadoop),but there is no doubt that Hadoop has raised the interest in Big Data and Big Data technologies to fever pitch,and Splunk’s IPO has just launched a nuclear missile into that explosive ammunition dump. Of course,I’m particularly excited about the Splunk IPO,since I work for RainStor,a Big Data database company,and we too at RainStor deal with the same Big Data and have done so BH. We also recently announced ourselves as the first database to run natively on Hadoop. The Splunk IPO is good for everyone who’s in the Big Data space,both from a VC valuation perspective but also a general public understanding of the types of real-life Big Data challenges that our technologies are looking to solve. In closing to show that everything is happening at warp speed and extraordinary valuations,note that the CEO of Splunk,Godfrey Sullivan (who incidentally has 8% of the company now valued at about $250M –see table at the end of the post),was previously the CEO of Hyperion (founded originally as IMRS in 1981,26 yrs old),when they were sold to Oracle in 2007 for $3.3B on revenues of about $1B with 2500 employees. Today Splunk (founded in 2006 –6 yrs old) is pushing on that market valuation with just $121M in revenues and about 500 employees. It won’t be long before more Big Data related IPOs and M&A follow,thanks and congrats again to Splunk for leading the way. 
Also published at http://rainstor.com/how-much-is-that-hadoop-cluster-really-costing-you/ Last month when we released our RainStor for Big Data Analytics product edition that runs natively on Hadoop,we raised a lot of eyebrows with two of the points that we were making: - Compression can dramatically reduce the TCO of Hadoop nodes needed
- SQL access to the compressed data in HDFS can be achieved without having to transfer the data out of Hadoop or use specialized tools
In my post “Feeding the Elephant Peanuts and Making Pig Fly” I talked how we could achieve massive compression,give SQL-92 access and boost the performance of MapReduce jobs. This post revisits the first point around TCO. I’ll cover the second point in a future blog post. The reason that I decided to go over the TCO point again is because I had the pleasure of chatting with David Merrill,Hitachi Data Systems Chief Economist (@StoragEcon) on this very topic. I have been a fan of his white papers and noted that he had started writing about Big Data Storage Economics on his blog titled, The Storage Economist. For those of you who are unfamiliar with his work,a good example is his white paper Four Principles For Reducing Total Cost of Ownershipproviding a pragmatic and quantifiable look at all of the factors that contribute towards operating and running different types of storage. We talked about his research and analysis and how from purely a bare-metal CPU,disk and component perspective,commodity clusters such as Hadoop can appear to provide lower TCO from a cost per usable Tb perspective. However as his research showed ,when cost per written to Tb is used,the equation is turned completely upside down. As David concluded, “don’t confuse price and cost,and look at a longer time horizon when planning and building big data storage infrastructures.” In our January release on Hadoop we had an example in an infographic illustrating how RainStor’s compression can significantly drive down the physical storage and therefore the number of nodes required. We used a simple operating cost metric of $3000 per node (containing 12Tb of raw disk) that resulted in a TCO (buying and operating the cluster) savings of over $1M over 3 years for storing 300Tb of user data. If you take a look at David’s numbers in his post he has it at a low of around $3000 per usable Tb for DAS to a cumulative high cost of $45,000 per written-to Tb! Granted the research was done in 2009 and acquisition costs have plummeted since,but since the price of floor space,heating,cooling etc. just continues to grow,it demonstrates that $3k per node can be considered reasonably conservative. His post also pointed out that in general the server CPUs with DAS were tasked with a lot of “mundane tasks”. As part of our conversation,I detailed how RainStor’s unique value and pattern de-duplication process leveraged CPU cycles up front to build highly compressed partitions which not only saved on physical disk space,but used collected metadata to make data access more intelligent and efficient,as well as magnifying the performance of the commodity disks by retrieving more data per block when required. This means using more CPUs on load and improving performance overall upon access. All of this reflects the savings and the impact of baseline storage and access costs,and doesn’t yet add the cost of administration (both software and personnel) of the nodes and cluster,as well as any development and integration costs (which I will cover in my next post). Bottom line,David’s research and white papers over the years have contributed greatly to the overall TCO of storage and the benefits of widely adopted technologies such as thin provisioning. Now he is pointing out that Big Data Hadoop clusters have hidden hardware operating costs and it’s best to go into such endeavors with eyes (and pocketbooks) wide open. Meanwhile,here at RainStor we continue to focus on drive down the TCO of your choice of storage and configuration,through our database. In the end,David and I both agreed that services aside,the best TCO lies between the efficient selection and implementation of the hardware and software for the given use case,and that the right combination is what will make Big Data manageable and affordable. A couple of interesting articles ran through my #BigData tweetstream last week. Shout out to Ivan Chong (@ichong) at Informatica who I re-tweeted “The Delivery Guy Who Saw Jeremy Lin Coming”tells the intriguing story of a Fedex delivery driver and numbers hobbyist named Ed Weiland,who in May 2010 wrote a long-term forecast of Jeremy Lin for the basketball website Hoops Analyst. His analysis of Lin’s college career statistics stated that beyond the top point guard available for the draft John Wall,there may be a surprise point guard available out there:“The best candidate to pull off such a surprise might be Harvard’s Jeremy Lin. The reason is two numbers Lin posted,2-point FG pct and RSB40. Lin was at .598 and 9.7. This is impressive on both counts. These numbers show NBA athleticism better than any other,because a high score in both shows dominance at the college level on both ends of the court.” Such claryvoyance based on examining the statistics in detail,rather than relying on “traditional”physical makeup or well known basketball college pedigrees,is the stuff of baseball Sabermetrics or Moneyball (to you Brad Pitt/Oakland As fans out there). Interestingly basketball coaches and general managers obviously did not read much into these stats (if they looked at them at all),nor did they pay any attention when Jeremy Lin outplayed #1 Point Guard pick John Wall in the Summer League,but I digress …Such examination of statistics blended with other related (or yet to be discovered related factors) is the stuff that people are starting to coin as the work of Data Scientists. This newly popularized term of Data Scientists has emerged together with Big Data and Hadoop as one of the highest paid and sexiest (as far as IT can be sexy to those of you not in the biz) job titles out there. EMC recently published an infographic with the mashable article calling it “the career of the future”. As a counterpoint respected industry analyst Neil Raden recently weighed in that Data Scientist is an overused term. Terminology aside,while it is doubtful that Ed Weiland derived this analysis by running a home built Hadoop cluster with hand-coded MapReduce jobs,he did nevertheless identify a combination of statistics which was vindicated by Jeremy’s subsequent ascendance as the starting point guard for the New York Knicks. Meanwhile,a professional “data scientist”was busy uncovering market and buyer patterns that ended up identifying a pregnant teen,before her father even knew. Andrew Pole,working for Target stores,revealed how the history of everything a customer has ever bought and any demographic information Target has collected from them or bought from other sources is used to determine if a woman has a high probability of being pregnant. They then use this information to send coupons or special offers of relevant products. In this instance there is no doubt that some form of Big Data analytics is at play. Traditional big iron analytics appliances such as Teradata (speculation on my part as they have a big footprint in retail) may have helped Target hone in on the pregnant teen. As it has been much publicized,traditional SQL-based data warehousing isn’t the ideal combination of platform and data analysis for these types of speculative,predictive analytics,and products like Teradata with their powerful capabilities and horsepower come at a Ferrari-scale price tag. This is where Hadoop and MapReduce might shine more brightly,though a quick glance at the Powered by Hadoop website does not list Target as a user of Hadoop (though this information is strictly voluntary). Many would consider the Jeremy Lin example cool and the Target example creepy,though a commenter pointed out companies have been gathering data about us for credit scores and the resulting credit worthiness for decades. With new technologies such as Hadoop making high performance analytics more affordable than ever,expect to see more cross domain analysis and use cases expanding beyond marketing research. Whether it’s the likes of Mike Zuckerberg becoming a billionaire through Facebook (powered by Hadoop),the prediction of Jeremy Lin being a NBA sensation out of Harvard,or predicting the pregnancies of teens;big data analytics and Hadoop is making big elephant sized footprints in business and mainstream media. Meanwhile I will be at the Strata Conference next week representing my company @RainStor and our database,which has been recently ported to run natively on Hadoop. DM me @ramonchen if you would like to meet up. See you there? Today at RainStor we announced a new product edition that runs natively on Hadoop and HDFS. We are particularly excited as we sincerely hope it will help support the growth and enterprise adoption of Hadoop in the marketplace. Although we are not an open source vendor,we have tremendous admiration and respect for the open source community and the incredible momentum that Hadoop has garnered. A special thanks to the efforts of Cloudera,who blazed and continues to blaze the trail evangelizing the virtues of Hadoop,and to others such as Hortonworks and MapR (all RainStor partners) who are legitimizing the technology for solving Big Data problems. By applying our unique pattern and value de-duplication to raw data that would normally be compressed via LZO or Gzip,RainStor can deliver significant savings in the number of nodes required to retain Big Data. For example 40-1 compression could cut the number of nodes from 75 down to 2! Which is not just a lower upfront purchase cost but also a significant ongoing total operating cost reduction. Why bother if your savings in deploying Hadoop are already so significant compared to “traditional” enterprise database or data warehouse hardware and software deployments? Besides the obvious fact that saving money never goes out of style,the sheer rate of data growth is outstripping advances in physical storage media,which means it is a never ending job to feed the elephant. Cost aside,another way to look at the challenge is to think logically about uncoupling the storage and processing requirements used within each Hadoop node for solving your problem. If you are adding nodes purely to hold the data,you might be significantly under-utilizing the CPUs in each node. Also those CPUs might also be spending effort re-inflating data,if compressed via LZO or Gzip,rather than being fully applied to supporting the query or business analytic calculations. RainStor on the other hand,requires no re-inflation and actually the RainStor compressed files contain more records per block and have a magnification effect on disk performance and bandwidth upon access. So you end up in an almost surreal situation where not only is the data more compressed,Pig and MapReduce jobs actually run faster. Even though the number of nodes are reduced,they would be more efficiently used thereby allowing you to set the correct balance of adding nodes for processing power and storage needs. Finally RainStor’s ability to run natively on Hadoop is due to the fact that our architecture fits Hadoop and HDFS like a glove. As a large block,MPP database already using MapReduce capabilities internally,it was a natural fit for RainStor to run on HDFS. This enables RainStor to be part of the Hadoop deployment,rather than a database or data warehouse connecting to or transferring data out of HDFS. Because you get all of the security,auditing,unique compliance and data lifecycle management features and more you would expect from an enterprise database that speaks perfect SQL so that your traditional BI tools can access the data without having to transform or transfer it into a separate environment. Furthermore our data virtualization partner Composite Software allows data stored within RainStor on Hadoop to be seamlessly combined with other data sources around the enterprise without the need for large scale copy or transfer. In closing I have to give credit to our CFO Jamie Andrews (who is a budding marketing intern on the side) for the title of this blog. He knows a thing or two about saving money and articulated that RainStor’s compression and node reduction will allow enterprises to feed their Hadoop cluster peanuts,all while making Pig and MapReduce jobs fly! As is now my tradition (having done it last year) as we approach Turkey day,I’d like to reflect on what’s been a wonderful year so far and give thanks,especially for the following: - My wonderful wife Kathy who has taken her game to a whole new level as Mom to Parker and Ryan. It’s a joy learning how to parent with you and despite the sleepless nights and zero personal free time,I am bursting with love and admiration for you.
- Our nearly 11 month old twins Parker and Ryan. Never in my wildest dreams could I be so proud and happy every day of my life since you both arrived. Your smiles and giggles light up my day,and watching you grow and discover your new world opens my eyes to the beauty and intrigue of simple objects and activities in our world.
- An annual shout out to my family of in-laws,especially my mother-in law Clare who is currently rehabilitating after being rushed to hospital. Thank you to everyone who has been wishing her well and for all of the help and support in getting her better.
- To my fantastic mother who has been regularly video Skyping with us on the weekends. Parker and Ryan will love seeing you in person next year. They and I can’t wait for their Grandma to hold them in her arms
- Once again to all my dear friends,Wasim,Deirdre,Manish,Ruth,Henley to name a few. Its been wonderful to exchange advice and ideas with you in 2011. I hope I have been able to help you in small ways this year,given my new family responsibilities. I look forward to seeing much more of you in 2012. “A good friend is cheaper than therapy. ~Author Unknown”
- The continued momentum of RainStor,my current company,even through a tough economy and environment we are making great strides with large partners such as Dell and others (who I cannot yet reveal). I am expecting even better things in 2012!
- The SF 49ers for their 9-1 start (hopefully 10-1 on Thanksgiving day) maybe they can match the Giants winning the World Series! Dare we dream a Superbowl?
And again Happy Thanksgiving to everyone who has been reading my blog. I appreciate all of the kind emails,comments and feedback provided. All the best to you and your family,have a safe and fun holiday. -Ramon When the world is coming up with ways to grapple with,process,store and retain the massive volumes of digital data,the USPS presents an innovative solution,get offline! I don’t often pay much attention to the ads on TV. However this weekend,I happened upon an ad from the US Post Office which left me dumbfounded. The ad which can be seen here http://uspsvideo.com//video/127/USPS-Hacked-TV-commercial focuses on the issue of online security saying: “A refrigerator has never been hacked. An online virus has never attacked a corkboard. Give your customers an added feeling of security of printed statement that a receipt provides. With mail.” The reason for these ads no doubt is the fact that the U.S. Postal Service said it lost $5.1 billion last year as a weak economy and increased Internet use drove down mail volume. Postal officials has been quoted as saying that the financial situation is “dire.”Postmaster General Patrick Donahoe has warned of a postal shutdown next year unless there is congressional action to address the agency’s long-term money problems. While I completely feel for the postal employees and potential for layoffs in this poor economy,I can’t help wonder who recommended this positioning as a possible way to address their challenges. Quite apart from the fact that the message of “security”is flawed,in that identity theft through stolen mail from unsecured mailboxes is a common occurrence;encouraging people to switch back to hardcopy paper statements goes against all environmental logic. Obviously this cannot be intended to be a long term fix,but perhaps just a last ditch attempt to spike post office revenues in the months ahead. It is even more of a long shot when you consider that the ad is actually targeted mainly at businesses as it says “It’s good for your business and good for your customers.”Since businesses have long championed the e-statement as a way for them to save money,while projecting an environmentally conscencious image,this message will appeal to no one. Thoughts that ran through my mind after watching this ad:“swimming against the tide”,“evolve or die”,“Blockbuster vs. Netflix”,“Kodak close to bankrupcy”among others. Perhaps the massive workforce at the Post Office could be redirected towards embracing online activities rather than making oars for a ship that has already sailed. Perhaps a focus on improving online email security by partnering with encryption leader Voltage Security or evolving to a system where paid emails sent via the post office have a guaranteed “certified delivery”that are legally admissible? Surely someone at the post office can embrace the digital age? Alternatively their current line of positioning might net them some serious VC funding,if they just change their message to be “How to reduce Big Data without the need for Hadoop!” Here is a quick graphic which I last updated May 2012 which visualizes some of the funding,M&A and partnerships related to the growing universe of Big Data and of course Hadoop. Note that in certain categories there are a mix of companies and technologies. The graphic is a work in progress and does NOT include RainStor (my current employer and a major player in Big Data and Hadoop –for confidentiality and conflict of interest reasons. RainStor is an enterprise database deployed by over 150 companies,running on any platform including Hadoop,that has the best compression in the market,full SQL access and accelerates the performance of MapReduce processing) The partnership lines in the graphic below converge on Hadoop market leader Cloudera,followed rapidly by Hortonworks who has made significant partnering progress since their launch. 
Investments by VC Firm (Source:Infrastructure 2.1 Nov 30,2011 Newsletter - jkatsaros@irg-intl.com) | Accel | $26.5 | Cloudera,Couchbase | | AGF | $9.3 | Talend | | Antham | $1.5 | StackIQ | | Avalon | $1,5 | StackIQ | | Balderton | $9.3 | Talend | | Benchmark | $10.7 | Pentaho | | Bessemer | $4.8 | Hadapt | | Conor | $3.5 | Neo Technology | | CrossLink | $6.8 | DataStax | | Docomo | $1 | Couchbase | | Fidelity | $3.5 | Neo Technology | | Fly-bridge | $10 | 10gen | | Free-Style | $.33 | BackType | | Galileo | $9.3 | Talend | | Giza | $4.5 | Mintigo | | Greylock | $19 | Cloudera | | Hummer Winblad | $2.5 | Karmasphere | | Ignition | $26.5 | Cloudera,Couchbase | | Index | $10.7 | Pentaho | | Kliner Perkins | $5.8 | Datameer | | Lightspeed | $11.3 | DataStax,MapR | | Lower-Case | $.33 | BackType | | Mayfield | 7.5 | Couchbase | | Meritech | 26.5 | Cloudera,Tableau | | NEA | 15.2 | MapR,Pentaho | | Northbridge | $7.5 | Couchbase | | Norwest | $4.8 | Hadapt | | Redpoint | $5.8 | Datameer | | Sequoia | $10 | 10gen | | Sunstone | $3.5 | Neo Technology | | True | $.33 | BackType | | Union Square | $10 | 10gen | | USVP | $10 | Karmasphere,Tableau | | Y-Comb | $.33 | BackType |
The following conversation with your Big Data was recorded at the offices of Dr. D. Dupe M.D. Doctor: “Please take a seat and tell me what brings you here today?” Big Data: “Thanks Doctor. Well,lately I’ve been feeling depressed. As you can see,I’m getting on in years. Certainly I’m not the same data I used to be. When I was young,I was able to adapt and change with the world. Frequently updating and reinventing myself,which is one of the reasons my ex-girlfriend was attracted to me.” Doctor: “Tell me more about your ex-girlfriend.” Big Data: “She’s a famous RDBMS,very energetic,well liked,but very high maintenance. Her job is to process and manage transactional data,that’s how we met actually. We’ve been together for years,but our relationship has been changing for a while. She rarely calls me,and when she does she only reminisces about the old times and our history together. Even though the relationship wasn’t going anywhere,I was comfortable with the arrangement,until one day she broke it off.” Doctor: “How did she break up with you?” Big Data: “She told me that she chatted to her boss (and best friend),Margie Application and she said we rarely we see each other,how I never change,and apparently the last straw was that I was overweight and dragging her down. Taking up valuable space with all my junk. To top it off,I live in her apartment and her landlord,a nice guy Joe Storage,told her that the lease was just for one person and she wasn’t allowed to have me stay there anymore.” Doctor: “How did you react?” Big Data: “I got mad,said a few choice words,grabbed my stuff and stormed out! Then I calmed down and went to talk to Joe to see if I there was space in the building for me. Unfortunately,Joe Storage’s building caters for young up and coming types. The rent is astronomical per sq. foot as you pay for the update facilities. It has fast elevators providing high-speed access with expensive security. I really couldn’t afford to live there.” Doctor: “You mentioned your age earlier,do you think this is the main reason why she broke up with you?” Big Data: “I did at first,because we had been together for so long. But a friend of mine recently experienced a similar issue. He’s young,machine-generated and straight off the network but was also told that he takes up too much space,and since he too never changes,it was too expensive to keep him around.” Doctor: “Actually you are correct,this is becoming a common issue. Many of my patients are reporting the same symptoms that you and your friend are experiencing. Unfortunately,there are as many people offering different approaches to this problem and it can be confusing. Just like fad diets and late night TV commercials,it’s important to be specific about your core needs,which if I could recap as follows: - You have come to terms with the fact that you no longer change
- You need help losing some weight
- You and your girlfriend would like to keep your relationship,even if you see each other infrequently
- You need somewhere nearby that you can stay,which is within your budget
- A place you know where you stand in terms of a lease,with a set of rules that are agreed upon as to when and if you have to leave
The organization and people I will introduce you to specialize in retaining Big Data like yourself and your friend. They can load billions of records a day,and unlike me their type of shrinking allows you to fit in a much smaller space. The result is a form of compression weight loss that forgoes the need for high-speed elevators. You can then choose your alternate Joe Storage apartment of choice,they negotiate and enforce the lease,and security is also provided. To your ex-girlfriend or more to the point her Application boss or her superiors the end-users,you are the same Big Data. Nothing has changed about you;they can see you whenever they want. And best of all,the total cost will be less than what you were paying,often 10x less. Finally,you and your friend can move in immediately,no complicated setup required. Big Data: “Doc,you’re a lifesaver,I can’t thank you enough!” Doctor: “You’re welcome,maybe you’ll come back in a few weeks and tell me how you are doing?” Big Data: “Definitely,you can count on it.” First of all,apologies for the lack of posts the last month or so. I’ve busy working on the launch of significant enhancements to the RainStor product and lots of exciting activity with our partners,including our recently announced relationship with Dell. My other focus has been with my fast growing identical twins Parker and Ryan who are now 7 months old. So pardon my indulgence as I combine the two into this blog post. As ever,comments are welcome but be gentle as I’m operating on a sleep deficit J
As per Wikipedia,Shared Nothing (SN) is a distributed computing architecture in which each node is independent and self-sufficient,and there is no single point of contention across the system. Shared Nothing Architectures have become prevalent in the Data Warehousing space with products such as EMC Greenplum,HP Vertica and Teradata providing Big Data analytic solutions. Hadoop and HDFS is also an example of a SN environment. With that as a brief backdrop let me turn my attention to the challenge of being a parent to twins. Together with obvious extreme lack of sleep,the other conundrum is do we need to buy two of everything,and if not how will we share it? If you are a parent of twins (or two siblings) the scenario and questions below will be familiar: 1. Will they both need parallel access to it at the same time? (Items such as bottles,clothing,pacifiers,car seats meet this criteria) This is a simple example of “shared nothing” with no contention,whereby each boy has his own item and is able to happily focus on each individual item in a self contained manner. With the use of their own bottles at feeding time,we are able to simultaneously feed both boys,in half the time versus if we proceeded sequentially. For SN systems such as Hadoop,massive parallelization by simply adding more nodes with its own locally attached disk allows it to scale to handle Big Data data volumes. 2. How will we provide access to an item if one or the other needs it? (For example using a fixed or portable changing table) With a changing table built into a dresser,we are pretty much saying that all diaper changing will be done in a fixed location (at least around the house). We therefore bring each boy to the nursery where the table resides. In Hadoop MapReduce function processing is moved to keep the work as close to the data as possible to reduce network traffic. This is to avoid,moving the data itself in what is known as “data shipping”. In contrast,Shared Disk and Shared Everything architectures don’t have a “data shipping” issue because each node has access to all of the data. Obviously Shared Nothing vs. Shared Disk vs. Shared Everything is a much more complex and sophisticated technical topic which I won’t be covering today. Throw in OLTP vs OLAP and the Cloud you have an even spicier debate. If you are interested you can check the links below for some good discussion. And as usual excellent reference material is available through links on Wikipedia: To tie off this post with my main two focuses;RainStor ‘s unique architecture which physically stores data using a “Shared Nothing” paradigm,while at the same time providing access from any SN node,mitigating the “data shipping” network transfer bandwidth problem by significantly compressing the data. This is one reason why RainStor is an ideal solution to ingest and support the query of ever growing Big Data volumes that have to be retained at petabyte-scale. Meanwhile Parker and Ryan are themselves growing and scaling at an alarming rate. | Cloud 'N Clear-Established April 2009: Cloud ‘N Clear on Facebook |
Recent Comments