I attended an IEEE-CNSV (Consultants Network of Silicon Valley) meeting on Cloud Computing Paradigms: MapReduce, Hadoop and Cascading in Santa Clara last night. It was extremely well attended with about 100+ people listening to Chris K. Wensel of Scale Unlimited providing an excellent, very technical overview of Hadoop, MapReduce and Cascading, an open source project originated by Chris that provides a feature rich API for more easily implementing complex distributed processes in Hadoop and MapReduce.
The room was filled with very smart engineers, scientists (one guy had over 50 patents to this name) and an abundance of “expert witnesses”. So I’m sure they had no problem keeping up with Chris’ presentation which I found very enlightening. However, it did take a significant combination of leveraging my Computer Science degree where I majored in Artificial Intelligence, writing systems in Lisp and Prolog, together with my years of working with Oracle RDBMS and a healthy dose of Java and UNIX/LINUX terminology comprehension for me to get the most out of the presentation. If you’d rather dig technically deeper into Hadoop, MapReduce and Cascade, I’d suggest you take a look at Chris’ excellent site www.cascading.org.
First a quick fun fact, the name Hadoop comes from the name of a plush elephant toy (pic to the left is not the original) belonging to the child of the original developer, Doug Cutting (who now works for Cloudera). It is an open source software platform that allows you to design and run applications that need to store and process humongous amounts of data. We are talking petabytes here not your garden variety Oracle business application with 20 to 50 terabytes. In everyday terms, as of Jan 2008, Google say they process about 20 petabytes a day, the New York Times speculated in 2006 that “the entire works of humankind, from the beginning of recorded history, in all languages” would amount to 50 petabytes of data, which for you data warehousing buffs means that it can be stored in a Teradata 12 database. So it’s not surprising that many of the major social networking apps like Facebook (approx. 1.5 petabytes of user photos, roughly 10 billion photos) use Hadoop.
Hadoop distributes data and processing across clusters of computers (commodity hardware) and as a result can execute in parallel enabling rapid insertion and query of information. Recent performance benchmarks from Yahoo show Hadoop sorting a Petabyte in 16.25 hrs and a terabyte in 62 seconds, reclaiming a record previously held by Google. Another benefit of Hadoop is the built in fault tolerance because Hadoop automatically maintains multiple copies of the data and is able to redeploy processing in case of failure.
Hadoop implements MapReduce, which is a software framework introduced by Google for distributed computing. As Chris described in this presentation, MapReduce is actually Map, Group then Reduce. Without getting too technical, think of the Map phase as the feeding in of the data together with the keys and values. For example assume you have 2 records. First record has a key of first name ‘Ramon’, with a value of ‘1’. The second record looks exactly the same with another key of ‘Ramon’ with a value also of ‘1’. The Group phase would then reorganize the information into ‘Ramon’ with two 1’s. Then the Reduce phase would perform aggregation (count) and the stored result would be ‘Ramon’ and ‘2’. This is of course a trivial example and the most common one typically shown in MapReduce examples. But the basic premise is that the data can be efficiently distributed, stored and retrieved at a high rate of performance within the cluster. (Update: For a great way to explain MapReduce using a worker/visual analogy see http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ and also visit Kristina Chodorow’s (of MongoDB fame) blog for an entertaining Star Trek analogy.
All sounds good right? Some of you as old as me might feel that this smacks a little of pre-RDBMS file systems back in the early days of computing where you had to write and manage all of the routines to retrieve and store data through low-level programming. For example, there is no “schema” in Hadoop/MapReduce, nor any transactional boundaries (commit processing). Chris’ Cascading project aims to make using MapReduce easier by allowing you to focus on the fields you want to store and retrieve, and not the heavy lifting of having to visualize concepts in MapReduce. However there are still major opportunities for improving ease of use, judging by the number of times Chris made mention of how to “game” or “hack” Hadoop to do what you want.
This is an opportunity being exploited by companies such as Cloudera who offer consulting services as well as a neat wizard to help define your Hadoop cluster by answering basic questions about your hardware configuration. Other companies such as Greenplum (ebay I believe is a customer) and Aster Data (another high profile customer is MySpace), also use MapReduce (albeit leveraging Postgres rather than using Hadoop) in a more user friendly way allowing SQL statements (or a variant thereof) in their commercial high performance analytical databases. Thereby introducing MapReduce to more mainstream business use. Putting them squarely in competition with a plethora of Columnar Databases out there such as Vertica and Paraccel in the land grab for the high performance analytical data warehousing market. Evan Levy of Baseline Consulting recently wrote a nice blog post on this very topic, and Merv Adrian’s blog has the very latest as he has recently spoken to, or visited many of the major analytics DB players.
Since Hadoop is open source, there are a number of initiatives to improve the system to overcome issues that prevent Hadoop from being used in more traditional RDBMS style processing scenarios. But is this the wise thing to do? IT and Computing, like most things in life, are cyclical in nature. Mainframes were outdated and on their way out in the 90s with the rise of Client Server computing, then a few years later the mainframe or server base computing (with thin client) came back into vogue, now cloud computing is the rage. Similarly it appears to me that primative low level file storage with programmatic manipulation was succeeded by RDBMS systems with SQL, now the lower level of file system storage through Hadoop with “roll your own SQL queries” is back.
Even though Hadoop is getting more and more popular for processing large datasets, the dirty little secret might be that Hadoop is not quite enterprise class yet, with a single point of failure in the master node, weak security standards and relatively poor binary data compression, which with replication requirements actually results in x times more physical storage required, and comes with data access performance penalties. Also not everyone can run a Facebook or Google size server farm to perform their calculations, no matter how cheap the storage or server hardware.
It will be interesting to see how everything plays out. Certainly there is no disputing that apps such as Facebook and Sharethis (another Aster Data customer) could only exist today with Hadoop and/or MapReduce style implementations. Oracle RDBMS simply isn’t up to the task and is more suited for its current business usage. Which brings me back to the context of the original title of my post, will Oracle just sit around and watch Hadoop and MapReduce takeoff? On the high performance analytics side they do offer Oracle Exadata, but will they make a move to acquire one of the many startups out there? Sybase already has a columnar DB offering through their IQ database, in fact they have sued Vertica claiming patent infringement. Certainly RDBMS’ are better at real-time while Hadoop is more batch oriented, but will Oracle or any of the big RDBMs vendors offer anything themselves in the area of MapReduce? Will they get in the game? Or will they stay “Irelephant” in this area and allow Greenplum, Aster Data or any of the columnar database vendors to become the next Oracle for non RDBMS big data?
Hopefully you got a sense of Hadoop and MapReduce from this post. Like the plush elephant that Hadoop is named after, I’m sure you’ll never forget