Last night I attended the excellent SDForum Cloud Services SIG: Demo Night: Databases in the Cloud hosted by Bernard Golden CEO of Hyperstratus and Dave Nielsen of Platformd who are the SIG co-chairs. These are my notes for your reading pleasure, as ever if you have comments or questions, please leave me a comment.
The presenters were:
- Roger Magoulas, Research Director at O’Reilly Media -kicked things off with an introduction about “big data”
- Ryan Barrett, Technical Staff (and one of the founders of Google App Engine) and lead on the Google Datastore – presented and demoed App Engine
- Christophe Bisciglia, co-founder of Cloudera - talked about Hadoop, MapReduce and Cloudera
- George Kong, member of technical staff, Aster Data – presented Aster Data and did a demo
After some introductions by Bernard, Roger Magoulas kicked things off with a presentation about “big data”. He described his use case at O’Reilly media where he has a 6Tb data set with over 1 Billion rows of “messy data”. He uses Greenplum’s MPP DB for his analytics (he added a disclaimer that he and Tim O’Reilly have been granted options by Greenplum and do promote them). The premise of “big data” is that we are generating data and new data types (such as unstructured, geo, graphs) at a phenomenal rate and their uses are also increasing (web, shared, public). He suggested taking a look at data.gov as an example of upcoming data explosion. A very interesting real-life example is Walt tools providing RFID tagging of their tools so that workers can look to see where their tools are in their truck. This was a 2 month payback because workers forget their tools all the time and it costs time and money to go back to get them. Other examples of large scale generating of data were cellphones that broadcast your movement and locations, an example of tracking people who visit stores in the mall. Capturing human behavior, both implicit and explicit adds to significant data accumulation. He noted that new technologies that support big data are working on a premise of write infrequently, read often; flexible schemas which are not predefined. Quote “You can be stupid and still get great performance”.
He contrasted 3 different big data technologies by saying Column was good for integer processing, MPP assumes hardware won’t break, and MapReduce assumes breakage with built in fault tolerance. He also noted that Hadoop is not the most “joule efficient” in terms of power consumption and highlighted the challenge of uploading large quantities to the cloud due to the bandwidth constraints (Amazon now allows you to “ship your drive to them” – something that Ryan at Google noted was a good idea that they should consider)
The next presentation by Ryan Barrett, co-founder of Google App Engine described how it is designed to be a scalable web application platform running on Google infrastructure. He noted that typically applications are not architected for massive volumes right off the bat and that a lot of rewriting of applications is done as volumes and processing increase, just to stay in the same place. What Google App Engine is great at is being stateless, avoiding single point of failure, with full built-in capabilities to support distributed processing, and having sharding best practices built-in. He said “waterfall is out” . Google App Engine has a query language, designed for easy changes of schema, interoperable (supporting bulk upload) and allows use of Python & Java languages. He did a quick live demo, showing how you can develop an app locally and then publish quickly live. The Google App Engine web console he showed highlighted how the dataview inferred the schema attributes on the fly. He mentioned that there were announcements forthcoming about the future direction of Google App Engine and did note that Data Mining is not an area of particular focus, but they would love to partner on. When quizzed on pricing he mentioned that 15 cents/gig/month is Google’s current charge, competitive with Amazon S3.
The next presenter was Christophe Bisciglia, co-founder of Cloudera who also gave an excellent presentation elegantly contrasting the big data issue using an iceberg analogy. Cloudera is an open source company that provides support and consulting for Hadoop. They “bring big data to the enterprise”. His first premise was given $1000 to spend on hardware, you could either have 10Tb of hard disk or 32GB of memory. The iceberg showed that 99.7% of the data is “underwater” which led him to his 2 minute overview of Hadoop (see also my previous blog post on Hadoop for beginners). He indicated that Hadoop’s core is the HDFS (Hadoop Distributed File System) whereby the data is spread over a set of clustered machines. The computation is pushed out to the data and failure is handled within the software. He then contasted MapReduce (developed by Google), Hive (a data warehousing framework with SQL, developed by Facebook) and Pig (a high-level language developed by Yahoo for high level analytics). He stressed that Hadoop is batch and does not serve data in real-time. It should be used to augment existing RDBMS and enables deeper and more flexible analysis. Using another analogy which showed RDBMS as a Ferrari and Cloudera/Hadoop as a Freight Train “drove” home the point. Finally, he walked through a canned demo of Cloudera’s wizard like configurator that takes the pain out of defining a Hadoop cluster. You simply answered a bunch of questions about your hardware and it created a RPM bundle for Red Hat Linux systems or an Amazon EC2 image. Cloudera also provides excellent free online training at www.cloudera.com/hadoop-trainingand according to DJ from Linkedin, “their in person training is even better!”
Last but by no means least, George Kong, member of technical staff, Aster Data started his presentation by highlighting ShareThis (one of Aster Data’s flagship customers). Aster Data launched last may and has about 12 customers. ShareThis was mentioned to have approx 10Tb database with 200GB or data being added per day. It has one backup cluster. George described the Aster architecture as an nCluster database. Using a bee analogy, there is a Queen Bee tier, Workers and Loaders. He states Aster Data is able to load without affecting query performance 24×7. “The only product that can “load, query and backup” at the same time as an elastic database. Aster Data uses Postgres in its underlying foundation together with MapReduce. George noted that Hadoop was good for large ETL load of data but that Aster Data was more optimized for distribution analysis. He highlighted how an Aster SQL query (through their nPath module) could easily retrieve results not possible through normal RDBMS SQL. Finally he demoed Aster Data’s console which showed a very clean, intuitive interface which was easy to add additional nodes.
In summary, this was an excellent meeting, thanks again to Bernard and Dave for putting it together. The presentations will be online shortly and I will link to them via this post when they are up.