I presented a workshop at the MDM & Data Governance Summit in San Francisco this week on the topic of Big Data and Master Data Management (MDM). It was a particularly interesting topic for me because I have spent the last 8 years working as VP Product Marketing at Siperian (A leading MDM provider acquired by Informatica) and now as VP Product Management of RainStor, a Big Data database provider for the last 3 years.
For the workshop, I presented some base-level definitions around Big Data, types of data and new classes of database, and a quick overview of Hadoop and MapReduce. I was then followed by a real-life financial services institution case study from Manish Sood, the CEO and Founder of Reltio, a new company offering capabilities to model and visualize Big Data from unlimited sources of varying variety. Finally Inderpal Bhandari, VP of Knowledge Solutions, Chief Data Officer, Express-Scripts, Inc. presented a whole range of additional use cases (some Big Data related, some not) ranging from retail to social media analysis.
The audience was very engaging, leading to some interesting questions that I thought I’d reiterate here for your reading pleasure:
Q1. Doesn’t MDM already touch and handle millions of records already? Isn’t that considered Big Data?
While an MDM hub can handle data from many data sources with data volumes in the millions, it doesn’t match the size or complexity of Big Data as currently defined and recognized by most players in the industry. Firstly, MDM cleanses, matches and merges master reference data (e.g customer name, address etc.), which is significantly less voluminous than transactional data (e.g customer orders) stored in applications, which is part of the 360 degree view through which MDM cross references once a system of record is established. Additionally other types of data now include interaction data (e.g. social media activity) and machine-generated data (e.g. from sensors), and those types of data quickly hit the tens of terabytes to petabytes in volume.
Volume is not the only distinguishing factor in Big Data, the other “V’s” include Velocity, the rate at which data is being generated and captured (often in the billions of records per day), and Variety, multi-structured/non-relational data, that cannot be captured and accessed through standard RDBMS and data warehouses.
Q2. How does Big Data affect MDM and what my business users want?
MDM has allowed siloed data sources within applications across an enterprise to be reconciled to gain a 360 degree view of a customer or product, the reality is that new data sources, such as Facebook, Twitter and other forms of social media, have appeared in recent years providing external insights into the behavior, characteristics and relationships of customers. The types of answers that marketing and sales teams are looking to garner now go beyond that in which MDM can provide. For example, it used to be sufficient use MDM to gain an understanding of what products a particular customer is purchasing across an enterprise. Marketers these days now want to know what products that customer may be buying or favoring from competitors, or influencing the purchase of within their social network. To that aim, MDM is no longer sufficient. Ironically while MDM is used to consolidate reference data from multiple internal sources and a few external sources such as DnB etc., gaining insights from Big Data means combining many more sources from different feeds with MDM itself being a contributing source.
Q3. So what is this Hadoop thing, and why should I look at it and other new generation products like Reltio?
Hadoop is a platform that enables big data management at scale on commodity based HW. It features the use of MapReduce that allows data to be processed independent of schema, and can handle ingesting and analysis of extremely high velocity and large data volumes of multi-structured data. This provides an operating framework to ask all manner of questions about the data without having to conform to a fixed data model. In many instances this freedom is combined with a NoSQL form of database (open source HBase or Hive or a product such as RainStor) in order to efficiently manage and provide effective access to the data captured.
This is all well and good if you have the technical expertise (Hadoop consultants are hard to find, hence the popularity of companies like Cloudera) So applications like Reltio are set to take center stage by doing the heavy lifting of capturing, consolidating, modeling and visualization of the data to make sense of it all, without a bus load of consultants.
Q4. What would be the signs that I need to look into Big Data?
Deploying an MDM initiative is a big enough project in its own right and if you are in charge, it may not fall to you to examine Big Data in the context of your efforts. Some signs, as mentioned previously, include your end-users voicing interest in the cross-reference hierarchies and relationships between your customers, suppliers etc. and a hunger to gain more insight from multi-structured data sources and social media feeds. It is more than likely that someone else in your company has already been put in charge of looking at this Big Data thing, but be prepared for that call or tap on the shoulder, as you still hold the “Master” data from which everything is related. So sooner or later the two worlds will collide.
That’s it for now, there were many more questions posed. Let me know if there is interest in exploring more of them, or if you have some questions or comments of your own, please post them for discussion. Also please take a look at my post last year Mastering Big Data Management