I was asked recently to define structured vs unstructured data and how the different types of data were being managed within the enterprise. I thought I’d list my responses below in case you find it useful/interesting. As ever, feedback and debate welcome
What are the challenges seen with the two different types of unstructured information — unstructured data, such as machine-generated data, versus unstructured content, such as human generated information in emails or social media?
To be clear, machine-generated data (MGD) does have structure. It’s just that the structure is not strictly enforced in a traditional relational database context. In many cases, the data can be considered multi-structured, since there are several ways the data can be viewed, without being fixed to a rigid permanent relational format. For example, MGD is often placed into Hadoop in raw form, and subsequently provided structured through late binding MapReduce processing. Data is often also loaded into HBase and then Hive is used to provide structured with a SQL like syntax to gain traditional analytical insights.
The reason that many call MGD unstructured, is that it is often stored in flat files rather than in a relational database. These flat files are consistent with the containers that hold unstructured human generated content, such as emails or other forms of social media.
- From a storage and retention perspective: The key difference between MGD and emails or other social media content is that the eventual structure of MGD allows it to be efficiently analyzed, and compressed through value and pattern de-duplication. My company RainStor can use the structured within MGD to reduce the raw physical footprint of this data without losing any of its meaning, thereby saving significant amounts of storage. With unstructured content such as emails and other forms of free format text, space savings are limited to binary compression (much like that for images and video) that can only marginally reduce the amount of storage required to keep the data.
- From an accessibility perspective: MGD can be analyzed using both MapReduce and traditional SQL using existing BI tools. At RainStor we provide a database over Hadoop that allows both forms of access. Interpretation of emails and social media content requires free-text like scanning of content to find patterns and to build metadata and indexes to which key the search and discovery of the information. Free text search also requires context and ontologies to be effective; there are specialized products such as HPs Autonomy and Oracle’s Endeca that provide such capabilities.
Are businesses being inundated with this data? Are they getting value from this data, or is most of it passing through unnoticed?
The main two reasons why this data might be retained are:
- Compliance: MGD in many industries such as Telco with Call Detail Records and Financial Services with trades and quotes are regulated to retain this data for pre-mandated periods. The massive storage requirements generated by accumulation of this data has led to companies seeking out new databases such as RainStor who can not only reduce the data footprint through value and pattern de-duplication, but provide the immutability required to meet regulatory demands. For unstructured content such as email communication, there are laws that require discoverable emails, internal or external and there are many products that provide archiving tools to capture and retain this content.
- Business Competitive Edge: MGD can provide competitive edge as patterns can be explored over greater volumes of data over larger time periods, providing the data can be retained. Sophisticated CRM systems which include customer interaction and support through emails already have technologies that can interpret the “mood” of a customer through the tone of their emails.
Will the value of this data surpass structured transactional data? If there is value now being seen, what types of applications or processes are taking advantage of unstructured data?
Both sets of data are equally as valuable. Although it can be argued that enterprises are slowly awakening to the value of unstructured Big Data. New types of databases and technologies such as Hadoop are being used to take advantage of this data. Additionally at a macro level, social media analysis of Twitter trends and comments, through products such as Salesforce.com’s Radian6 already provide an insight into crowd sentiment and a way of engaging with the prospect or customer through what Salesforce deems the “Social Enterprise”.
What areas of the business are now benefiting from unstructured data — both machine- and user-generated data? What are the issues that still need to be tackled with the two types of unstructured data?
As previously detailed, compliance and regulation by industries such as Telco and Financial services are dictating company-wide retention of Big Data ranging from MGD to email content. Additionally across all industries, marketing departments are seeking better ways to connect with the customer by analyzing their sphere of influence, who they are connected to socially to help drive more targeted sales. They also use social media to manage their company brand by leading the way with their use of sentiment analysis and internet/web related capture and review of clickstream data to better understand customer behavior and patterns.
Is there progress integrating this data into core enterprise systems?
Every enterprise is investigating the use of Hadoop to handle unstructured data, and looking for ways to bridging the gap through integration with transitional enterprise structured systems to enable predictive learning on top of the combined information. RainStor provides the ability to handle the unstructured MGD while also retaining and providing access to both through traditional enterprise tools and new MapReduce paradigms. Allowing you to ask traditional questions of the data, but also to explore questions that have yet to be thought of.
As far as blending truly unstructured content such as social media, emails and enterprise data, large companies such as HP, Oracle, Dell, IBM, Salesforce and others are building or acquiring complementary technologies to provide an enterprise view across all data sources. Other up and coming startups aiming to provide a cross-data all encompassing view of Big Data for enterprises include startups such as Factual, Clearstory and Reltio.