Data Scientists and Big Data: Predicting Linsanity and Targeting Pregnant Teens

A couple of interesting articles ran through my #BigData tweetstream last week. Shout out to Ivan Chong (@ichong) at Informatica who I re-tweeted

Can this guy be considered a Data Scientist?online.wsj.com/article/SB1000… “Delivery Guy Who Saw Jeremy Lin Coming” #bigdata #overlooked #linsanity

“The Delivery Guy Who Saw Jeremy Lin Coming” tells the intriguing story of a Fedex delivery driver and numbers hobbyist named Ed Weiland, who in May 2010 wrote a long-term forecast of Jeremy Lin for the basketball website Hoops Analyst. His analysis of Lin’s college career statistics stated that beyond the top point guard available for the draft John Wall, there may be a surprise point guard available out there: “The best candidate to pull off such a surprise might be Harvard’s Jeremy Lin. The reason is two numbers Lin posted, 2-point FG pct and RSB40. Lin was at .598 and 9.7. This is impressive on both counts. These numbers show NBA athleticism better than any other, because a high score in both shows dominance at the college level on both ends of the court.”

Such claryvoyance based on examining the statistics in detail, rather than relying on “traditional” physical makeup or well known basketball college pedigrees, is the stuff of baseball Sabermetrics or Moneyball (to you Brad Pitt/Oakland As fans out there). Interestingly basketball coaches and general managers obviously did not read much into these stats (if they looked at them at all), nor did they pay any attention when Jeremy Lin outplayed #1 Point Guard pick John Wall in the Summer League, but I digress … Such examination of statistics blended with other related (or yet to be discovered related factors) is the stuff that people are starting to coin as the work of Data Scientists.

This newly popularized term of Data Scientists has emerged together with Big Data and Hadoop as one of the highest paid and sexiest (as far as IT can be sexy to those of you not in the biz) job titles out there. EMC recently published an infographic with the mashable article calling it “the career of the future”. As a counterpoint respected industry analyst Neil Raden recently weighed in that Data Scientist is an overused term. Terminology aside, while it is doubtful that Ed Weiland derived this analysis by running a home built Hadoop cluster with hand-coded MapReduce jobs, he did nevertheless identify a combination of statistics which was vindicated by Jeremy’s subsequent ascendance as the starting point guard for the New York Knicks.

Meanwhile, a professional “data scientist” was busy uncovering market and buyer patterns that ended up identifying a pregnant teen, before her father even knew. Andrew Pole, working for Target stores, revealed how the history of everything a customer has ever bought and any demographic information Target has collected from them or bought from other sources is used to determine if a woman has a high probability of being pregnant. They then use this information to send coupons or special offers of relevant products. In this instance there is no doubt that some form of Big Data analytics is at play. Traditional big iron analytics appliances such as Teradata (speculation on my part as they have a big footprint in retail) may have helped Target hone in on the pregnant teen. As it has been much publicized, traditional SQL-based data warehousing isn’t the ideal combination of platform and data analysis for these types of speculative, predictive analytics, and products like Teradata with their powerful capabilities and horsepower come at a Ferrari-scale price tag. This is where Hadoop and MapReduce might shine more brightly, though a quick glance at the Powered by Hadoop website does not list Target as a user of Hadoop (though this information is strictly voluntary).

Many would consider the Jeremy Lin example cool and the Target example creepy, though a commenter pointed out companies have been gathering data about us for credit scores and the resulting credit worthiness for decades. With new technologies such as Hadoop making high performance analytics more affordable than ever, expect to see more cross domain analysis and use cases expanding beyond marketing research. Whether it’s the likes of Mike Zuckerberg becoming a billionaire through Facebook (powered by Hadoop), the prediction of Jeremy Lin being a NBA sensation out of Harvard, or predicting the pregnancies of teens; big data analytics and Hadoop is making big elephant sized footprints in business and mainstream media.

Meanwhile I will be at the Strata Conference next week representing my company @RainStor and our database, which has been recently ported to run natively on Hadoop. DM me @ramonchen if you would like to meet up. See you there?

Leave a Reply