OCDQ Blog

Talking Business about the Weather

Word of Mouth has become Word of Data

The Symbiotic Relationship of Cloud and Mobile

Cloud Benefits for Midsize Businesses

Barriers to Cloud Adoption

Leveraging the Cloud for Application Development

Cloud Computing for Midsize Businesses

Cloud Computing is the New Nimbyism

Devising a Mobile Device Strategy

The Age of the Mobile Device

Social Business is more than Social Marketing

Social Media Marketing: From Monologues to Dialogues

Social Media for Midsize Businesses

Information Asymmetry versus Empowered Customers

February 19, 2013

Demystifying Data Science

February 19, 2013/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.

Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific. Her work there focuses on power system optimization using real-time prediction models. She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.

Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.

While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

February 05, 2013

Big Data and the Infinite Inbox

February 05, 2013/ Jim Harris

Occasionally it’s necessary to temper the unchecked enthusiasm accompanying the peak of inflated expectations associated with any hype cycle. This may be especially true for big data, and especially now since, as Svetlana Sicular of Gartner recently blogged, big data is falling into the trough of disillusionment and “to minimize the depth of the fall, companies must be at a high enough level of analytical and enterprise information management maturity combined with organizational support of innovation.”

I fear the fall may feel bottomless for those who fell hard for the hype and believe the Big Data Psychic capable of making better, if not clairvoyant, predictions. When, in fact, “our predictions may be more prone to failure in the era of big data,” explained Nate Silver in his book The Signal and the Noise: Why Most Predictions Fail but Some Don't. “There isn’t any more truth in the world than there was before the Internet. Most of the data is just noise, as most of the universe is filled with empty space.”

Proposing the 3Ss (Small, Slow, Sure) as a counterpoint to the 3Vs (Volume, Velocity, Variety), Stephen Few recently blogged about the slow data movement. “Data is growing in volume, as it always has, but only a small amount of it is useful. Data is being generated and transmitted at an increasing velocity, but the race is not necessarily for the swift; slow and steady will win the information race. Data is branching out in ever-greater variety, but only a few of these new choices are sure.”

Big data requires us to revisit information overload, a term that was originally about, not the increasing amount of information, but instead the increasing access to information. As Clay Shirky stated, “It’s not information overload, it’s filter failure.”

As Silver noted, the Internet (like the printing press before it) was a watershed moment in our increased access to information, but its data deluge didn’t increase the amount of truth in the world. And in today’s world, where many of us strive on a daily basis to prevent email filter failure and achieve what Merlin Mann called Inbox Zero, I find unfiltered enthusiasm about big data to be rather ironic, since big data is essentially enabling the data-driven decision making equivalent of the Infinite Inbox.

Imagine logging into your email every morning and discovering: You currently have (∞) Unread Messages.

However, I’m sure most of it probably would be spam, which you obviously wouldn’t have any trouble quickly filtering (after all, infinity minus spam must be a back of the napkin calculation), allowing you to only read the truly useful messages. Right?

Open MIKE Podcast — Episode 05: Defining Big Data

Magic Elephants, Data Psychics, and Invisible Gorillas

Data Silence

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Statistically Significant Resolution for 2013

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data, Predictive Analytics, and the Ideal Chronicler

What Magic Tricks teach us about Data Science

Swimming in Big Data

What Mozart for Babies teaches us about Data Science

December 04, 2012

The Wisdom of Crowds, Friends, and Experts

December 04, 2012/ Jim Harris

I recently finished reading the TED Book by Jim Hornthal, A Haystack Full of Needles, which included an overview of the different predictive approaches taken by one of the most common forms of data-driven decision making in the era of big data, namely, the recommendation engines increasingly provided by websites, social networks, and mobile apps.

These recommendation engines primarily employ one of three techniques, choosing to base their data-driven recommendations on the “wisdom” provided by either crowds, friends, or experts.

The Wisdom of Crowds

In his book The Wisdom of Crowds, James Surowiecki explained that the four conditions characterizing wise crowds are diversity of opinion, independent thinking, decentralization, and aggregation. Amazon is a great example of a recommendation engine using this approach by assuming that a sufficiently large population of buyers is a good proxy for your purchasing decisions.

For example, Amazon tells you that people who bought James Surowiecki’s bestselling book also bought Thinking, Fast and Slow by Daniel Kahneman, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business by Jeff Howe, and Wikinomics: How Mass Collaboration Changes Everything by Don Tapscott. However, Amazon neither provides nor possesses knowledge of why people bought all four of these books or qualification of the subject matter expertise of these readers.

However, these concerns, which we could think of as potential data quality issues, and which would be exacerbated within a small amount of transaction data where the eclectic tastes and idiosyncrasies of individual readers would not help us decide what books to buy, within a large amount of transaction data, we achieve the Wisdom of Crowds effect when, taken in aggregate, we receive a general sense of what books we might like to read based on what a diverse group of readers collectively makes popular.

As I blogged about in my post Sometimes it’s Okay to be Shallow, sometimes the aggregated, general sentiment of a large group of unknown, unqualified strangers will be sufficient to effectively make certain decisions.

The Wisdom of Friends

Although the influence of our friends and family is the oldest form of data-driven decision making, historically this influence was delivered by word of mouth, which required you to either be there to hear those influential words when they were spoken, or have a large enough network of people you knew that would be able to eventually pass along those words to you.

But the rise of social networking services, such as Twitter and Facebook, has transformed word of mouth into word of data by transcribing our words into short bursts of social data, such as status updates, online reviews, and blog posts.

Facebook “Likes” are a great example of a recommendation engine that uses the Wisdom of Friends, where our decision to buy a book, see a movie, or listen to a song might be based on whether or not our friends like it. Of course, “friends” is used in a very loose sense in a social network, and not just on Facebook, since it combines strong connections such as actual friends and family, with weak connections such as acquaintances, friends of friends, and total strangers from the periphery of our social network.

Social influence has never ended with the people we know well, as Nicholas Christakis and James Fowler explained in their book Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives. But the hyper-connected world enabled by the Internet, and further facilitated by mobile devices, has strengthened the social influence of weak connections, and these friends form a smaller crowd whose wisdom is involved in more of our decisions than we may even be aware of.

The Wisdom of Experts

Since it’s more common to associate wisdom with expertise, Pandora is a great example of a recommendation engine that uses the Wisdom of Experts. Pandora used a team of musicologists (professional musicians and scholars with advanced degrees in music theory) to deconstruct more than 800,000 songs into 450 musical elements that make up each performance, including qualities of melody, harmony, rhythm, form, composition, and lyrics, as part of what Pandora calls the Music Genome Project.

As Pandora explains, their methodology uses precisely defined terminology, a consistent frame of reference, redundant analysis, and ongoing quality control to ensure that data integrity remains reliably high, believing that delivering a great radio experience to each and every listener requires an incredibly broad and deep understanding of music.

Essentially, experts form the smallest crowd of wisdom. Of course, experts are not always right. At the very least, experts are not right about every one of their predictions. Nor do experts always agree with other, which is why I imagine that one of the most challenging aspects of the Music Genome Project is getting music experts to consistently apply precisely the same methodology.

Pandora also acknowledges that each individual has a unique relationship with music (i.e., no one else has tastes exactly like yours), and allows you to “Thumbs Up” or “Thumbs Down” songs without affecting other users, producing more personalized results than either the popularity predicted by the Wisdom of Crowds or the similarity predicted by the Wisdom of Friends.

The Future of Wisdom

It’s interesting to note that the Wisdom of Experts is the only one of these approaches that relies on what data management and business intelligence professionals would consider a rigorous approach to data quality and decision quality best practices. But this is also why the Wisdom of Experts is the most time-consuming and expensive approach to data-driven decision making.

In the past, the Wisdom of Crowds and Friends was ignored in data-driven decision making for the simple reason that this potential wisdom wasn’t digitized. But now, in the era of big data, not only are crowds and friends digitized, but technological advancements combined with cost-effective options via open source (data and software) and cloud computing make these approaches quicker and cheaper than the Wisdom of Experts. And despite the potential data quality and decision quality issues, the Wisdom of Crowds and/or Friends is proving itself a viable option for more categories of data-driven decision making.

I predict that the future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three potential sources of wisdom often acknowledged as contributors to data-driven decision making.

Sometimes it’s Okay to be Shallow

Word of Mouth has become Word of Data

The Wisdom of the Social Media Crowd

Data Management: The Next Generation

Exercise Better Data Management

Data-Driven Intuition

Finding a Needle in a Needle Stack

Big Data, Predictive Analytics, and the Ideal Chronicler

The Limitations of Historical Analysis

Magic Elephants, Data Psychics, and Invisible Gorillas

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

Magic Elephants, Data Psychics, and Invisible Gorillas

November 23, 2012

The Limitations of Historical Analysis

November 23, 2012/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

“Those who cannot remember the past are condemned to repeat it,” wrote George Santayana in the early 20th century to caution us about not learning the lessons of history. But with the arrival of the era of big data and dawn of the data scientist in the early 21st century, it seems like we no longer have to worry about this problem since not only is big data allowing us to digitize history, data science is also building us sophisticated statistical models from which we can analyze history in order to predict the future.

However, “every model is based on historical assumptions and perceptual biases,” Daniel Rasmus blogged. “Regardless of the sophistication of the science, we often create models that help us see what we want to see, using data selected as a good indicator of such a perception.” Although perceptual bias is a form of the data silence I previously blogged about, even absent such a bias, there are limitations to what we can predict about the future based on our analysis of the past.

“We must remember that all data is historical,” Rasmus continued. “There is no data from or about the future. Future context changes cannot be built into a model because they cannot be anticipated.” Rasmus used the example that no models of retail supply chains in 1962 could have predicted the disruption eventually caused by that year’s debut of a small retailer in Arkansas called Wal-Mart. And no models of retail supply chains in 1995 could have predicted the disruption eventually caused by that year’s debut of an online retailer called Amazon. “Not only must we remember that all data is historical,” Rasmus explained, “but we must also remember that at some point historical data becomes irrelevant when the context changes.”

As I previously blogged, despite what its name implies, predictive analytics can’t predict what’s going to happen with certainty, but it can predict some of the possible things that could happen with a certain probability. Another important distinction is that “there is a difference between being uncertain about the future and the future itself being uncertain,” Duncan Watts explained in his book Everything is Obvious (Once You Know the Answer). “The former is really just a lack of information — something we don’t know — whereas the latter implies that the information is, in principle, unknowable. The former is an orderly universe, where if we just try hard enough, if we’re just smart enough, we can predict the future. The latter is an essentially random world, where the best we can ever hope for is to express our predictions of various outcomes as probabilities.”

“When we look back to the past,” Watts explained, “we do not wish that we had predicted what the search market share for Google would be in 1999. Instead we would end up wishing we’d been able to predict on the day of Google’s IPO that within a few years its stock price would peak above $500, because then we could have invested in it and become rich. If our prediction does not somehow help to bring about larger results, then it is of little interest or value to us. We care about things that matter, yet it is precisely these larger, more significant predictions about the future that pose the greatest difficulties.”

Although we should heed Santayana’s caution and try to learn history’s lessons in order to factor into our predictions about the future what was relevant from the past, as Watts cautioned, there will be many times when “what is relevant can’t be known until later, and this fundamental relevance problem can’t be eliminated simply by having more information or a smarter algorithm.”

Although big data and data science can certainly help enterprises learn from the past in order to predict some probable futures, the future does not always resemble the past. So, remember the past, but also remember the limitations of historical analysis.

This blog post is sponsored by the Enterprise CIO Forum and HP.

Data Silence

WYSIWYG and WYSIATI

Information Overload Revisited

Big Data el Memorioso

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

Finding a Needle in a Needle Stack

Data-Driven Intuition

Magic Elephants, Data Psychics, and Invisible Gorillas

November 13, 2012

Data Silence

November 13, 2012/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

In the era of big data, information optimization is becoming a major topic of discussion. But when some people discuss the big potential of big data analytics under the umbrella term of data science, they make it sound like since we have access to all the data we would ever need, all we have to do is ask the Data Psychic the right question and then listen intently to the answer.

However, in his recent blog post Silence Isn’t Always Golden, Bradley S. Fordham, PhD explained that “listening to what the data does not say is often as important as listening to what it does. There can be various types of silences in data that we must get past to take the right actions.” Fordham described these data silences as various potential gaps in our analysis.

One data silence is syntactic gaps, which is a proportionately small amount of data in a very large data set that “will not parse (be converted from raw data into meaningful observations with semantics or meaning) in the standard way. A common response is to ignore them under the assumption there are too few to really matter. The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other. So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems.”

This data silence reminded me of my podcast discussion with Thomas C. Redman, PhD about big data and data quality, during which we discussed how some people erroneously assume that data quality issues can be ignored in larger data sets.

Another data silence is inferential gaps, which is basing an inference on only one variable in a data set. The example Fordham uses is from a data set showing that 41% of the cars sold during the first quarter of the year were blue, from which we might be tempted to infer that customers bought more blue cars because they preferred blue. However, by looking at additional variables in the data set and noticing that “70% of the blue cars sold were from the previous model year, it is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold. So, maybe blue wasn’t so popular after all.”

Another data silence Fordham described using the same data set is gaps in field of view. “At first glance, knowing everything on the window sticker of every car sold in the first quarter seems to provide a great set of data to understand what customers wanted and therefore were buying. At least it did until we got a sinking feeling in our stomachs because we realized that this data only considers what the auto manufacturer actually built. That field of view is too limited to answer the important customer desire and motivation questions being asked. We need to break the silence around all the things customers wanted that were not built.”

This data silence reminded me of WYSIATI, which is an acronym coined by Daniel Kahneman to describe how the data you are looking at can greatly influence you to jump to the comforting, but false, conclusion that “what you see is all there is,” thereby preventing you from expanding your field of view to notice what data might be missing from your analysis.

As Fordham concluded, “we need to be careful to listen to all the relevant data, especially the data that is silent within our current analyses. Applying that discipline will help avoid many costly mistakes that companies make by taking the wrong actions from data even with the best of techniques and intentions.”

Therefore, in order for your enterprise to leverage big data analytics for business success, you not only need to adopt a mindset that embraces the principles of data science, you also need to make sure that your ears are set to listen for data silence.

This blog post is sponsored by the Enterprise CIO Forum and HP.

WYSIWYG and WYSIATI

Information Overload Revisited

Big Data el Memorioso

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

Finding a Needle in a Needle Stack

Data-Driven Intuition