In the era of big data, information optimization is becoming a major topic of discussion. But when some people discuss the big potential of big data analytics under the umbrella term of data science, they make it sound like since we have access to all the data we would ever need, all we have to do is ask the Data Psychic the right question and then listen intently to the answer.
However, in his recent blog post Silence Isn’t Always Golden, Bradley S. Fordham, PhD explained that “listening to what the data does not say is often as important as listening to what it does. There can be various types of silences in data that we must get past to take the right actions.” Fordham described these data silences as various potential gaps in our analysis.
One data silence is syntactic gaps, which is a proportionately small amount of data in a very large data set that “will not parse (be converted from raw data into meaningful observations with semantics or meaning) in the standard way. A common response is to ignore them under the assumption there are too few to really matter. The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other. So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems.”
This data silence reminded me of my podcast discussion with Thomas C. Redman, PhD about big data and data quality, during which we discussed how some people erroneously assume that data quality issues can be ignored in larger data sets.
Another data silence is inferential gaps, which is basing an inference on only one variable in a data set. The example Fordham uses is from a data set showing that 41% of the cars sold during the first quarter of the year were blue, from which we might be tempted to infer that customers bought more blue cars because they preferred blue. However, by looking at additional variables in the data set and noticing that “70% of the blue cars sold were from the previous model year, it is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold. So, maybe blue wasn’t so popular after all.”
Another data silence Fordham described using the same data set is gaps in field of view. “At first glance, knowing everything on the window sticker of every car sold in the first quarter seems to provide a great set of data to understand what customers wanted and therefore were buying. At least it did until we got a sinking feeling in our stomachs because we realized that this data only considers what the auto manufacturer actually built. That field of view is too limited to answer the important customer desire and motivation questions being asked. We need to break the silence around all the things customers wanted that were not built.”
This data silence reminded me of WYSIATI, which is an acronym coined by Daniel Kahneman to describe how the data you are looking at can greatly influence you to jump to the comforting, but false, conclusion that “what you see is all there is,” thereby preventing you from expanding your field of view to notice what data might be missing from your analysis.
As Fordham concluded, “we need to be careful to listen to all the relevant data, especially the data that is silent within our current analyses. Applying that discipline will help avoid many costly mistakes that companies make by taking the wrong actions from data even with the best of techniques and intentions.”
Therefore, in order for your enterprise to leverage big data analytics for business success, you not only need to adopt a mindset that embraces the principles of data science, you also need to make sure that your ears are set to listen for data silence.