The Limitations of Historical Analysis

This blog post is sponsored by the Enterprise CIO Forum and HP.

“Those who cannot remember the past are condemned to repeat it,” wrote George Santayana in the early 20th century to caution us about not learning the lessons of history. But with the arrival of the era of big data and dawn of the data scientist in the early 21st century, it seems like we no longer have to worry about this problem since not only is big data allowing us to digitize history, data science is also building us sophisticated statistical models from which we can analyze history in order to predict the future.

However, “every model is based on historical assumptions and perceptual biases,” Daniel Rasmus blogged. “Regardless of the sophistication of the science, we often create models that help us see what we want to see, using data selected as a good indicator of such a perception.” Although perceptual bias is a form of the data silence I previously blogged about, even absent such a bias, there are limitations to what we can predict about the future based on our analysis of the past.

“We must remember that all data is historical,” Rasmus continued. “There is no data from or about the future. Future context changes cannot be built into a model because they cannot be anticipated.” Rasmus used the example that no models of retail supply chains in 1962 could have predicted the disruption eventually caused by that year’s debut of a small retailer in Arkansas called Wal-Mart. And no models of retail supply chains in 1995 could have predicted the disruption eventually caused by that year’s debut of an online retailer called Amazon. “Not only must we remember that all data is historical,” Rasmus explained, “but we must also remember that at some point historical data becomes irrelevant when the context changes.”

As I previously blogged, despite what its name implies, predictive analytics can’t predict what’s going to happen with certainty, but it can predict some of the possible things that could happen with a certain probability. Another important distinction is that “there is a difference between being uncertain about the future and the future itself being uncertain,” Duncan Watts explained in his book Everything is Obvious (Once You Know the Answer). “The former is really just a lack of information — something we don’t know — whereas the latter implies that the information is, in principle, unknowable. The former is an orderly universe, where if we just try hard enough, if we’re just smart enough, we can predict the future. The latter is an essentially random world, where the best we can ever hope for is to express our predictions of various outcomes as probabilities.”

“When we look back to the past,” Watts explained, “we do not wish that we had predicted what the search market share for Google would be in 1999. Instead we would end up wishing we’d been able to predict on the day of Google’s IPO that within a few years its stock price would peak above $500, because then we could have invested in it and become rich. If our prediction does not somehow help to bring about larger results, then it is of little interest or value to us. We care about things that matter, yet it is precisely these larger, more significant predictions about the future that pose the greatest difficulties.”

Although we should heed Santayana’s caution and try to learn history’s lessons in order to factor into our predictions about the future what was relevant from the past, as Watts cautioned, there will be many times when “what is relevant can’t be known until later, and this fundamental relevance problem can’t be eliminated simply by having more information or a smarter algorithm.”

Although big data and data science can certainly help enterprises learn from the past in order to predict some probable futures, the future does not always resemble the past. So, remember the past, but also remember the limitations of historical analysis.

This blog post is sponsored by the Enterprise CIO Forum and HP.

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

WYSIWYG and WYSIATI

Will Big Data be Blinded by Data Science?

Big Data el Memorioso

Information Overload Revisited

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems