Getting Messy with Big Data

William McKnight recently blogged about the need to throw out the old adage of garbage in, garbage out (GIGO), especially on analytics projects where data is sourced from operational systems. “Operational systems have a different sense of quality than an analytical system,” McKnight explained. “Often operational systems are built to take in data record by record and do not see the lack of attention to complete, accurate data. For an analytical database that is looking to support business decisions with high quality data and will look at trends in the data, this may not be acceptable.”

While GIGO is often invoked to declare doom if the operational data has quality issues from the perspective of analytic needs, McKnight explained “the source environment should not be required to acquiesce to the data standard of the analytic environment. Furthermore, it is very likely that significant operational system enhancement is not going to happen on your project’s timetable. Ultimately, the success of your analytic project will depend on the data being up to a standard. There are tools at your disposal to make sure you can do this. If the source data, from your perspective, is garbage, it will be your project’s demise to cite, especially well into the project, the GIGO mantra as if you have no options.”

The Sheer Messiness of Reality

“Life is stubbornly qualitative on every level,” James Kobielus blogged. “But we wouldn’t be modern and scientific if we didn’t try to constantly reduce it to numbers that we can calculate, manipulate, and extrapolate. Even when we’re trying to parse the mess into particular entities and interactions that we can analyze scientifically, the sheer messiness of reality often endures. Discovering meaningful patterns in a messy problem domain is what the best data scientists do exceptionally well.”

In Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schönberger and Kenneth Cukier explained that “moving into a world of big data will require us to change our thinking about the merits of exactitude. The obsession with exactness is an artifact of the information-deprived analog era. When data was sparse, every data point was critical, and thus great care was taken to avoid letting any point bias the analysis. However, in many new situations that are cropping up today, allowing for imprecision—for messiness—may be a positive feature, not a shortcoming.”

“Of course, the data can’t be completely incorrect,” Mayer-Schönberger and Cukier noted, “but we’re willing to sacrifice a bit of accuracy in return for knowing the general trend. Big data transforms figures into something more probabilistic than precise. Big data, with its emphasis on comprehensive datasets and messiness, helps us get closer to reality.”

From Messy to Multilingual

Google, one of the muddy poster children of the age of big data, jumped into un-muddying the waters of language translation. They did this, not by hiring a thousand language experts to accurately translate pages of text from one language into another, but by feeding their algorithms a trillion words of messy text from billions of pages of translations of widely varying quality (e.g., incomplete sentences, spelling errors, and grammatical errors) from the global internet, including corporate websites in multiple languages, and reports from intergovernmental bodies likes the United Nations and the European Union.

“Despite the messiness of the input,” Mayer-Schönberger and Cukier explained, the translations provided by Google Translate “are more accurate than other systems (though still highly imperfect). And it is far, far richer. By mid-2012 its dataset covered more than 60 languages. It could even accept voice input in 14 languages for fluid translations. And because it treats language as messy data with which to judge probabilities, it can even translate between languages, such as Hindi and Catalan, in which there are very few direct translations to develop the system. The reason Google’s translation system works well is not that it has a smarter algorithm. It works well because it’s fed more data—and not just of high quality.”

It’s Not a Small World After All

Big data shows us it’s not a small world after all. It’s a messy world, which is why much of the data describing it is messy.

Dismissing the messiness of big data by invoking GIGO is like burying your head in a compost pile. Composting, however, is an example of how you can get something other than garbage out of a process that sends garbage in. Composting messy piles of big data yields more fruitful insights than cherry-picking tidy files of tiny data. Businesses of all sizes need to get messy with big data.

This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies, or opinions.

Big Data is not just for Big Businesses

Smart Big Data Adoption for Midsize Businesses

A Big Data Platform for Midsize Businesses

Capitalizing on Big Data Analytics

Better Decisions in Less Time with Bigger Data and Smaller Egos

Data is a Game Changer

Governing Big Data

Small Data is the Last Mile of Big Data

Living and Working in a Digital World

Driving a Data-Driven Connected Car