Poor Data Quality is a Virus
Jim Harris in
Books,
Data Quality tagged
Best of 2009,
Danette McGilvray,
Data Governance,
David Loshin,
Jill Dyché,
Philosophy,
Thomas Redman
Thursday, October 1, 2009 at 12:03AM “A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.”
These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.
“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus. Like a virus, data by itself is inert. Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”
“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”
“The antonym of viral data is trusted information.”
Data Quality
“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.” Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:
- David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
- Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses. It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
- Thomas Redman: “Data are of high quality if those who use them say so. Usually, high-quality data must be both free of defects and possess features that customers desire.”
Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative. This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.
However, the enterprise-wide data quality standards must be understood as dynamic. Therefore, enforcing strict conformance to data quality standards can be self-defeating. On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”
Defining data quality is both an essential and challenging exercise for every enterprise. “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business. Business data is data about the business, and any data about the business is metadata. First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”
Data Governance
“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.”
“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.”
For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance.
However, these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”
Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”
Conclusion
“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”
Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.
Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions. As the pathogen replicates, more and more decision-critical enterprise information will be compromised.
According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:
“Prevent viral data and preserve trusted information.”
Books Referenced in this Post
Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman
Enterprise Knowledge Management: The Data Quality Approach by David Loshin
Data Quality: The Field Guide by Thomas Redman
Customer Data Integration: Reaching a Single Version of the Truth by Jill Dyché and Evan Levy
Related Posts
DQ-Tip: “Don't pass bad data on to the next person...”
The Only Thing Necessary for Poor Data Quality
Hyperactive Data Quality (Second Edition)
The General Theory of Data Quality
Data Governance and Data Quality



Reader Comments (6)
On Twitter, Rob Paller commented:
"Viral YouTube videos are celebrated, but viral data should be vaccinated with trusted information."
On Twitter, Henrik Liliendahl Sørensen commented:
"Reminds me of an ebizQ blog post by David Linthicum: Lack of Focus on Data Killing SOA"
Quality is not subjective, nor are standards of data quality dynamic. Ask any programmer if the rules governing whether his applications compile are subjective or not. If he wants to effectively communicate within the context of his agreed upon contracts with the operating system / programming language / customer requirements he is not free to unilaterally change the rules. On the flip side, a good programmer given the same constraints can create an almost infinite range of very sophisticated - dare I say cool - new species of applications, precisely because those constraints are fixed and objective. How the heck does that happen? It happens the same way in any number of disciplines: Chemistry, Algebra, Genetics and Physics to name a few, but for some reason information management has not or will not make the same connections.
Good quality data improves the chances of enterprise survival. poor quality data decreases them. The rules for determining whether the quality of data is high or low are extremely simple, even while the combinations and permutations through which that data is transformed seem impossibly complex.
Data is a medium of exchange: my business acquires data, transforms it and communicates it to one or more 'outside' entities. At each stage of this process, my business needs to establish a contract first with the incoming data (what context, who is the provider, when was it current, etc); the transformation process (how does my business add value to the data); and to whom and in what context I will transmit it. Going back to my programming example, the contract for data exchange with my programming language uses exactly the same contracts and constraints: I declare my vocabulary and what those vocabulary values represent; I pass information through methods to the processor that are 100% consistent with those rules; and I transmit those results to some other operation for interpretation and further action.
The problem with viral data is that the same rules that apply to machine processing do not apply to the commodity that processing is designed to manage, In other words, the packets are ok, but the contents stink. The DNA carried by a virus screws up the host by consuming available resources. What needs to happen before a balance in the information management ecosystem can be acheived, is for information management as a discipline to adhere to a standard that identifies the suitability of the DNA coming in without limiting the uses to which that DNA can be put without modification of its elemental nature. The good news is those rules already exist.
From the LinkedIn Group for Master Data Management, Mariusz Binczycki commented:
“...everywhere somebody says data quality is the key...but the question is to what?
Today control is the key not quality!
You may have rubbish in your database even when it follows Sarbanes-Oxley (SOX). This has to stop!”
From the LinkedIn Group for Master Data Management, Amit Yadav commented:
“Many times even what ‘quality’ or ‘correct’ data means is not clear.
Thus there is no single version of truth to ‘Data Quality.’
Garbage In Garbage Out (GIGO) is pretty evident in many cases, but there are expectations that the transient processes will clean up the Garbage that comes in. This is possible only if the controls are strict.”
From the LinkedIn Group for Master Data Management, John O'Gorman commented:
“A virus in the human body consumes resources indiscriminately, sometimes defeating the host entirely. The body corporate must respond to all 'foreign' data (DNA messages) in the same way or risk destruction. In an enterprise, dirty data (i.e. information that has the wrong DNA) illicits a much lower key response, because the enterprise has no way of making the distinction between quality DNA and garbage.
The prevailing philosophy seems to be: 'If it came from a computer it must be OK.'
The problem, as I see it, is too much reliance on the quality of the packets and not enough on the quality of the contents. To illustrate, an application can satisfy all of the constraints of the language in which it was written (the packets are ok) while delivering exactly the wrong message.
Until people start treating the data as a value commodity independent of the systems that move it around we are, as Einstein said, doomed to repeat the same insanity over and over.”