Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments
« Hyperactive Data Quality (Second Edition) | Main | Adventures in Data Profiling (Part 3) »
Wednesday
Aug122009

The General Theory of Data Quality

In one of the famous 1905 Annus Mirabilis Papers On the Electrodynamics of Moving Bodies, Albert Einstein published what would later become known as his Special Theory of Relativity.

This theory introduced the concept that space and time are interrelated entities forming a single continuum and that the passage of time can be a variable that could change for each specific observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective. 

As Einstein's Padawan Obi-Wan Kenobi would later explain in his remarkable 1983 “paper” on The Return of the Jedi:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

Although the Special Theory of Relativity could explain the different perspectives of different observers, it could not explain the shared perspective of all observers.  Special relativity ignored a foundational force in classical physics – gravity.  So in 1916, Einstein used the force to incorporate a new perspective on gravity into what he called his General Theory of Relativity.

 

The Data-Information Continuum

In my popular post The Data-Information Continuum, I explained that data and information are also interrelated entities forming a single continuum.  I used the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers).

I explained that although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I defined information as data in use or data in action

I went on to the explain that data's quality must be objectively measured separate from its many uses and that information's quality can only be subjectively measured according to its specific use.

 

The Special Theory of Data Quality

The majority of data quality initiatives are reactive projects launched in the aftermath of an event when poor data quality negatively impacted decision-critical information. 

Many of these projects end in failure.  Some fail because of lofty expectations or unmanaged scope creep.  Most fail because they are based on the flawed perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.

Whenever an organization approaches data quality as a one-time project and not as a sustained program, they are accepting what I refer to as the Special Theory of Data Quality.

However, similar to the accuracy of special relativity for solving a narrowly defined problem, sometimes applications of the Special Theory of Data Quality can yield successful results – from a certain point of view. 

Tactical initiatives will often have a necessarily narrow focus.  Reactive data quality projects are sometimes driven by a business triage for the most critical data problems requiring near-term prioritization that simply can't wait for the effects that would be caused by implementing a proactive strategic initiative (i.e. one that may have prevented the problems from happening).

One of the worst things that can happen to an organization is a successful data quality project – because it is almost always an implementation of information quality customized to the needs of the tactical initiative that provided its funding. 

Ultimately, this misperceived success simply delays an actual failure when one of the following happens:

  1. When the project is over, the team returns to their previous activities only to be forced into triage once again when the next inevitable crisis occurs where poor data quality negatively impacts decision-critical information.
  2. When either a new project (or later phase of the same project) attempts to enforce the information quality standards throughout the organization as if they were enterprise data quality standards.

 

The General Theory of Data Quality

True data quality standards are enterprise-wide standards providing an objective data foundation.  True information quality standards must always be customized to meet the subjective needs of a specific business process and/or initiative.

Both aspects of this shared perspective of quality must be incorporated into a single sustained program that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

Whenever an organization approaches data quality as a sustained program and not as a one-time project, they are accepting what I refer to as the General Theory of Data Quality.

Data governance provides the framework for crossing the special to general theoretical threshold necessary to evolve data quality from a project to a sustained program.  However, in this post, I want to remain focused on which theory an organization accepts because if you don't accept the General Theory of Data Quality, you likely also don't accept the crucial role that data governance plays in a data quality initiative – and in all fairness, data governance obviously involves much more than just data quality.

 

Theory vs. Practice

Even though I am an advocate for the General Theory of Data Quality, I also realize that no one works at a company called Perfect, Incorporated.  I would be lying if I said that I had not worked on more projects than programs, implemented more reactive data cleansing than proactive defect prevention, or that I have never championed a “single version of the truth.”

Therefore, my career has more often exemplified the Special Theory of Data Quality.  Or perhaps my career has exemplified what could be referred to as the General Practice of Data Quality?

What theory of data quality does your organization accept?  Which one do you personally accept? 

More importantly, what does your organization actually practice when it comes to data quality?

 

Related Posts

The Data-Information Continuum

Hyperactive Data Quality (Second Edition)

Hyperactive Data Quality (First Edition)

Data Governance and Data Quality

Schrödinger's Data Quality

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (8)

From the LinkedIn Group for The Greater IBM Connection, Gurdon Blackwell commented:

I would argue that any focus on data (anything that is stored, like cutting slashes in trees) is misplaced. Of course, one's interpretation of data, information and meaning must first be explained, not as independent variables but as a continuum of process relationships ending in one's mind (assuming one accepts that concept).

What we say is "data" in database terms for example are referents to "things" that are missing a most important unit of data--a verb. What we are asked to do is store sentences in which both the time and actions causing their creation are removed and then through the magic of business intelligence (packaged arithmetic) or some other reverse engineering tactic, add back the time and verbs. This is storytelling as it was done in Homeric times.

I would look to science for more than analogies to help understand the so-called "data quality" issue. The answer to data quality is found in a most unusual place--our language--and a most provocative invention by Alan Turing: the "stored program computer" and the meaning of the term "computation."

August 15, 2009 | Registered CommenterJim Harris

My theory of the data, information, knowledge continuum is more closely related to the element, compound, protein, structure arc.

In my world, there is no such thing as 'bad' data, just as there is no 'bad' elements. Data is either useful or not: the larger the audience that agrees that a string is representative of something they can use, the more that string will be of value to me.

By dint of its existence in the world of human communication and in keeping with my theory, I can assign every piece of data to one of a fixed number of classes, each with characteristics of their own, just like elements in the periodic table. And, just like the periodic table, those characteristics do not change. The same 109 usable elements in the periodic table are found and are consistent throughout the universe, and our ability to understand that universe is based on that stability.

Information is simply data in a given context, like a molecule of carbon in flour. The carbon retains all of its characteristics but the combination with other elements allows it to partake in a whole class of organic behavior. This is similar to the word 'practical' occurring in a sentence: Jim is a practical person or the letter 'p' in the last two words.

Where the analogue bends a bit is a cause of a lot of information management pain, but can be rectified with a slight change in perspective. Computers (and almost all indexes) have a hard time with homographs: strings that are identical but that mean different things. By creating fixed and persistent categories of data, my model suffers no such pain.

Take the word 'flies' in the following: 'Time flies like an arrow.' and 'Fruit flies like a pear.' The data 'flies' can be permanently assigned to two different places, and their use determines which instance is relevant in the context of the sentence. One instance is a verb, the other a plural noun.

Knowledge, in my opinion, is the ability to recognize, predict and synthesize patterns of information for past, present and future use, and more importantly to effectively communicate those patterns in one or more contexts to one or more audiences.

On one level, the model for information management that I use makes no apparent distinction between the data: we all use nouns, adjectives, verbs and sometimes scalar objects to communicate. We may compress those into extremely compact concepts but they can all be unraveled to get at elemental components. At another level every distinction is made to insure precision.

The difference between information and knowledge is experiential and since experience is an accumulative construct, knowledge can be layered to appeal to common knowledge, special knowledge and unique knowledge.

Common being the most easily taught and widely applied; Special being related to one or more disciplines and/or special functions; and, Unique to individuals who have their own elevated understanding of the world and so have a need for compact and purpose-built semantic structures.

Going back to the analogue, knowledge is equivalent to the creation by certain proteins of cartilage, the use to which that cartilage is put throughout a body, and the specific shape of the cartilage that forms my nose as unique from the one on my wife's face.

To me, the most important part of the model is at the element level. If I can convince a group of people to use a fixed set of elemental categories and to reference those categories when they create information, it's amazing how much tension disappears in the design, creation and deployment of knowledge.

August 15, 2009 | Unregistered CommenterJohn O'Gorman

Jim,

I strongly believe that there is a niche for each of the approaches, application or project driven and more general approach aligned with the "Information as an Asset" mentality.

A few days ago I posted an article that you may find worth reading:

Quantifying Data Quality with Information Theory

Cheers,

Larry

August 15, 2009 | Unregistered CommenterLawrence Dubov

Jim,

I subscribe to your "General Theory of Data Quality." I have stated on several occasions that data quality must be set as a corporate objective with everybody being made responsible for it from the CEO to the tea boy.

There is, however, a slight fear that is creeping into my thoughts when I read all the comments about data quality. It is starting to be viewed as a science where people put forward theory and conjecture and others set out to create proofs.

Data Quality and Data Governance is more akin to art than science.

With the strides we make in technology there will always be different ways to capture and store data thus giving rise to many different ways that data quality and data governance can be approached. There is no one absolute or proof to be found just different expressions of data quality perceptions.

There are companies without any data quality or governance frameworks that are still extremely successful and making sound strategic decisions. Is this luck, gut instinct or data quality being actioned on the fly? What ever it is there appears to be no logical framework in play which in my mind does not conform to any science that I know.

Regards...

Mike

August 16, 2009 | Unregistered CommenterMike Pratt

Jim,

Excellent article, well done. I completely agree that Data Quality is a journey rather than a destination, and once-off data quality projects are of little use.

Like you, I have worked on more projects than programmes. Many of the projects have been what I refer to as "End of the food chain" projects. An "End of food chain" project is one that depends on existing data within the Enterprise, but has no control over the capture, or quality of that data. These projects are typically 'tactical', often in response to a regulatory requirement (e.g. requirement to implement Transaction Monitoring to combat Anti-Money Laundering).

While working on these projects, I raised many "Enterprise Wide Data Issues." Tactical projects are not funded to address "Enterprise Wide Data Issues", and on many occasions I have had to develop tactical workarounds - such is the nature of our profession.

I am currently writing a series of blog posts in which I am sharing the "Enterprise Wide Data Issues" I have encountered, together with a process for assessing the status of these issues within an Enterprise.

Please see the fifth post in the series: There is little understanding of what Data Quality means.

Regards...

Ken

August 17, 2009 | Unregistered CommenterKen O'Connor

Great article.

The problem with a program approach I have seen is the desire to boil the ocean, i.e. take on all dq problems in one go.

Start-up projects that prove some value to retain business interest and funding are vital, but within a program governance / communications structure / overall plan.

Our organisation has definitely suffered from 'Tactical initiatives' with a 'necessarily narrow focus' - most of our dq projects have been externally imposed (regulatory compliance) or occasionally internally around a specific business problem, and are indeed short-lived and solve only specific issues short-term.

It was also interesting to talk to a site here who started with a greenfields approach and implemented great business processes / supporting IT systems and had good, rich information about customers, products etc. Several years and lots of M&A later, not such a rosy picture - always challenges, even with the best thinking at the start!

August 21, 2009 | Unregistered Commenterglenn mead

From the LinkedIn Group for Master Data Management, Marian Sherrin commented:

"We are just starting to adopt Data Quality Principles. Top on the the list are single system of reference, gold standard source data, and looking at data quality attributes as outlined in Larry English's TIQM book."

August 21, 2009 | Registered CommenterJim Harris

From the LinkedIn Group for Master Data Management, John Ferraioli commented:

"It's important to identify and apply enterprise data standards as a first step. The Data Quality will come downstream after you apply standards, and data governance. Otherwise, you'll clean the data and then it will degrade in short order. Then you'll be in a nasty cycle of 'rinse, lather and repeat' for your data quality and integrity. Another important step is to apply data quality metrics so you can measure and report against your progress."

August 22, 2009 | Registered CommenterJim Harris

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>