Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments
« DQ-BE: Data Quality Airlines | Main | The Business versus IT—Tear down this wall! »
Monday
Sep202010

DQ-Tip: “There is no such thing as data accuracy...”

Data Quality (DQ) Tips is an OCDQ regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“There is no such thing as data accuracy — There are only assertions of data accuracy.”

This DQ-Tip came from the Data Quality Pro webinar ISO 8000 Master Data Quality featuring Peter Benson of ECCMA.

You can download (.pdf file) quotes from this webinar by clicking on this link: Data Quality Pro Webinar Quotes - Peter Benson

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000

 

Data Accuracy

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy).

“The definition of data quality,” according to Peter and the ISO 8000 standards, “is the ability of the data to meet requirements.”

Although accuracy is only one of many dimensions of data quality, whenever we refer to data as accurate, we are referring to the ability of the data to meet specific requirements, and quite often it’s the ability to support making a critical business decision.

I agree with Peter and the ISO 8000 standards because we can’t simply take an accuracy metric on a data quality dashboard (or however else the assertion is presented to us) at face value without understanding how the metric is both defined and measured.

However, even when well defined and properly measured, data accuracy is still only an assertion.  Oftentimes, the only way to verify the assertion is by putting the data to its intended use.

If by using it you discover that the data is inaccurate, then by having established what the assertion of accuracy was based on, you have a head start on performing root cause analysis, enabling faster resolution of the issues—not only with the data, but also with the business and technical processes used to define and measure data accuracy.

 

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Why isn’t our data quality worse?

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “There is no point in monitoring data quality...”

DQ-Tip: “Don't pass bad data on to the next person...”

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is about more than just improving your data...” 

DQ-Tip: “Start where you are...”

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (16)

Jim,

I cannot agree with Benson's definition of data quality (as defined here) because it rests on the assumption that every piece of data exists to fulfill a business requirement, and (by extension) that data only exists within businesses.

This is clearly not the case. I also have no problem in believing that data accuracy exists and can be defined.

As is often the case, I think definitions and tips like these are rather too restricted by the context of the person and/or business making them - we need to be more open and generic in the way we view data and data quality.

Must be Monday morning ...

September 20, 2010 | Unregistered CommenterGraham Rhind

I saw a post a while back where they broke this down into two pieces: and I thought it made a lot of sense.

There's the "accuracy" issue - is the data correct? and then there's the usability issue - is the data "fit for use"?

Think this distinction makes a lot of sense because, essentially, you need to be sure you have the right types of info - and that the info you have is accurate. Does this jibe with what you're thinking here?

September 20, 2010 | Unregistered Commenterm ellard

@Graham — I understand your objections, and prior to attending the webinar, I would have agreed with you completely.

I have not had the time to review the ISO 8000 standards in more detail, but my understanding based on Peter Benson's presentation raises the following two counterpoints:

(1) Requirement was being used generically and not necessarily restricted to a business requirement. For example, your ability to download a readable PDF document from my website (like the one with Peter's quotes) is a requirement for data that neither exists within a business nor exists to fulfill a business requirement.

(2) Data accuracy does exist and can be defined. The point is that the definition of data accuracy, including how it is being measured, is only a verifiable assertion, and not a guarantee that the data will meet its intended purposes (which should be included in the definition of accuracy). For example, stating that the Country_Code field has 100% data accuracy is a verifiable assertion based on checking its definition and measurement (e.g., uses ISO 3166 standards for validity of the field values, which are verified as accurate for the postal address on the record on a monthly basis by running the acclaimed Acme Global Postal Address Verification Software written by Wile E. Coyote).

Please Note: These are only my interpretations and might not accurately (pun intended) reflect the ISO 8000 standards :-)

@m ellard — I think that your question is answered by my response to Graham. If not, please let me know.

September 20, 2010 | Registered CommenterJim Harris

Leaving aside ISO 8000 for a moment (and I admit that I leave aside for many moments most things that start ISO ...), let's discuss Schrödinger's Data.

You'll know that I regard data quality as being an intrinsic property of that data, and to be unrelated to any use that that data is put or for which it is intended.

There are many whatever bytes of data around the world (and around on my hard disk) which is just there - it's not used and it's not referred to. Does this mean that that data has no quality? I would argue that that is NOT the case.

But that's because I largely equate accuracy with quality, and reject purpose as a definition of quality, because accurate data is always fit for any purpose.

I think we've been here before ;-)

If we can't agree on a definition of data quality, we won't be able to agree on tips etc. based on those definitions. I probably couldn't get past the first paragraph of the ISO 8000 document because it would be based on a definition alien to my own.

September 20, 2010 | Unregistered CommenterGraham Rhind

@Graham — The world would be a happier place if everyone abided by the GSO or JSO standards (those being, of course, the Graham's Standards Organization and the Jim's Standards Organization, respectively).

;-)

In Schrödinger's Data Quality, I used that famous thought experiment as a metaphor for waiting until the very end of a data quality project to learn if you have succeeded or failed.

Applied to data quality in general, I would say, according to the Copenhagen interpretation of quantum mechanics, until the data is used for some purpose, the data is simultaneously accurate and inaccurate.

Yet, once you use it, the data will either be accurate or inaccurate.

September 20, 2010 | Registered CommenterJim Harris

I like to use the word "veracity" when referring to how closely a piece of data aligns with the actual events that occur. After all, the data that we collect about a thing or event is merely one representation of that thing or event.

I agree that it makes perfect sense to measure how "accurately" a piece of information matches to the intent of any particular model of something; but I also think there's value in describing the "veracity" of that information, too.

As you've said, defining accuracy is very dependent on defining the context of the conversation and either the model that's used to define how the information is collected or the requirements on that information for its usage. I think the concept of veracity can stand alone, however, because it is the definition of both context and data quality at the same time.

September 21, 2010 | Unregistered CommenterPaul Boal

@Paul — Thanks for the veracious comment :-)

Data is an abstract description of reality, whether it’s an abstract description of real-world entities (i.e., “master data”) or an abstract description of real-world interactions (i.e., “transaction data”) among entities.

Veracity as a term referring to the alignment of data with the real-world it describes is definitely a concept that can stand alone and does make a good data quality definition (i.e., data quality = real-world alignment), which is independent of any potential uses of the data, which, from my perspective, is the threshold crossing over to information (i.e., information is data in use for a particular purpose).

However, I still think the veracity of data is only a verifiable assertion, which could be proven true or false, a relative truth that we must believe in, but a relative truth nonetheless.

To paraphrase Ralph Waldo Emerson:

“The success of the enterprise is upheld by the veracity of good data: it makes information useful.
They who made business decisions based on it found opportunities both plentiful and profitable.
Corporate life is sweet and tolerable only in our belief in such data.”

September 21, 2010 | Registered CommenterJim Harris

Glad to see that the subject is of interest.

Regarding the definition of quality as "meets requirements" this comes from ISO 9000 and so far it appears to be standing the test of time if nothing else.

By extension data quality is data that meets requirements, nothing more and nothing less.

The issue at hand is therefore how to specify data requirements. ISO 22745-30 is the standard we use for expressing data requirements in XML, given a data requirement statement I can easily measure if any given data meets the requirements. The requirements for data are expressed through a collection of properties with constraints on the values that can be associated with the property, for example I may want to know the day of the week and the reply would be an enumerated value or a number between 1 and 7.

In developing the standard we learnt to understand the difference between data and information and this is critical to understanding data quality as opposed to information quality.

Quality data does not necessarily yield quality information but quality information has to be based on quality data.

In the context of data quality, accuracy is an assertion of accuracy and we need to know who made the assertion and on what basis.

We are starting to deal with information quality and it appears that unlike data quality the characteristics of information quality are relative, for example timeliness is a characteristic of information quality which is measured relative to the recipient of the data.

Accuracy in information quality would be a measurement of proximity to real-world observation and again it would be relative to the measurement or observation performed. Think of a date of birth, how would you measure accuracy when we know that even authoritative records can be wrong and even the original record of a birth can be suspect. In the end some organization asserts the accuracy of the data.

I am working on a standard for defining real property by combining shape and GPS coordinates, one of the issues is the accuracy of the survey, i.e., the data's ability to reflect the real world. A thorny question is what to do when the property moves position as it does in an earthquake.

Real world alignment is never absolute (similar to the way light does not travel in a straight line, it is bent by gravity).

September 22, 2010 | Unregistered CommenterPeter Benson

Peter,

A few comments on your response, if I may.

1) Data quality as "meeting requirements" is a common definition in the business world, and it works there (as long as you don't ask anybody actually dealing with the data rather than those dealing with information derived from data...). If I'm not mistaken, ISO 9000 is about quality management in businesses and organizations. So, by extension, the definition will work there. But there is data around not in organizations or businesses, and if we define data quality, it needs to be a definition for all data, not just (big) business data.

2) I think measuring the values that can be associated with a property, such as a number for day of the week, is a measure of data validity, not of data quality. An invalid value (e.g., 8) is invalid data, and lacks quality. A valid value (e.g., 7) may be valid but may not be accurate and may still lack quality; this is the reason I associate accuracy so closely with quality.

3) You say: "In the context of data quality accuracy is an assertion of accuracy and we need to know who made the assertion and on what basis." Yes, I can see that. But to me data quality (as opposed to information quality) is an inalienable and intrinsic property of the data. I can assert that I am male. Somebody else may assert that I am female. Only one of these assertions is accurate and only one has quality. Regardless of who asserts what in this case, and without any naval gazing or ponderances on quantum mechanics, DNA, or gravity, there is an accurate and quality aspect which belongs to the data. And that's data quality (to me, anyway). The data can be rendered in a number of ways (Male, M, Homme, Mannelijk and so on), but the quality of the data remains the same. When the rendering affects the data being used, that's an information quality issue.

We've been discussing what data quality is for many years and they'll still be discussing it long after I'm (assertably) dead.

I think we'd do better to start every discussion and post we make about data and information quality with the definition of each as we understand it ...

Graham

September 22, 2010 | Unregistered CommenterGraham Rhind

Graham,

The only limitation of the definition of data quality comes from the definition of data.

My preferred definition of data is "a disruption in a continuum" but for the purposes of ISO 8000 we have used the term data to mean that which can be processed by a computer which would exclude a painting for example.

Data quality is as relevant to someone using a cell phone or browsing the Internet just as it is to a business using SAP, Oracle, or Microsoft Dynamics, definitely not just big business.

We have used as the definition of quality "meets requirements" and I have yet to see an alternative.

The common use of the term quality typically implies a reference but it is rarely (and regrettably) not explicit. The "best quality eggs" and "top quality hotel" are actually meaningless statements, which is pretty clear if you have ever stayed at a "Quality Inn", not that it is a bad place to stay but the Ritz would probably differ on the definition (requirements) of quality.

The differentiation between data and information is important.

We appear to agree that the quality of a data element is an inalienable and intrinsic property of the data, but where we appear to disagree is that the data needs to be an accurate representation of the real world. I believe this is a characteristic of information quality.

Taking your example, one cannot actually assume that the individual is the ultimate authority on their sexuality. If the definition of the property "sex" is a strict biological definition then it would be authoritatively defined by a DNA test and the credibility of the testing organization may be a contributing factor to determining quality and even then there is sometimes a requirement for a legal determination in XXY cases for example. Using your example many countries have definitions of sex that differs from the strict biological definition so here again you would need to know the definition of the property and on what basis the assertion was made as well as by who.

To validate the accuracy of data I need to know what was the basis of the test and who performed it, basically this is the assertion of accuracy.

Bottom line: I believe the quality of the data can be measured by comparison with the requirements for data and we do this on a regular basis defining data as level 1 quality to level 4 quality based on the functions that the data supports. Data quality covers provenance and assertions of accuracy as well as completeness. These essentially define required data components.

Information quality is challenging but I expect we will be able to define the characteristics of information quality to allow us to state that information is quality information, the first characteristic we already know; the data from which the information was derived was quality data.

Peter

September 23, 2010 | Unregistered CommenterPeter Benson

Peter,

Thanks for your clarifications. Much of why we disagree on this inevitably comes down to how we define words.

Data quality is absolutely essential to all of us, and I'm glad we agree on that, but many of us tend to look at it from the viewpoint of our experience with data and quality in our professional lives - in big business - and we therefore tend to over complicate the issue.

I do also think that much of our disagreement boils down to the definitions of:

Data Qualityhttp://www.dqglossary.com/dqglossary/#-229

Information Qualityhttp://www.dqglossary.com/dqglossary/#-363

I like to look at them simply in this way:

A mother writes down her son's new address (that's information) on a piece of paper and pops it into a box (not all definitions of data state that it has to be held on a computer ...). That's data. When her son invites her to come around for a visit, she locates the box and the paper within it and reads the address (which returns then to being information). Unfortunately, she has transposed two digits of the building number, and she goes to the wrong house. This is an information quality problem (the information is not fit for purpose). The information quality problem is directly based on the issue of her data being incorrect (inaccurate) - and that's the data quality issue.

Thus I equate data quality with accuracy and information quality with use.

In this case, to validate the accuracy of the data she did not "need to know what was the basis of the test and who performed it" - she just has to use some information derived from it (back to Jim's point). She didn't need to assess or define or assign levels of confidence, because that's not what real people do - that's what businesses do. If she had chosen to use a different part of her data (such as telling her daughter what her son's postal code was), the data would have been correct and the information derived from it equally good.

Thus the same data is fit for one use and not for another - reason I reject the "fit for purpose/fulfills requirements" label.

So, for me, it's a case of what is data and what is information.

Incidentally, I was very careful in my example to state that I knew MY gender, not anybody else's, as you are quite correct about that point in your response.

Graham

September 24, 2010 | Unregistered CommenterGraham Rhind

Graham,

I believe you would enjoy reading the Semantic Conceptions of Information, which you can find at the following URL:

http://plato.stanford.edu/entries/information-semantic/

I believe that information is translated into data, transferred as data and translated back into information.

This points to three processes, two data translations and one data transfer. The quality of each of these processes will impact the quality of the information. In your example, the second translation was flawed.

Data quality is not absolute but can only be measured in reference to a data requirement, so yes absolutely data may be deemed quality data for one purpose but not for another, this our bread and butter.

For example, Level one quality data in procurement is sufficient to order an item, I know who to order it from and I have an identifier that the supplier understands. Level two quality data contains a class sufficient to analyze the category of spend (spend analysis). Level four quality data is data sufficient to describe the item such that it may be competitively sourced based on its characteristics. (In Level three some but not all the requirements have been met).

This is just one of many examples where data meets the requirement for one application but not another.

Quality data is data that meets a defined requirement—nothing more, nothing less.

Without a defined data requirement, you can not measure data quality.

In your previous example, my position was that I may need to know more about you before I could accept even your own assertion as to your own sex.

First, I would have to determine that you were sane, then that you understood the definition of sex, and then that you were capable of testing the defined characteristics. No offense intended, of course. :-)

Peter

September 24, 2010 | Unregistered CommenterPeter Benson

Some fantastic observations.

I think that data quality and accuracy begins with just two simple rules:

Data Quality Rule 1: Data ONLY exists to support the Business Functions of an enterprise.

We might need to add Rule 1a here, which is that a Business Function is NOT a Department! It is a fundamental business activity. Oh! and Rule 1b: A Function is not a Process!

Data Quality Rule 2: There are no exceptions to Rule 1.

Data that is not used required by any Function within the enterprise should be removed. If the data might be of value to someone else then the enterprise should put it up for sale, otherwise simply delete it.


These two rules free the enterprise up from thinking that it has to control data for "life, the universe and everything". It allows the enterprise do define the boundaries of its world - its "worldview".

Data accuracy can only be defined within a clearly defined worldview!

The enterprise's worldview coincides with its "management horizon", that is, those areas and activities over which it can (and must) exert management control. This control can be over standards, practices, materials - even data quality.

Enterprises that practice Total Quality Management (TQM) know how important it is to move the management horizon as far upstream as possible and thus ensure that the quality of everything entering the enterprise, whether it is products, materials or data, meets the defined quality criteria within the the enterprise. For data, this means moving the management horizon upstream to suppliers and even to customers. It is moved to the customers by enabling them to enter the correct data first time every time. This requires analysis, modeling, initiative, imagination and quality design.

With regard to the ability to download a PDF (to which one commentator alluded), this is simply an electronic means of receiving unstructured data, as is a fax, a phone call or an e-mail. The recipient must then decide whether or not this data is required by a Business Function within the enterprise. If it is, then they are responsible for entering the relevant parts of the data into the appropriate systems while conforming to all appropriate, existing data quality and integrity standards.

Let us learn from tried and proven techniques, such as TQM, and become much more imaginative about creating zero data defects. This will mean we do not have to be so creative about finding them. This might well spoil some people's fun but it must be done.

Is the person who says, "you can't have zero data defects" really saying, "I don't know how to achieve zero data defects"?

Regards,

John

September 26, 2010 | Unregistered CommenterJohn Owens

@Graham and Peter — Thank you both very much for continuing your discussion and debate in this comments section. Your perspectives are providing valuable insight. Please note that the extra line breaks, italics and bold formatting were mostly added by me to help emphasize what I thought were some of the most important aspects of each comment.


@John — Thank you as well for adding your excellent insights to this debate. Your feedback is always greatly appreciated.

However, I am afraid that I strongly disagree with your closing statement.

Achieving zero defects is what I refer to as the Defect Prevention Fallacy — I am currently writing a blog post about this topic, so for now I will try to be brief.

“Getting data right the first time, every time” is based on two HUGE assumptions:

(1) Data has ONLY one business use.

(2) The business use of data DOES NOT change over time.

Data’s quality is determined by evaluating its fitness for the purpose of business use.

However, in the vast majority of cases, data has multiple business uses, and data of sufficient quality for one use may not be for other, and perhaps unintended, uses.

Many times, it is the unknown future business uses of the originally entered data that is the context for what, in hindsight, appear to be obvious data defects.

“Getting data right the first time, every time” is usually the mantra for blaming data entry for creating data defects. I took on this topic on in my blog post Who Framed Data Entry?

The carpenter’s motto is “measure twice, cut once.” Unfortunately, data entry is a bit more complicated, seemingly following the motto “entered once, used often.”

Not only can we not predict the unknown future business uses, we often can not satisfy all known business uses of the data at time of creation without disrupting the operational source system, especially since multiple business uses of the same data often conflict with each other — and for VALID business-justified reasons.

This is why, as I explained in my blog post The Fourth Law of Data Quality that data quality standards include both objective data quality and subjective information quality.

Getting back to the second assumption, businesses evolve over time. This has always been true, but is even more true today.

As I explained in my blog post Is your data complete and accurate, but useless to your business?, with silos replicating data as well as new data being created daily, managing all of the data is not only becoming impractical, but because we are too busy with the activity of trying to manage all of it, no one is stopping to evaluate usage or business relevance. When this happens, an organization can become so lost in all of the data it manages that it is unable to convert data into business insight and unable, as a result, to survive and thrive in today’s highly competitive and rapidly evolving marketplace.

Furthermore, data is now everywhere. Data is no longer just in the structured rows of our relational databases and spreadsheets. Data is also in the unstructured streams of our Facebook and Twitter status updates, as well as our blog posts, our photos, and our videos.

I am greatly puzzled by your recommendation for how these unstructured data sources should be used:

“The recipient must then decide whether or not this data is required by a Business Function within the enterprise. If it is, then they are responsible for entering the relevant parts of the data into the appropriate systems while conforming to all appropriate, existing data quality and integrity standards.”

So is the person who says: “We must have zero data defects!”

Really saying: “We can not use our data until we achieve zero data defects?”

If so, good luck building that Data Utopia — meanwhile business users have business decisions to make right now.

Although zero defects is obviously preferable to data containing defects, less than perfect data quality can not be used as an excuse to delay making a critical business decision.

When it comes to the quality of the data being used to make business decisions, you can't always get the data you want,
but if you try sometimes, you just might find, you get the business insight you need.

Best Regards,

Jim

September 26, 2010 | Registered CommenterJim Harris

@John — May I be a little finicky (and why change the habit of a lifetime!)?

Regarding your Data Quality Rule 1: Data ONLY exists to support the Business Functions of an enterprise.

My rewriting of it would be:

Data Quality for data within business enterprises ONLY exists to support the Business Functions (past, present, and future) of an enterprise.

Apart from my obvious and oft repeated gripe that too many people treat data as if it only exists in businesses (and define it as such), when it comes to data we need to plan ahead. Collecting data now which may only be required for a review in a year's time is not wasted effort, in my view. I find myself doing far too much work to collect data that I should have collected before (when it was not required), and most businesses are no different.

Graham

September 26, 2010 | Unregistered CommenterGraham Rhind

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>