DQ-Tip: “Data quality is primarily about context not accuracy...”
Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.
“Data quality is primarily about context not accuracy.
Accuracy is part of the equation, but only a very small portion.”
This DQ-Tip is from Rick Sherman's recent blog post summarizing the TDWI Boston Chapter Meeting at MIT.
I define data using the Dragnet definition – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). A common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements. Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.
Alternatively, information can be defined as data in context.
Quality, as Sherman explains, “is in the eyes of the beholder, i.e. the business context.”
Related Posts
DQ-Tip: “Don't pass bad data on to the next person...”
The General Theory of Data Quality
The Data-Information Continuum



Jim Harris
Reader Comments (13)
The challenges in Multi-Purpose Data Quality are one of my favourite topics.
I've written some more about it here, where I have tried to place the challenges within a real life example.
I hope to receive some comments on this matter.
From the LinkedIn Group for Data Governance & Data Quality, Julian Schwarzenbach commented:
"I disagree with this view - if data is inaccurate, or not of known accuracy, then it will not be able to effectively support the business processes and decision making it is intended to support. For transport and utilities organisations, long term strategic planning is reliant on the quality (particularly accuracy) of asset information.
There are examples in the UK where organisations who have either had poorly stated, or inaccurate, business plans have been impacted in the order of £hundreds of millions when final regulatory determinations have been published. In these circumstances data accuracy is a large part of the equation."
And I responded:
Thanks for sharing your perspective.
I was really hoping that someone would provide this counterpoint because I have to admit that I was initially uneasy about downplaying the importance of accuracy.
Accuracy is of course still important.
However, the challenge with many data metrics (especially financial metrics) is context.
For example, there can be many ways for an organization to calculate revenue. It is common for different business units within the same organization to calculate revenue very differently. (Of course, it is also common for this to happen even within the same business unit!).
Each calculation would be deemed accurate within the context of the formula or the particular data selected for inclusion in the calculation. The famous quote from Mark Twain applies here:
"A man with one watch knows what time it is. A man with two is never sure."
Therefore, often the first question needed to verify accuracy is "within what context?"
And Julian Schwarzenbach responded:
"Arguably, different definitions of revenue calculations are less important than the accuracy of the data used in the revenue calculations. If the calculated value of revenue (by whichever formula) is viewed as information, it does not detract from the core issue of ensuring that data is of suitable accuracy and precision."
To me it’s a question about whether we are talking raw data quality or information quality.
While information quality perhaps isn’t so much about accuracy but more about context, yet if I have to fix information quality with multi-purpose data, I have to deal with accuracy (and completeness, timeliness, uniqueness etc) with the raw data.
With the financial data it’s about accuracy and other data quality dimensions with amounts, timestamps, currency rates, categories etc., so you may derive all relevant different information for various purposes.
With master data you also must maintain raw master data in terms of accuracy, timeliness, uniqueness and other dimensions that fulfills all the various purposes. The easiest way to do this is often to have a close real-world alignment - re link in comment 1.
From the LinkedIn Group for Data Governance & Data Quality, Denis Kosar commented:
"Well, I have been in data architecture for over 25 years. During that time, I was lucky enough to have established a formal data quality function for a major healthcare provider because the business area recognized the value of inaccurate diagnosis code which lead to major delays in the claims process.
All I will say is a good friend of mine once defined data quality as "Fitness for Use." So I guess what is important is the type of field we are talking about, as well as the type of business rules we put in place and why we are testing them.
If you are talking about code set tables, the value must be a valid value and consistent with what is being tested. An example would be rule testing consistency. If you are working with asset class and your test is to see if maturity date is not null, the asset class should be fixed income (Bonds). You would not perform this test for equities. Another test where accuracy is important would be a primary key where a not null test is important.
Anyway there are instances when accuracy is important and others where its not so I would agree context is important."
I have to agree with Rick about data quality being in the eye of the beholder - and with Henrik on the several dimensions of quality.
A theme I often return to is "what does the business want/expect from data?" - and when you hear them talk about quality, it's not just an issue of accuracy. The business stakeholder cares - more than many seem to notice - about a number of other issues that are squarely BI concerns:
- Timeliness ("when I want it")
- Format ("how I want to see it") - visualization, delivery channels
- Usability ("how I want to then make use of it") - being able to extract information from a report (say) for other purposes
- Relevance ("I want highlighted the information that is meaningful to me")
And so on. Yes, accuracy is important, and it messes your effectiveness when delivering inaccurate information. But that's not the only thing a business stakeholder can raise when discussing issues of quality. A report can be rejected as poor quality if it doesn't adequately meet business needs in a far more general sense. That is the constant challenge for a BI professional.
Data Quality is always about accuracy at the raw level, but then I agree with Rick that the value of data quality is realized through the business context, which helps to focus and prioritize the data quality effort and then one gets the full benefit out of data quality. Data Quality must be 'Fit for Business Purpose.'
Maybe the headline could have been a little different. Saying '... Not Accuracy' somehow implies that accuracy is not important. If it is all about context and if accuracy is compromised, then the context loses the business benefit that could have been realized with higher accuracy.
This is a great debate, about a serious topic for Data Quality and BI professionals.
The bottom line from a business perspective is always "just enough is good enough." The challenge for the Data Quality profession is to make the business case for "just enough" to be clearly defined in terms of measurable dimensions.
Ultimately, the context is most important - i.e. the use to which the information is put. However, the final presentation of the information is dependent on the underlying data, and as Stephen Simmonds points out, many other factors such as Timeliness, Format, Usability and Relevance.
Let me give you a simple analogy. Suppose there is a requirement to water a new lawn. The context is that 500 litres (or liters for our US friends) must be sprayed on the new lawn. One might assume that so long as the water is delivered, the requirement is met...
However, what if:
- A well had to be dug to provide the water?
- The water contains contaminants that will kill the new lawn?
- The hose contains many leaks, and leaks 5,000 litres in delivering 500 (incurring 10 times the water charges)
- etc.
Watering a lawn is such an everyday occurrence, that one reasonably assumes that the required 'plumbing' is in place to deliver clean water in a cost effective manner.
Business people have the right to assume that they can access the information they require. Business people have the right to assume that the required 'plumbing' is in place to deliver 'clean', 'complete', 'accurate' 'timely', 'relevant' 'usable' information in a cost effective manner.
Thus we need to split the "plumbing', which should be standard across all applications, from the business specific, "bespoke by nature" part of data / information management. The business specific stuff, the 'context', the 'really important stuff' simply cannot happen if the 'plumbing' is not in place.
So, which is more important, the context or the accuracy? Which is more important, the chicken or the egg?
I have written a series on my blog covering Data 'plumbing' issues, and how to assess the status in your Enterprise:
Process for assessing status of common Enterprise-Wide Data Issues
The problem is that context gets thrown out the window when that one person believes she/he can individually solve the corporate data quality problem. That's what makes data quality so hard - you have to rely heavily on your cross-functional teams to both understand the meaning of the data and its impact on the organization.
An example is the simple two-letter abbreviation “pt.” Within various contexts, “pt” can mean many different things:
• PT Emp = Part-time employee
• PTCRSR = PT Cruiser (Personal Transportation Cruiser)
• Blk pt chassis = Black platinum chassis
• 24pt bk = Manual published in 24-point type
• 2 pt asbl = Two-part assembly
• 1 pt = One pint
• LIS PT = Lisbon, Portugal
Data quality requires people working together to understand the meaning of data like PT.
The only way to make the data fit-for-use is to provide context.
I see this discussion has morphed somewhat to be discussing information quality rather than data quality/accuracy.
I will resist adding my own views for the moment, since I first want to ask a question:
In terms of DATA (not information), can anybody give me an example where data which is accurate, relevant, complete and up-to-date has not proved to have been fit for ANY purpose?
Thanks!
If a tree falls in the forest, but there's no one there to hear it...
At the end of every post in this thread, there is a date corresponding to the date the poster added to the thread. If no one analyzes this information, does it have any purpose?
Yes, it can serve to reorder the thread (though it's not clear if you could shuffle multiple postings on the same day), and you could perform analysis on that field (say, what is the average, max, min, etc number of postings per day? What days had the most activity, and was there any identifiable driver for that activity? What if we were trying to perform some capacity planning for the website?).
But what if no one performs any of that analysis--then it becomes unneeded data, clogging up all the places it appears. One of the trends in database tuning helps identify these kind of fields, so you can weed them out.
So, to refer to what others have previously posted:
Accuracy, timeliness, etc doesn't mean a whole lot if you're not going to use that data.
Charles: I don't know if your post is meant as an answer to my question (I presume not, as your concerns are covered, I think, by the word "relevant" in my post).
I do feel, though, that the idea of non-required data clogging up systems is not the problem some consider it to be.
It is always better to collect more data than you may need than to collect too little, because when you find that you need to know that extra piece of information (and we never know what may be required in the future), you'll generally not going to be able to go back and collect it in the time available.
In my experience software and hardware are developing faster than our ability to collect data, and stored data should not get in the way of other data or processes if it is stored and processed properly.
I have collected tens of millions of records on my humble PC over the years, but I am now able to process them faster than ever. Often I collected the data without knowing at the time what I would do with it, but my customers were always delighted that I had it when they suddenly needed it, and that it could be provided in a timely manner.
Naturally, I'm not suggesting that banks, for example, collect data about the number of trees I have in my garden - none of us can, or would want to, collect every piece of data about everything - but I would always advise collecting data which may not be relevant now but may be relevant in the future.
Graham,
UK addresses can be adequately recorded using only the first line of the address plus postcode. Software is available which returns a full address given those 2 pieces of information.
Yet, despite this, a customer database will still hold the full address. (I know you know this already). Therefore I put it to you that address line 2 is accurate, relevant (unless you want to argue semantics :-) ), complete (more than necessary) and up-to-date...yet serves no purpose.
If you give a postman house number and postcode the letter will arrive at it's proper destination.
There is(was) an outdoor pursuits shop in the Lake District (I forget it's name) which used to boast it was so famous that mail arrived even without a full address. The envelope just had a drawing of a man fishing with the word KESWICK (the town), but that's another story.
While I agree very much about data context, I have to say that getting inaccurate data contextually right often doesn't help much. I look for consistency in 4 things:
1. Complete
2. Accurate
3. Normalized
4. Timely
If all of that is true, the systems that have the data should be able to present it in the context I want. In general, I'm more concerned in getting the data in the first place - and getting it right - and letting downstream processes act upon the data knowing that they have consistently reliable information.