Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments

Entries in Accuracy (12)

Tuesday
Jul102012

Shining a Social Light on Data Quality

Last week, when I published my blog post Lightning Strikes the Cloud, I unintentionally demonstrated three important things about data quality.

The first thing I demonstrated was even an obsessive-compulsive data quality geek is capable of data defects, since I initially published the post with the title Lightening Strikes the Cloud, which is an excellent example of the difference between validity and accuracy caused by the Cupertino Effect, since although lightening is valid (i.e., a correctly spelled word), it isn’t contextually accurate.

The second thing I demonstrated was the value of shining a social light on data quality — the value of using collaborative tools like social media to crowd-source data quality improvements.  Thankfully, Julian Schwarzenbach quickly noticed my error on Twitter.  “Did you mean lightning?  The concept of lightening clouds could be worth exploring further,” Julian humorously tweeted.  “Might be interesting to consider what happens if the cloud gets so light that it floats away.”  To which I replied that if the cloud gets so light that it floats away, it could become Interstellar Computing or, as Julian suggested, the start of the Intergalactic Net, which I suppose is where we will eventually have to store all of that big data we keep hearing so much about these days.

The third thing I demonstrated was the potential dark side of data cleansing, since the only remaining trace of my data defect is a broken URL.  This is an example of not providing a well-documented audit trail, which is necessary within an organization to communicate data quality issues and resolutions.

Communication and collaboration are essential to finding our way with data quality.  And social media can help us by providing more immediate and expanded access to our collective knowledge, experience, and wisdom, and by shining a social light that illuminates the shadows cast upon data quality issues when a perception filter or bystander effect gets the better of our individual attention or undermines our collective best intentions — which, as I recently demonstrated, occasionally happens to all of us.

 

Related Posts

Data Quality and the Cupertino Effect

Are you turning Ugly Data into Cute Information?

The Importance of Envelopes

The Algebra of Collaboration

Finding Data Quality

The Wisdom of the Social Media Crowd

Perception Filters and Data Quality

Data Quality and the Bystander Effect

The Family Circus and Data Quality

Data Quality and the Q Test

Metadata, Data Quality, and the Stroop Test

The Three Most Important Letters in Data Governance

Monday
Jun252012

Metadata, Data Quality, and the Stroop Test

In psychology, the Stroop Effect is a demonstration of the reaction time of a task.  The most commonly used example is what is known as the Stroop Test, which compares the time needed to name colors when they are printed in an ink color that matches their name (e.g., greenyellowredbluebrownpurple) with the time needed to name the same colors when they are printed in an ink color that does not match their name (e.g., bluered, purple, green, brownyellow).  Naming the color of the word takes longer, and is more prone to errors, when the ink color does not match the name of the color.

The Stroop Test, where colors do not match their names, reminds me of the relationship between metadata and data quality if I view the ink color as the metadata and the name of the color as the data, given that understanding data takes longer, and is more prone to errors, when the metadata does not match the data, or when the metadata is ambiguous.

Unlike the Stroop Test, where poor metadata (ink color) obfuscates good data (name of the color), data quality issues can also be caused when good metadata is undermined by poor data (e.g., data entry errors like an email address being entered into a postal address field).  And, of course, even when the entered data matches the metadata (or automatic data-to-metadata matching is enabled by drop-down boxes), more insidious data quality issues can be caused by the complex challenge of data accuracy.

Additionally, the point of view paradox can turn data quality debates about fitness for the purpose of use even more colorful than the Stroop Test, such as when data that one user sees as red and green, another user sees as crimson and chartreuse.

But hopefully we can all agree that good data quality begins with good metadata, because better metadata makes data better.

 

Related Posts

You Say Potato and I Say Tater Tot

The Metadata Continuum

The Metadata Crisis

Let’s Meta a Data

What’s the Meta with your Data?

DQ-View: MetaData makes BettahMusic

Who Framed Data Entry?

Data Quality and the Cupertino Effect

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-BE: Data Quality Airlines

Data Quality and the Q Test

Wednesday
Jun012011

A Brave New Data World

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Welcome to the highly anticipated debut episode of the Obsessive-Compulsive Data Quality (OCDQ) podcast—OCDQ Radio!

In this episode, I discuss how data, data quality, data-driven decision making, and metadata quality no longer reside exclusively within the esoteric realm of data management.  Data has now so thoroughly pervaded mainstream culture that we hardly seem to notice that we are quite literally swimming in data on a daily basis.

The growing challenge is can we extract meaningful insights from these vast and veritable oceans of unrelenting data volumes, and use those insights to make better decisions in near real-time in order to positively impact the various aspects of our lives.

We are now living in a brave new data world where everyone is a data geek—and data quality affects us all.

Or to paraphrase William Shakespeare:

“O wonder!

How many goodly data are there here!  How beauteous data geeks are! 

O brave new world! 

That is so dependent on the quality of the data in it!”

 

A Brave New Data World

Additional listening options:

 

Related Posts

Data, data everywhere, but where is data quality?

Data In, Decision Out

The Data-Decision Symphony

Data Confabulation in Business Intelligence

The Real Data Value is Business Insight

Thaler’s Apples and Data Quality Oranges

Amazon’s Data Management Brain

The Reptilian Anti-Data Brain

Identifying Duplicate Customers

Data Quality and the Cupertino Effect

What's the Meta with your Data?

Let’s Meta a Data

Thursday
Mar032011

Thaler’s Apples and Data Quality Oranges

In the opening chapter of his book Carrots and Sticks, Ian Ayres recounts the story of Thaler’s Apples:

“The behavioral revolution in economics began in 1981 when Richard Thaler published a seven-page letter in a somewhat obscure economics journal, which posed a pretty simple choice about apples.

Which would you prefer:

(A) One apple in one year, or

(B) Two apples in one year plus one day?

This is a strange hypothetical—why would you have to wait a year to receive an apple?  But choosing is not very difficult; most people would choose to wait an extra day to double the size of their gift.

Thaler went on, however, to pose a second apple choice.

Which would you prefer:

(C) One apple today, or

(D) Two apples tomorrow?

What’s interesting is that many people give a different, seemingly inconsistent answer to this second question.  Many of the same people who are patient when asked to consider this choice a year in advance turn around and become impatient when the choice has immediate consequences—they prefer C over D.

What was revolutionary about his apple example is that it illustrated the plausibility of what behavioral economists call ‘time-inconsistent’ preferences.  Richard was centrally interested in the people who chose both B and C.  These people, who preferred two apples in the future but one apple today, flipped their preferences as the delivery date got closer.”

What does this have to do with data quality?  Give me a moment to finish eating my second apple, and then I will explain . . .

 

Data Quality Oranges

Let’s imagine that an orange represents a unit of measurement for data quality, somewhat analogous to data accuracy, such that the more data quality oranges you have, the better the quality of data is for your needs—let’s say for making a business decision.

Which would you prefer:

(A) One data quality orange in one month, or

(B) Two data quality oranges in one month plus one day?

(Please Note: Due to the strange uncertainties of fruit-based mathematics, two data quality oranges do not necessarily equate to a doubling of data accuracy, but two data quality oranges are certainly an improvement over one data quality orange).

Now, of course, on those rare occasions when you can afford to wait a month or so before making a critical business decision, most people would choose to wait an extra day in order to improve their data quality before making their data-driven decision.

However, let’s imagine you are feeling squeezed by a more pressing business decision—now which would you prefer:

(C) One data quality orange today, or

(D) Two data quality oranges tomorrow?

In my experience with data quality and business intelligence, most people prefer B over A—and C over D.

This “time-inconsistent” data quality preference within business intelligence reflects the reality that with the speed at which things change these days, more real-time business decisions are required—perhaps making speed more important than quality.

In a recent Data Knights Tweet Jam, Mark Lorion pondered speed versus quality within business intelligence, asking: “Is it better to be perfect in 30 days or 70% today?  Good enough may often be good enough.”

To which Henrik Liliendahl Sørensen responded with the perfectly pithy wisdom: “Good, Fast, Decision—Pick any two.”

However, Steve Dine cautioned that speed versus quality is decision dependent: “70% is good when deciding how many pencils to order, but maybe not for a one billion dollar acquisition.”

Mark’s follow-up captured the speed versus quality tradeoff succinctly with “Good Now versus Great Later.”  And Henrik added the excellent cautionary note: “Good decision now, great decision too late—especially if data quality is not a mature discipline.”

 

What Say You?

How many data quality oranges do you think it takes?  Or for those who prefer a less fruitful phrasing, where do you stand on the speed versus quality debate?  How good does data quality have to be in order to make a good data-driven business decision?

 

Related Posts

To Our Data Perfectionists

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

Data Quality and the Cupertino Effect

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data In, Decision Out

The Data-Decision Symphony

Data!

You Can’t Always Get the Data You Want

Tuesday
Feb012011

Data In, Decision Out

This recent blog post by Seth Godin made me think about the data quality adage garbage in, garbage out (aka GIGO).

Since we live in the era of data deluge and information overload, Godin’s question about how much time and effort should be spent on absorbing data and how much time and effort should be invested in producing output is an important one, especially for enterprise data management, where it boils down to how much data should be taken in before a business decision can come out.

In other words, it’s about how much time and effort is invested in the organization’s data in, decision out (i.e., DIDO) process.

And, of course, quality is an important aspect of the DIDO process—both data quality and decision quality.  But, oftentimes, it is an organization’s overwhelming concerns about its GIGO that lead to inefficiencies and ineffectiveness around its DIDO.

How much data is necessary to make an effective business decision?  Having complete (i.e., all available) data seems obviously preferable to incomplete data.  However, with data volumes always burgeoning, the unavoidable fact is that sometimes having more data only adds confusion instead of clarity, thereby becoming a distraction instead of helping you make a better decision.

Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a business decision.  Even large amounts of high quality data will not guarantee high quality business decisions, just as high quality business decisions will not guarantee high quality business results.

In other words, overcoming GIGO will not guarantee DIDO success.

When it comes to the amount and quality of the data used to make business decisions, you can’t always get the data you want, and while you should always be data-driven, never only intuition-driven, eventually it has to become: Time to start deciding.

 

Related Posts

The Data-Decision Symphony

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

DQ-View: From Data to Decision

TDWI World Conference Orlando 2010

Thursday
Sep232010

DQ-BE: Data Quality Airlines

Data Quality By Example (DQ-BE) is a new OCDQ segment that will provide examples of data quality key concepts.

“Good morning sir!” said the smiling gentleman behind the counter—and a little too cheerily for 5 o’clock in the morning.  “Welcome to the check-in counter for Data Quality Airlines.  My name is Edward.  How may I help you today?”

“Good morning Edward,” I replied.  “My name is John Smith.  I am traveling to Boston today on flight number 221.”

“Thank you for choosing Data Quality Airlines!” responded Edward.  “May I please see your driver’s license, passport, or other government issued photo identification so that I can verify your data accuracy.”

As I handed Edward my driver’s license, I explained “it’s an old photograph in which I was clean-shaven, wearing contact lenses, and ten pounds lighter” since I now had a full beard, was wearing glasses, and, to be honest, was actually thirty pounds heavier.

“Oh,” said Edward, his plastic smile morphing into a more believable and stern frown.  “I am afraid you are on the No Fly List.”

“Oh, that’s right—because of my name being so common!” I replied while fumbling through my backpack, frantically searching for the piece of paper, which I then handed to Edward.  “I’m supposed to give you my Redress Control Number.”

“Actually, you’re supposed to use your Redress Control Number when making your reservation,” Edward retorted.

“In other words,” I replied, while sporting my best plastic smile, “although you couldn’t verify the accuracy of my customer data when I made my reservation on-line last month, you were able to verify the authorization to immediately charge my credit card for the full price of purchasing a non-refundable plane ticket to fly on Data Quality Airlines.”

“I don’t appreciate your sense of humor,” replied Edward.  “Everyone at Data Quality Airlines takes accuracy very seriously.”

Edward printed my boarding pass, wrote BCS on it in big letters, handed it to me, and with an even more plastic smile cheerily returning to his face, said: “Please proceed to the security checkpoint.  Thank you again for choosing Data Quality Airlines!”

“Boarding pass?” asked the not-at-all smiling woman at the security checkpoint.  After I handed her my boarding pass, she said, “And your driver’s license, passport, or other government issued photo identification so that I can verify your data accuracy.”

“I guess my verified data accuracy at the Data Quality Airlines check-in counter must have already expired,” I joked as I handed her my driver’s license.  “It’s an old photograph in which I was clean-shaven, wearing contact lenses, and ten pounds lighter.”

The woman silently examined my boarding pass and driver’s license, circled BCS with a magic marker, and then shouted over her shoulder to a group of not-at-all smiling security personnel standing behind her: “Randomly selected security screening!”

One of them, a very large man, stepped toward me as the sound from the snap of the fresh latex glove he had just placed on his very large hand echoed down the long hallway that he was now pointing me toward.  “Right this way sir,” he said with a smile.

Ten minutes later, as I slowly walked to the gate for Data Quality Airlines Flight Number 221 to Boston, the thought echoing through my mind was that there is no such thing as data accuracy—there are only verifiable assertions of data accuracy . . .

Related Posts

DQ-Tip: “There is no such thing as data accuracy...”

Why isn’t our data quality worse?

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

Monday
Sep202010

DQ-Tip: “There is no such thing as data accuracy...”

Data Quality (DQ) Tips is an OCDQ regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“There is no such thing as data accuracy — There are only assertions of data accuracy.”

This DQ-Tip came from the Data Quality Pro webinar ISO 8000 Master Data Quality featuring Peter Benson of ECCMA.

You can download (.pdf file) quotes from this webinar by clicking on this link: Data Quality Pro Webinar Quotes - Peter Benson

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000

 

Data Accuracy

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy).

“The definition of data quality,” according to Peter and the ISO 8000 standards, “is the ability of the data to meet requirements.”

Although accuracy is only one of many dimensions of data quality, whenever we refer to data as accurate, we are referring to the ability of the data to meet specific requirements, and quite often it’s the ability to support making a critical business decision.

I agree with Peter and the ISO 8000 standards because we can’t simply take an accuracy metric on a data quality dashboard (or however else the assertion is presented to us) at face value without understanding how the metric is both defined and measured.

However, even when well defined and properly measured, data accuracy is still only an assertion.  Oftentimes, the only way to verify the assertion is by putting the data to its intended use.

If by using it you discover that the data is inaccurate, then by having established what the assertion of accuracy was based on, you have a head start on performing root cause analysis, enabling faster resolution of the issues—not only with the data, but also with the business and technical processes used to define and measure data accuracy.

 

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Why isn’t our data quality worse?

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “There is no point in monitoring data quality...”

DQ-Tip: “Don't pass bad data on to the next person...”

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is about more than just improving your data...” 

DQ-Tip: “Start where you are...”

Thursday
Sep092010

Why isn’t our data quality worse?

In psychology, the term negativity bias is used to explain how bad evokes a stronger reaction than good in the human mind.  Don’t believe that theory?  Compare receiving an insult with receiving a compliment—which one do you remember more often?

Now, this doesn’t mean the dark side of the Force is stronger, it simply means that we all have a natural tendency to focus more on the negative aspects, rather than on the positive aspects, of most situations, including data quality.

In the aftermath of poor data quality negatively impacting decision-critical enterprise information, the natural tendency is for a data quality initiative to begin by focusing on the now painfully obvious need for improvement, essentially asking the question:

Why isn’t our data quality better?

Although this type of question is a common reaction to failure, it is also indicative of the problem-seeking mindset caused by our negativity bias.  However, Chip and Dan Heath, authors of the great book Switch, explain that even in failure, there are flashes of success, and following these “bright spots” can illuminate a road map for action, encouraging a solution-seeking mindset.

“To pursue bright spots is to ask the question: What’s working, and how can we do more of it?

Sounds simple, doesn’t it?  Yet, in the real-world, this obvious question is almost never asked.

Instead, the question we ask is more problem focused: What’s broken, and how do we fix it?”

Why isn’t our data quality worse?

For example, let’s pretend that a data quality assessment is performed on a data source used to make critical business decisions.  With the help of business analysts and subject matter experts, it’s verified that this critical source has an 80% data accuracy rate.

The common approach is to ask the following questions (using a problem-seeking mindset):

  • Why isn’t our data quality better?
  • What is the root cause of the 20% inaccurate data?
  • What process (business or technical, or both) is broken, and how do we fix it?
  • What people are responsible, and how do we correct their bad behavior?

But why don’t we ask the following questions (using a solution-seeking mindset):

  • Why isn’t our data quality worse?
  • What is the root cause of the 80% accurate data?
  • What process (business or technical, or both) is working, and how do we re-use it?
  • What people are responsible, and how do we encourage their good behavior?

I am not suggesting that we abandon the first set of questions, especially since there are times when a problem-seeking mindset might be a better approach (after all, it does also incorporate a solution-seeking mindset—albeit after a problem is identified).

I am simply wondering why we often never even consider asking the second set of questions?

Most data quality initiatives focus on developing new solutions—and not re-using existing solutions.

Most data quality initiatives focus on creating new best practices—and not leveraging existing best practices.

Perhaps you can be the chosen one who will bring balance to the data quality initiative by asking both questions:

Why isn’t our data quality better?  Why isn’t our data quality worse?

Monday
Aug232010

The Real Data Value is Business Insight

Data Values for COUNTRY Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.

A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).

However, a common mistake is to hyper-focus on the data values.

Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.

Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.

 

“Begin with the decision in mind”

In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.”  Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.

Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries.  And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.

However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind.  Find the decisions that are going to make a difference to business results—to the metrics that drive the organization.  Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”

Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.

 

The Real Data Value is Business Insight

The Real Data Value is Business Insight

Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.

Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.

However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.

But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.

Oh, no!  The organization must be teetering on the edge of oblivion, right?  Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin.  However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?

As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”

So, would reducing your duplicate rate to only 1% automatically result in better customer insight?  Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have?  (Or maybe the 11% indicates the matching criteria was too aggressive).

My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?

The fundamental question that every data quality metric you create must answer is: How does this provide business insight?

If a data quality (or any other data) metric can not answer this question, then it is meaningless.  Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind.  Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.

Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.

Although analyzing your data values is important, you must always remember that the real data value is business insight.

 

Related Posts

The First Law of Data Quality

Adventures in Data Profiling

Data Quality and the Cupertino Effect

Is your data complete and accurate, but useless to your business?

The Idea of Order in Data

You Can’t Always Get the Data You Want

Red Flag or Red Herring? 

DQ-Tip: “There is no point in monitoring data quality…”

Which came first, the Data Quality Tool or the Business Need?

Selling the Business Benefits of Data Quality

Tuesday
Jul272010

Is your data complete and accurate, but useless to your business?

Ensuring that complete and accurate data is being used to make critical daily business decisions is perhaps the primary reason why data quality is so vitally important to the success of your organization. 

However, this effort can sometimes take on a life of its own, where achieving complete and accurate data is allowed to become the raison d'être of your data management strategy—in other words, you start managing data for the sake of managing data.

When this phantom menace clouds your judgment, your data might be complete and accurate—but useless to your business.

 

Completeness and Accuracy

How much data is necessary to make an effective business decision?  Having complete (i.e., all available) data seems obviously preferable to incomplete data.  However, with data volumes always burgeoning, the unavoidable fact is that sometimes having more data only adds confusion instead of clarity, thereby becoming a distraction instead of helping you make a better decision.

Returning to my original question, how much data is really necessary to make an effective business decision? 

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy). 

Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a critical business decision.  When it comes to the quality of the data being used to make these business decisions, you can’t always get the data you want, but if you try sometimes, you just might find, you get the business insight you need.

 

Data-driven Solutions for Business Problems

Obviously, there are even more dimensions of data quality beyond completeness and accuracy. 

However, although it’s about more than just improving your data, data quality can be misperceived to be an activity performed just for the sake of the data.  When, in fact, data quality is an enterprise-wide initiative performed for the sake of implementing data-driven solutions for business problems, enabling better business decisions, and delivering optimal business performance.

In order to accomplish these objectives, data has to be not only complete and accurate, as well as whatever other dimensions you wish to add to your complete and accurate definition of data quality, but most important, data has to be useful to the business.

Perhaps the most common definition for data quality is “fitness for the purpose of use.” 

The missing word, which makes this definition both incomplete and inaccurate, puns intended, is “business.”  In other words, data quality is “fitness for the purpose of business use.”  How complete and how accurate (and however else) the data needs to be is determined by its business use—or uses since, in the vast majority of cases, data has multiple business uses.

 

Data, data everywhere

With silos replicating data as well as new data being created daily, managing all of the data is not only becoming impractical, but because we are too busy with the activity of trying to manage all of it, no one is stopping to evaluate usage or business relevance.

The fifth of the Five New Ideas From 2010 MIT Information Quality Industry Symposium, which is a recent blog post written by Mark Goloboy, was that “60-90% of operational data is valueless.”

“I won’t say worthless,” Goloboy clarified, “since there is some operational necessity to the transactional systems that created it, but valueless from an analytic perspective.  Data only has value, and is only worth passing through to the Data Warehouse if it can be directly used for analysis and reporting.  No news on that front, but it’s been more of the focus since the proliferation of data has started an increasing trend in storage spend.”

In his recent blog post Are You Afraid to Say Goodbye to Your Data?, Dylan Jones discussed the critical importance of designing an archive strategy for data, as opposed to the default position many organizations take, where burgeoning data volumes are allowed to proliferate because, in large part, no one wants to delete (or, at the very least, archive) any of the existing data. 

This often results in the data that the organization truly needs for continued success getting stuck in the long line of data waiting to be managed, and in many cases, behind data for which the organization no longer has any business use (and perhaps never even had the chance to use when the data was actually needed to make critical business decisions).

“When identifying data in scope for a migration,” Dylan advised, “I typically start from the premise that ALL data is out of scope unless someone can justify its existence.  This forces the emphasis back on the business to justify their use of the data.”

 

Data Memorioso

Funes el memorioso is a short story by Jorge Luis Borges, which describes a young man named Ireneo Funes who, as a result of a horseback riding accident, has lost his ability to forget.  Although Funes has a tremendous memory, he is so lost in the details of everything he knows that he is unable to convert the information into knowledge and unable, as a result, to grow in wisdom.

In Spanish, the word memorioso means “having a vast memory.”  When Data Memorioso is your data management strategy, your organization becomes so lost in all of the data it manages that it is unable to convert data into business insight and unable, as a result, to survive and thrive in today’s highly competitive and rapidly evolving marketplace.

In their great book Made to Stick: Why Some Ideas Survive and Others Die, Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless.  If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”  I believe that this is also true for your data and your organization’s business uses for it.

Is your data complete and accurate, but useless to your business?

 

Related Posts

Data Quality and the Cupertino Effect

Data Rock Stars: The Rolling Forecasts

Data!

Data, data everywhere, but where is data quality?

DQ-Tip: “There is no point in monitoring data quality…”

DQ-Tip: “Data quality is about more than just improving your data...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

The First Law of Data Quality

Thursday
Jul152010

Data Quality and the Cupertino Effect

The Cupertino Effect can occur when you accept the suggestion of a spellchecker program, which was attempting to assist you with a misspelled word (or what it “thinks” is a misspelling because it cannot find an exact match for the word in its dictionary). 

Although the suggestion (or in most cases, a list of possible words is suggested) is indeed spelled correctly, it might not be the word you were trying to spell, and in some cases, by accepting the suggestion, you create a contextually inappropriate result.

It’s called the “Cupertino” effect because with older programs the word “cooperation” was only listed in the spellchecking dictionary in hyphenated form (i.e., “co-operation”), making the spellchecker suggest “Cupertino” (i.e., the California city and home of the worldwide headquarters of Apple, Inc.,  thereby essentially guaranteeing it to be in all spellchecking dictionaries).

By accepting the suggestion of a spellchecker program (and if there’s only one suggested word listed, don’t we always accept it?), a sentence where we intended to write something like:

“Cooperation is vital to our mutual success.”

Becomes instead:

“Cupertino is vital to our mutual success.”

And then confusion ensues (or hilarity—or both).

Beyond being a data quality issue for unstructured data (e.g., documents, e-mail messages, blog posts, etc.), the Cupertino Effect reminded me of the accuracy versus context debate.

 

“Data quality is primarily about context not accuracy...”

This Data Quality (DQ) Tip from last September sparked a nice little debate in the comments section.  The complete DQ-Tip was:

“Data quality is primarily about context not accuracy. 

Accuracy is part of the equation, but only a very small portion.”

Therefore, the key point wasn’t that accuracy isn’t important, but simply to emphasize that context is more important. 

In her fantastic book Executing Data Quality Projects, Danette McGilvray defines accuracy as “a measure of the correctness of the content of the data (which requires an authoritative source of reference to be identified and accessible).”

Returning to the Cupertino Effect for a moment, the spellchecking dictionary provides an identified, accessible, and somewhat authoritative source of reference—and “Cupertino” is correct data content for representing the name of a city in California. 

However, absent a context within which to evaluate accuracy, how can we determine the correctness of the content of the data?

 

The Free-Form Effect

Let’s use a different example.  A common root cause of poor quality for structured data is: free-form text fields.

Regardless of how good the metadata description is written or how well the user interface is designed, if a free-form text field is provided, then you will essentially be allowed to enter whatever you want for the content of the data (i.e., the data value).

For example, a free-form text field is provided for entering the Country associated with your postal address.

Therefore, you could enter data values such as:

Brazil
United States of America
Portugal
United States
República Federativa do Brasil
USA
Canada
Federative Republic of Brazil
Mexico
República Portuguesa
U.S.A.
Portuguese Republic

However, you could also enter data values such as:

Gondor
Gnarnia
Rohan
Citizen of the World
The Land of Oz
The Island of Sodor
Berzerkistan
Lilliput
Brobdingnag
Teletubbyland
Poketopia
Florin

The first list contains real countries, but a lack of standard values introduces needless variations. The second list contains fictional countries, which people like me enter into free-form fields to either prove a point or simply to amuse myself (well okay—both).

The most common solution is to provide a drop-down box of standard values, such as those provided by an identified, accessible, and authoritative source of reference—the ISO 3166 standard country codes.

Problem solved—right?  Maybe—but maybe not. 

Yes, I could now choose BR, US, PT, CA, MX (the ISO 3166 alpha-2 codes for Brazil, United States, Portugal, Canada, Mexico), which are the valid and standardized country code values for the countries from my first list above—and I would not be able to find any of my fictional countries listed in the new drop-down box.

However, I could also choose DO, RE, ME, FI, SO, LA, TT, DE (Dominican Republic, Réunion, Montenegro, Finland, Somalia, Lao People’s Democratic Republic, Trinidad and Tobago, Germany), all of which are valid and standardized country code values, however all of them are also contextually invalid for my postal address.

 

Accuracy: With or Without Context?

Accuracy is only one of the many dimensions of data quality—and you may have a completely different definition for it. 

Paraphrasing Danette McGilvray, accuracy is a measure of the validity of data values, as verified by an authoritative reference. 

My question is what about context?  Or more specifically, should accuracy be defined as a measure of the validity of data values, as verified by an authoritative reference, and within a specific context?

Please note that I am only trying to define the accuracy dimension of data quality, and not data quality

Therefore, please resist the urge to respond with “fitness for the purpose of use” since even if you want to argue that “context” is just another word meaning “use” then next we will have to argue over the meaning of the word “fitness” and before you know it, we will be arguing over the meaning of the word “meaning.”

Please accurately share your thoughts (with or without context) about accuracy and context—by posting a comment below.

Wednesday
Sep232009

DQ-Tip: “Data quality is primarily about context not accuracy...”

Data Quality (DQ) Tips is an OCDQ regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Data quality is primarily about context not accuracy. 

Accuracy is part of the equation, but only a very small portion.”

This DQ-Tip is from Rick Sherman's recent blog post summarizing the TDWI Boston Chapter Meeting at MIT.

 

I define data using the Dragnet definition – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers).  A common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Alternatively, information can be defined as data in context

Quality, as Sherman explains, “is in the eyes of the beholder, i.e. the business context.”

 

Related Posts

DQ-Tip: “Don't pass bad data on to the next person...”

The General Theory of Data Quality

The Data-Information Continuum