Data Quality and the Cupertino Effect

The Cupertino Effect can occur when you accept the suggestion of a spellchecker program, which was attempting to assist you with a misspelled word (or what it “thinks” is a misspelling because it cannot find an exact match for the word in its dictionary). 

Although the suggestion (or in most cases, a list of possible words is suggested) is indeed spelled correctly, it might not be the word you were trying to spell, and in some cases, by accepting the suggestion, you create a contextually inappropriate result.

It’s called the “Cupertino” effect because with older programs the word “cooperation” was only listed in the spellchecking dictionary in hyphenated form (i.e., “co-operation”), making the spellchecker suggest “Cupertino” (i.e., the California city and home of the worldwide headquarters of Apple, Inc.,  thereby essentially guaranteeing it to be in all spellchecking dictionaries).

By accepting the suggestion of a spellchecker program (and if there’s only one suggested word listed, don’t we always accept it?), a sentence where we intended to write something like:

“Cooperation is vital to our mutual success.”

Becomes instead:

“Cupertino is vital to our mutual success.”

And then confusion ensues (or hilarity—or both).

Beyond being a data quality issue for unstructured data (e.g., documents, e-mail messages, blog posts, etc.), the Cupertino Effect reminded me of the accuracy versus context debate.

 

“Data quality is primarily about context not accuracy...”

This Data Quality (DQ) Tip from last September sparked a nice little debate in the comments section.  The complete DQ-Tip was:

“Data quality is primarily about context not accuracy. 

Accuracy is part of the equation, but only a very small portion.”

Therefore, the key point wasn’t that accuracy isn’t important, but simply to emphasize that context is more important. 

In her fantastic book Executing Data Quality Projects, Danette McGilvray defines accuracy as “a measure of the correctness of the content of the data (which requires an authoritative source of reference to be identified and accessible).”

Returning to the Cupertino Effect for a moment, the spellchecking dictionary provides an identified, accessible, and somewhat authoritative source of reference—and “Cupertino” is correct data content for representing the name of a city in California. 

However, absent a context within which to evaluate accuracy, how can we determine the correctness of the content of the data?

 

The Free-Form Effect

Let’s use a different example.  A common root cause of poor quality for structured data is: free-form text fields.

Regardless of how good the metadata description is written or how well the user interface is designed, if a free-form text field is provided, then you will essentially be allowed to enter whatever you want for the content of the data (i.e., the data value).

For example, a free-form text field is provided for entering the Country associated with your postal address.

Therefore, you could enter data values such as:

Brazil
United States of America
Portugal
United States
República Federativa do Brasil
USA
Canada
Federative Republic of Brazil
Mexico
República Portuguesa
U.S.A.
Portuguese Republic

However, you could also enter data values such as:

Gondor
Gnarnia
Rohan
Citizen of the World
The Land of Oz
The Island of Sodor
Berzerkistan
Lilliput
Brobdingnag
Teletubbyland
Poketopia
Florin

The first list contains real countries, but a lack of standard values introduces needless variations. The second list contains fictional countries, which people like me enter into free-form fields to either prove a point or simply to amuse myself (well okay—both).

The most common solution is to provide a drop-down box of standard values, such as those provided by an identified, accessible, and authoritative source of reference—the ISO 3166 standard country codes.

Problem solved—right?  Maybe—but maybe not. 

Yes, I could now choose BR, US, PT, CA, MX (the ISO 3166 alpha-2 codes for Brazil, United States, Portugal, Canada, Mexico), which are the valid and standardized country code values for the countries from my first list above—and I would not be able to find any of my fictional countries listed in the new drop-down box.

However, I could also choose DO, RE, ME, FI, SO, LA, TT, DE (Dominican Republic, Réunion, Montenegro, Finland, Somalia, Lao People’s Democratic Republic, Trinidad and Tobago, Germany), all of which are valid and standardized country code values, however all of them are also contextually invalid for my postal address.

 

Accuracy: With or Without Context?

Accuracy is only one of the many dimensions of data quality—and you may have a completely different definition for it. 

Paraphrasing Danette McGilvray, accuracy is a measure of the validity of data values, as verified by an authoritative reference. 

My question is what about context?  Or more specifically, should accuracy be defined as a measure of the validity of data values, as verified by an authoritative reference, and within a specific context?

Please note that I am only trying to define the accuracy dimension of data quality, and not data quality

Therefore, please resist the urge to respond with “fitness for the purpose of use” since even if you want to argue that “context” is just another word meaning “use” then next we will have to argue over the meaning of the word “fitness” and before you know it, we will be arguing over the meaning of the word “meaning.”

Please accurately share your thoughts (with or without context) about accuracy and context—by posting a comment below.