Data Quality and the Cupertino Effect
Jim Harris in
Books,
Data Quality,
Debates tagged
Accuracy,
Best of 2010,
Danette McGilvray
Thursday, July 15, 2010 at 3:00AM The Cupertino Effect can occur when you accept the suggestion of a spellchecker program, which was attempting to assist you with a misspelled word (or what it “thinks” is a misspelling because it cannot find an exact match for the word in its dictionary).
Although the suggestion (or in most cases, a list of possible words is suggested) is indeed spelled correctly, it might not be the word you were trying to spell, and in some cases, by accepting the suggestion, you create a contextually inappropriate result.
It’s called the “Cupertino” effect because with older programs the word “cooperation” was only listed in the spellchecking dictionary in hyphenated form (i.e., “co-operation”), making the spellchecker suggest “Cupertino” (i.e., the California city and home of the worldwide headquarters of Apple, Inc., thereby essentially guaranteeing it to be in all spellchecking dictionaries).
By accepting the suggestion of a spellchecker program (and if there’s only one suggested word listed, don’t we always accept it?), a sentence where we intended to write something like:
“Cooperation is vital to our mutual success.”
Becomes instead:
“Cupertino is vital to our mutual success.”
And then confusion ensues (or hilarity—or both).
Beyond being a data quality issue for unstructured data (e.g., documents, e-mail messages, blog posts, etc.), the Cupertino Effect reminded me of the accuracy versus context debate.
“Data quality is primarily about context not accuracy...”
This Data Quality (DQ) Tip from last September sparked a nice little debate in the comments section. The complete DQ-Tip was:
“Data quality is primarily about context not accuracy.
Accuracy is part of the equation, but only a very small portion.”
Therefore, the key point wasn’t that accuracy isn’t important, but simply to emphasize that context is more important.
In her fantastic book Executing Data Quality Projects, Danette McGilvray defines accuracy as “a measure of the correctness of the content of the data (which requires an authoritative source of reference to be identified and accessible).”
Returning to the Cupertino Effect for a moment, the spellchecking dictionary provides an identified, accessible, and somewhat authoritative source of reference—and “Cupertino” is correct data content for representing the name of a city in California.
However, absent a context within which to evaluate accuracy, how can we determine the correctness of the content of the data?
The Free-Form Effect
Let’s use a different example. A common root cause of poor quality for structured data is: free-form text fields.
Regardless of how good the metadata description is written or how well the user interface is designed, if a free-form text field is provided, then you will essentially be allowed to enter whatever you want for the content of the data (i.e., the data value).
For example, a free-form text field is provided for entering the Country associated with your postal address.
Therefore, you could enter data values such as:
Brazil
United States of America
Portugal
United States
República Federativa do Brasil
USA
Canada
Federative Republic of Brazil
Mexico
República Portuguesa
U.S.A.
Portuguese Republic
However, you could also enter data values such as:
Gondor
Gnarnia
Rohan
Citizen of the World
The Land of Oz
The Island of Sodor
Berzerkistan
Lilliput
Brobdingnag
Teletubbyland
Poketopia
Florin
The first list contains real countries, but a lack of standard values introduces needless variations. The second list contains fictional countries, which people like me enter into free-form fields to either prove a point or simply to amuse myself (well okay—both).
The most common solution is to provide a drop-down box of standard values, such as those provided by an identified, accessible, and authoritative source of reference—the ISO 3166 standard country codes.
Problem solved—right? Maybe—but maybe not.
Yes, I could now choose BR, US, PT, CA, MX (the ISO 3166 alpha-2 codes for Brazil, United States, Portugal, Canada, Mexico), which are the valid and standardized country code values for the countries from my first list above—and I would not be able to find any of my fictional countries listed in the new drop-down box.
However, I could also choose DO, RE, ME, FI, SO, LA, TT, DE (Dominican Republic, Réunion, Montenegro, Finland, Somalia, Lao People’s Democratic Republic, Trinidad and Tobago, Germany), all of which are valid and standardized country code values, however all of them are also contextually invalid for my postal address.
Accuracy: With or Without Context?
Accuracy is only one of the many dimensions of data quality—and you may have a completely different definition for it.
Paraphrasing Danette McGilvray, accuracy is a measure of the validity of data values, as verified by an authoritative reference.
My question is what about context? Or more specifically, should accuracy be defined as a measure of the validity of data values, as verified by an authoritative reference, and within a specific context?
Please note that I am only trying to define the accuracy dimension of data quality, and not data quality.
Therefore, please resist the urge to respond with “fitness for the purpose of use” since even if you want to argue that “context” is just another word meaning “use” then next we will have to argue over the meaning of the word “fitness” and before you know it, we will be arguing over the meaning of the word “meaning.”
Please accurately share your thoughts (with or without context) about accuracy and context—by posting a comment below.



Reader Comments (14)
Great article, Jim.
I have an interesting coincidence story to tell. I was listening to WNYC Radiolab podcast and the recent podcast "Oops" that was released when I saw your tweet about the "Cupertino effect."
So I read your article above, and remembered, "wait, didn't I reference WNYC Radiolab in a comment on Jim's inner beagle post? And didn't he say thanks for the introduction to it? So I gotta think that you first heard about the Cupertino effect on Radiolab. Am I right?
If so (or even if not), great application of that amusing effect to your data quality and context post. I'm going to have to start putting a Radiolab tag on my comments of your blog.
Smaller coincidence, considering the Cupertino effect: when I type "Cupertino" into this editor, it's flagged as a bad spelling, and the suggestions for a change are "Pertinacious" and "Pertinent," but alas, no "cooperation."
Thanks for your comment, Alan.
Yes, you are correct! I followed your previous comment recommendation and starting listening to WNYC Radiolab and it was during their Oops podcast that I heard about the Cupertino Effect.
And I too have noticed that the Cupertino Effect appears to now be over-corrected in most spellcheckers. Therefore, if you were actually trying to spell the name of that city in California, you are going to have a rather pertinacious problem.
Perhaps we should call that the Anti-Cupertino Effect?
Best Regards,
Jim
It seems to me, Jim, that there is a danger here in confusing accuracy with validity.
You can, of course, choose any valid country code from a drop down. But accuracy (in my definition) is a measure of the extent to which a value reflects the real world entity, transaction etc. to which it refers. So, choosing TT to represent the real world entity that is the country in which I live (NL) is not accurate. And I don't see this as really relating to context, which is really an information issue (don't get me started on the difference between data and information :-) )
Thanks for your comment, Graham.
First, I must admit that I struggled against the urge to include the data versus information debate in this blog post because, as you alluded, that opens up a much larger thread of discussion and debate.
Second, I can't help but wonder if the challenge is semantics. When you say that accuracy measures the extent to which a data value reflects the real world entity or transaction to which it refers, doesn't that reflection mean that the real world object provides context?
Returning to the Danette McGilvray definition (which, by the way, I am not trying to criticize), I have seen many people create data quality dashboards with an accuracy dimension showing that the Country field was 100% accurate because all of the data values were verified by the authoritative reference of ISO 3166 standard country codes.
However, just because TT and NL are both valid data values, can the Country field be considered accurate without verifying the context of the record? That was the question I was trying to ask, but perhaps didn't articulate very well.
Best Regards,
Jim
Jim,
Semantics is part of the challenge - the very reason that I started building and maintaining an online Data Quality Glossary.
I noticed that the definitions of words and phrases vary according to the context in which they are used - a perfect example of understanding requiring context.
But for data I see this as a much clearer issue. For the country in which I live there is only one accurate answer (though that answer can be expressed in any number of ways: NL, NLD, Netherlands, The Netherlands, Nederland and so on).
Accuracy is a reflection of a real world situation, but that real world situation is information and not data (which, without stepping any more into that dangerous area, is why I see the context issue as being an information quality issue and not a data quality issue).
Thus, "NL" is an accurate way of describing the country in which I live, but would not suffice if written on an envelope sent from the USA - in that case "The Netherlands" would be better - so the context defines the way the data is presented (which is information ... sorry, there I go again!).
You'll know that I have little respect for most of what markets itself as data quality tools, and you've given a great example of why. Dashboards measuring validity are a step towards data quality, but won't help with measuring accuracy. So I, for one, don't agree with Danette's definition - it's flawed in the way you describe.
No wonder we get confused about such things!
Put 10 data quality experts into a room to discuss the difference between data and information, and watch the blood flow!
@Graham - And just imagine adding "Holland" into the mix!
@Crysta - exactly! But I resisted the temptation to add that to the list as I don't actually live in Holland - I live in Noord Holland, and as we're discussing accuracy . . .
@Graham — Excellent points, especially about the Battle Royale of Data versus Information. My recent contribution to it was my blog post about The Fourth Law of Data Quality.
@Crysta — What is this "Holland" you speak of? . . . Just kidding, Graham!
As always, as said, data versus information makes a good discussion.
I just performed a small experiment.
First, I found a data file on my computer. Lots of data in there being numbers and letters. And sure, what is interesting is the information I can derive for different purposes.
Then I deleted the data file and tried to see how much information was left behind.
Guess what? Not a bit.
@Henrik — First, thanks for the hilarious and insightful comment. But you know I have to ask: was the data valid and the information accurate before you deleted the data and thereby erased the information? :-)
Jim,
GREAT post, as usual, and great dialog amongst thought leaders!
I have to admit that I tend to think about this more the way Graham articulates accuracy vs. validity.
I usually separate those out by saying that validity is a binary measurement of whether or not a value is a correct or incorrect within a certain context, whereas accuracy is a measurement of the valid value's "correctness" within the context of the other data surrounding it and/or the processes operating upon it.
So, validity answers the question "Is 'ZW' a valid country code?" and the answer would (currently) be "yes, on the African continent, or perhaps on planet Earth."
Accuracy answers the question "is it 2.5 degrees Celsius today in Redding, California?" - to which the answer would measure several things: is 2.5 degrees Celsius a valid temperature for Redding, CA? (yes it is), is it probable this time of year? (no, it has never been nearly that cold on this date), and are there any weather anomalies noted that might recommend that 2.5C is valid for Redding today? (no there aren't). So even though 2.5C is a valid air temperature, Redding, CA is a valid city and state combination, and 2.5C is valid for Redding in some parts of the year, that temperature has never been seen in Redding on July 15th and therefore it is probably not accurate.
Another "accuracy" use case is one I've run into before: Is it accurate that Customer A purchased $15,049.00 in <product> on order 123 on <this date>?
To answer this, you may look at the average order size for this product (in quantity and overall price), the average order sizes from Customer A (in quantity ordered and monetary value), any promotions that offer such pricing deals, etc.
Given that the normal CC charges for this customer are in the $50.00 to $150.00 range, and that the products ordered are on average $10.00 to $30.00, and that even the best customers normally do not order more than $200, and that there has never been a single order from this type of customer for this amount, then it is highly unlikely that a purchase of this size is accurate.
But it's still fuzzy in my head (as are many things!) . . .
My colleague, Larry Dubov, has done some significant work in applying information theory to measure the validity of an entire record by using probabilistic matching algorithms to estimate the probability that a record is "valid" - but you'll have to ask him to explain it!!!
Hope this is not confusing!
Cheers!
@Marty — Thanks for your great comment. Let me see if I can summarize it both validly and accurately:
Validity = correctness of a data value within a limited context such as verification by an authoritative reference
Accuracy = correctness of a valid data value within an extensive context including other data as well as business processes
Therefore, a data value must first be determined to be valid before it can be determined to be accurate.
Returning to the example from the discussion between Graham and I, TT and NL are both valid country code values, but only NL is accurate for Graham's postal address.
Check out the excellent follow-up blog post expanding on this discussion, and written by Graham Rhind: Definition drift
Additionally, check out the online data quality glossary built and maintained by Graham Rhind: Data Quality Glossary
Hi, Jim!
I've just read your post Predictably Poor Data Quality on the DataFlux Community of Experts. Great!
My boss, a not so young lady told me that I don't have a life, because I'm always thinking about Data Quality. I guess there is a connection between the post and this opinion.
We, OCDQP (Obsessive-Compulsive Data Quality People), can get satisfaction from this, although it is very difficult to explain how this is possible, to people who don't. Philosophers, tropical disease researchers (who cares about the mosquito? kill them all!), data practitioners ... we are brothers in arms.
That´s why - I think - we are not IT folks and we cannot expect understanding and support from them. Actually, they saw us as IT mosquitoes, zooming around their brilliant algorithmic brains (I´ve tested the software, it was fine, I don't know what happened, reality sucks, the user is dumb, where is my security blanket?).
Business people don´t like us either, because if data is perfect they would become useless - automated processes will handle everything (and this is slowly becoming true).
So it´s OK, resistance to Data Quality is part of the "people" problem. Information Technology is posing a great psychology dilemma to mankind. We all will become IT Luddites, we cannot support the perspective of a perfect information world.
Best regards and thanks for your work - it´s very stimulating.