In my previous post, I made the argument that many times it’s okay to call data quality as good as it needs to get, as opposed to demanding data perfection. However, a balanced perspective demands acknowledging there are times when nothing less than perfect data quality is necessary. In fact, there are times when poor data quality can have deadly consequences.
In his book The Information: A History, a Theory, a Flood, James Gleick explained “pharmaceutical names are a special case: a subindustry has emerged to coin them, research them, and vet them. In the United States, the Food and Drug Administration reviews proposed drug names for possible collisions, and this process is complex and uncertain. Mistakes cause death.”
“Methadone, for opiate dependence, has been administrated in place of Metadate, for attention-deficit disorder, and Taxcol, a cancer drug, for Taxotere, a different cancer drug, with fatal results. Doctors fear both look-alike errors and sound-alike errors: Zantac/Xanax; Verelan/Virilon. Linguists devise scientific measures of the distance between names. But Lamictal and Lamisil and Ludiomil and Lomotil are all approved drug names.”
All data matching techniques, such as edit distance functions, phonetic comparisons, and more complex algorithms, provide a way to represent (e.g., numeric probabilities, weighted percentages, odds ratios, etc.) the likelihood that two non-exact matching data items are the same. No matter what data quality software vendors tell you, all data matching techniques are susceptible to false negatives (data that did not match, but should have) and false positives (data that matched, but should not have).
This pharmaceutical example is one case where a false positive could be deadly, a time when poor data quality kills. Admittedly, this is an extreme example. What other examples can you offer where perfect data quality is actually a necessity?