Recent Comments
Affiliate Links
« Getting Your Data Freq On | Main | Worthy Data Quality Whitepapers (Part 2) »
Saturday
18Jul2009

The Very True Fear of False Positives

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

 

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

 

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

 

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (6)

Over on the SmartData Collective, Henrik Liliendahl Sørensen commented:

Jim,

A fantastic post taking us from the academic foundations over the commercial considerations to the philosophy in the art of matching.

I do think that automated data matching is here to stay for a while. Alternative approaches as up front utilizing external reference data does lower the needs for fuzzy matching on one hand, but on the other hand the spread of e-commerce and data mining in the cloud introduces increased demand for computerized data matching.

In e-commerce the customer becomes the data entry clerk thus setting some limitations in Data Governance thrown that way. The business case for e-commerce is very much about reducing staff involvement, so calls for excessive human interaction will not be well received. While some first movers in e-commerce have been pure players, now we see the bulk of the traditional retailers (and other industries) moving in on the net. They need a strong match between data captured offline and online.

For those interested in data matching, there is a new LinkedIn Group for Data Matching.

75 members joined during the first week, but I think the space is unlimited.

July 19, 2009 | Registered CommenterJim Harris

Henrik,

Thanks for the feedback and for providing your insights, especially about data matching and e-commerce.

As more enterprise information assets move into the cloud, there will definitely be a growing need for organizations to use matching for understanding the relationships between data captured offline and online.

I completely agree that automated data matching is here to stay and I am not advocating an excessive human interaction to replace automated matching.

However, more human interaction and education of data matching concepts in less overly technical jargon is necessary to help organizations understand how to best configure their automated data matching.

Organizations need to become comfortable with the fact that false positives will always occasionally occur and they shouldn’t let their very true fear of false positives allow them to implement less than optimal automated data matching.

Best Regards…

Jim

July 19, 2009 | Registered CommenterJim Harris

It looks like we are on the same page Jim. Recently I wrote about having some more common sense around data quality:

Data Quality and Common Sense

LinkedIn Group for Data Matching then may be the place where we in a closed circle of wired matching addictives can throw around all the terms like deterministic record linkage, probabilistic learning elements, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or your personal favourite: The redundant data capacitor.

Hi Jim,

Great posting!

I have led two distinct projects. We used the same tool in both to find potential matches.

However, in one of them we had human inspection of the potential matches, but not on the other.

I believe when you have people validating the matches, you would like more false positives and less false negatives, since human eyes can easily distinguish them, and flag false positives as non-matches. On the other hand, when there is no human intervention, you want the opposite. As you state, in this latter case, it is better to miss a good match, then assume an incorrect match.

But I agree with you. Tools are biases towards eliminating false positives. In my project with human inspection, we had to complement the tool with other techniques to gather the false negatives. I wish there was better tuning based on what you're trying to accomplish.

Thanks!

Dalton

July 20, 2009 | Unregistered CommenterDalton Cervo

Hi Dalton,

Thanks for the feedback and for sharing your insights.

I completely agree with you that when you have people validating the matches (especially during initial reviews when you are trying to verify and optimize the matching criteria), you want more false positives since human eyes can easily distinguish and flag them.

In fact, regardless of the data matching software I am using, I often initially setup the match to be overly aggressive and intentionally create false positives because they provide excellent test cases for understanding the client’s business rules, which is essential to the overall quality of the data matching implementation. I use the term “negativity bias” from psychology to explain that the concept of bad evokes a stronger reaction than the concept of good in the human mind, which makes it much easier for clients to explain the false positives that they don’t want, than it is for them to explain the false negatives that they are trying to find.

One of my major pet peeves is when the very true fear of false positives is expressed by either a vendor or consultant (or both) by intentionally concealing false positives during reviews in order to pass user acceptance testing. Sadly, I have seen this attempted many times. In fact, I was almost fired once for refusing to conceal false positives during user acceptance.

Best Regards…

Jim

July 20, 2009 | Registered CommenterJim Harris

Jim,

Great post!

You could also sum this up as: "You don't know what you are missing." Ignorance can be bliss.

It is essential that we understand the impact of both false positives and false negatives, carefully weigh one vs. the other.

In patient matching, a match threshold may look very different from one in merging prospect lists for direct marketing companies, although even the latter companies are more and more realizing the importance of accuracy to drive efficiency and customer satisfaction.

Furthermore, we have to minimize both false positives and false negatives at the same time by picking the best matching approach.

Cheers,

Patrick

July 22, 2009 | Unregistered CommenterPatrick Austermann

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>