Recent Comments
Affiliate Links
« The Three Musketeers of Data Quality | Main | The Nine Circles of Data Quality Hell »
Wednesday
03Jun2009

The Two Headed Monster of Data Matching

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

Data matching is commonly plagued by what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

 

I Fought The Two Headed Monster...

 

On a recent (mostly) business trip to Las Vegas, I scheduled a face-to-face meeting with a potential business partner that I had previously communicated with via phone and email only.  We agreed to a dinner meeting at a restaurant in the hotel/casino where I was staying. 

I would be meeting with the President/CEO and the Vice President of Business Development, a man and a woman respectively.

I was facing a real world data matching problem.

I knew their names, but I had no idea what they looked like.  Checking their company website and LinkedIn profiles didn't help - no photos.  I neglected to get their mobile phone numbers, however they had mine.

The restaurant was inside the casino and the only entrance was adjacent to a Starbucks that had tables and chairs facing the casino floor.  I decided to arrive at the restaurant 15 minutes early and camp out at Starbucks since anyone going near the restaurant would have to walk right past me.

I was more concerned about avoiding false positives.  I didn't want to walk up to every potential match and introduce myself since casino security would soon intervene (and I have seen enough movies to know that scene always ends badly). 

I decided to apply some probabilistic data matching principles to evaluate the mass of humanity flowing past me. 

If some of my matching criteria seems odd, please remember I was in a Las Vegas casino. 

I excluded from consideration all:

  • Individuals wearing a uniform or a costume
  • Groups consisting of more than two people
  • Groups consisting of two men or two women
  • Couples carrying shopping bags or souvenirs
  • Couples demonstrating a public display of affection
  • Couples where one or both were noticeably intoxicated
  • Couples where one or both were scantily clad
  • Couples where one or both seemed too young or too old

I carefully considered any:

  • Couples dressed in business attire or business casual attire
  • Couples pausing to wait at the restaurant entrance
  • Couples arriving close to the scheduled meeting time

I was quite pleased with myself for applying probabilistic data matching principles to a real world situation.

However, the scheduled meeting time passed.  At first, I simply assumed they might be running a little late or were delayed by traffic.  As the minutes continued to pass, I started questioning my matching criteria.

 

...And The Two Headed Monster Won

 

When the clock reached 30 minutes past the scheduled meeting time, my mobile phone rang.  My dinner companions were calling to ask if I was running late.  They had arrived on time, were inside the restaurant, and had already ordered.

Confused, I entered the restaurant.  Sure enough, there sat a man and a woman that had walked right past me.  I excluded them from consideration because of how they were dressed.  The Vice President of Business Development was dressed in jeans, sneakers and a casual shirt.  The President/CEO was wearing shorts, sneakers and a casual shirt.

I had dismissed them as a vacationing couple.

I had been defeated by a false negative.

 

The Harsh Reality is that Monsters are Real

My data quality expertise could not guarantee victory in this particular battle with The Two Headed Monster. 

Monsters are real and the hero of the story doesn't always win.

And it doesn’t matter if the match algorithms I use are deterministic, probabilistic, or even supercalifragilistic. 

The harsh reality is that false negatives and false positives can be reduced, but never eliminated.

 

Are You Fighting The Two Headed Monster?

Are you more concerned about false negatives or false positives?  Please share your battles with The Two Headed Monster.

 

Related Articles

Back in February and March, I published a five part series of articles on data matching methodology on Data Quality Pro

Parts 2 and 3 of the series provided data examples to illustrate the challenge of false negatives and false positives within the context of identifying duplicate customers:

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (11)

Over on the SmartData Collective, Tanton Gibbs commented:

"I disagree completely that you used probabilistic matching. The way you describe it, your matching algorithm was completely deterministic. You either excluded a party or not. A probabilistic algorithm would have assigned some weight to the fact that the couple was a man and a woman and arriving at the appointed time. In fact, the probability that the couple was the correct couple would have increased as other parties were eliminated. Had you included other factors such as expected age, you might have gotten an even higher percentage. At the very least, walking to the restaurant to ask if your party had arrived would have helped ;-)"

And I responded:

Your criticism is very fair.

I didn't explain the frequency distributions, logarithmic equations, probabilistic weight assignments, comparisons to matching thresholds and multiple statistical evaluations of the same people that I was doing in my head.

I did explain my matching in a far more deterministic way.

However, my point (in addition to making fun of myself for being an idiot, which you did nicely pick up on) was more about the fact that false negatives and false positives are a challenge for any matching algorithm.

June 4, 2009 | Registered CommenterJim Harris

On Twitter, Henrik Liliendahl Sørensen commented:

"Degrees of match confidence: Exact Match, Close Match, Clothes Match"

June 4, 2009 | Registered CommenterJim Harris

Great post! I have found at times that clients view the matching algorithms as a technology that can transcend the simple, human logic of identifying two like entities.

I liked the way your post applied this logic in a practical way. After all, identifying duplication is really identifying the same people in two like forms.

In my experience the ultimate goal in matching has been consolidation. As a result my concern is focused on the identification of false positives. False negatives are a cause for concern as they decrease the effectiveness of a consolidation effort. However a false positive represents the loss of opportunity. A falsely identified positive match will lead to the archival or even worse removal of customers and as such should be mitigated at all times.

Better to miss an opportunity to decrease redundancy than to miss an opportunity at increasing profitability.

June 4, 2009 | Unregistered CommenterWilliam Sharp

William,

Thanks for sharing your perspective.

In my experience, I have found that my clients' greater concern about false positives (for exactly the reasons that you excellently explained) motivated a cautious approach to duplicate identification that resulted in numerous false negatives, especially when they restricted the implementation to exact matching techniques.

False positives appear to be considered the more intimidating of The Two Headed Monster.

Best Regards...

Jim

June 4, 2009 | Registered CommenterJim Harris

You really hit the nail on the head with this one! In constantly thinking about identity resolution, I've talked about how even great software (in this case, your brain) can't overcome bad configuration assumptions (targets will be in business attire). For another recent take on this, see: The Human Element in Identity Resolution.

June 8, 2009 | Unregistered CommenterBob Barker

Having spent some time in Vegas and trying to meet up with people it is a fun way to entertain yourself without dropping coins and pulling on a handle. It just seems that the game was more interesting then the result. If you were with a woman she would have asked the maître d' and you would have been having a drink with them minutes after they arrived. It just seems to be a right vs. left brain discussion, paralysis by analysis.

If you want to see a great example of data analysis check out Visokio. Looked at these guys about five years ago, fantastically simple approach to multi-variant analysis, the whole world of "if, then, and".

Oh, I would have kept the scantily clad variable in my considered group.

June 9, 2009 | Unregistered Commenterpeter

From the LinkedIn Group for The Greater IBM Connection, Larry van Onselen commented:

"Brilliant!!!

Well worth the read and now I can explain in layman's terms to my family what sort of challenges I handle in my job :-)

Awesome, thanks."

June 9, 2009 | Registered CommenterJim Harris

From the LinkedIn Group for Master Data Management, Valentin Veytsman commented:

"As we called it 'over-consolidated' or 'under-consolidated' entities are the aftereffects of any match algorithm. Defining a proper threshold was the key tactic of my match days.

Also, what was worse for the business: a false negative or a false positive?

In our case, it was better to break one guy into two than have two different clients considered as one."

June 9, 2009 | Registered CommenterJim Harris

I have two issues with your example.

One, although you had rules, you weren't probabilistic or deterministic, because the match didn't qualify for any of the sets of rules (for inclusion or exclusion).

In fact, (two) you were inferential, because that's what humans do - they infer from more knowledge than mere rules, and you dismissed the match for unforeseen but perfectly reasonable reasons (in general).

It's not a matter of whether a technique over-matches or under-matches, but measuring sameness and determining if the result is worth doing something about.

June 11, 2009 | Unregistered CommenterJax Gibb

From the LinkedIn Group for Master Data Management Pros, Martin Doyle commented:

It seems more like a Venn diagram to me, made up of Black and White rings with Grey intersection.

There are three simple states:

1. Things are definite matches (Black)
2. Things might be matches (Grey)
3. Things are not matches – they are unique (White)

The challenge is minimizing the Grey so that machines (great with logic, poor with intuition) can work out the propensity or certainty of a set of records matching? And that's not always easy.

Some of the challenges:

1. Missing or incomplete data - where two records might be the same yet there is missing data between them, i.e. one has a telephone number the other does not? In this instance should two null or empty strings score 100% or 0% match. Where one record has a value and the other does not, should the value be ignored and scored 100% or 0% for that field?

2. Where Aliases occur - are Jim, Jimmy and James Harris the same guy? String wise Jim and James are miles off and if we were to rely on Soundex Jim is J500, James is J520, so they would not even be considered for matching.

As an overly simplistic example of string matching, the only two similarities between Jim and James are the letters "J" and "M" so 2 out of 5 Characters = 20% match. However, if you throw away the vowels you have JM versus JMS so 66.66% similarity. Although you'll get a bunch of false positives of words containing J, M and S to contend with as matches.

There are certainly many more issues which I won't go into now. However, I think you can see why Grey is a color we'd better get used to. The best we can do is minimize the problem, even differences between UK English and US English causes Grey!

So above are some of the reasons for Grey, as to the solutions: that’s what smart guys in the data management industry work out. There is – in my opinion – no silver bullet, just solid strategies founded on many years of experience, successes and learning outcomes (some say failures). A combination of string, pattern, alias, weightings, grouping, linking, re-try, referencing will all add up to a level of certainty that two things match.

Having said all that. Now they match and you have a certainty of that, what are you going to do with them. Which one - if any - is the perfect record, what attributes of each record can you trust, what will become the Single Customer Master and how will you update all the silo’d records across the enterprise so they are all congruent?

The guy’s who solve that will hold the keys to the kingdom.

June 17, 2009 | Registered CommenterJim Harris

From the LinkedIn Group for Master Data Management Pros, Henrik Liliendahl Sørensen commented:

In response to Martin Doyle's comment, I like that when we go from defining the problems to lining up the solutions.

There is more on this matter here: Narrative Fallacy and Data Matching

I agree that we use machines for dividing the results into black, grey and white pots and that we tune the solutions so that:

• The black and white pots are compliant with the real world

• The grey pot are as small as possible saving costly human interaction

June 17, 2009 | Registered CommenterJim Harris

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>