August 28, 2009

Adventures in Data Profiling (Part 5)

August 28, 2009/ Jim Harris

In Part 4 of this series: You went totally postal...shifting your focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

You learned when a field is both 100% complete and has an extremely low cardinality, its most frequently occurring value could be its default value, how forcing international addresses to be entered into country-specific data structures can cause data quality problems, and with the expert assistance of Graham Rhind, we all learned more about international postal code formats.

In Part 5, you will continue your adventures in data profiling by completing your initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

As we discussed in Part 3 when we looked at the E-mail Address field, most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Postal Address Line 1 and Postal Address Line 2 are additional examples of the necessity of this analysis technique. Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record). Some variations in postal addresses can be the results of data entry errors, the use of local conventions, or ignoring (or lacking) postal standards.

Additionally, postal address lines can sometimes contain overflow from other fields (e.g. Customer Name) or they can be used as a dumping ground for values without their own fields (e.g. Twitter username), values unable to conform to the limitations of their intended fields (e.g. countries with something analogous to a US state or CA province but incompatible with a two character field length), or comments (e.g. LDIY, which as Steve Sarsfield discovered, warns us about the Large Dog In Yard).

Postal Address Line 1

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 1:

The top twenty most frequently occurring field formats for Postal Address Line 1 collectively account for over 80% of the records with an actual value in this field for this data source. All of these field formats appear to be common potentially valid structures. Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 1 field?

Postal Address Line 2

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 2:

The top ten most frequently occurring field formats for Postal Address Line 2 collectively account for half of the records with an actual value in this sparsely populated field for this data source. Some of these field formats show several common potentially valid structures. Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 2 field?

Postal Address Validation

Many data quality initiatives include the implementation of postal address validation software. This provides the capability to parse, identify, verify, and format a valid postal address by leveraging country-specific postal databases.

Some examples of postal validation functionality include correcting misspelled street and city names, populating missing postal codes, and applying (within context) standard abbreviations for sub-fields such as directionals (e.g. N for North and E for East), street types (e.g. ST for Street and AVE for Avenue), and box types (e.g. BP for Boite Postale and CP for Case Postale). These standards not only vary by country, but can also vary within a country when there are multiple official languages.

The presence of non-postal data can sometimes cause either validation failures (i.e. an inability to validate some records, not a process execution failure) or simply deletion of the unexpected values. Therefore, some implementations will use a pre-process to extract the non-postal data prior to validation.

Most validation software will append one or more status fields indicating what happened to the records during processing. It is a recommended best practice to perform post-validation analysis by not only looking at these status fields, but also comparing the record content before and after validation, in order to determine what modifications and enhancements have been performed.

What other analysis do you think should be performed for postal address?

In Part 6 of this series: We will continue the adventures by analyzing the Account Number and Tax ID fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 26, 2009

The Only Thing Necessary for Poor Data Quality

August 26, 2009/ Jim Harris

“Demonstrate projected defects and business impacts if the business fails to act,” explains Dylan Jones of Data Quality Pro in his recent and remarkable post How To Deliver A Compelling Data Quality Business Case:

“Presenting a future without data quality management...leaves a simple take-away message – do nothing and the situation will deteriorate.”

I can not help but be reminded of the famous quote often attributed to the 18th century philosopher Edmund Burke:

“The only thing necessary for the triumph of evil, is for good men to do nothing.”

Or the even more famous quote often attributed to the long time ago Jedi Master Yoda:

“Poor data quality is the path to the dark side. Poor data quality leads to bad business decisions.

Bad business decisions leads to lost revenue. Lost revenue leads to suffering.”

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, demonstrate that poor data quality is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision-critical enterprise information.

Preventing poor data quality is mission-critical. Poor data quality will undermine the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

“The only thing necessary for Poor Data Quality – is for good businesses to Do Nothing.”

Hyperactive Data Quality (Second Edition)

Data Quality: The Reality Show?

Data Governance and Data Quality

August 24, 2009

Resistance is NOT Futile

August 24, 2009/ Jim Harris

“Your opinion is irrelevant. We wish to improve ourselves.

We will add your business and technological distinctiveness to our own.

Your culture will adapt to service us.

You will be assimilated. Resistance is futile.”

Continuing my Star Trek theme, which began with my previous post Hailing Frequencies Open, imagine that you have been called into the ready room to be told your enterprise has decided to implement the proven data quality framework known as Business Operations and Reporting Governance – the BORG.

Frameworks are NOT Futile

Please understand – I am an advocate for methodology and best practices, and there are certainly many excellent frameworks that are far from futile. I have worked on many data quality initiatives that were following a framework and have seen varying degrees of success in their implementation.

However, the fictional BORG framework that I am satirizing exemplifies a general problem that I have with any framework that advocates a one-size-fits-all strategy, which I believe is an approach that is doomed to fail.

Any implemented framework must be customized to adapt to an organization's unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance. An attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Resistance is NOT Futile

Everyone has opinions – and opinions are never irrelevant. Fundamentally, all change starts with changing people's minds.

The starting point has to be improving communication and encouraging open dialogue. This means listening to what people throughout the organization have to say and not just telling them what to do. Keeping data aligned with business processes and free from poor quality requires getting people aligned and free to communicate their concerns.

Obviously, there will be dissension. However, you must seek a mutual understanding by practicing empathic listening. The goal is to foster an environment in which a diversity of viewpoints is freely shared without bias.

“One of the real dangers is emphasizing consensus over dissent,” explains James Surowiecki in his excellent book The Wisdom of Crowds. “The best collective decisions are the product of disagreement and contest, not consensus or compromise. Group deliberations are more successful when they have a clear agenda and when leaders take an active role in making sure that everyone gets a chance to speak.”

Avoid Assimilation

In order to be successful in your attempt to implement any framework, you must have realistic expectations.

Starting with a framework simply provides a reference of best practices and recommended options of what has worked on successful data quality initiatives. But the framework must still be reviewed in order to determine what can be learned from it and to select what will work in the current environment and what simply won't.

This doesn't mean that the customized components of the framework will be implemented simultaneously. All change will be gradual and implemented in phases – without the use of BORG nanoprobes. You will NOT be assimilated.

Your organization's collective consciousness will be best served by adapting the framework to your corporate culture.

Your data quality initiative will facilitate the collaboration of business and technical stakeholders, as well as align data usage with business metrics, and enable people to be responsible for data ownership and data quality.

Best practices will be disseminated throughout your collective – while also maintaining your individual distinctiveness.

Hailing Frequencies Open

Data Governance and Data Quality

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

You're So Vain, You Probably Think Data Quality Is About You

August 20, 2009

Hailing Frequencies Open

August 20, 2009/ Jim Harris

“This is Captain James E. Harris of the Data Quality Starship Collaboration...”

Clearly, I am a Star Trek nerd – but I am also a people person. Although people, process, and technology are all important for successful data quality initiatives, without people, process and technology are useless.

Collaboration is essential. More than anything else, it requires effective communication – which begins with effective listening.

Seek First to Understand...Then to Be Understood

This is Habit 5 from Stephen Covey's excellent book The 7 Habits of Highly Effective People. “We typically seek first to be understood,” explains Covey. “Most people do not listen with the intent to understand; they listen with the intent to reply.”

We are all proud of our education, knowledge, understanding, and experience. Since it is commonly believed that experience is the path that separates knowledge from wisdom, we can't wait to share our wisdom with the world. However, as Covey cautions, our desire to be understood can make “our conversations become collective monologues.”

Covey explains that listening is an activity that can be practiced at one of the following five levels:

Ignoring – we are not really listening at all.
Pretending – we are only waiting for our turn to speak, constantly nodding and saying: “Yeah. Uh-huh. Right.”
Selective Listening – we are only hearing certain parts of the conversation, such as when we're listening to the constant chatter of a preschool child.
Attentive Listening – we are paying attention and focusing energy on the words that are being said.
Empathic Listening – we are actually listening with the intent to really try to understand the other person's frame of reference. You look out through it, you see the world the way they see the world, you understand their paradigm, you understand how they feel.

“Empathy is not sympathy,” explains Covey. “Sympathy is a form of agreement, a form of judgment. And it is sometimes the more appropriate response. But people often feed on sympathy. It makes them dependent. The essence of empathic listening is not that you agree with someone; it's that you fully, deeply, understand that person, emotionally as well as intellectually.”

Vulcans

Some people balk at discussing the use of emotion in a professional setting, where typically it is believed that rational analysis must protect us from irrational emotions. To return to a Star Trek metaphor, these people model their professional behavior after the Vulcans.

Vulcans live according to the philosopher Surak's code of emotional self-control. Starting at a very young age, they are taught meditation and other techniques in order to suppress their emotions and live a life guided by reason and logic alone.

Be Truly Extraordinary

In all professions, it is fairly common to encounter rational and logically intelligent people.

Truly extraordinary people masterfully blend both kinds of intelligence – intellectual and emotional. A well-grounded sense of self-confidence, an empathetic personality, and excellent communication skills, exert a more powerfully positive influence than simply remarkable knowledge and expertise alone.

Your Away Mission

As a data quality consultant, when I begin an engagement with a new client, I often joke that I shouldn't be allowed to speak for the first two weeks. This is my way of explaining that I will be asking more questions than providing answers.

I am seeking first to understand the current environment from both the business and technical perspectives. Only after I have achieved this understanding, will I then seek to be understood regarding my extensive experience of the best practices that I have seen work on successful data quality initiatives.

As fellow Star Trek nerds know, the captain doesn't go on away missions. Therefore, your away mission is to try your best to practice empathic listening at your next data quality discussion – “Make It So!”

Data quality initiatives require a holistic approach involving people, process, and technology. You must consider the people factor first and foremost, because it will be the people involved, and not the process or the technology, that will truly allow your data quality initiative to “Live Long and Prosper.”

As always, hailing frequencies remain open to your comments. And yes, I am trying my best to practice empathic listening.

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

August 18, 2009

Adventures in Data Profiling (Part 4)

August 18, 2009/ Jim Harris

In Part 3 of this series: The adventures continued with a detailed analysis of the fields Birth Date, Telephone Number and E-mail Address. This provided you with an opportunity to become familiar with analysis techniques that use a combination of field values and field formats.

You also saw examples of how valid values in a valid format can have an invalid context, how valid field formats can conceal invalid field values, and how free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

In Part 4, you will continue your adventures in data profiling by going postal...postal address that is, by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Country Code

In Part 1, we wondered if 5 distinct Country Code field values indicated international postal addresses. This drill-down “screen” provided by the data profiling tool shows the frequency distribution. First of all, the field name might have lead us to assume we would only see ISO 3166 standard country codes.

However, two of the field values are a country name and not a country code. This is another example of how verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

Secondly, the field values would appear to indicate that most of the postal addresses are from the United States. However, if you recall from Part 3, we discovered some potential clues during our analysis of Telephone Number, which included two formats that appear invalid based on North American standards, and E-mail Address, which included country code Top Level Domain (TLD) values for Canada and the United Kingdom.

Additionally, whenever a field is both 100% complete and has an extremely low cardinality, it could be an indication that the most frequently occurring value is the field's default value.

Therefore, is it possible that US is simply the default value for Country Code for this data source?

Zip Code

From the Part 1 comments, it was noted that Zip Code as a field name is unique to the postal code system used in the United States (US). This drill-down “screen” provided by the data profiling tool shows the field has only a total of ten field formats.

The only valid field formats for ZIP (which, by the way, is an acronym for Zone Improvement Plan) are 5 digits and 9 digits when the 4 digit ZIP+4 add-on code is also present, which according to the US postal standards should be separated from the 5 digit ZIP Code using a hyphen.

The actual field formats in the Zip Code field of this data source reveal another example of how we should not make assumptions about our data based on the metadata that describes it. Although the three most frequently occurring field formats appear to be representative of potentially valid US postal codes, the alphanumeric postal code field formats are our first indication that it is, perhaps sadly, not all about US (pun intended, my fellow Americans).

The two most frequently occurring alphanumeric field formats appear to be representative of potentially valid Canadian postal codes. An interesting thing to note is that their combined frequency distribution is double the count of the number of records having CA as a Country Code field value. Therefore, if these field formats are representative of a valid Canadian postal code, then some Canadian records have a contextually invalid field value in Country Code.

The other alphanumeric field formats appear to be representative of potentially valid postal codes for the United Kingdom (UK). To the uninitiated, the postal codes of Canada (CA) and the UK appear very similar. Both postal code formats contain two parts, which according to their postal standards should be separated by a single character space.

In CA postal codes, the first part is called the Forward Sortation Area (FSA) and the second part is called the Local Delivery Unit (LDU). In UK postal codes, the first part is called the outward code and the second part is called the inward code.

One easy way to spot the difference is that a UK inward code always has the format of a digit followed by two letters (i.e. “naa” in the field formats generated by my fictional data profiling tool), whereas a CA LDU always has the format of a digit followed by a letter followed by another digit (i.e. “nan”).

However, we should never rule out the possibility of transposed values making a CA postal code look like a UK postal code, or vice versa. Also, never forget the common data quality challenge of valid field formats concealing invalid field values.

Returning to the most frequently occurring field format of 5 digits, can we assume all valid field values would represent US postal addresses? Of course not. One significant reason is that a 5 digit postal code is one of the most common formats in the world.

Just some of the other countries also using a 5 digit postal code include: Algeria, Cuba, Egypt, Finland, France, Germany, Indonesia, Israel, Italy, Kuwait, Mexico, Spain, and Turkey.

What about the less frequently occurring field formats of 4 digits and 6 digits? It is certainly possible that these field formats could indicate erroneous attempts at entering a valid US postal code. However, it could also indicate the presence of additional non-US postal addresses.

Just some of the countries using a 4 digit postal code include: Australia, Austria, Belgium, Denmark, El Salvador, Georgia (no, the US state did not once again secede, there is also a country called Georgia and its not even in the Americas), Hungary, Luxembourg, Norway, and Venezuela. Just some of the countries using a 6 digit postal code include: Belarus, China, India, Kazakhstan (yes, Borat fans, Kazakhstan is a real country), Russia, and Singapore.

Additionally, why do almost 28% of the records in this data source not have a field value for Zip Code?

One of the possibilities is that we could have postal addresses from countries that do not have a postal code system. Just a few examples would be: Aruba, Bahamas (sorry fellow fans of the Beach Boys, but both Jamaica and Bermuda have a postal code system, and therefore I could not take you down to Kokomo), Fiji (home of my favorite bottled water), and Ireland (home of my ancestors and inventors of my second favorite coffee).

State Abbreviation

From the Part 1 comments, it was noted that the cardinality of State Abbreviation appeared suspect because, if we assume that its content matches its metadata, then we would expect only 51 distinct values (i.e. actual US state abbreviations without counting US territories) and not the 72 distinct values discovered by the data profiling tool.

Let's assume that drill-downs have revealed the single profiled field data type was CHAR, and the profiled minimum/maximum field lengths were both 2. Therefore, State Abbreviation, when populated, always contains a two character field value.

This drill-down “screen” first displays the top ten most frequently occurring values in the State Abbreviation field, which are all valid US state abbreviations. The frequency distributions are also within general expectations since eight of the largest US states by population are represented.

However, our previous analysis of Country Code and Zip Code has already made us aware that international postal addresses exist in this data source. Therefore, this drill-down “screen” also displays the top ten most frequently occurring non-US values based on the data profiling tool comparing all 72 distinct values against a list of valid US state and territory abbreviations.

Most of the field values discovered by this analysis appear to be valid CA province codes (including PQ being used as a common alternative for QC – the province of Quebec or Québec si vous préférez). These frequency distributions are also within general expectations since six of the largest CA provinces by population are represented. Their combined frequency distribution is also fairly close to the combined frequency distribution of potentially valid Canadian postal codes found in the Zip Code field.

However, we still have three additional values (ZZ, SA, HD) which require more analysis. Additionally, almost 22% of the records in this data source do not have a field value for State Abbreviation, which could be attributable to the fact that even when the postal standards for other countries include something analogous to a US state or CA province, it might not be compatible with a two character field length.

City Name

Let's assume that we have performed some preliminary analysis on the statistical summaries and frequency distributions provided by the data profiling tool for the City Name field using the techniques illustrated throughout this series so far.

Let's also assume analyzing the City Name field in isolation didn't reveal anything suspicious. The field is consistently populated and its frequently occurring values appeared to meet general expectations. Therefore, let's assume we have performed additional drill-down analysis using the data profiling tool and have selected the following records of interest:

Based on reviewing these records, what conclusions, assumptions, and questions do you have about the City Name field?

What other questions can you think of for these fields? What other analysis do you think should be performed for these fields?

In Part 5 of this series: We will continue the adventures in data profiling by completing our initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

International Man of Postal Address Standards

Since I am a geographically-challenged American, the first (and often the only necessary) option I choose for assistance with international postal address standards is Graham Rhind.

His excellent book The Global Source-Book for Address Data Management is an invaluable resource and recognized standard reference that contains over 1,000 pages of data pertaining to over 240 countries and territories.

August 15, 2009

Imagining the Future of Data Quality

August 15, 2009/ Jim Harris

Earlier this week on Data Quality Pro, Dylan Jones published an Interview with Larry English, one of the earliest pioneers of information quality management, one of the most prominent thought leaders in the industry, and one of the co-founders (along with Thomas Redman) of the International Association for Information and Data Quality (IAIDQ).

The interview also unintentionally sparked some very common debates, including the differences between data and information, data quality (DQ) and information quality (IQ), as well as proactive and reactive approaches to quality management.

Of course, I added my own strong opinions to these debates, including a few recent posts – The General Theory of Data Quality and Hyperactive Data Quality (Second Edition).

On a much lighter note, and with apologies to fellow fans of John Lennon, I also offer the following song:

Imagining the Future of Data Quality

Imagine there's no defects
It's easy if you try
No data cleansing beneath us
Above us only sky
Imagine all the data
Living with quality

Imagine there's no companies
It isn't hard to do
Nothing to manage or govern
And no experts too
Imagine all the data
Living life in peace

You may say that I'm a dreamer
But I'm not the only one
I hope someday you'll join us
And the DQ/IQ world will be as one

Imagine no best practices
I wonder if you can
No need for books or lectures
A brotherhood of man
Imagine all the data
Sharing all the world

You may say that I'm a dreamer
But I'm not the only one
I hope someday you'll join us
And the DQ/IQ world will live as one

August 14, 2009

Hyperactive Data Quality (Second Edition)

August 14, 2009/ Jim Harris

In the first edition of Hyperactive Data Quality, I discussed reactive and proactive approaches using the data quality lake analogy from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset:

“...a lake represents a database and the water therein the data. The stream, which adds new water, is akin to a business process that creates new data and adds them to the database. The lake...is polluted, just as the data are dirty. Two factories pollute the lake. Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source – the factories.

The contrast between the two approaches is stark. In the first, the focus is on the lake; in the second, it is on the stream. So too with data. Finding and fixing errors focuses on the database and data that have already been created. Preventing errors focuses on the business processes and future data.”

Reactive Data Quality

Reactive Data Quality (i.e. “cleaning the lake” in Redman's analogy) focuses entirely on finding and fixing the problems with existing data after it has been extracted from its sources.

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert “lake cleaners”). Data quality problems can be very insidious and even the best “lake cleaning” process will still produce exceptions. Your process should be designed to identify and report exceptions when they occur. In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

Proactive Data Quality

Proactive Data Quality focuses on preventing errors at the sources where data is entered or received, and before it is extracted for use by downstream applications (i.e. “enters the lake” in Redman's analogy).

Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

“It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect.”

Proactive data quality advocates reevaluating business processes that create data, implementing improved controls on data entry screens and web forms, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the information needs of your consumers before delivering enterprise data for their use.

Proactive Data Quality > Reactive Data Quality

Proactive data quality is clearly the superior approach. Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.

Reactive data quality essentially treats the symptoms without curing the disease. As Redman explains: “...the problem with being a good lake cleaner is that life never gets better...it gets worse as more data...conspire to mean there is more work every day.”

So why do the vast majority of data quality initiatives use a reactive approach?

An Arrow Thickly Smeared With Poison

In Buddhism, there is a famous parable:

A man was shot with an arrow thickly smeared with poison. His friends wanted to get a doctor to heal him, but the man objected by saying:

“I will neither allow this arrow to be pulled out nor accept any medical treatment until I know the name of the man who wounded me, whether he was a nobleman or a soldier or a merchant or a farmer or a lowly peasant, whether he was tall or short or of average height, whether he used a long bow or a crossbow, and whether the arrow that wounded me was hoof-tipped or curved or barbed.”

While his friends went off in a frantic search for these answers, the man slowly, and painfully, dies.

“Flight to Data Quality”

In economics, the term “flight to quality” describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar “flight to data quality” can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information. Some examples include a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal.

Driven by a business triage for critical data problems, reactive data cleansing is purposefully chosen over proactive defect prevention. The priority is finding and fixing the near-term problems rather than worrying about the long-term consequences of not identifying the root cause and implementing process improvements that would prevent it from happening again.

The enterprise has been shot with an arrow thickly smeared with poison – poor data quality. Now is not the time to point out that the enterprise has actually shot itself by failing to have proactive measures in place.

Reactive data quality only treats the symptoms. However, during triage, the priority is to stabilize the patient. A cure for the underlying condition is worthless if the patient dies before it can be administered.

Hyperactive Data Quality

Proactive data quality is the best practice. Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing.

A data governance framework is necessary for proactive data quality to be successful. Patience and understanding are also necessary. Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly.

Even when not facing an immediate crisis, the reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.

Just like any complex problem, data quality has no fast and easy solution. Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into an approach that I refer to as Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

Please share your thoughts and experiences.

Hyperactive Data Quality (First Edition)

The General Theory of Data Quality

August 12, 2009

The General Theory of Data Quality

August 12, 2009/ Jim Harris

In one of the famous 1905 Annus Mirabilis Papers On the Electrodynamics of Moving Bodies, Albert Einstein published what would later become known as his Special Theory of Relativity.

This theory introduced the concept that space and time are interrelated entities forming a single continuum and that the passage of time can be a variable that could change for each specific observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective.

As Einstein's Padawan Obi-Wan Kenobi would later explain in his remarkable 1983 “paper” on The Return of the Jedi:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

Although the Special Theory of Relativity could explain the different perspectives of different observers, it could not explain the shared perspective of all observers. Special relativity ignored a foundational force in classical physics – gravity. So in 1916, Einstein used the force to incorporate a new perspective on gravity into what he called his General Theory of Relativity.

The Data-Information Continuum

In my popular post The Data-Information Continuum, I explained that data and information are also interrelated entities forming a single continuum. I used the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers).

I explained that although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements. Viewing each intended use as the information that is derived from data, I defined information as data in use or data in action.

I went on to the explain that data's quality must be objectively measured separate from its many uses and that information's quality can only be subjectively measured according to its specific use.

The Special Theory of Data Quality

The majority of data quality initiatives are reactive projects launched in the aftermath of an event when poor data quality negatively impacted decision-critical information.

Many of these projects end in failure. Some fail because of lofty expectations or unmanaged scope creep. Most fail because they are based on the flawed perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.

Whenever an organization approaches data quality as a one-time project and not as a sustained program, they are accepting what I refer to as the Special Theory of Data Quality.

However, similar to the accuracy of special relativity for solving a narrowly defined problem, sometimes applications of the Special Theory of Data Quality can yield successful results – from a certain point of view.

Tactical initiatives will often have a necessarily narrow focus. Reactive data quality projects are sometimes driven by a business triage for the most critical data problems requiring near-term prioritization that simply can't wait for the effects that would be caused by implementing a proactive strategic initiative (i.e. one that may have prevented the problems from happening).

One of the worst things that can happen to an organization is a successful data quality project – because it is almost always an implementation of information quality customized to the needs of the tactical initiative that provided its funding.

Ultimately, this misperceived success simply delays an actual failure when one of the following happens:

When the project is over, the team returns to their previous activities only to be forced into triage once again when the next inevitable crisis occurs where poor data quality negatively impacts decision-critical information.
When either a new project (or later phase of the same project) attempts to enforce the information quality standards throughout the organization as if they were enterprise data quality standards.

The General Theory of Data Quality

True data quality standards are enterprise-wide standards providing an objective data foundation. True information quality standards must always be customized to meet the subjective needs of a specific business process and/or initiative.

Both aspects of this shared perspective of quality must be incorporated into a single sustained program that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

Whenever an organization approaches data quality as a sustained program and not as a one-time project, they are accepting what I refer to as the General Theory of Data Quality.

Data governance provides the framework for crossing the special to general theoretical threshold necessary to evolve data quality from a project to a sustained program. However, in this post, I want to remain focused on which theory an organization accepts because if you don't accept the General Theory of Data Quality, you likely also don't accept the crucial role that data governance plays in a data quality initiative – and in all fairness, data governance obviously involves much more than just data quality.

Theory vs. Practice

Even though I am an advocate for the General Theory of Data Quality, I also realize that no one works at a company called Perfect, Incorporated. I would be lying if I said that I had not worked on more projects than programs, implemented more reactive data cleansing than proactive defect prevention, or that I have never championed a “single version of the truth.”

Therefore, my career has more often exemplified the Special Theory of Data Quality. Or perhaps my career has exemplified what could be referred to as the General Practice of Data Quality?

What theory of data quality does your organization accept? Which one do you personally accept?

More importantly, what does your organization actually practice when it comes to data quality?

The Data-Information Continuum

Hyperactive Data Quality (Second Edition)

Hyperactive Data Quality (First Edition)

Data Governance and Data Quality

Schrödinger's Data Quality

August 09, 2009

Adventures in Data Profiling (Part 3)

August 09, 2009/ Jim Harris

In Part 2 of this series: The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields. This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.

Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.

In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Birth Date

The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool. Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year).

Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.

However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.

This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context. Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.

We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.

The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record). Therefore, we will limit the review to only the top ten most frequently occurring values.

Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values. One aspect of this analysis is that it can be used as an easier method for examining the customer age range.

Here we also see two contextually invalid birth years: 2011 and 2012. Any thoughts on a possible explanation for this data anomaly?

Telephone Number

The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.

The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.

Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

This summary information points to the need to use drill-downs in order to review more detailed information.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.

With only ten profiled field formats, we can easily review them all. Most formats appear to be representative of potentially valid telephone numbers. However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States). Additionally, there are two formats that appear invalid based on North American standards.

However, a common data quality challenge is that valid field formats can conceal invalid field values.

Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values. In this case, more obviously invalid telephone numbers are discovered.

E-mail Address

The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool. In order to save some space, I have intentionally omitted the additional profiling statistics for this field.

E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.

Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD). These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).

Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures. However, some formats are missing one of the three required sub-values. The formats missing User could be an indication that the field sometimes contains a Website Address.

Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.

What other questions can you think of for these fields? Additional analysis could be done using drill-downs to perform a more detailed review of records of interest. What other analysis do you think should be performed for these fields?

In Part 4 of this series: We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 05, 2009

Adventures in Data Profiling (Part 2)

August 05, 2009/ Jim Harris

In Part 1 of this series: The adventures began with the following scenario – You are an external consultant on a new data quality initiative. You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.

You have no prior knowledge of the data or its expected characteristics. You are performing this analysis without the aid of either business requirements or subject matter experts. Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.

The customer data source was processed by the data profiling tool, which provided the following statistical summaries:

The Adventures Continue...

In Part 1, we asked if Customer ID was the primary key for this data source. In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:

Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts. This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.

This field summary for Customer ID includes some input metadata, identifying the expected data type and field length. Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.

We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS). Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).

Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values). Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.

The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit. We could also continue drilling down to see the actual field values behind the field formats.

Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?

In Part 1, we asked why the Gender Code field has 8 distinct values. Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields).

We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.

It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male. Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).

However, it is dangerous to assume anything and what about those numeric values? Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields. For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name?

To attempt to answer these questions, it may be helpful to review records with each of these field values. Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

As is so often the case, data rarely conforms to our assumptions about it. Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?

In Part 3 of this series: We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 03, 2009

Adventures in Data Profiling (Part 1)

August 03, 2009/ Jim Harris

In my popular post Getting Your Data Freq On, I explained that understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.

I explained the benefits of using a data profiling tool to help automate some of the grunt work, but that you need to perform the actual analysis and then prepare meaningful questions and reports to share with the rest of your team.

Series Overview

This post is the beginning of a vendor-neutral series on the methodology of data profiling.

In order to narrow the scope of the series, the scenario used will be that a customer data source for a new data quality initiative has been made available to an external consultant who has no prior knowledge of the data or its expected characteristics. Also, the business requirements have not yet been documented, and the subject matter experts are not currently available.

The series will not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that are covered. Both the data profiling tool and the data used throughout the series will be fictional. The “screen shots” have been customized to illustrate concepts and are not modeled after any particular data profiling tool.

The Adventures Begin...

The customer data source has been processed by a data profiling tool, which has provided the above counts and percentages that summarize the following field content characteristics:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

Some initial questions based on your analysis of these statistical summaries might include the following:

Is Customer ID the primary key for this data source?
Is Customer Name 1 the primary name on the account? If so, why isn't it always populated?
Do the statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records?
Why does the Gender Code field have 8 distinct values?
Do the 5 distinct values in Country Code indicate international postal addresses?

Please remember the series scenario – You are an external consultant with no prior knowledge of the data or its expected characteristics, who is performing this analysis without the aid of either business requirements or subject matter experts.

What other questions can you think of based on analyzing the statistical summaries provided by the data profiling tool?

In Part 2 of this series: We will continue the adventures by attempting to answer these questions (and more) by beginning our analysis of the frequency distributions of the unique values and formats found within the fields. Additionally, we will begin using drill-down analysis in order to perform a more detailed review of records of interest.

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

July 29, 2009

Data Quality: The Reality Show?

July 29, 2009/ Jim Harris

Over on the DataFlux Community of Experts, Dylan Jones (of Data Quality Pro fame) posted Data Quality and Social Proof, which advocates an interesting approach to convincing stakeholders of the need to act on poor data quality:

Use video testimonials from knowledge workers to record what data quality really means to the people who use data and drive value in your business.

My overactive imagination and sense of humor couldn't help but wonder what some of these testimonials would be like...

A Few Good Knowledge Workers

“You want the truth? You can't handle the truth! We live in a world that has data and the quality of those data need to be guarded by workers with knowledge. Who's gonna do it? You? I have a greater responsibility than you can possibly fathom. You have the luxury of not knowing what I know.

You don't want the truth because deep down in places you don't talk about at board meetings, you want me on that data, you need me on that data!

We use words like completeness, consistency, accuracy, timeliness. We use them as the backbone of a career spent trying to defend data. You use them as bullet points on a presentation slide.

I suggest that you pick up a pen and sign the authorization for our data quality initiative!”

Data-pocalypse Now

“I've seen poor data quality...poor data quality that you've seen. It's impossible for words to describe what is necessary to those who do not know what poor data quality means. Poor data quality has a face:

A customer that we can not provide service, an auditor that we can not prevent from failing us on regulatory compliance, a stockholder to whom we can not accurately report revenue.

Poor data quality...the horror...the horror...”

Data Busters

“You want to know how poor our data quality is?

Our data is headed for a disaster of Y2K proportions. What do we mean by Y2K?

Old Mainframe, real wrath of EBCDIC type stuff. Fire and brimstone coming down from the codepages! Rivers and seas of boiling data! Forty years of darkness! Hard drive crashes! HTTP 404! Deleted records rising from the Recycle Bin! Precision sacrifice! Dogs and cats living together...Mass Hysteria!

We are all terrified beyond the capacity for rational thought.

If someone asks if you are going to approve our data quality initiative...you say YES!”

Your Data Quality Reality Show

What would your video testimonial show about the reality of data quality in your organization?

How would you respond if asked to help convince your stakeholders of the need to act on poor data quality?

July 26, 2009

The Wisdom of Failure

July 26, 2009/ Jim Harris

Earlier this month, I had the honor of being interviewed by Ajay Ohri on his blog Decision Stats, which is an excellent source of insights on business intelligence and data mining as well as interviews with industry thought leaders and chief evangelists.

One of the questions Ajay asked me during my interview was what methods and habits would I recommend to young analysts just starting in the business intelligence field and part of my response was:

“Don't be afraid to ask questions or admit when you don't know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don't know the answers.”

It is perhaps one of life’s cruelest paradoxes that some lessons simply cannot be taught, but instead have to be learned through the pain of making mistakes. To err is human, but not all humans learn from their errors. In fact, some of us find it extremely difficult to even simply acknowledge when we have made a mistake. This was certainly true for me earlier in my career.

The Wisdom of Crowds

One of my favorite books is The Wisdom of Crowds by James Surowiecki. Before reading it, I admit that I believed crowds were incapable of wisdom and that the best decisions are based on the expert advice of carefully selected individuals. However, Surowiecki wonderfully elucidates the folly of “chasing the expert” and explains the four conditions that characterize wise crowds: diversity of opinion, independent thinking, decentralization and aggregation. The book is also balanced by examining the conditions (e.g. confirmation bias and groupthink) that can commonly undermine the wisdom of crowds. All and all, it is a wonderful discourse on both collective intelligence and collective ignorance with practical advice on how to achieve the former and avoid the latter.

Chasing the Data Quality Expert

Without question, a data quality expert can be an invaluable member of your team. Often an external consultant, a data quality expert can provide extensive experience and best practices from successful implementations. However, regardless of their experience, even with other companies in your industry, every organization and its data is unique. An expert's perspective definitely has merit, but their opinions and advice should not be allowed to dominate the decision making process.

“The more power you give a single individual in the face of complexity,” explains Surowiecki, “the more likely it is that bad decisions will get made.” No one person regardless of their experience and expertise can succeed on their own. According to Surowiecki, the best experts “recognize the limits of their own knowledge and of individual decision making.”

“Success is on the far side of failure”

One of the most common obstacles organizations face with data quality initiatives is that many initial attempts end in failure. Some fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program. However, regardless of the reason for the failure, it can negatively affect morale and cause employees to resist participating in the next data quality effort.

Although a common best practice is to perform a post-mortem in order to document the lessons learned, sometimes the stigma of failure persuades an organization to either skip the post-mortem or ignore its findings.

However, in the famous words of IBM founder Thomas J. Watson: “Success is on the far side of failure.”

A failed data quality initiative may have been closer to success than you realize. At the very least, there are important lessons to be learned from the mistakes that were made. The sooner you can recognize your mistakes, the sooner you can mitigate their effects and hopefully prevent them from happening again.

The Wisdom of Failure

In one of my other favorite books, How We Decide, Jonah Lehrer explains:

“The brain always learns the same way, accumulating wisdom through error...there are no shortcuts to this painstaking process...becoming an expert just takes time and practice...once you have developed expertise in a particular area...you have made the requisite mistakes.”

Therefore, although it may be true that experience is the path that separates knowledge from wisdom, I have come to realize that the true wisdom of my experience is the wisdom of failure.

A Portrait of the Data Quality Expert as a Young Idiot

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Nine Circles of Data Quality Hell

July 23, 2009

Getting Your Data Freq On

July 23, 2009/ Jim Harris

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources.

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling.

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq'ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

Verifying data matches the metadata that describes it
Identifying missing values
Identifying potential default values
Identifying potential invalid values
Checking data formats for inconsistencies
Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns. However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes. Consistently representing the absence of data is a common data quality standard.

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential for duplicate records. For example, a Tax ID field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it can not be considered a potential single primary key because it can not be used to uniquely identify every record). If the Tax ID field is also less than 100% distinct (i.e. some distinct actual values occur on more than one record), then this could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

Fields with an extremely low cardinality (i.e. indicating potential default values)
Fields with a relatively low cardinality (e.g. gender code and source system code)
Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
Fields with free-form values and a high cardinality (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers).

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

Conclusion

Understanding your data is essential to using it effectively and improving its quality. In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly – translate your analysis into meaningful reports and questions to share with the rest of the project team. Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project. This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality. All of which are necessary to progress from profiling your data to performing a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and...

Get your data freq on!

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Schrödinger's Data Quality

Data Gazers

July 18, 2009

The Very True Fear of False Positives

July 18, 2009/ Jim Harris

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words. Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels). Ranking is often used as a primary method in differentiating the three possible result categories:

Automatic Matches
Automatic Non-Matches
Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

False Negatives - records that did not match, but should have been matched
False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching. I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching.

However, I am not going to try to convince you which data matching technique provides the superior solution – at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working – because I firmly believe in the following two things:

Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
There is no such thing as the best data matching technique – every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club. Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques.

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors. Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives – the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives.

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked – after all, those records were not linked before investing in the data matching software. Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives. Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot. In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted.

My point is that data matching debates often ignore this pragmatic concern. It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation.

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome. The harsh reality is that there is no perfect data matching solution. The risk of false positives can be mitigated but never eliminated. However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

OCDQ Blog

Postal Address Line 1

Postal Address Line 2

Postal Address Validation

Related Posts

Related Posts

Frameworks are NOT Futile

Resistance is NOT Futile

Avoid Assimilation

Related Posts

Seek First to Understand...Then to Be Understood

Vulcans

Be Truly Extraordinary

Your Away Mission

Related Posts

Country Code

Zip Code

State Abbreviation

City Name

Related Posts

International Man of Postal Address Standards

Imagining the Future of Data Quality

Reactive Data Quality

Proactive Data Quality

Proactive Data Quality > Reactive Data Quality

An Arrow Thickly Smeared With Poison

“Flight to Data Quality”

Hyperactive Data Quality

Related Posts

The Data-Information Continuum

The Special Theory of Data Quality

The General Theory of Data Quality

Theory vs. Practice

Related Posts

Birth Date

Telephone Number

E-mail Address

Related Posts

The Adventures Continue...

Related Posts

Series Overview

The Adventures Begin...

Related Posts

A Few Good Knowledge Workers

Data-pocalypse Now

Data Busters

Your Data Quality Reality Show

The Wisdom of Crowds

Chasing the Data Quality Expert

“Success is on the far side of failure”

The Wisdom of Failure

Related Posts

Freq'ing Awesome Analysis

How can a data profiling tool help you? Let me count the ways

Show Me the Value (or the Format)

Conclusion

Related Posts

Data Matching Techniques

The Very True Fear of False Positives

Additional Resources

OCDQ Blog