OCDQ Blog

Adventures in Data Profiling (Part 1)

August 09, 2009

Adventures in Data Profiling (Part 3)

August 09, 2009/ Jim Harris

In Part 2 of this series: The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields. This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.

Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.

In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Birth Date

The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool. Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year).

Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.

However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.

This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context. Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.

We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.

The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record). Therefore, we will limit the review to only the top ten most frequently occurring values.

Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values. One aspect of this analysis is that it can be used as an easier method for examining the customer age range.

Here we also see two contextually invalid birth years: 2011 and 2012. Any thoughts on a possible explanation for this data anomaly?

Telephone Number

The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.

The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.

Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

This summary information points to the need to use drill-downs in order to review more detailed information.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.

With only ten profiled field formats, we can easily review them all. Most formats appear to be representative of potentially valid telephone numbers. However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States). Additionally, there are two formats that appear invalid based on North American standards.

However, a common data quality challenge is that valid field formats can conceal invalid field values.

Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values. In this case, more obviously invalid telephone numbers are discovered.

E-mail Address

The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool. In order to save some space, I have intentionally omitted the additional profiling statistics for this field.

E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.

Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD). These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).

Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures. However, some formats are missing one of the three required sub-values. The formats missing User could be an indication that the field sometimes contains a Website Address.

Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.

What other questions can you think of for these fields? Additional analysis could be done using drill-downs to perform a more detailed review of records of interest. What other analysis do you think should be performed for these fields?

In Part 4 of this series: We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Adventures in Data Profiling (Part 2)

Getting Your Data Freq On

August 05, 2009

Adventures in Data Profiling (Part 2)

August 05, 2009/ Jim Harris

In Part 1 of this series: The adventures began with the following scenario – You are an external consultant on a new data quality initiative. You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.

You have no prior knowledge of the data or its expected characteristics. You are performing this analysis without the aid of either business requirements or subject matter experts. Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.

The customer data source was processed by the data profiling tool, which provided the following statistical summaries:

The Adventures Continue...

In Part 1, we asked if Customer ID was the primary key for this data source. In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:

Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts. This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.

This field summary for Customer ID includes some input metadata, identifying the expected data type and field length. Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.

We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS). Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).

Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values). Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.

The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit. We could also continue drilling down to see the actual field values behind the field formats.

Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?

In Part 1, we asked why the Gender Code field has 8 distinct values. Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields).

We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.

It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male. Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).

However, it is dangerous to assume anything and what about those numeric values? Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields. For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name?

To attempt to answer these questions, it may be helpful to review records with each of these field values. Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

As is so often the case, data rarely conforms to our assumptions about it. Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?

In Part 3 of this series: We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 3)

Getting Your Data Freq On

August 03, 2009

Adventures in Data Profiling (Part 1)

August 03, 2009/ Jim Harris

In my popular post Getting Your Data Freq On, I explained that understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.

I explained the benefits of using a data profiling tool to help automate some of the grunt work, but that you need to perform the actual analysis and then prepare meaningful questions and reports to share with the rest of your team.

Series Overview

This post is the beginning of a vendor-neutral series on the methodology of data profiling.

In order to narrow the scope of the series, the scenario used will be that a customer data source for a new data quality initiative has been made available to an external consultant who has no prior knowledge of the data or its expected characteristics. Also, the business requirements have not yet been documented, and the subject matter experts are not currently available.

The series will not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that are covered. Both the data profiling tool and the data used throughout the series will be fictional. The “screen shots” have been customized to illustrate concepts and are not modeled after any particular data profiling tool.

The Adventures Begin...

The customer data source has been processed by a data profiling tool, which has provided the above counts and percentages that summarize the following field content characteristics:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

Some initial questions based on your analysis of these statistical summaries might include the following:

Is Customer ID the primary key for this data source?
Is Customer Name 1 the primary name on the account? If so, why isn't it always populated?
Do the statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records?
Why does the Gender Code field have 8 distinct values?
Do the 5 distinct values in Country Code indicate international postal addresses?

Please remember the series scenario – You are an external consultant with no prior knowledge of the data or its expected characteristics, who is performing this analysis without the aid of either business requirements or subject matter experts.

What other questions can you think of based on analyzing the statistical summaries provided by the data profiling tool?

In Part 2 of this series: We will continue the adventures by attempting to answer these questions (and more) by beginning our analysis of the frequency distributions of the unique values and formats found within the fields. Additionally, we will begin using drill-down analysis in order to perform a more detailed review of records of interest.

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Getting Your Data Freq On

July 29, 2009

Data Quality: The Reality Show?

July 29, 2009/ Jim Harris

Over on the DataFlux Community of Experts, Dylan Jones (of Data Quality Pro fame) posted Data Quality and Social Proof, which advocates an interesting approach to convincing stakeholders of the need to act on poor data quality:

Use video testimonials from knowledge workers to record what data quality really means to the people who use data and drive value in your business.

My overactive imagination and sense of humor couldn't help but wonder what some of these testimonials would be like...

A Few Good Knowledge Workers

“You want the truth? You can't handle the truth! We live in a world that has data and the quality of those data need to be guarded by workers with knowledge. Who's gonna do it? You? I have a greater responsibility than you can possibly fathom. You have the luxury of not knowing what I know.

You don't want the truth because deep down in places you don't talk about at board meetings, you want me on that data, you need me on that data!

We use words like completeness, consistency, accuracy, timeliness. We use them as the backbone of a career spent trying to defend data. You use them as bullet points on a presentation slide.

I suggest that you pick up a pen and sign the authorization for our data quality initiative!”

Data-pocalypse Now

“I've seen poor data quality...poor data quality that you've seen. It's impossible for words to describe what is necessary to those who do not know what poor data quality means. Poor data quality has a face:

A customer that we can not provide service, an auditor that we can not prevent from failing us on regulatory compliance, a stockholder to whom we can not accurately report revenue.

Poor data quality...the horror...the horror...”

Data Busters

“You want to know how poor our data quality is?

Our data is headed for a disaster of Y2K proportions. What do we mean by Y2K?

Old Mainframe, real wrath of EBCDIC type stuff. Fire and brimstone coming down from the codepages! Rivers and seas of boiling data! Forty years of darkness! Hard drive crashes! HTTP 404! Deleted records rising from the Recycle Bin! Precision sacrifice! Dogs and cats living together...Mass Hysteria!

We are all terrified beyond the capacity for rational thought.

If someone asks if you are going to approve our data quality initiative...you say YES!”

Your Data Quality Reality Show

What would your video testimonial show about the reality of data quality in your organization?

How would you respond if asked to help convince your stakeholders of the need to act on poor data quality?

July 26, 2009

The Wisdom of Failure

July 26, 2009/ Jim Harris

Earlier this month, I had the honor of being interviewed by Ajay Ohri on his blog Decision Stats, which is an excellent source of insights on business intelligence and data mining as well as interviews with industry thought leaders and chief evangelists.

One of the questions Ajay asked me during my interview was what methods and habits would I recommend to young analysts just starting in the business intelligence field and part of my response was:

“Don't be afraid to ask questions or admit when you don't know the answers. The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don't know the answers.”

It is perhaps one of life’s cruelest paradoxes that some lessons simply cannot be taught, but instead have to be learned through the pain of making mistakes. To err is human, but not all humans learn from their errors. In fact, some of us find it extremely difficult to even simply acknowledge when we have made a mistake. This was certainly true for me earlier in my career.

The Wisdom of Crowds

One of my favorite books is The Wisdom of Crowds by James Surowiecki. Before reading it, I admit that I believed crowds were incapable of wisdom and that the best decisions are based on the expert advice of carefully selected individuals. However, Surowiecki wonderfully elucidates the folly of “chasing the expert” and explains the four conditions that characterize wise crowds: diversity of opinion, independent thinking, decentralization and aggregation. The book is also balanced by examining the conditions (e.g. confirmation bias and groupthink) that can commonly undermine the wisdom of crowds. All and all, it is a wonderful discourse on both collective intelligence and collective ignorance with practical advice on how to achieve the former and avoid the latter.

Chasing the Data Quality Expert

Without question, a data quality expert can be an invaluable member of your team. Often an external consultant, a data quality expert can provide extensive experience and best practices from successful implementations. However, regardless of their experience, even with other companies in your industry, every organization and its data is unique. An expert's perspective definitely has merit, but their opinions and advice should not be allowed to dominate the decision making process.

“The more power you give a single individual in the face of complexity,” explains Surowiecki, “the more likely it is that bad decisions will get made.” No one person regardless of their experience and expertise can succeed on their own. According to Surowiecki, the best experts “recognize the limits of their own knowledge and of individual decision making.”

“Success is on the far side of failure”

One of the most common obstacles organizations face with data quality initiatives is that many initial attempts end in failure. Some fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program. However, regardless of the reason for the failure, it can negatively affect morale and cause employees to resist participating in the next data quality effort.

Although a common best practice is to perform a post-mortem in order to document the lessons learned, sometimes the stigma of failure persuades an organization to either skip the post-mortem or ignore its findings.

However, in the famous words of IBM founder Thomas J. Watson: “Success is on the far side of failure.”

A failed data quality initiative may have been closer to success than you realize. At the very least, there are important lessons to be learned from the mistakes that were made. The sooner you can recognize your mistakes, the sooner you can mitigate their effects and hopefully prevent them from happening again.

The Wisdom of Failure

In one of my other favorite books, How We Decide, Jonah Lehrer explains:

“The brain always learns the same way, accumulating wisdom through error...there are no shortcuts to this painstaking process...becoming an expert just takes time and practice...once you have developed expertise in a particular area...you have made the requisite mistakes.”

Therefore, although it may be true that experience is the path that separates knowledge from wisdom, I have come to realize that the true wisdom of my experience is the wisdom of failure.

A Portrait of the Data Quality Expert as a Young Idiot

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Nine Circles of Data Quality Hell

July 23, 2009

Getting Your Data Freq On

July 23, 2009/ Jim Harris

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources.

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling.

However, in this post I want to focus on some of the benefits of using a data profiling tool.

Freq'ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

Verifying data matches the metadata that describes it
Identifying missing values
Identifying potential default values
Identifying potential invalid values
Checking data formats for inconsistencies
Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

How can a data profiling tool help you? Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns. However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes. Consistently representing the absence of data is a common data quality standard.

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential for duplicate records. For example, a Tax ID field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it can not be considered a potential single primary key because it can not be used to uniquely identify every record). If the Tax ID field is also less than 100% distinct (i.e. some distinct actual values occur on more than one record), then this could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

Fields with an extremely low cardinality (i.e. indicating potential default values)
Fields with a relatively low cardinality (e.g. gender code and source system code)
Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
Fields with free-form values and a high cardinality (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers).

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

Conclusion

Understanding your data is essential to using it effectively and improving its quality. In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly – translate your analysis into meaningful reports and questions to share with the rest of the project team. Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project. This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality. All of which are necessary to progress from profiling your data to performing a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and...

Get your data freq on!

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Data Quality and Record Linkage Techniques

Data Gazers

July 18, 2009

The Very True Fear of False Positives

July 18, 2009/ Jim Harris

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words. Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels). Ranking is often used as a primary method in differentiating the three possible result categories:

Automatic Matches
Automatic Non-Matches
Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

False Negatives - records that did not match, but should have been matched
False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching. I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching.

However, I am not going to try to convince you which data matching technique provides the superior solution – at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working – because I firmly believe in the following two things:

Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
There is no such thing as the best data matching technique – every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club. Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques.

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors. Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives – the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives.

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked – after all, those records were not linked before investing in the data matching software. Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives. Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot. In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted.

My point is that data matching debates often ignore this pragmatic concern. It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation.

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome. The harsh reality is that there is no perfect data matching solution. The risk of false positives can be mitigated but never eliminated. However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

Additional Resources

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

July 16, 2009

Worthy Data Quality Whitepapers (Part 2)

July 16, 2009/ Jim Harris

Overall Approach to Data Quality ROI

Overall Approach to Data Quality ROI is a worthy data quality whitepaper freely available (name and email required for download) from the McKnight Consulting Group.

William McKnight

The author of the whitepaper is William McKnight, President of McKnight Consulting Group. William focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures. William has more than 20 years of information management experience, nearly half of which was gained in IT leadership positions, dealing firsthand with the challenging issues his clients now face. His IT and consulting teams have won best practice competitions for their implementations. In 11 years of consulting, he has been a part of 150 client programs worldwide, has over 300 articles, whitepapers and tips in publication and is a frequent international speaker. William and his team provide clients with action plans, architectures, complete programs, vendor-neutral tool selection and right-fit resources.

Additionally, William has an excellent blog on the B-eye-Network and a new course now available on eLearningCurve.

Whitepaper Excerpts

Excerpts from Overall Approach to Data Quality ROI:

“Data quality is an elusive subject that can defy measurement and yet be critical enough to derail any single IT project, strategic initiative, or even a company as a whole.”
“Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Put simply, it is a competitive strategy.”
Six key steps to help you realize tangible ROI on your data quality initiative:
1. System Profiling – survey and prioritize your company systems according to their use of and need for quality data.
2. Data Quality Rule Determination – data quality can be defined as a lack of intolerable defects.
3. Data Profiling – usually no one can articulate how clean or dirty corporate data is. Without this measurement of cleanliness, the effectiveness of activities that are aimed at improving data quality cannot be measured.
4. Data Quality Scoring – scoring is a relative measure of conformance to rules. System scores are an aggregate of the rule scores for that system and the overall score is a prorated aggregation of the system scores.
5. Measure Impact of Various Levels of Data Quality – ROI is about accumulating all returns and investments from a project’s build, maintenance, and associated business and IT activities through to the ultimate desired results – all while considering the possible outcomes and their likelihood.
6. Data Quality Improvement – it is much more costly to fix data quality errors in downstream systems than it is at the point of origin.

Worthy Data Quality Whitepapers (Part 1)

Data Quality Whitepapers are Worthless

July 14, 2009

Data Quality Blogging All-Stars

July 14, 2009/ Jim Harris

The 2009 Major League Baseball (MLB) All-Star Game is being held tonight at Busch Stadium in St. Louis, Missouri.

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with the best statistical performances from the first half of the MLB season.

As I watch the 80th Midsummer Classic, I offer this exhibition that showcases the bloggers with the posts I have most enjoyed reading from the first half of the 2009 data quality blogging season.

Dylan Jones

From Data Quality Pro:

How to transform your ETL tool into a data quality toolkit
DEBATE: How should data governance and data quality work together?
Selecting Data Quality Software (Two Part Series): Part 1, Part 2
Creating An Internal Data Quality Community (Four Part Series): Part 1, Part 2, Part 3, Part 4
15 Tips for transforming knowledge-workers into a data quality task force
10 Tips to help data quality professionals boost their career prospects in the downturn

Daragh O Brien

From The DOBlog:

Steve Sarsfield

From Data Governance and Data Quality Insider:

Daniel Gent

From Data Quality Edge:

Sun Tzu and the Art of Data Quality (Two Part Series): Part 1, Part 2
DQ is 1/3 Process Knowledge + 1/3 Business Knowledge + 1/3 Intuition
When Bad Data Becomes Acceptable Data
DQ Problems? Start a Data Quality Recognition Program!
Five Attributes for the Data Quality Analyst

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

Stefanos Damianakis

From Netrics HD:

TSA False Negatives and the URoSD
TSA “Secure Flight” will require more demographic information
Narrative Fallacy and Data Matching
What’s in a Name? (Three Part Series): Part 1, Part 2, Part 3

Vish Agashe

From Business Intelligence: Process, People and Products:

Mark Goloboy

From Boston Data, Technology & Analytics:

Additional Resources

Over on Data Quality Pro, read the data quality blog roundups from the first half of 2009:

From the IAIDQ, read the 2009 issues of the IAIDQ Blog Carnival:

July 07, 2009

Data Governance and Data Quality

July 07, 2009/ Jim Harris

Regular readers know that I often blog about the common mistakes I have observed (and made) in my professional services and application development experience in data quality (for example, see my post: The Nine Circles of Data Quality Hell).

According to Wikipedia: “Data governance is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, business process management, and risk management surrounding the handling of data in an organization.”

Since I have never formally used the term “data governance” with my clients, I have been researching what data governance is and how it specifically relates to data quality.

Thankfully, I found a great resource in Steve Sarsfield's excellent book The Data Governance Imperative, where he explains:

“Data governance is about changing the hearts and minds of your company to see the value of information quality...data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise...at the root of the problems with managing your data are data quality problems...data governance guarantees that data can be trusted...putting people in charge of fixing and preventing issues with data...to have fewer negative events as a result of poor data.”

Although the book covers data governance more comprehensively, I focused on three of my favorite data quality themes:

Business-IT Collaboration
Data Quality Assessments
People Power

Business-IT Collaboration

Data governance establishes policies and procedures to align people throughout the organization. Successful data quality initiatives require the Business and IT to forge an ongoing and iterative collaboration. Neither the Business nor IT alone has all of the necessary knowledge and resources required to achieve data quality success. The Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes.

Steve Sarsfield explains:

“Business users need to understand that data quality is everyone's job and not just an issue with technology...the mantra of data governance is that technologists and business users must work together to define what good data is...constantly leverage both business users, who know the value of the data, and technologists, who can apply what the business users know to the data.”

Data Quality Assessments

Data quality assessments provide a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data. Data quality assessments help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements. Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Steve Sarsfield explains:

“In order to know if you're winning in the fight against poor data quality, you have to keep score...use data quality scorecards to understand the detail about quality of data...and aggregate those scores into business value metrics...solid metrics...give you a baseline against which you can measure improvement over time.”

People Power

Although incredible advancements continue, technology alone cannot provide the solution. Data governance and data quality both require a holistic approach involving people, process and technology. However, by far the most important of the three is people. In my experience, it is always the people involved that make projects successful.

Steve Sarsfield explains:

“The most important aspect of implementing data governance is that people power must be used to improve the processes within an organization. Technology will have its place, but it's most importantly the people who set up new processes who make the biggest impact.”

Conclusion

Data governance provides the framework for evolving data quality from a project to an enterprise-wide initiative. By facilitating the collaboration of business and technical stakeholders, aligning data usage with business metrics, and enabling people to be responsible for data ownership and data quality, data governance provides for the ongoing management of the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

TDWI World Conference Chicago 2009

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

Additional Resources

Over on Data Quality Pro, read the following posts:

From the IAIDQ publications portal, read the 2008 industry report: The State of Information and Data Governance

Read Steve Sarsfield's book: The Data Governance Imperative and read his blog: Data Governance and Data Quality Insider

July 01, 2009

Missed It By That Much

July 01, 2009/ Jim Harris

In the mission to gain control over data chaos, a project is launched in order to implement a new system to help remediate the poor data quality that is negatively impacting decision-critical enterprise information.

The project appears to be well planned. Business requirements were well documented. A data quality assessment was performed to gain an understanding of the data challenges that would be faced during development and testing. Detailed architectural and functional specifications were written to guide these efforts.

The project appears to be progressing well. Business, technical and data issues all come up from time to time. Meetings are held to prioritize the issues and determine their impact. Some issues require immediate fixes, while other issues are deferred to the next phase of the project. All of these decisions are documented and well communicated to the end-user community.

Expectations appear to have been properly set for end-user acceptance testing.

As a best practice, the new system was designed to identify and report exceptions when they occur. The end-users agreed that an obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause. Data quality problems can be very insidious and even the best data remediation process will still produce exceptions.

Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.

Once the end-users start reviewing the exceptions, their confidence in the new system drops rapidly. Even after some enhancements increase the number of records without an exception from 86% to 99% – the end-users continue to focus on the remaining 1% of the records that are still producing data quality exceptions.

Would you believe this incredibly common scenario can prevent acceptance of an overwhelmingly successful implementation?

How about if I quoted one of the many people who can help you get smarter than by only listening to me?

In his excellent book Why New Systems Fail: Theory and Practice Collide, Phil Simon explains:

“Systems are to be appreciated by their general effects, and not by particular exceptions...

Errors are actually helpful the vast majority of the time.”

In fact, because the new system was designed to identify and report errors when they occur:

“End-users could focus on the root causes of the problem and not have to wade through hundreds of thousands of records in an attempt to find the problem records.”

I have seen projects fail in the many ways described by detailed case studies in Phil Simon's fantastic book. However, one of the most common and frustrating data quality failures is the project that was so close to being a success but the focus on exceptions resulted in the end-users telling us that we “missed it by that much.”

I am neither suggesting that end-users are unrealistic nor that exceptions should be ignored.

Reducing exceptions (i.e. poor data quality) is the whole point of the project and nobody understands the data better than the end-users. However, chasing perfection can undermine the best intentions.

In order to be successful, data quality projects must always be understood as an iterative process. Small incremental improvements will build momentum to larger success over time.

Instead of focusing on the exceptions – focus on the improvements.

And you will begin making steady progress toward improving your data quality.

And loving it!

The Data Quality Goldilocks Zone