Adventures in Data Profiling (Part 3)
Jim Harris in
Data Quality,
Methodology tagged
Adventure in Data Profiling,
Data Profiling
Sunday, August 9, 2009 at 10:08PM In Part 2 of this series: The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields. This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.
Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.
In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address.
Birth Date
The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool. Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year).
Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.
However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.
This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context. Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.
We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.
The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record). Therefore, we will limit the review to only the top ten most frequently occurring values.
Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values. One aspect of this analysis is that it can be used as an easier method for examining the customer age range.
Here we also see two contextually invalid birth years: 2011 and 2012. Any thoughts on a possible explanation for this data anomaly?
Telephone Number
The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.
The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.
The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.
Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.
This summary information points to the need to use drill-downs in order to review more detailed information.
The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.
With only ten profiled field formats, we can easily review them all. Most formats appear to be representative of potentially valid telephone numbers. However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States). Additionally, there are two formats that appear invalid based on North American standards.
However, a common data quality challenge is that valid field formats can conceal invalid field values.
Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values. In this case, more obviously invalid telephone numbers are discovered.
E-mail Address
The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool. In order to save some space, I have intentionally omitted the additional profiling statistics for this field.
E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.
Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.
In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD). These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).
Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures. However, some formats are missing one of the three required sub-values. The formats missing User could be an indication that the field sometimes contains a Website Address.
Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.
What other questions can you think of for these fields? Additional analysis could be done using drill-downs to perform a more detailed review of records of interest. What other analysis do you think should be performed for these fields?
In Part 4 of this series: We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.
Related Posts
Adventures in Data Profiling (Part 1)
Adventures in Data Profiling (Part 2)
Adventures in Data Profiling (Part 4)
Adventures in Data Profiling (Part 5)
Adventures in Data Profiling (Part 6)
Adventures in Data Profiling (Part 7)
Reader Comments (5)
My guess on the DOB's of 2011 & 2012 is the dates were possibly entered as 2 digit dates for 1911 & 1912 but the date validator incorrectly assumed this century. Many date algorithms use a cut off point to decide the century for 2 digits years. In this example we spotted the future dates, but what about a 2 digit date entered for 1906 which would give us 2006 - not in the future but a very young customer. Perhaps there's more we can do about profiling a typical customer age wise?
The dates in the field summary screen look very much like MM-DD-CCYY to me, not DD-MM-CCYY as stated in your text. Could you clarify what's going on there please - have I missed something?
The cardinality of 25,275 offers us about 25275 / 365 = 69 years worth of birth dates - more if we think that some dates won't have a customer...
Jim, excellent series you've published here.
For me, I think the thing that jumps out is just how accessible and easy it is for organisations to find this low-hanging fruit. There is often the misconception that we need to be data analysts armed with the latest tools and gadgets to find these type of issues.
The fact is that I've seen all of the problems you've cited above in business-critical data in banks, telcos, power companies, travel orgs, healthcare - the list goes on, and they're generally not too difficult to spot.
I'm sure some may read this and go yeah, but most systems these days don't allow data to be entered this bad, the fact is that modern apps can often make it more likely for users to enter erroneous information.
So what your mini-course is hopefully illustrating is just how much low-hanging fruit is there ripe for the picking. Of course the real challenge is how to take these stats and build in controls to prevent the causes re-occurring but I hope it inspires people to get started with profiling, it is always a real eye-opener and one of the fastest ways to create positive action.
Great stuff, really enjoyed the series.
Dylan
Phil,
First of all, thanks for catching my unintentional mistake on the date format!
I meant (and have since corrected the error) to write MM-DD-CCYY and not DD-MM-CCYY. No other explanation other than I screwed up – perhaps I should have used a data profiling tool on the content of my blog post :-)
Thanks also for offering the explanation that I was intentionally hinting at as a possible explanation for the apparently invalid birth years of 2011 and 2012. An additional question on the possibility that early 20th century dates could be misinterpreted as early 21st century dates giving us very young customers would be to ask if it would or would not make sense to have customers that young? Therefore, as you have suggested, perhaps there is more analysis that needs to be done regarding customer age range after we get some preliminary feedback from business analysts and subject matter experts.
Best Regards…
Jim
Dylan,
I completely agree that some readers will claim “those type of issues don’t happen in our data.”
My goal for the series is definitely my hope that it will inspire people to get started with data profiling. As you stated, it is always a real eye-opener when you actually look at the data.
It always surprises (and disappoints) me how often people don’t do it.
And you certainly don’t need to have a data profiling tool in order to perform the analysis covered in the series. A tool does make it easier since it can automate a lot of the menial labor for you, but as I keep mentioning (and illustrating) throughout the series, the data profiling tool does not automate the analysis.
Thanks for sharing your perspective. It is always greatly appreciated.
Best Regards…
Jim
Dylan,
You're right, there is plenty of low hanging fruit to be picked, but the only people picking it are the ones who are looking up at it. Those with their heads down saying 'yeah, but that doesn't happen' won't even see it.
Very often looking for the obvious is the most difficult just because it's the last thing you'd expect. I lost my keys today. I found them exactly where I'd left them on my desk in full view but it isn't where I normally keep them. If I'd not been influenced by habit I would have seen them straight away.
People who know the data inside out are the ones most likely to miss the easy issues, something experienced Data Gazers would do well to remember.
I also agree with your comment about the challenge of correcting the issues upstream. I have long been fighting to get representation for reporting/data warehousing into the operational front-end development. If developers understood the data quality issues in the data warehouse they might design the systems differently, although not all DQ issues are down to the robustness of the software.