Adventures in Data Profiling (Part 2)
Jim Harris in
Data Quality,
Methodology tagged
Adventure in Data Profiling,
Data Profiling
Wednesday, August 5, 2009 at 10:51PM In Part 1 of this series: The adventures began with the following scenario – You are an external consultant on a new data quality initiative. You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.
You have no prior knowledge of the data or its expected characteristics. You are performing this analysis without the aid of either business requirements or subject matter experts. Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.
The customer data source was processed by the data profiling tool, which provided the following statistical summaries:
The Adventures Continue...
In Part 1, we asked if Customer ID was the primary key for this data source. In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:
Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts. This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.
This field summary for Customer ID includes some input metadata, identifying the expected data type and field length. Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.
The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.
We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.
The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS). Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).
Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values). Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.
The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit. We could also continue drilling down to see the actual field values behind the field formats.
Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?
In Part 1, we asked why the Gender Code field has 8 distinct values. Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields).
We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.
It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male. Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).
However, it is dangerous to assume anything and what about those numeric values? Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields. For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name?
To attempt to answer these questions, it may be helpful to review records with each of these field values. Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:
As is so often the case, data rarely conforms to our assumptions about it. Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?
In Part 3 of this series: We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.
Related Posts
Adventures in Data Profiling (Part 1)
Adventures in Data Profiling (Part 3)
Adventures in Data Profiling (Part 4)
Adventures in Data Profiling (Part 5)
Adventures in Data Profiling (Part 6)
Adventures in Data Profiling (Part 7)
Reader Comments (12)
It's a fair bet that Customer ID is a sequence but what happens when it exceeds 10 digits? Will it be increased to accommodate more or will some old data be culled and the number re-used. As a Data Warehouse (DW) we don't want to delete any data - the upstream may not care.
Has the gender coding changed over time, or are all variations still being entered? Is there more than 1 system supplying this data, each having different standards?
Phil,
Thanks for continuing to provide your perspectives and questions – both are greatly appreciated.
I share your concern about Customer ID – definitely a question to include in our report.
I also share your deductive questions about Gender Code. Let’s pretend that we fortuitously found ourselves standing in line in the cafeteria behind a business analyst that we know is working on the first draft of the business requirements and we decide to engage him in some "casual" conversation:
Us: “Hey Bill, how are you today? Looks like everyone is in line for the pizza today. Hey, I was wondering what the source of that customer data I am looking at is – do you know by any chance?”
Bill: “Hey Consultant Dude, I am going well, thanks for asking. Yes, pizza day is always a big hit in the cafeteria. That customer data we gave you? Oh, it’s a full-volume snapshot of our legacy system as of last week.”
Us: “Cool, good to know. How old is the legacy system?”
Bill: “Well, it’s been used as our system of record for as long as I have been here and that will be 15 years next week.”
Us: “Congratulations, Bill. I hope you at least get your anniversary as an extra personal day. At the very least, let me buy your pizza today.”
We external consultants are wily folk :-)
So, perhaps we can assume that since the data source is a legacy system that has been in use for at least 15 years, it is certainly possible that gender coding has changed over time and that older records may not have been updated when the conventions changed. However, that would still be an assumption and it probably still doesn’t explain some of the records we have looked at so far, does it?
Best Regards…
Jim
There's only 1 gender code but 2 customer name fields. Which customer field does gender apply to?
On one hand it appears to be customer name 1 since we have Elinor (Female?) and Robert (Male?) Frost recorded as Female (from name 1) but by contrast we have Patrick Thames and George Eliot (name 1) with different genders (1 & 2, whatever they are).
Then there's the Griffins who are both in the same field - perhaps we need a 'Family Guy' and 'Family Gal' as gender codes...
What can we deduce about the numeric codes? We can see from the distribution that there are more F's than M's and more Females than Males so we might guess that because there are more 1's than 2's the 1's must be Female.
Is it true that this business deals with more Female customers than Male? What business is it? If the distribution ought be roughly equal split between genders then we might decide that 1's are male to balance it up.
Oh, and there's some 0's in there as well, we can't assume that 0 is unknown, it might be 0 female, 1 male, 2 unknown...
There's little point speculating, we have to ask the question about the numeric values, or do you have something else up your sleeve?
Excellent analysis, as usual Phil!
The only additional information that I will reveal at this time (count it as another preview provided by Bill’s gratitude for us buying his pizza), is that the numeric gender codes were intended to represent the following scheme:
0 = Unknown
1 = Male
2 = Female
When we return to a more detailed analysis of Customer Name later in the series (probably Part 5 of what now is looking like a 8 part series), we will may wonder if the input Gender Code can be trusted or if we should recommend that some external reference data is used to make gender recommendations for all of the personal names found on the customer records.
Perhaps I have said too much…this comment will self destruct in…too late :-)
Best Regards…
Jim
Excellent article Jim.
And you tease us with your strip-tease like exposure of the tantalizing data and rules that exist below the surface...
Something I found interesting about Gender Code is the Customer ID values associated with them. For example Patrick Thames has Customer ID of 725019 and a Gender Code of 1, while Tereza M. Kundera has Customer ID of 2232687 and a Gender Code of F, and Mary Anne Evans has a customer ID of 2828666 and a Gender Code of 2. If the Customer ID is truly a sequential numeric key, then it seems that the different Gender Code values (numeric vs alpha) are being entered into the database during the same time periods. I wonder if this implies that the data in this legacy database are coming from different source systems and therefore supplying difference code values.
Kay,
You have a remarkable eye for spotting a potential pattern in the data and you provide an excellent deduction for why this pattern could be occurring.
I am writing this series in part because I truly find data analysis to be a fascinating adventure. I have blogged about what Arkady Maydanchik calls Data Gazing and why this is an essential skill for data quality initiatives (and really all enterprise information initiatives).
So I can’t help but make an additional observation of my own about Gender Code – it sometimes appears to be incorrect.
For example, is George Eliot (Gender Code = 2 = Female), as my favorite Johnny Cash song would say, analogous to “A Boy Named Sue?”
Beyond my intentional joke (George Eliot was the pen name used by 19th century English novelist Mary Anne Evans), can we truly determine gender code with absolute certainty – even when it is provided to us directly from the customer?
After all, Mary Anne Evans sadly had to tell her readership that she was a man in order (as she personally felt whether it was actually true or not) for her works to be taken seriously as literature at the time.
Best Regards…
Jim
P.S. Additionally, it may amuse you to know that the Customer ID values in my data were generated using a random number function – but I cannot deny the pattern you noticed.
Jim,
I am amused. Thanks for the chuckle before heading into my weekend.
:-)
Regards back atcha...
Kay
It is very good analysis and discussion on Profiling.
Some additional information that I want to contribute after looking at profiling report is that, as you see Customer Id (Field Format) Counts are in proper format. There is not a single number missed in range (1 - 3,338,190). Just look at counts for different format (say in single digit number maximum 9 values can exists, in two digits 90 values and so on). So there is no missing Customer Id whatever random number function is used, they may be or may not be sequential.
Apart from that, seven digits numbers still have "6,661,809" (9,000,000- 2,338,191) values available that is 284% of occupied values (i.e. 3,338,190). So right know DWH is comfortable to accept almost three times customers than the present count. This can help us to decide strategy for future Customer Ids.
Rahul
Jim,
Catching up on my reading. So I decided to go sequentially. I have not read part 3 or above.
A quick question, on the table where you have values for gender code and count, would getting dates be possible? Min(Created Date), Max(Created Date), Min (Update Date), Max (Update Date).
If we can get some visibility into these dates, some of the answers to above discussions can be found (if the Legacy system implemented how genders are entered or if some of this information is coming from data migration done from other systems etc.)
Was that information available to the consultant?
Oh, and by the way, if you are a consultant next to me in the line for lunch on my anniversary with my employer, I like to eat brownies with my Pizza... :)
Vish
Hi Vish,
First of all, I am glad to see that this series is on your reading list :)
I agree that checking the associated MIN/MAX Create and Update Dates would be excellent analysis to perform in order to determine if the gender coding conventions have changed over time.
This information was unfortunately not available for this data source.
However, as I mentioned above, even if we could confirm that (and when) the gender coding conventions changed, it probably still doesn’t explain the gender coding of the records we have looked at so far.
Best Regards...
Jim
Alright...if dates are not available...
It seems like either 0 or Unknown is being entered when there is no clarity about the name:
- 0 or Unknown is being used when only initials are in the customer name (0 for T.S.Elliot and Unknown for J.D. Salinger)
- 0 is used for Huggy Bear Brown (Sounds too suspicious to be a real name)
- Unknown is used for Peter and Lois Griffin
While this pattern does not have statistically significant set to conclude, further digging into this would be a good idea...
Look for the Gender Code where "&" or "and" is in the customer name.
Vish Agashe