Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments
« Hailing Frequencies Open | Main | Imagining the Future of Data Quality »
Tuesday
Aug182009

Adventures in Data Profiling (Part 4)

In Part 3 of this series:  The adventures continued with a detailed analysis of the fields Birth Date, Telephone Number and E-mail Address.  This provided you with an opportunity to become familiar with analysis techniques that use a combination of field values and field formats. 

You also saw examples of how valid values in a valid format can have an invalid context, how valid field formats can conceal invalid field values, and how free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

In Part 4, you will continue your adventures in data profiling by going postal...postal address that is, by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

 

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Postal Address Summary

 

Country Code

Field Values for Country Code

  In Part 1, we wondered if 5 distinct Country Code field values indicated international postal addresses.  This drill-down “screen” provided by the data profiling tool shows the frequency distribution.  First of all, the field name might have lead us to assume we would only see ISO 3166 standard country codes.

 

However, two of the field values are a country name and not a country code.  This is another example of how verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data. 

Secondly, the field values would appear to indicate that most of the postal addresses are from the United States.  However, if you recall from Part 3, we discovered some potential clues during our analysis of Telephone Number, which included two formats that appear invalid based on North American standards, and E-mail Address, which included country code Top Level Domain (TLD) values for Canada and the United Kingdom.

Additionally, whenever a field is both 100% complete and has an extremely low cardinality, it could be an indication that the most frequently occurring value is the field's default value. 

Therefore, is it possible that US is simply the default value for Country Code for this data source?

 

Zip Code

Field Formats for Zip Code

  From the Part 1 comments, it was noted that Zip Code as a field name is unique to the postal code system used in the United States (US).  This drill-down “screen” provided by the data profiling tool shows the field has only a total of ten field formats.   

  The only valid field formats for ZIP (which, by the way, is an acronym for Zone Improvement Plan) are 5 digits and 9 digits when the 4 digit ZIP+4 add-on code is also present, which according to the US postal standards should be separated from the 5 digit ZIP Code using a hyphen.

 

The actual field formats in the Zip Code field of this data source reveal another example of how we should not make assumptions about our data based on the metadata that describes it.  Although the three most frequently occurring field formats appear to be representative of potentially valid US postal codes, the alphanumeric postal code field formats are our first indication that it is, perhaps sadly, not all about US (pun intended, my fellow Americans).

The two most frequently occurring alphanumeric field formats appear to be representative of potentially valid Canadian postal codes.  An interesting thing to note is that their combined frequency distribution is double the count of the number of records having CA as a Country Code field value.  Therefore, if these field formats are representative of a valid Canadian postal code, then some Canadian records have a contextually invalid field value in Country Code.

The other alphanumeric field formats appear to be representative of potentially valid postal codes for the United Kingdom (UK).  To the uninitiated, the postal codes of Canada (CA) and the UK appear very similar.  Both postal code formats contain two parts, which according to their postal standards should be separated by a single character space. 

In CA postal codes, the first part is called the Forward Sortation Area (FSA) and the second part is called the Local Delivery Unit (LDU).  In UK postal codes, the first part is called the outward code and the second part is called the inward code. 

One easy way to spot the difference is that a UK inward code always has the format of a digit followed by two letters (i.e. “naa” in the field formats generated by my fictional data profiling tool), whereas a CA LDU always has the format of a digit followed by a letter followed by another digit (i.e. “nan”). 

However, we should never rule out the possibility of transposed values making a CA postal code look like a UK postal code, or vice versa.  Also, never forget the common data quality challenge of valid field formats concealing invalid field values.

Returning to the most frequently occurring field format of 5 digits, can we assume all valid field values would represent US postal addresses?  Of course not.  One significant reason is that a 5 digit postal code is one of the most common formats in the world. 

Just some of the other countries also using a 5 digit postal code include: Algeria, Cuba, Egypt, Finland, France, Germany, Indonesia, Israel, Italy, Kuwait, Mexico, Spain, and Turkey.

What about the less frequently occurring field formats of 4 digits and 6 digits?  It is certainly possible that these field formats could indicate erroneous attempts at entering a valid US postal code.  However, it could also indicate the presence of additional non-US postal addresses.

Just some of the countries using a 4 digit postal code include: Australia, Austria, Belgium, Denmark, El Salvador, Georgia (no, the US state did not once again secede, there is also a country called Georgia and its not even in the Americas), Hungary, Luxembourg, Norway, and Venezuela.  Just some of the countries using a 6 digit postal code include: Belarus, China, India, Kazakhstan (yes, Borat fans, Kazakhstan is a real country), Russia, and Singapore.

Additionally, why do almost 28% of the records in this data source not have a field value for Zip Code? 

One of the possibilities is that we could have postal addresses from countries that do not have a postal code system.  Just a few examples would be: Aruba, Bahamas (sorry fellow fans of the Beach Boys, but both Jamaica and Bermuda have a postal code system, and therefore I could not take you down to Kokomo), Fiji (home of my favorite bottled water), and Ireland (home of my ancestors and inventors of my second favorite coffee).

 

State Abbreviation

Field Values for State Abbreviation

  From the Part 1 comments, it was noted that the cardinality of State Abbreviation appeared suspect because, if we assume that its content matches its metadata, then we would expect only 51 distinct values (i.e. actual US state abbreviations without counting US territories) and not the 72 distinct values discovered by the data profiling tool.

  Let's assume that drill-downs have revealed the single profiled field data type was CHAR, and the profiled minimum/maximum field lengths were both 2.  Therefore, State Abbreviation, when populated, always contains a two character field value.    

  This drill-down “screen” first displays the top ten most frequently occurring values in the State Abbreviation field, which are all valid US state abbreviations.  The frequency distributions are also within general expectations since eight of the largest US states by population are represented.

 

However, our previous analysis of Country Code and Zip Code has already made us aware that international postal addresses exist in this data source.  Therefore, this drill-down “screen” also displays the top ten most frequently occurring non-US values based on the data profiling tool comparing all 72 distinct values against a list of valid US state and territory abbreviations.

Most of the field values discovered by this analysis appear to be valid CA province codes (including PQ being used as a common alternative for QC – the province of Quebec or Québec si vous préférez).  These frequency distributions are also within general expectations since six of the largest CA provinces by population are represented.  Their combined frequency distribution is also fairly close to the combined frequency distribution of potentially valid Canadian postal codes found in the Zip Code field.

However, we still have three additional values (ZZ, SA, HD) which require more analysis.  Additionally, almost 22% of the records in this data source do not have a field value for State Abbreviation, which could be attributable to the fact that even when the postal standards for other countries include something analogous to a US state or CA province, it might not be compatible with a two character field length.

 

City Name

Let's assume that we have performed some preliminary analysis on the statistical summaries and frequency distributions provided by the data profiling tool for the City Name field using the techniques illustrated throughout this series so far. 

Let's also assume analyzing the City Name field in isolation didn't reveal anything suspicious.  The field is consistently populated and its frequently occurring values appeared to meet general expectations.  Therefore, let's assume we have performed additional drill-down analysis using the data profiling tool and have selected the following records of interest:

Record Drill-down for City Name

Based on reviewing these records, what conclusions, assumptions, and questions do you have about the City Name field?

 

What other questions can you think of for these fields?  What other analysis do you think should be performed for these fields?

 

In Part 5 of this series:  We will continue the adventures in data profiling by completing our initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

 

International Man of Postal Address Standards

Since I am a geographically-challenged American, the first (and often the only necessary) option I choose for assistance with international postal address standards is Graham Rhind

His excellent book The Global Source-Book for Address Data Management is an invaluable resource and recognized standard reference that contains over 1,000 pages of data pertaining to over 240 countries and territories.

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (9)

I think the illustration shows us real examples. If so, then a question is coming up: what is the source of these data?

Were they typed in manually over a GUI or through an integration software from another database? I can imagine that Hamburg is a US-city, but it is a little suspicious that London and Paris (not to mention New Delhi) is placed also in the US.

I think the country code is set automatically by the software, i.e. the country is not given from the original source (be it human or other database), because there are zero NULL values for it.

Sorry for my question, which is not related to the data integration profession: do you really think, that the target audience of your blog does not know, that Georgia is not only a US-state?

August 19, 2009 | Unregistered CommenterTibor Bossanyi

Ah, you knew I'd wake up when you got to postal codes, right, Jim?

You've kindly plugged the Global Sourcebook, so here's some information from it relating to your post:

El Salvador postal codes have 4 digits - it was 5 digits before 1997; and Iran now have postal codes of 10 digits.

Georgia now has a 4-digit system and Azerbaijan's format is now "AZ" followed by 4 digits.

Both Jamaica and Columbia now have postal code systems :-)

It's difficult to keep up sometimes ...

Cheers!

Graham

August 19, 2009 | Unregistered CommenterGraham Rhind

Graham,

Yes, I was willing to place sizable wagers that:

(1) You would comment on this part of the series
(2) I would mess up several of the postal code formats

Thanks for both – I have edited the post based on your feedback.

Best Regards…

Jim

August 19, 2009 | Registered CommenterJim Harris

Tibor,

The answer to both of your questions is that although I both acknowledge and appreciate my intelligent and internationally savvy audience, part of the motivation behind some of the details in this post is to talk directly to a common problem that we have here in the United States – many companies have data sources that have US-specific data structures for storing postal addresses despite the acknowledged presence of international content.

This common problem results in many non-US postal addresses suffering additional data quality problems caused by the inability to properly enter them (even when you know the country-specific standards) into fields designed to store a US postal address.

As for my Georgia joke – I would be willing to wager that even if all of my US colleagues knew that it was also a country, most of them could not find it on a map – myself included by the way.

Best Regards…

Jim

August 19, 2009 | Registered CommenterJim Harris

US-specific data structures have caused me much grief when online shopping!

I can't get an Australian address to fit into the boxes. So I put in some commas and make my state and postcode fit (which looks rather similar to your city name field).

Worst case scenario is if the zip field is not nullable, because then I have to make up a zip code (and sometimes I get told it is not a valid one, so I have to go look up a zip code) and then HOPE that someone reads my "Comments" box to realize that half of the address is fake. And then sometimes as a result of all this messing around, my visa card won't validate. It can be incredibly frustrating.

For an interesting read about your favourite brand of water, visit: Fiji Water: Spin the Bottle.


(For more information on Georgia, visit the Eurovision Song Contest website.)

August 19, 2009 | Unregistered CommenterTamyka Bell

I don't want to sound like a stick in the mud but now might be a good time to remind ourselves about what we are trying to achieve. The purpose of the series is to demonstrate profiling techniques so any analysis of any data is a good thing to bring out issues and examples and I accept that, so let's carry on analyzing everything.

In the real world, we must constantly remember that Data Quality is connected to it being fit for purpose and it might be seen as a waste of resource analyzing and fixing data that makes no difference to it's ultimate purpose.

For example, in the UK the postcode format, as you point out, has an outcode and an incode separated by a single space. Now let's assume that the ONLY use for our postcode is to print an address label in order to send out mailshots and/or letters. For that purpose the postcode is to be read by humans and the number of spaces between in and out codes is irrelevant. If we spent time analyzing and cleansing postcodes with incorrect spacings what benefit would that give us?

I'm off topic now, sorry, but my point is that before I can justify spending lots of time profiling I must already have priorities about which fields can give benefits back. Also, by showing real benefits to our sponsors they will be more likely to allow us to carry on dealing with the rest of the data quality stuff we love gazing at. :-)

Back on topic (I think), UK postcodes have changed format, previously they were fixed length 8 characters with space padding between the out and in codes. There's a time element involved when checking fields, how we decide the cut-off time in our zip code examples I don't know, but it's a question.

August 20, 2009 | Unregistered CommenterPhil A

Tamyka,

Thanks for sharing your experiences with the all too common prevalence of US-specific data structures in the web forms used for online shopping.

Earlier this year, the Data Value Talk blog posted:

It’s how you approach your customer – Search for the ultimate data entry form

Thanks also for the article on bottled water. Recently, I have been drinking Smartwater – because it has been on sale at my local supermarket, but that article makes me want to switch permanently.

The Eurovision Song Contest (and Georgia’s entries in it) has come up in the comments section of one of my previous posts:

The Very Model of a Modern DQ General

Best Regards…

Jim

August 20, 2009 | Registered CommenterJim Harris

Phil,

As usual, I completely agree with your perspective and I thank you for taking the time to share it.

Late last year on the B-eye-Network, David Loshin blogged about the need for performing a Directed Data Quality Assessment, where “instead of just throwing the data into the profiler and hoping that something good comes out…concentrate on the statistical details associated with the critical data elements as a way to evaluate the extent to which data anomalies might impact the business.”

Business requirements and subject matter experts are necessary to discuss usage, relevancy, standards and the metrics for measuring and improving data quality – in order to progress from basic data profiling to performing a full data quality assessment.

As you have clearly stated: the purpose of the series is to demonstrate profiling techniques so any analysis of any data is a good thing to bring out issues and examples.

Loshin also states at the beginning of the same blog post that “in situations where not much is known about the data, profiling provides some insight into basic issues with the data.”

As I have explained in earlier comments, I have been on many projects where I was told that I should delay data profiling until business requirements and subject matter experts are available. I always disagree because I think that I can do a better job of evaluating the business requirements and preparing for my meetings with subject matter experts when I have spent some time looking at the data from a starting point of blissful ignorance and curiosity.

However, in a more realistic scenario, the data source would have more than 15 fields and there would most likely be more than just one data source. Therefore, even preliminary data profiling would have to make some educated guesses to exclude some fields from analysis and not try to look at all of them.

When I have multiple data sources, I sometimes limit my preliminary data profiling to a particular domain or subject area and analyze only the fields in each of the data sources that I think would be relevant. For example, I might want to get a handle on just the postal address challenges and therefore only profile the fields in each data source that I think contain postal address data.

Best Regards…

Jim

August 20, 2009 | Registered CommenterJim Harris

Jim,

I am warmed by your attitude of "looking at the data from a starting point of blissful ignorance and curiosity". In this state of mind you will be open and have no pre-conceptions of the data before the experts start telling you what it looks like. It's an opportunity to take a look at the wood before the trees get in the way.

August 21, 2009 | Unregistered CommenterPhil A

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>