Recent Comments
Affiliate Links
« Fantasy League Data Quality | Main | Adventures in Data Profiling (Part 5) »
Thursday
03Sep2009

To Parse or Not To Parse

“To Parse, or Not To Parse,—that is the question:
Whether 'tis nobler in the data to suffer
The slings and arrows of free-form fields,
Or to take arms against a sea of information,
And by parsing, understand them?”

Little known fact: before William Shakespeare made it big as a playwright, he was a successful data quality consultant. 

Alas, poor data quality!  The Bard of Avon knew it quite well.  And he was neither a fan of free verse nor free-form fields.

 

Free-Form Fields

A free-form field contains multiple (usually interrelated) sub-fields.  Perhaps the most common examples of free-form fields are customer name and postal address.

A Customer Name field with the value “Christopher Marlowe” is comprised of the following sub-fields and values:

  • Given Name = “Christopher”
  • Family Name = “Marlowe”

A Postal Address field with the value “1587 Tambur Lane” is comprised of the following sub-fields and values:

  • House Number = “1587”
  • Street Name = “Tambur”
  • Street Type = “Lane”

Obviously, both of these examples are simplistic.  Customer name and postal address are comprised of additional sub-fields, not all of which will be present on every record or represented consistently within and across data sources.

Returning to the bard's question, a few of the data quality reasons to consider parsing free-form fields include:

  • Data Profiling
  • Data Standardization
  • Data Matching

 

Much Ado About Analysis

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  In Adventures in Data Profiling (Part 5), a data profiling tool was used to analyze the field Postal Address Line 1:

Field Formats for Postal Address Line 1

 

The Taming of the Variations

Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value, and a general lack of data quality standards.  Additional variations are introduced by multiple data sources, each with its own unique data characteristics and quality challenges.

Data standardization parses free-form fields to break them down into their smaller individual sub-fields to gain improved visibility of the available input data.  Data standardization is the taming of the variations that creates a consistent representation, applies standard values where appropriate, and when possible, populates missing values.

The following example shows parsed and standardized postal addresses:

Parsed and Standardized Postal Address

In your data quality implementations, do you use this functionality for processing purposes only?  If you retain the standardized results, do you store the parsed and standardized sub-fields or just the standardized free-form value?

 

Shall I compare thee to other records?

Data matching often uses data standardization to prepare its input.  This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.

Imagine matching the following product description records with and without the parsed and standardized sub-fields:

Parsed and Standardized Product Description

 

Doth the bard protest too much? 

Please share your thoughts and experiences regarding free-form fields.

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (7)

Great examples Jim, free-form fields were the bane of my life on large scale data migration and integration projects.

I remember one project where we tried to match 2 systems which should have had the same equipment. System A had strong record attribution, i.e. the equipment type was broken down and recorded correctly. This resulted in 500 equipment types, about half of those turned out to be duplicates.

In the free-form equipment type field on the other system we found 7,500 distinct equipment types! This made matching a complete nightmare.

When you have a standard ruleset for one subject area (e.g. USA Address), life is a lot easier. However, when you move to sub-categories of rules (e.g. Global addresses or each manufacturer having their own coding and naming system), then it can dramatically increase the effort and cost. Product catalogues and master data management are obvious advantages here.

Technology has come a hell of a long way and the parsing algorithms are incredibly smart now but this problem should really be resolved when designing the data model and data structures. Organisations need to think ahead and understand how other systems, not just the operational system hosting the data, will want to connect with this data. From a business intelligence viewpoint, free-form fields are a dead end for example but they often contain the most attractive dimensions and values of information.

Well, that's my tuppenceworth anyway, would be really interested in hearing other people's free-form anecdotes.

September 4, 2009 | Unregistered CommenterDylan Jones

When it comes to addresses, one needs to be careful when parsing into component parts (what you call "sub-fields"). Unless there's a good procedural reason for parsing an address component (for example, parsing postal code and building number to allow postal address validation), I would counsel caution, because putting the data back into a meaningful format again is full of pitfalls. You illustrate this well when 221B Baker Street becomes 221 Baker Street Unit B - not the same address at all!

There is, for example, no real need to parse the thoroughfare name away from the thoroughfare type - on the ground they refer to a single entity, and the constants can be standardised. I've seen parsed data, such as AVENUE OF THE AMERICAS being reformed as OF THE AMERICAS AVENUE because the order is not that which is found "normally", and the relative positions of the components was not noted during parsing.

Attempting to parse personal names into component parts is doomed to cause more damage to the data than it resolves - I counsel strongly against anything more than superficial processing of names.

Dylan is absolutely on the button - any need to parse (and cause problems with the data) can be obviated by collecting and storing the data correctly in the first place.

September 4, 2009 | Unregistered CommenterGraham Rhind

Jim,

Great post - crystal clear - leaves one in no doubt as to the challenge posed by free form fields, the requirement to parse them, and the benefits of retaining the parsed and standardised sub-fields.

Ken

September 4, 2009 | Unregistered CommenterKen O'Connor

Good examples. It gets even worse than that, though. A decent parser should have the ability to know whether it is being handed an address or, for example, a name that has been misfielded. If the parser sees "Jim Harris" as an address, it should be smart enough to move along and look for the address elsewhere. It shouldn't make any assumptions that there is actually address data in the 'postal address' field.

Worst case scenario is when you get other data stuck into a name and address. You might see a tax id number or vin number, for example. A key parser feature would be to capture that misfielded data and not just cleanse the address by throwing it away. Again, no assumptions.

You don't see that type of misfielded bad data as much in modern databases, where it's more difficult to break the data model once it's configured. But you definitely see it in legacy systems like old mainframes, where it was much more difficult to tweak the data model when necessary.

September 4, 2009 | Unregistered CommenterSteve Sarsfield

Balancing whether to parse or not to parse is truly difficult and closely related to the art of avoiding false positives and false negatives. The project where I have worked the most with this is a global matching solution described here.

The solution I have ended up with here is that I do both. The matching is based on both parsed data and original data.

September 4, 2009 | Unregistered CommenterHenrik Liliendahl Sørensen

Jim,

Once again you make a great point on a subject which I thought everyone already knew!

Great comments too folks!

I would add that accurate parsing must be accompanied by a robust dictionary enabling the matching engine to take advantage of possible alias values for the parsed elements.

The classic Bob & Robert examples are handled either by populating name standards fields used for cross matching, or internally by the matching engine through alias tables. I personally find the internal alias table method preferable, as adding all of the alias values to the parsed data can increase the volume considerably.

Address issues can be more troublesome without the benefit of the USPS or UPU data in the case of international matching.

The parsing engines I have worked with recently have been particularly good at identifying name, address, attention text, and other extraneous information, found in the name and address stencil. Two in particular actually offer named elements or “Extra” buckets for this information, taking it out of the way of the matching process.

I don’t usually comment publicly on these issues but I have too often heard that a client “doesn’t want to add confusion by parsing and standardizing”, feeling that the original data is the most accurate way to match. I do not agree.

Keep up the posts Jim!

Triebs

September 6, 2009 | Unregistered CommenterTriebs

From the LinkedIn Group for Master Data Management, Jacqueline Roberts commented:

Great Article.

Data Parsing in the most efficient way to “auto” classification, improve search capabilities and to compare item records. It requires a standard template for descriptive technical properties. For instance, a class – CONE,TAPERED ROLLER BEARING would include the properties: Outside Diameter, Cup Width, Maximum House Fillet Radius, etc. The properties will always appear in the same sequence when the technical description is built.

The Long Description is:

CUP,TAPERED ROLLER BEARING - | OUTSIDE DIAMETER: 1.781 INCHES | CUP WIDTH: 0.475 INCHES | MAXIMUM HOUSE FILLET RADIUS: 0.050 INCHES | MAXIMUM HOUSING SHOULDER DIAMETER: 1.555 INCHES | MINIMUM HOUSING SHOULDER DIAMETER: 1.634 INCHES | BEARING WIDTH: 0.610 INCHES | TEMPERATURE RANGE: -20 -250 FAHRENHEIT | AGENCY COMPLIANCE: ABMA

Data parsing and profiling improves the automation of Master Data Management and Data Quality from every aspect of the business processes from engineering, purchasing to the maintenance employee searching for a functional equivalent part that has shut down the manufacturing line.

September 9, 2009 | Registered CommenterJim Harris

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>