Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments

Entries in Data Profiling (17)

Thursday
Jul072011

Data Profiling Early and Often

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode of OCDQ Radio, I discuss data profiling with James Standen, the founder and CEO of nModal Solutions Inc., the makers of Datamartist, which is a fast, easy to use, visual data profiling and transformation tool.

Before founding nModal, James had over 15 years experience in a broad range of roles involving data, ranging from building business intelligence solutions, creating data warehouses and a data warehouse competency center, through to working on data migration and ERP projects in large organizations.  You can learn more about and connect with James Standen on LinkedIn.

James thinks that while there is obviously good data and bad data, that often bad data is just misunderstood and can be coaxed away from the dark side if you know how to approach it.  He does recommend wearing the proper safety equipment however, and having the right tools.  For more of his wit and wisdom, follow Datamartist on Twitter, and read the Datamartist Blog.

 

Data Profiling Early and Often

Additional listening options:

 

Related Posts

Adventures in Data Profiling

Alternatives to Enterprise Data Quality Tools

The Real Data Value is Business Insight

Finding Data Quality

I’m Gonna Data Profile (500 Records)

Data Governance Star Wars

Master Data Management in Practice

The Art of Data Matching

Data Quality Pro

A Brave New Data World

Tuesday
Mar012011

Data Qualia

In philosophy (according to Wikipedia), the term qualia is used to describe the subjective quality of conscious experience.

Examples of qualia are the pain of a headache, the taste of wine, or the redness of an evening sky.  As Daniel Dennett explains:

“Qualia is an unfamiliar term for something that could not be more familiar to each of us:

The ways things seem to us.”

Like truth, beauty, and singing ability, data quality is in the eyes of the beholder, or since data quality is most commonly defined as fitness for the purpose of use, we could say that data quality is in the eyes of the user.

However, most data has both multiple uses and multiple users.  Data of sufficient quality for one use or one user may not be of sufficient quality for other uses and other users.  Quite often these diverse data needs and divergent data quality perspectives make it a daunting challenge to provide meaningful data quality metrics to the organization.

Recently on the Data Roundtable, Dylan Jones of Data Quality Pro discussed the need to create data quality reports that matter, explaining that if you’re relying on canned data profiling reports (i.e., column statistics and data quality metrics at an attribute, table, and system level), then you are measuring data quality in isolation of how the business is performing.

Instead, data quality metrics must measure data qualia—the subjective quality of the user’s business experience with data:

“Data Qualia is an unfamiliar term for something that must become more familiar to the organization:

The ways data quality impact business performance.”

Related Posts

The Point of View Paradox

DQ-BE: Single Version of the Time

Single Version of the Truth

Beyond a “Single Version of the Truth”

The Idea of Order in Data

Hell is other people’s data

DQ-BE: Data Quality Airlines

DQ-Tip: “There is no such thing as data accuracy...”

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

Monday
Feb212011

Alternatives to Enterprise Data Quality Tools

The recent analysis by Andy Bitterer of Gartner Research (and ANALYSTerical) about the acquisition of open source data quality tool DataCleaner by the enterprise data quality vendor Human Inference, prompted the following Twitter conversation:

Since enterprise data quality tools can be cost-prohibitive, more prospective customers are exploring free and/or open source alternatives, such as the Talend Open Profiler, licensed under the open source General Public License, or non-open source, but entirely free alternatives, such as the Ataccama DQ Analyzer.  And, as Andy noted in his analysis, both of these tools offer an easy transition to the vendors’ full-fledged commercial data quality tools, offering more than just data profiling functionality.

As Henrik Liliendahl Sørensen explained, in his blog post Data Quality Tools Revealed, data profiling is the technically easiest part of data quality, which explains the tool diversity, and early adoption of free and/or open source alternatives.

And there are also other non-open source alternatives that are more affordable than enterprise data quality tools, such as Datamartist, which combines data profiling and data migration capabilities into an easy-to-use desktop application.

My point is neither to discourage the purchase of enterprise data quality tools, nor promote their alternatives—and this blog post is certainly not an endorsement—paid or otherwise—of the alternative data quality tools I have mentioned simply as examples.

My point is that many new technology innovations originate from small entrepreneurial ventures, which tend to be specialists with a narrow focus that can provide a great source of rapid innovation.  This is in contrast to the data management industry trend of innovation via acquisition and consolidation, embedding data quality technology within data management platforms, which also provide data integration and master data management (MDM) functionality as well, allowing the mega-vendors to offer end-to-end solutions and the convenience of one-vendor information technology shopping.

However, most software licenses for these enterprise data management platforms start in the six figures.  On top of the licensing, you have to add the annual maintenance fees, which are usually in the five figures.  Add to the total cost of the solution, the professional services that are needed for training and consulting for installation, configuration, application development, testing, and production implementation—and you have another six figure annual investment.

Debates about free and/or open source software usually focus on the robustness of functionality and the intellectual property of source code.  However, from my perspective, I think that the real reason more prospective customers are exploring these alternatives to enterprise data quality tools is because of the free aspect—but not because of the open source aspect.

In other words—and once again I am only using it as an example—I might download Talend Open Profiler because I wanted data profiling functionality at an affordable price—but not because I wanted the opportunity to customize its source code.

I believe the “try it before you buy it” aspect of free and/or open source software is what’s important to prospective customers.

Therefore, enterprise data quality vendors, instead of acquiring an open source tool as Human Inference did with DataCleaner, how about offering a free (with limited functionality) or trial version of your enterprise data quality tool as an alternative option?

 

Related Posts

Do you believe in Magic (Quadrants)?

Can Enterprise-Class Solutions Ever Deliver ROI?

Which came first, the Data Quality Tool or the Business Need?

Selling the Business Benefits of Data Quality

What Data Quality Technology Wants

Tuesday
Dec282010

I’m Gonna Data Profile (500 Records)

While researching my blog post (to be published on December 31) about the best data quality blog posts of the year, I re-read the great post Profound Profiling by Daragh O Brien, which recounted how he found data profiling cropping up in conversations and presentations he’d made this year, even where the topic of the day wasn’t “Information Quality” and shared his thoughts on the profound business benefits of data profiling for organizations seeking to manage risk and ensure compliance.

And I noticed that I had actually commented on this blog post . . . with song lyrics . . .

 

I’m Gonna Data Profile (500 Records) *

When I wake up, well I know I’m gonna be,
I’m gonna be the one who profiles early and often for you
When I go out, yeah I know I’m gonna be
I’m gonna be the one who goes along with data
If I get drunk, well I know I’m gonna be
I’m gonna be the one who gets drunk on managing risk for you
And if I haver up, yeah I know I’m gonna be
I’m gonna be the one who’s havering about how: “It’s the Information, Stupid!”

But I would profile 500 records
And I would profile 500 more
Just to be the one who profiles a thousand records
To deliver the profound business benefits of data profiling to your door

da da da da – ta ta ta ta
da da da da – ta ta ta ta – data!
da da da da – ta ta ta ta
da da da da – ta ta ta ta – data profiling!

When I’m working, yes I know I’m gonna be
I’m gonna be the one who’s working hard to ensure compliance for you
And when the money, comes in for the work I do
I’ll pass almost every penny on to improving data for you
When I come home (When I come home), well I know I’m gonna be
I’m gonna be the one who comes back home with data quality
And if I grow-old, (When I grow-old) well I know I’m gonna be
I’m gonna be the one who’s growing old with information quality

But I would profile 500 records
And I would profile 500 more
Just to be the one who profiles a thousand records
To deliver the profound business benefits of data profiling to your door

da da da da – ta ta ta ta
da da da da – ta ta ta ta – data!
da da da da – ta ta ta ta
da da da da – ta ta ta ta – data profiling!

When I’m lonely, well I know I’m gonna be
I’m gonna be the one who’s lonely without data profiling to do
And when I’m dreaming, well I know I’m gonna dream
I’m gonna dream about the time when I’m data profiling for you
When I go out (When I go out), well I know I’m gonna be
I’m gonna be the one who goes along with data
And when I come home (When I come home), yes I know I’m gonna be
I’m gonna be the one who comes back home with data quality
I’m gonna be the one who’s coming home with information quality

But I would profile 500 records
And I would profile 500 more
Just to be the one who profiles a thousand records
To deliver the profound business benefits of data profiling to your door

da da da da – ta ta ta ta
da da da da – ta ta ta ta – data!
da da da da – ta ta ta ta
da da da da – ta ta ta ta – data profiling!

___________________________________________________________________________________________________________________

* Based on the 1988 song I’m Gonna Be (500 Miles) by The Proclaimers.

 

Data Quality Music (DQ-Songs)

Over the Data Governance Rainbow

A Record Named Duplicate

New Time Human Business

People

You Can’t Always Get the Data You Want

A spoonful of sugar helps the number of data defects go down

Data Quality is such a Rush

I’m Bringing DQ Sexy Back

Imagining the Future of Data Quality

The Very Model of a Modern DQ General

Monday
Aug232010

The Real Data Value is Business Insight

Data Values for COUNTRY Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.

A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).

However, a common mistake is to hyper-focus on the data values.

Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.

Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.

 

“Begin with the decision in mind”

In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.”  Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.

Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries.  And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.

However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind.  Find the decisions that are going to make a difference to business results—to the metrics that drive the organization.  Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”

Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.

 

The Real Data Value is Business Insight

The Real Data Value is Business Insight

Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.

Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.

However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.

But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.

Oh, no!  The organization must be teetering on the edge of oblivion, right?  Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin.  However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?

As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”

So, would reducing your duplicate rate to only 1% automatically result in better customer insight?  Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have?  (Or maybe the 11% indicates the matching criteria was too aggressive).

My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?

The fundamental question that every data quality metric you create must answer is: How does this provide business insight?

If a data quality (or any other data) metric can not answer this question, then it is meaningless.  Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind.  Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.

Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.

Although analyzing your data values is important, you must always remember that the real data value is business insight.

 

Related Posts

The First Law of Data Quality

Adventures in Data Profiling

Data Quality and the Cupertino Effect

Is your data complete and accurate, but useless to your business?

The Idea of Order in Data

You Can’t Always Get the Data You Want

Red Flag or Red Herring? 

DQ-Tip: “There is no point in monitoring data quality…”

Which came first, the Data Quality Tool or the Business Need?

Selling the Business Benefits of Data Quality

Thursday
Jul082010

Finding Data Quality

Have you ever experienced that sinking feeling, where you sense if you don’t find data quality, then data quality will find you?

In the spring of 2003, Pixar Animation Studios produced one of my all-time favorite Walt Disney Pictures—Finding Nemo

This blog post is an hommage to not only the film, but also to the critically important role into which data quality is cast within all of your enterprise information initiatives, including business intelligence, master data management, and data governance. 

I hope that you enjoy reading this blog post, but most important, I hope you always remember: “Data are friends, not food.”

 

Data Silos

“Mine!  Mine!  Mine!  Mine!  Mine!”

That’s the Data Silo Mantra—and it is also the bane of successful enterprise information management.  Many organizations persist on their reliance on vertical data silos, where each and every business unit acts as the custodian of their own private data—thereby maintaining their own version of the truth.

Impressive business growth can cause an organization to become a victim of its own success.  Significant collateral damage can be caused by this success, and most notably to the organization’s burgeoning information architecture.

Earlier in an organization’s history, it usually has fewer systems and easily manageable volumes of data, thereby making managing data quality and effectively delivering the critical information required to make informed business decisions everyday, a relatively easy task where technology can serve business needs well—especially when the business and its needs are small.

However, as the organization grows, it trades effectiveness for efficiency, prioritizing short-term tactics over long-term strategy, and by seeing power in the hoarding of data, not in the sharing of information, the organization chooses business unit autonomy over enterprise-wide collaboration—and without this collaboration, successful enterprise information management is impossible.

A data silo often merely represents a microcosm of an enterprise-wide problem—and this truth is neither convenient nor kind.

 

Data Profiling

“I see a light—I’m feeling good about my data . . . Good feeling’s gone—AHH!”

Although it’s not exactly a riddle wrapped in a mystery inside an enigma,  understanding your data is essential to using it effectively and improving its quality—to achieve these goals, there is simply no substitute for data analysis.

Data profiling can provide a reality check for the perceptions and assumptions you may have about the quality of your data.  A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.

However, it is important to remember that the analysis itself can not be automated—you need to translate your analysis into the meaningful reports and questions that will facilitate more effective communication and help establish tangible business context.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions. 

Discovering the right questions requires talking with data’s best friends—its stewards, analysts, and subject matter experts.  These discussions are a critical prerequisite for determining data usage, standards, and the business relevant metrics for measuring and improving data quality.  Always remember that well performed data profiling is highly interactive and a very iterative process.

 

Defect Prevention

“You, Data-Dude, takin’ on the defects.
You’ve got serious data quality issues, dude.
Awesome.”

Even though it is impossible to truly prevent every problem before it happens, proactive defect prevention is a highly recommended data quality best practice because the more control enforced where data originates, the better the overall quality will be for enterprise information.

Although defect prevention is most commonly associated with business and technical process improvements, after identifying the burning root cause of your data defects, you may predictably need to apply some of the principles of behavioral data quality.

In other words, understanding the complex human dynamics often underlying data defects is necessary for developing far more effective tactics and strategies for implementing successful and sustainable data quality improvements.

 

Data Cleansing

“Just keep cleansing.  Just keep cleansing.
Just keep cleansing, cleansing, cleansing.
What do we do?  We cleanse, cleanse.”

That’s not the Data Cleansing Theme Song—but it can sometimes feel like it.  Especially whenever poor data quality negatively impacts decision-critical information, the organization may legitimately prioritize a reactive short-term response, where the only remediation will be fixing the immediate problems.

Balancing the demands of this data triage mentality with the best practice of implementing defect prevention wherever possible, will often create a very challenging situation for you to contend with on an almost daily basis.

Therefore, although comprehensive data remediation will require combining reactive and proactive approaches to data quality, you need to be willing and able to put data cleansing tools to good use whenever necessary.

 

Communication

“It’s like he’s trying to speak to me, I know it.
Look, you’re really cute, but I can’t understand what
you’re saying.
Say that data quality thing again.”

I hear this kind of thing all the time (well, not the “you’re really cute” part).

Effective communication improves everyone’s understanding of data quality, establishes a tangible business context, and helps prioritize critical data issues. 

Keep in mind that communication is mostly about listening.  Also, be prepared to face “data denial” when data quality problems are discussed.  Most often, this is a natural self-defense mechanism for the people responsible for business processes, technology, and data—and because of the simple fact that nobody likes to feel blamed for causing or failing to fix the data quality problems.

The key to effective communication is clarity.  You should always make sure that all data quality concepts are clearly defined and in a language that everyone can understand.  I am not just talking about translating the techno-mumbojumbo, because even business-speak can sound more like business-babbling—and not just to the technical folks.

Additionally, don’t be afraid to ask questions or admit when you don’t know the answers.  Many costly mistakes can be made when people assume that others know (or pretend to know themselves) what key concepts and other terminology actually mean.

Never underestimate the potential negative impacts that the point of view paradox can have on communication.  For example, the perspectives of the business and technical stakeholders can often appear to be diametrically opposed.

Practicing effective communication requires shutting our mouth, opening our ears, and empathically listening to each other, instead of continuing to practice ineffective communication, where we merely take turns throwing word-darts at each other.

 

Collaboration

“Oh and one more thing:
When facing the daunting challenge of collaboration,
Work through it together, don't avoid it.
Come on, trust each other on this one.
Yes—trust—it’s what successful teams do.”

Most organizations suffer from a lack of collaboration, and as noted earlier, without true enterprise-wide collaboration, true success is impossible.

Beyond the data silo problem, the most common challenge for collaboration is the divide perceived to exist between the Business and IT, where the Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise, and IT usually owns the hardware and software infrastructure of the enterprise’s technical architecture.

However, neither the Business nor IT alone has all of the necessary knowledge and resources required to truly be successful.  Data quality requires that the Business and IT forge an ongoing and iterative collaboration.

You must rally the team that will work together to improve the quality of your data.  A cross-disciplinary team will truly be necessary because data quality is neither a business issue nor a technical issue—it is both, truly making it an enterprise issue.

Executive sponsors, business and technical stakeholders, business analysts, data stewards, technology experts, and yes, even consultants and contractors—only when all of you are truly working together as a collaborative team, can the enterprise truly achieve great things, both tactically and strategically.

Successful enterprise information management is spelled E—A—C.

Of course, that stands for Enterprises—Always—Collaborate.  The EAC can be one seriously challenging place, dude.

You don’t know if you know what they know, or if they know what you know, but when you know, then they know, you know?

It’s like first you are all like “Whoa!” and they are all like “Whoaaa!” then you are like “Sweet!” and then they are like “Totally!”

This critical need for collaboration might seem rather obvious.  However, as all of the great philosophers have taught us, sometimes the hardest thing to learn is the least complicated.

Okay.  Squirt will now give you a rundown of the proper collaboration technique:

“Good afternoon. We’re gonna have a great collaboration today.
Okay, first crank a hard cutback as you hit the wall.
There’s a screaming bottom curve, so watch out.
Remember: rip it, roll it, and punch it.”

 

Finding Data Quality

As more and more organizations realize the critical importance of viewing data as a strategic corporate asset, data quality is becoming an increasingly prevalent topic of discussion.

However, and somewhat understandably, data quality is sometimes viewed as a small fish—albeit with a “lucky fin”—in a much larger pond.

In other words, data quality is often discussed only in its relation to enterprise information initiatives such as data integration, master data management, data warehousing, business intelligence, and data governance.

There is nothing wrong with this perspective, and as a data quality expert, I admit to my general tendency to see data quality in everything.  However, regardless of the perspective from which you begin your journey, I believe that eventually you will be Finding Data Quality wherever you look as well.

 

Follow OCDQ

If you enjoyed this blog post, then please subscribe to OCDQ via my RSS feed, my E-mail updates, or Google Reader.

You can also follow OCDQ on Twitter, fan the Facebook page for OCDQ, and connect with me on LinkedIn.


Thursday
Mar042010

Adventures in Data Profiling

Data profiling is a critical step in a variety of information management projects, including data quality initiatives, MDM implementations, data migration and consolidation, building a data warehouse, and many others.

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.

 

Webinar

In this vendor-neutral eLearningCurve webinar, I discuss the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.

You can download (no registration required) the webinar (.wmv file) using this link: Adventures in Data Profiling Webinar

 

Presentation

You can download the presentation (no registration required) used in the webinar as an Adobe Acrobat Document (.pdf file) using this link: Adventures in Data Profiling Presentation

 

Complete Blog Series

You can read (no registration required) the complete OCDQ blog series Adventures in Data Profiling by following these links:

Tuesday
Dec012009

Adventures in Data Profiling (Part 8)

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.  This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

 

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics.  Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered.  Both the data profiling tool and data used throughout the series were fictional.  The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

  1. Counts and Percentages
  2. Values and Formats
  3. Drill-down Analysis

 

Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

 Data Profiling Summary

  • NULL – count of the number of records with a NULL value 
  • Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces) 
  • Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing) 
  • Completeness – percentage calculated as Actual divided by the total number of records 
  • Cardinality – count of the number of distinct actual values 
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records 
  • Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records.  In Part 6, Account Number and Tax ID were used as examples.  Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record).  The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).  Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.

 

Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
  • Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
  • Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
  • Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
  • Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record). 

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.  Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid. 

 

Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest.  Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage.  As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series.  However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).

 

Conclusion

Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative.  I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions. 

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality.  All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.  However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important translate your analysis into meaningful reports and questions to share with the rest of your team. 

Always remember that well performed data profiling is both a highly interactive and a very iterative process.

 

Thank You

I want to thank you for providing your feedback throughout this series. 

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments. 

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.

 

The Complete Series


Monday
Oct192009

Adventures in Data Profiling (Part 7)

In Part 6 of this seriesYou completed your initial analysis of the Account Number and Tax ID fields. 

Previously during your adventures in data profiling, you have looked at customer name within the context of other fields.  In Part 2, you looked at the associated customer names during drill-down analysis on the Gender Code field while attempting to verify abbreviations as well as assess NULL and numeric values.  In Part 6, you investigated customer names during drill-down analysis for the Account Number and Tax ID fields while assessing the possibility of duplicate records. 

In Part 7 of this award-eligible series, you will complete your initial analysis of this data source with direct investigation of the Customer Name 1 and Customer Name 2 fields.

 

Previously, the data profiling tool provided you with the following statistical summaries for customer names:

Customer Name Summary

As we discussed when we looked at the E-mail Address field (in Part 3) and the Postal Address Line fields (in Part 5), most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Customer Name 1 and Customer Name 2 are additional examples of the necessity of this analysis technique.  Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

 

Customer Name 1

The data profiling tool has provided you the following drill-down “screen” for Customer Name 1:

Field Formats for Customer Name 1 

Please Note: The differentiation between given and family names has been based on our fictional data profiling tool using probability-driven non-contextual classification of the individual field values. 

For example, Harris, Edward, and James are three of the most common names in the English language, and although they can also be family names, they are more frequently given names.  Therefore, “Harris Edward James” is assigned “Given-Name Given-Name Given-Name” for a field format.  For this particular example, how do we determine the family name?

The top twenty most frequently occurring field formats for Customer Name 1 collectively account for over 80% of the records with an actual value in this field for this data source.  All of these field formats appear to be common potentially valid structures.  Obviously, more than one sample field value would need to be reviewed using more drill-down analysis. 

What conclusions, assumptions, and questions do you have about the Customer Name 1 field?

 

Customer Name 2

The data profiling tool has provided you the following drill-down “screen” for Customer Name 2:

Field Formats for Customer Name 2 

The top ten most frequently occurring field formats for Customer Name 2 collectively account for over 50% of the records with an actual value in this sparsely populated field for this data source.  Some of these field formats show common potentially valid structures.  Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Customer Name 2 field?

 

The Challenges of Person Names

Not that business names don't have their own challenges, but person names present special challenges.  Many data quality initiatives include the business requirement to parse, identify, verify, and format a “valid” person name.  However, unlike postal addresses where country-specific postal databases exist to support validation, no such “standards” exist for person names.

In his excellent book Viral Data in SOA: An Enterprise Pandemic, Neal A. Fishman explains that “a person's name is a concept that is both ubiquitous and subject to regional variations.  For example, the cultural aspects of an individual's name can vary.  In lieu of last name, some cultures specify a clan name.  Others specify a paternal name followed by a maternal name, or a maternal name followed by a paternal name; other cultures use a tribal name, and so on.  Variances can be numerous.”

“In addition,” continues Fishman, “a name can be used in multiple contexts, which might affect what parts should or could be communicated.  An organization reporting an employee's tax contributions might report the name by using the family name and just the first letter (or initial) of the first name (in that sequence).  The same organization mailing a solicitation might choose to use just a title and a family name.”

However, it is not a simple task to identify what part of a person's name is the family name or the first given name (as some of the above data profiling sample field values illustrate).  Again, regional, cultural, and linguistic variations can greatly complicate what at first may appear to be a straightforward business request (e.g. formatting a person name for a mailing label).

As Fishman cautions, “many regions have cultural name profiles bearing distinguishing features for words, sequences, word frequencies, abbreviations, titles, prefixes, suffixes, spelling variants, gender associations, and indications of life events.”

If you know of any useful resources for dealing with the challenges of person names, then please share them by posting a comment below.  Additionally, please share your thoughts and experiences regarding the challenges (as well as useful resources) associated with business names.

 

What other analysis do you think should be performed for customer names?

 

In Part 8 of this series:  We will conclude the adventures in data profiling with a summary of the lessons learned.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Getting Your Data Freq On

Monday
Sep212009

Adventures in Data Profiling (Part 6)

In Part 5 of this seriesYou completed your initial analysis of the fields relating to postal address with the investigation of Postal Address Line 1 and Postal Address Line 2.

You saw additional examples of why free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. 

You learned this analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record). 

You also saw examples of how the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.

In Part 6, you will continue your adventures in data profiling by analyzing the Account Number and Tax ID fields.

 

Account Number

Field Summary for Account Number

  The field summary for Account Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

  In Part 2, we learned that Customer ID is likely an integer surrogate key and the primary key for this data source because it is both 100% complete and 100% unique.  Account Number is 100% complete and almost 100% unique.  Perhaps it was intended to be the natural key for this data source?   

  Let's assume that drill-downs revealed the single profiled field data type was VARCHAR and the single profiled field format was aa-nnnnnnnnn (i.e. 2 characters, followed by a hyphen, followed by a 9 digit number).

  Combined with the profiled minimum/maximum field lengths, the good news appears to be that not only is Account Number always populated, it is also consistently formatted. 

  The profiled minimum/maximum field values appear somewhat suspicious, possibly indicating the presence of invalid values?

 

Field Values for Account Number

  We can use drill-downs on the field summary “screen” to get more details about Account Number provided by the data profiling tool.

  The cardinality of Account Number is very high, as is its Distinctness (i.e. the same field value rarely occurs on more than one record).  Therefore, when we limit the review to only the top ten most frequently occurring values, it is not surprising to see low counts.

  Since we do not yet have a business understanding of the data, we are not sure if it is valid for multiple records to have the same Account Number

  Additional analysis can be performed by extracting the alpha prefix and reviewing its top ten most frequently occurring values.  One aspect of this analysis is that it can be used to assess the possibility that Account Number is an “intelligent key.”  Perhaps the alpha prefix is a source system code?

 

 

Tax ID

Field Summary for Tax ID

   The field summary for Tax ID includes input metadata along with the summary and additional statistics provided by the data profiling tool.

  Let's assume that drill-downs revealed the single profiled field data type was INTEGER and the single profiled field format was nnnnnnnnn (i.e. a 9 digit number).

  Combined with the profiled minimum/maximum field lengths, the good news appears to be that Tax ID is also consistently formatted.  However, the profiled minimum/maximum field values appear to indicate the presence of invalid values.

  In Part 4, we learned that most of the records appear to have either an United States (US) or Canada (CA) postal address.  For US records, the Tax ID field could represent the social security number (SSN), federal employer identification number (FEIN), or tax identification number (TIN).  For CA records, this field could represent the social insurance number (SIN).  All of these identifiers are used for tax reporting purposes and have a 9 digit number format (when no presentation formatting is used).

 

Field Values for Tax ID

  We can use drill-downs on the field summary “screen” to get more details about Tax ID provided by the data profiling tool.

  The Distinctness of Tax ID is slightly lower than Account Number and therefore the same field value does occasionally occur on more than one record.

  Since the cardinality of Tax ID is very high, we will limit the review to only the top ten most frequently occurring values.  This analysis reveals the presence of more (most likely) invalid values.

 

Potential Duplicate Records

In Part 1, we asked if the data profiling statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records.  In other words, since some distinct actual values for these fields occur on more than one record, does this imply more than just a possible data relationship, but a possible data redundancy?  Obviously, we would need to interact with the business team in order to better understand the data and their business rules for identifying duplicate records.

However, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

Record Drill-down for Account Number and Tax ID

 

What other analysis do you think should be performed for these fields?

 

In Part 7 of this series:  We will continue the adventures in data profiling by completing our initial analysis with the investigation of the Customer Name 1 and Customer Name 2 fields.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Thursday
Sep032009

To Parse or Not To Parse

“To Parse, or Not To Parse,—that is the question:
Whether 'tis nobler in the data to suffer
The slings and arrows of free-form fields,
Or to take arms against a sea of information,
And by parsing, understand them?”

Little known fact: before William Shakespeare made it big as a playwright, he was a successful data quality consultant. 

Alas, poor data quality!  The Bard of Avon knew it quite well.  And he was neither a fan of free verse nor free-form fields.

 

Free-Form Fields

A free-form field contains multiple (usually interrelated) sub-fields.  Perhaps the most common examples of free-form fields are customer name and postal address.

A Customer Name field with the value “Christopher Marlowe” is comprised of the following sub-fields and values:

  • Given Name = “Christopher”
  • Family Name = “Marlowe”

A Postal Address field with the value “1587 Tambur Lane” is comprised of the following sub-fields and values:

  • House Number = “1587”
  • Street Name = “Tambur”
  • Street Type = “Lane”

Obviously, both of these examples are simplistic.  Customer name and postal address are comprised of additional sub-fields, not all of which will be present on every record or represented consistently within and across data sources.

Returning to the bard's question, a few of the data quality reasons to consider parsing free-form fields include:

  • Data Profiling
  • Data Standardization
  • Data Matching

 

Much Ado About Analysis

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  In Adventures in Data Profiling (Part 5), a data profiling tool was used to analyze the field Postal Address Line 1:

Field Formats for Postal Address Line 1

 

The Taming of the Variations

Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value, and a general lack of data quality standards.  Additional variations are introduced by multiple data sources, each with its own unique data characteristics and quality challenges.

Data standardization parses free-form fields to break them down into their smaller individual sub-fields to gain improved visibility of the available input data.  Data standardization is the taming of the variations that creates a consistent representation, applies standard values where appropriate, and when possible, populates missing values.

The following example shows parsed and standardized postal addresses:

Parsed and Standardized Postal Address

In your data quality implementations, do you use this functionality for processing purposes only?  If you retain the standardized results, do you store the parsed and standardized sub-fields or just the standardized free-form value?

 

Shall I compare thee to other records?

Data matching often uses data standardization to prepare its input.  This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.

Imagine matching the following product description records with and without the parsed and standardized sub-fields:

Parsed and Standardized Product Description

 

Doth the bard protest too much? 

Please share your thoughts and experiences regarding free-form fields.

Friday
Aug282009

Adventures in Data Profiling (Part 5)

In Part 4 of this series:  You went totally postal...shifting your focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

You learned when a field is both 100% complete and has an extremely low cardinality, its most frequently occurring value could be its default value, how forcing international addresses to be entered into country-specific data structures can cause data quality problems, and with the expert assistance of Graham Rhind, we all learned more about international postal code formats.

In Part 5, you will continue your adventures in data profiling by completing your initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

 

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Postal Address Summary

As we discussed in Part 3 when we looked at the E-mail Address field, most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field. 

Postal Address Line 1 and Postal Address Line 2 are additional examples of the necessity of this analysis technique.  Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).  Some variations in postal addresses can be the results of data entry errors, the use of local conventions, or ignoring (or lacking) postal standards.

Additionally, postal address lines can sometimes contain overflow from other fields (e.g. Customer Name) or they can be used as a dumping ground for values without their own fields (e.g. Twitter username), values unable to conform to the limitations of their intended fields (e.g. countries with something analogous to a US state or CA province but incompatible with a two character field length), or comments (e.g. LDIY, which as Steve Sarsfield discovered, warns us about the Large Dog In Yard).

 

Postal Address Line 1

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 1:

Field Formats for Postal Address Line 1

The top twenty most frequently occurring field formats for Postal Address Line 1 collectively account for over 80% of the records with an actual value in this field for this data source.  All of these field formats appear to be common potentially valid structures.  Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 1 field?

 

Postal Address Line 2

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 2:

Field Formats for Postal Address Line 2

The top ten most frequently occurring field formats for Postal Address Line 2 collectively account for half of the records with an actual value in this sparsely populated field for this data source.  Some of these field formats show several common potentially valid structures.  Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 2 field?

 

Postal Address Validation

Many data quality initiatives include the implementation of postal address validation software.  This provides the capability to parse, identify, verify, and format a valid postal address by leveraging country-specific postal databases. 

Some examples of postal validation functionality include correcting misspelled street and city names, populating missing postal codes, and applying (within context) standard abbreviations for sub-fields such as directionals (e.g. N for North and E for East), street types (e.g. ST for Street and AVE for Avenue), and box types (e.g. BP for Boite Postale and CP for Case Postale).  These standards not only vary by country, but can also vary within a country when there are multiple official languages.

The presence of non-postal data can sometimes cause either validation failures (i.e. an inability to validate some records, not a process execution failure) or simply deletion of the unexpected values.  Therefore, some implementations will use a pre-process to extract the non-postal data prior to validation.

Most validation software will append one or more status fields indicating what happened to the records during processing.  It is a recommended best practice to perform post-validation analysis by not only looking at these status fields, but also comparing the record content before and after validation, in order to determine what modifications and enhancements have been performed.

 

What other analysis do you think should be performed for postal address?

 

In Part 6 of this series:  We will continue the adventures by analyzing the Account Number and Tax ID fields.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Tuesday
Aug182009

Adventures in Data Profiling (Part 4)

In Part 3 of this series:  The adventures continued with a detailed analysis of the fields Birth Date, Telephone Number and E-mail Address.  This provided you with an opportunity to become familiar with analysis techniques that use a combination of field values and field formats. 

You also saw examples of how valid values in a valid format can have an invalid context, how valid field formats can conceal invalid field values, and how free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

In Part 4, you will continue your adventures in data profiling by going postal...postal address that is, by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

 

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Postal Address Summary

 

Country Code

Field Values for Country Code

  In Part 1, we wondered if 5 distinct Country Code field values indicated international postal addresses.  This drill-down “screen” provided by the data profiling tool shows the frequency distribution.  First of all, the field name might have lead us to assume we would only see ISO 3166 standard country codes.

 

However, two of the field values are a country name and not a country code.  This is another example of how verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data. 

Secondly, the field values would appear to indicate that most of the postal addresses are from the United States.  However, if you recall from Part 3, we discovered some potential clues during our analysis of Telephone Number, which included two formats that appear invalid based on North American standards, and E-mail Address, which included country code Top Level Domain (TLD) values for Canada and the United Kingdom.

Additionally, whenever a field is both 100% complete and has an extremely low cardinality, it could be an indication that the most frequently occurring value is the field's default value. 

Therefore, is it possible that US is simply the default value for Country Code for this data source?

 

Zip Code

Field Formats for Zip Code

  From the Part 1 comments, it was noted that Zip Code as a field name is unique to the postal code system used in the United States (US).  This drill-down “screen” provided by the data profiling tool shows the field has only a total of ten field formats.   

  The only valid field formats for ZIP (which, by the way, is an acronym for Zone Improvement Plan) are 5 digits and 9 digits when the 4 digit ZIP+4 add-on code is also present, which according to the US postal standards should be separated from the 5 digit ZIP Code using a hyphen.

 

The actual field formats in the Zip Code field of this data source reveal another example of how we should not make assumptions about our data based on the metadata that describes it.  Although the three most frequently occurring field formats appear to be representative of potentially valid US postal codes, the alphanumeric postal code field formats are our first indication that it is, perhaps sadly, not all about US (pun intended, my fellow Americans).

The two most frequently occurring alphanumeric field formats appear to be representative of potentially valid Canadian postal codes.  An interesting thing to note is that their combined frequency distribution is double the count of the number of records having CA as a Country Code field value.  Therefore, if these field formats are representative of a valid Canadian postal code, then some Canadian records have a contextually invalid field value in Country Code.

The other alphanumeric field formats appear to be representative of potentially valid postal codes for the United Kingdom (UK).  To the uninitiated, the postal codes of Canada (CA) and the UK appear very similar.  Both postal code formats contain two parts, which according to their postal standards should be separated by a single character space. 

In CA postal codes, the first part is called the Forward Sortation Area (FSA) and the second part is called the Local Delivery Unit (LDU).  In UK postal codes, the first part is called the outward code and the second part is called the inward code. 

One easy way to spot the difference is that a UK inward code always has the format of a digit followed by two letters (i.e. “naa” in the field formats generated by my fictional data profiling tool), whereas a CA LDU always has the format of a digit followed by a letter followed by another digit (i.e. “nan”). 

However, we should never rule out the possibility of transposed values making a CA postal code look like a UK postal code, or vice versa.  Also, never forget the common data quality challenge of valid field formats concealing invalid field values.

Returning to the most frequently occurring field format of 5 digits, can we assume all valid field values would represent US postal addresses?  Of course not.  One significant reason is that a 5 digit postal code is one of the most common formats in the world. 

Just some of the other countries also using a 5 digit postal code include: Algeria, Cuba, Egypt, Finland, France, Germany, Indonesia, Israel, Italy, Kuwait, Mexico, Spain, and Turkey.

What about the less frequently occurring field formats of 4 digits and 6 digits?  It is certainly possible that these field formats could indicate erroneous attempts at entering a valid US postal code.  However, it could also indicate the presence of additional non-US postal addresses.

Just some of the countries using a 4 digit postal code include: Australia, Austria, Belgium, Denmark, El Salvador, Georgia (no, the US state did not once again secede, there is also a country called Georgia and its not even in the Americas), Hungary, Luxembourg, Norway, and Venezuela.  Just some of the countries using a 6 digit postal code include: Belarus, China, India, Kazakhstan (yes, Borat fans, Kazakhstan is a real country), Russia, and Singapore.

Additionally, why do almost 28% of the records in this data source not have a field value for Zip Code? 

One of the possibilities is that we could have postal addresses from countries that do not have a postal code system.  Just a few examples would be: Aruba, Bahamas (sorry fellow fans of the Beach Boys, but both Jamaica and Bermuda have a postal code system, and therefore I could not take you down to Kokomo), Fiji (home of my favorite bottled water), and Ireland (home of my ancestors and inventors of my second favorite coffee).

 

State Abbreviation

Field Values for State Abbreviation

  From the Part 1 comments, it was noted that the cardinality of State Abbreviation appeared suspect because, if we assume that its content matches its metadata, then we would expect only 51 distinct values (i.e. actual US state abbreviations without counting US territories) and not the 72 distinct values discovered by the data profiling tool.

  Let's assume that drill-downs have revealed the single profiled field data type was CHAR, and the profiled minimum/maximum field lengths were both 2.  Therefore, State Abbreviation, when populated, always contains a two character field value.    

  This drill-down “screen” first displays the top ten most frequently occurring values in the State Abbreviation field, which are all valid US state abbreviations.  The frequency distributions are also within general expectations since eight of the largest US states by population are represented.

 

However, our previous analysis of Country Code and Zip Code has already made us aware that international postal addresses exist in this data source.  Therefore, this drill-down “screen” also displays the top ten most frequently occurring non-US values based on the data profiling tool comparing all 72 distinct values against a list of valid US state and territory abbreviations.

Most of the field values discovered by this analysis appear to be valid CA province codes (including PQ being used as a common alternative for QC – the province of Quebec or Québec si vous préférez).  These frequency distributions are also within general expectations since six of the largest CA provinces by population are represented.  Their combined frequency distribution is also fairly close to the combined frequency distribution of potentially valid Canadian postal codes found in the Zip Code field.

However, we still have three additional values (ZZ, SA, HD) which require more analysis.  Additionally, almost 22% of the records in this data source do not have a field value for State Abbreviation, which could be attributable to the fact that even when the postal standards for other countries include something analogous to a US state or CA province, it might not be compatible with a two character field length.

 

City Name

Let's assume that we have performed some preliminary analysis on the statistical summaries and frequency distributions provided by the data profiling tool for the City Name field using the techniques illustrated throughout this series so far. 

Let's also assume analyzing the City Name field in isolation didn't reveal anything suspicious.  The field is consistently populated and its frequently occurring values appeared to meet general expectations.  Therefore, let's assume we have performed additional drill-down analysis using the data profiling tool and have selected the following records of interest:

Record Drill-down for City Name

Based on reviewing these records, what conclusions, assumptions, and questions do you have about the City Name field?

 

What other questions can you think of for these fields?  What other analysis do you think should be performed for these fields?

 

In Part 5 of this series:  We will continue the adventures in data profiling by completing our initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

 

International Man of Postal Address Standards

Since I am a geographically-challenged American, the first (and often the only necessary) option I choose for assistance with international postal address standards is Graham Rhind

His excellent book The Global Source-Book for Address Data Management is an invaluable resource and recognized standard reference that contains over 1,000 pages of data pertaining to over 240 countries and territories.

Sunday
Aug092009

Adventures in Data Profiling (Part 3)

In Part 2 of this series:  The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields.  This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.

Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.

In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address

 

Birth Date

Field Summary for Birth Date

 

  The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool.  Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year). 

  Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.

  However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.

  This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context.  Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.

 

Field Values for Birth Date

 

  We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.

  The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record).  Therefore, we will limit the review to only the top ten most frequently occurring values.

  Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values.  One aspect of this analysis is that it can be used as an easier method for examining the customer age range.

  Here we also see two contextually invalid birth years: 2011 and 2012.  Any thoughts on a possible explanation for this data anomaly?

 

Telephone Number

Field Summary for Telephone Number

  The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

  The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.

  The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.

  Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

  This summary information points to the need to use drill-downs in order to review more detailed information.

 

Field Values for Telephone Number

  The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.

  With only ten profiled field formats, we can easily review them all.  Most formats appear to be representative of potentially valid telephone numbers.  However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States).  Additionally, there are two formats that appear invalid based on North American standards.

  However, a common data quality challenge is that valid field formats can conceal invalid field values.

  Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values.  In this case, more obviously invalid telephone numbers are discovered.  

 

E-mail Address

Field Summary for E-mail Address

 

  The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool.  In order to save some space, I have intentionally omitted the additional profiling statistics for this field.

  E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.

  Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

 

Field Values for E-mail Address

 

  In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD).  These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).

  Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures.  However, some formats are missing one of the three required sub-values.  The formats missing User could be an indication that the field sometimes contains a Website Address.

  Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.

 

 

What other questions can you think of for these fields?  Additional analysis could be done using drill-downs to perform a more detailed review of records of interest.  What other analysis do you think should be performed for these fields? 

 

In Part 4 of this series:  We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Wednesday
Aug052009

Adventures in Data Profiling (Part 2)

In Part 1 of this series:  The adventures began with the following scenario – You are an external consultant on a new data quality initiative.  You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.

You have no prior knowledge of the data or its expected characteristics.  You are performing this analysis without the aid of either business requirements or subject matter experts.  Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.

 

The customer data source was processed by the data profiling tool, which provided the following statistical summaries:

 

Data Profiling Summary

 

The Adventures Continue...

In Part 1, we asked if Customer ID was the primary key for this data source.  In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:

 

Field Summary for Customer ID  Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts.  This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.

  This field summary for Customer ID includes some input metadata, identifying the expected data type and field length.  Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

  The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.

 

 

Field Details for Customer ID

  We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.

  The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS).  Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).

  Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values).  Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.

  The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit.  We could also continue drilling down to see the actual field values behind the field formats.

 

Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?

 

In Part 1, we asked why the Gender Code field has 8 distinct values.  Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields). 

 

Field Values for Gender Code

  We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.

  It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male.  Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).

However, it is dangerous to assume anything and what about those numeric values?  Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields.  For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name? 

To attempt to answer these questions, it may be helpful to review records with each of these field values.  Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

 Record Drill-down for Gender Code

As is so often the case, data rarely conforms to our assumptions about it.  Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?

 

In Part 3 of this series:  We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On