So Long 2009, and Thanks for All the . . .

Before I look ahead to the coming New Year and wonder what it may (or may not) bring, I wanted to pause, reflect on, and in the following OCDQ Video, share some of the many joys I was thankful for 2009 bringing to me.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: OCDQ Video


Thank You

Thank you all—and I do mean every single one of you—thank you for everything.

Happy New Year!!!

Adventures in Data Profiling (Part 8)

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.  This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.


Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics.  Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered.  Both the data profiling tool and data used throughout the series were fictional.  The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

  1. Counts and Percentages
  2. Values and Formats
  3. Drill-down Analysis


Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

 Data Profiling Summary

  • NULL – count of the number of records with a NULL value 
  • Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces) 
  • Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing) 
  • Completeness – percentage calculated as Actual divided by the total number of records 
  • Cardinality – count of the number of distinct actual values 
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records 
  • Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records.  In Part 6, Account Number and Tax ID were used as examples.  Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record).  The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).  Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.


Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
  • Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
  • Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
  • Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
  • Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record). 

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.  Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid. 


Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest.  Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage.  As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series.  However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).



Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative.  I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions. 

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality.  All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.  However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important translate your analysis into meaningful reports and questions to share with the rest of your team. 

Always remember that well performed data profiling is both a highly interactive and a very iterative process.


Thank You

I want to thank you for providing your feedback throughout this series. 

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments. 

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.


The Complete Series


The meteoric rise of the Internet coupled with social media has created an amazing medium that is enabling people who are separated by vast distances and disparate cultures to come together, communicate, and collaborate in ways few would have thought possible just a few decades ago.  Blogging, especially when effectively integrated with social networking, can be one of the most powerful aspects of social media.

The great advantage to blogging as a medium, as opposed to books, newspapers, magazines, and even presentations, is that blogging is not just about broadcasting a message. 

This is not to say that books, newspapers, and magazines aren't useful (they certainly can be) or that presentations lack an interactive component (they certainly should not).  I simply believe that, when done well, blogging better facilities effective communication by starting a conversation, encouraging collaboration, and fostering a true sense of community.

Mashing together the words collaboration, blog, and community, I use the term collablogaunity — which is pronounced “Call a Blog a Unity” — to describe how remarkable blogs do this remarkably well.



Blogging is a conversation — with your readers. 

I love the sound of my own voice and I talk to myself all the time (even in public).  However, the two-way conversation that blogging provides via comments from my readers greatly improves the quality of my blog content —  because it helps me better appreciate the difference between what I know and what I only think I know.

Without comments, the conversation is only one way.  Engaging readers in dialogue and discussion allows some of your points to be made for you by those who take the time to comment as opposed to you just telling everyone how you see the world.

Blogging isn't about using the Internet as your own personal bullhorn for broadcasting your message.  In her wonderful book The Whuffie Factor, Tara Hunt explains that you really need to:

“Turn the bullhorn around: stop talking, start listening, and create continuous conversations.”

Respond to the comments you receive (but never feed the troll).  You don't have to respond immediately.  Sometimes, the conversation will go more smoothly without your involvement as your readers talk amongst themselves.  Other times, your response will help continue the conversation and encourage participation from others. 

Always demonstrate that feedback is both welcome and appreciated.  Make sure to never talk down to your readers (either in your blog post or your comment responses).  It is perfectly fine to disagree and debate, just don't denigrate.  

In a recent guest post on ProBlogger, Rob McPhillips explained: 

“If instead, you are all the time only seeking praise and approval from everyone, then there is nothing solid, consistent or certain about your blog and so ultimately it will never gather a sizeable core of die hard fans.  Only drive by readers who scan a post and never look back.” 


Blogging is a collaboration — with other bloggers.

While conversation is primarily between you and your readers, collaboration is primarily between you and other bloggers.  Although you may be inclined to view other bloggers as “the competition,” especially those within your own niche, this would be a mistake.  Yes, it is true that blogs are competing with each other for readers.  However, sustainable success is achieved through collaboration and friendly competition with your peers.

Brian Clark has explained in the past and continues to exemplify that strategic collaboration is the secret to 21st century success.  Clark has stated that if he had to reduce his recipe for success to just three ingredients, it would be content, copywriting, and collaboration.  And if he had to give up two of those, then he'd keep collaboration.

In their terrific book Trust Agents, Chris Brogan and Julien Smith explain that although people in most cultures view themselves as the central hero in their life's story, the reality is that you need to build an army because you can't do it all alone.

Collaboration between bloggers is mainly about networking and cross-promotion.  You should network with other bloggers, especially those within your own niche.  This can be accomplished a number of ways including e-mail introductions, Twitter direct messages (if the other blogger is following you), LinkedIn connection requests, or Facebook friend requests.

As with any networking, the most important thing is being genuine.  As Darren Rowse and Chris Garrett explained in their highly recommended ProBlogger book, when you network with other bloggers, keep it real, be specific, keep it brief without being rude, and explain why you are interested in connecting.  They rightfully emphasize the importance of that last point.

As we all know, although content may be king, marketing is queen.  Networking with other bloggers can help you get the word out about your brilliant blog and its penchant for publishing posts that everyone must read.  Adding other bloggers to your blogroll, linking to their posts when applicable to your content, and leaving meaningful comments on their posts are not only recommended best practices of netiquette, they are also just the right thing to do.

Too many bloggers have a selfish networking and marketing strategy.  They only promote their own content and then wonder why nobody reads their blog.  I am fond of referring to all social media as Social Karma.  Focus on helping other bloggers promote their content and they will likely be more willing to return the favor.  However, don't misunderstand this technique to be a pathetic peer pressure tactic in other words, I re-tweeted your blog post, why didn't you re-tweet my blog post?

One last point on collaboration is to set realistic expectations — for others and for yourself.  You should definitely try to help others when you can.  However, you simply can't help everyone.  Don't let people take advantage of your generosity. 

Politely, but firmly, say no when you need to say no.  Also extend the same courtesy to other people when they turn you down (or simply ignore you) when you try to connect with them or when you ask them for their help. 

Mean and selfish people definitely suck.  But let's face it, nobody's perfect — we all have bad days, we all occasionally say and do stupid things, and we all occasionally treat people worse than they deserve to be treated.  So don't be too hard on people when they disappoint you, because tomorrow it will probably be your turn to have a bad day.



Blogging is a community service.

If you truly believe and actually practice the principles of both conversation and collaboration, then viewing blogging as a community service comes naturally.  You will truly be more interested in actually listening to what your readers have to say, and less interested in just broadcasting your message.  You will see your words as simply the catalyst that gets the conversation started, and when necessary, helps continue the discussion. 

You will see friends not foes when encountering your blogging peers.  You will help them celebrate their successes and quickly recover from their failures.  You will help others when you can and without worrying about what's in it for you.

As James Chartrand says, you will welcome people to your blog because you view blogging as a festival of people, a community strengthened by people, where everyone can speak up with great care and attention, sharing thoughts and views while openly accepting differing opinions.  Blogging is a community service providing a wealth of experience, thoughts and knowledge being shared by all sorts of participants.

In the closing keynote of this year's BlogWorld conference, Chris Brogan explained (from notes taken by David B. Thomas):

“Make it about them.  Stop looking at this as a cult of me. 

It has to be about your audience.  Turn them into a community. 

The difference between an audience and a community is the way you face the chairs. 

The difference between an audience and a community:

One will fall on its sword for you and the other will watch you fall.”


Pronounced: “Call a Blog a Unity”

There are literally millions of blogs on the Internet today.  Your blog (to quote Seth Godin) is “either remarkable or invisible.”

Remarkable blogs primarily do three things:

  1. Start conversations
  2. Encourage collaboration
  3. Foster a true sense of community

Remarkable blogs are collablogaunities.  Is your blog a collablogaunity?


Related Posts

The Mullet Blogging Manifesto

Brevity is the Soul of Social Media

Podcast: Your Blog, Your Voice

Beyond a “Single Version of the Truth”

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below).  Thanks!


The “Point of View” Paradox

In the early 20th century, within his Special Theory of Relativity, Albert Einstein introduced the concept that space and time are interrelated entities forming a single continuum, and therefore the passage of time can be a variable that could change for each individual observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective. 

It was Einstein's apprentice, Obi-Wan Kenobi (to whom Albert explained “Gravity will be with you, always”), who stated:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

The Data-Information Continuum

In the early 21st century, within his popular blog post The Data-Information Continuum, Jim Harris introduced the concept that data and information are interrelated entities forming a single continuum, and that speaking of oneself in the third person is the path to the dark side.

I use the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g., customers, vendors, suppliers).

Although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Quality within the Data-Information Continuum has both objective and subjective dimensions.  Data's quality is objectively measured separate from its many uses, while information's quality is subjectively measured according to its specific use.


Objective Data Quality

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. 

In order to lay this foundation, raw data is extracted directly from its sources, profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs. 

At this phase of the architecture, the manipulations of raw data must be limited to objective standards and not be customized for any subjective use.  From this perspective, data is now fit to serve (as at least the basis for) each and every purpose.


Subjective Information Quality

Information quality standards (starting from the objective data foundation) are customized to meet the subjective needs of each business unit and initiative.  This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

But please understand: customization should not be performed simply for the sake of it.  You must always define your information quality standards by using the enterprise-wide data quality standards as your initial framework. 

Whenever possible, enterprise-wide standards should be enforced without customization.  The key word within the phrase “subjective information quality standards” is standards — as opposed to subjective, which can quite often be misinterpreted as “you can do whatever you want.”  Yes you can – just as long as you have justifiable business reasons for doing so.

This approach to implementing information quality standards has three primary advantages.  First, it reinforces a consistent understanding and usage of data throughout the enterprise.  Second, it requires each business unit and initiative to clearly explain exactly how they are using data differently from the rest of your organization, and more important, justify why.  Finally, all deviations from enterprise-wide data quality standards will be fully documented. 


The “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a “Single Version of the Truth.”

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Beyond a “Single Version of the Truth”

In the classic 1985 film Mad Max Beyond Thunderdome, the title character arrives in Bartertown, ruled by the evil Auntie Entity, where people living in the post-apocalyptic Australian outback go to trade for food, water, weapons, and supplies.  Auntie Entity forces Mad Max to fight her rival Master Blaster to the death within a gladiator-like arena known as Thunderdome, which is governed by one simple rule:

“Two men enter, one man leaves.”

I have always struggled with the concept of creating a “Single Version of the Truth.”  I imagine all of the key stakeholders from throughout the enterprise arriving in Corporatetown, ruled by the Machiavellian CEO known only as Veritas, where all business units and initiatives must go to request funding, staffing, and continued employment.  Veritas forces all of them to fight their Master Data Management rivals within a gladiator-like arena known as Meetingdome, which is governed by one simple rule:

“Many versions of the truth enter, a Single Version of the Truth leaves.”

For any attempted “version of the truth” to truly be successfully implemented within your organization, it must take into account both the objective and subjective dimensions of quality within the Data-Information Continuum. 

Both aspects of this shared perspective of quality must be incorporated into a “Shared Version of the Truth” that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

The Data-Information Continuum is governed by one simple rule:

“All validly different points of view must be allowed to enter,

In order for an all encompassing Shared Version of the Truth to be achieved.”


You are the Judge

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate.  A link to the same poll is provided on all three blogs.  Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals. 

The poll will remain open for one week, closing at midnight on November 19 so that the “medal ceremony” can be conducted via Twitter on Friday, November 20.  Additionally, please share your thoughts and perspectives on this debate by posting a comment below.  Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.


Related Posts

Poor Data Quality is a Virus

The General Theory of Data Quality

The Data-Information Continuum

The Once and Future Data Quality Expert

World Quality Day 2009

Wednesday, November 11 is World Quality Day 2009.

World Quality Day was established by the United Nations in 1990 as a focal point for the quality management profession and as a celebration of the contribution that quality makes to the growth and prosperity of nations and organizations.  The goal of World Quality Day is to raise awareness of how quality approaches (including data quality best practices) can have a tangible effect on business success, as well as contribute towards world-wide economic prosperity.



The International Association for Information and Data Quality (IAIDQ) was chartered in January 2004 and is a not-for-profit, vendor-neutral professional association whose purpose is to create a world-wide community of people who desire to reduce the high costs of low quality information and data by applying sound quality management principles to the processes that create, maintain and deliver data and information.

Since 2007 the IAIDQ has celebrated World Quality Day as a springboard for improvement and a celebration of successes.  Please join us to celebrate World Quality Day by participating in our interactive webinar in which the Board of Directors of the IAIDQ will share with you stories and experiences to promote data quality improvements within your organization.

In my recent Data Quality Pro article The Future of Information and Data Quality, I reported on the IAIDQ Ask The Expert Webinar with co-founders Larry English and Tom Redman, two of the industry pioneers for data quality and two of the most well-known data quality experts.


Data Quality Expert

As World Quality Day 2009 approaches, my personal reflections are focused on what the title data quality expert has meant in the past, what it means today, and most important, what it will mean in the future.

With over 15 years of professional services and application development experience, I consider myself to be a data quality expert.  However, my experience is paltry by comparison to English, Redman, and other industry luminaries such as David Loshin, to use one additional example from many. 

Experience is popularly believed to be the path that separates knowledge from wisdom, which is usually accepted as another way of defining expertise. 

Oscar Wilde once wrote that “experience is simply the name we give our mistakes.”  I agree.  I have found that the sooner I can recognize my mistakes, the sooner I can learn from the lessons they provide, and hopefully prevent myself from making the same mistakes again. 

The key is early detection.  As I gain experience, I gain an improved ability to more quickly recognize my mistakes and thereby expedite the learning process.

James Joyce wrote that “mistakes are the portals of discovery” and T.S. Eliot wrote that “we must not cease from exploration and the end of all our exploring will be to arrive where we began and to know the place for the first time.”

What I find in the wisdom of these sages is the need to acknowledge the favor our faults do for us.  Therefore, although experience is the path that separates knowledge from wisdom, the true wisdom of experience is the wisdom of failure.

As Jonah Lehrer explained: “Becoming an expert just takes time and practice.  Once you have developed expertise in a particular area, you have made the requisite mistakes.”

But expertise in any discipline is more than simply an accumulation of mistakes and birthdays.  And expertise is not a static state that once achieved, allows you to simply rest on your laurels.

In addition to my real-world experience working on data quality initiatives for my clients, I also read all of the latest books, articles, whitepapers, and blogs, as well as attend as many conferences as possible.


The Times They Are a-Changin'

Much of the discussion that I have heard regarding the future of the data quality profession has been focused on the need for the increased maturity of both practitioners and organizations.  Although I do not dispute this need, I am concerned about the apparent lack of attention being paid to how fast the world around us is changing.

Rapid advancements in technology, coupled with the meteoric rise of the Internet and social media (blogs, wikis,  Twitter, Facebook, LinkedIn, etc.) has created an amazing medium that is enabling people separated by vast distances and disparate cultures to come together, communicate, and collaborate in ways few would have thought possible just a few decades ago. 

I don't believe that it is an exaggeration to state that we are now living in an age where the contrast between the recent past and the near future is greater than perhaps it has ever been in human history.  This brave new world has such people and technology in it, that practically every new day brings the possibility of another quantum leap forward.

Although it has been argued by some that the core principles of data quality management are timeless, I must express my doubt.  The daunting challenges of dramatically increasing data volumes and the unrelenting progress of cloud computing, software as a service (SaaS), and mobile computing architectures, would appear to be racing toward a high-speed collision with our time-tested (but time-consuming to implement properly) data quality management principles.

The times they are indeed changing and I believe we must stop using terms like Six Sigma and Kaizen as if they were a shibboleth.  If these or any other disciplines are to remain relevant, then we must honestly assess them in the harsh and unforgiving light of our brave new world that is seemingly changing faster than the speed of light.

Expertise is not static.  Wisdom is not timeless.  The only constant is change.  For the data quality profession to truly mature, our guiding principles must change with the times, or be relegated to a past that is all too quickly becoming distant.


Share Your Perspectives

In celebration of World Quality Day, please share your perspectives regarding the past, present, and most important, the future of the data quality profession.  With apologies to T. H. White, I declare this debate to be about the difference between:

The Once and Future Data Quality Expert

Related Posts

Mistake Driven Learning

The Fragility of Knowledge

The Wisdom of Failure

A Portrait of the Data Quality Expert as a Young Idiot

The Nine Circles of Data Quality Hell


Additional IAIDQ Links

IAIDQ Ask The Expert Webinar: World Quality Day 2009

IAIDQ Ask The Expert Webinar with Larry English and Tom Redman

INTERVIEW: Larry English - IAIDQ Co-Founder

INTERVIEW: Tom Redman - IAIDQ Co-Founder

IAIDQ Publications Portal

The Tell-Tale Data

It is a dark and stormy night in the data center.  The constant humming of hard drives is mimicking the sound of a hard rain falling in torrents, except at occasional intervals, when it is checked by a violent gust of conditioned air sweeping through the seemingly endless aisles of empty cubicles, rattling along desktops, fiercely agitating the flickering glow from flat panel monitors that are struggling against the darkness.

Tonight, amid this foreboding gloom with only my thoughts for company, I race to complete the production implementation of the Dystopian Automated Transactional Analysis (DATA) system.  Nervous, very, very dreadfully nervous I have been, and am, but why will you say that I am mad?  Observe how calmly I can tell you the whole story.

Eighteen months ago, I was ordered by executive management to implement the DATA system.  The vendor's salesperson was an oddly charming fellow named Machiavelli, who had the eye of a vulture — a pale blue eye, with a film over it.  Whenever this eye fell upon me, my blood ran cold. 

Machiavelli assured us all that DATA's seamlessly integrated Magic Beans software would migrate and consolidate all of our organization's information, clairvoyantly detecting and correcting our existing data quality problems, and once DATA was implemented into production, Magic Beans would prevent all future data quality problems from happening.

As soon as a source was absorbed into DATA, Magic Beans automatically did us the favor of freeing up disk space by deleting all traces of the source, somehow even including our off-site archives.  DATA would then become our only system of record, truly our Single Version of the Truth.

It is impossible to say when doubt first entered my brain, but once conceived, it haunted me day and night.  Whenever I thought about it, my blood ran cold — as cold as when that vulture eye was gazing upon me — very gradually, I made up my mind to simply load DATA and rid myself of my doubt forever.

Now this is the point where you will fancy me quite mad.  But madmen know nothing.  You should have seen how wisely I proceeded — with what caution — with what foresight — with what Zen-like tranquility, I went to work! 

I was never happier than I was these past eighteen months while I simply followed the vendor's instructions step by step and loaded DATA!  Would a madman have been so wise as this?  I think not.

Tomorrow morning, DATA goes live.  I can imagine how wonderful that will be.  I will be sitting at my desk, grinning wildly, deliriously happy with a job well done.  DATA will be loaded, data quality will trouble me no more.

It is now four o'clock in the morning, but still it is as dark as midnight.  But as bright as the coming dawn, I can now see three strange men as they gather around my desk. 

Apparently, a shriek had been heard from the business analysts and subject matter experts as soon as they started using DATA.  Suspicions had been aroused, complaints had been lodged, and they (now identifying themselves as auditors) had been called in by a regulatory agency to investigate.

I smile — for what have I to fear?  I welcome these fine gentlemen.  I give them a guided tour of DATA using its remarkably intuitive user interface.  I urge them audit — audit well.  They seemed satisfied.  My manner has convinced them.  I am singularly at ease.  They sit, and while I answer cheerily, they chat away about trivial things.  But before long, I feel myself growing pale and wish them gone.

My head aches and I hear a ringing in my ears, but still they sit and chat.  The ringing becomes more distinct.  I talk more freely, to get rid of the feeling, but it continues and gains volume — until I find that this noise is not within my ears.

No doubt I now grow very pale — but I talk more fluently, and with a heightened voice.  Yet the sound increases — and what can I do?  It is a low, dull, quick sound.  I gasp for breath — and yet the auditors hear it not. 

I talk more quickly — more vehemently — but the noise steadily increases.  I arise, and argue about trifles, in a high key and with violent gesticulations — but the noise steadily increases.  Why will they not be gone?  I pace the floor back and forth, with heavy strides, as if excited to fury by the unrelenting observations of the auditors — but the noise steadily increases.

What could I do?  I raved — I ranted — I raged!  I swung my chair and smashed my computer with it — but the noise rises over all of my attempts to silence it.  It grows louder — louder — louder!  And still the auditors chat pleasantly, and smile.  Is it really possible they can not hear it?  Is it really possible they did not notice me smashing my computer?

They hear! — they suspect! — they know! — they are making a mockery of my horror! — this I thought, and this I think.  But anything is better than this agony!  Anything is more tolerable than this derision!  I can not bear their hypocritical smiles any longer!  I feel that I must scream or die! — and now — again! — the noise!  Louder!  Louder!!  LOUDER!!!


“DATA!” I finally shriek.  “DATA has no quality!  NO DATA QUALITY!!!  What have I done?  What — Have — I — Done?!?”


With a sudden jolt, I awaken at my desk, with my old friend Edgar shaking me by the shoulders. 

“Hey, wake up!  Executive management wants us in the conference room in five minutes.  Apparently, there is a vendor here today pitching a new system called DATA using software called Magic Beans...” 

“...and the salesperson has this weird eye...”

Days Without A Data Quality Issue

In 1970, the United States Department of Labor created the Occupational Safety and Health Administration (OSHA).  The mission of OSHA is to prevent work-related injuries, illnesses, and deaths.  Based on statistics from 2007, since OSHA's inception, occupational deaths in the United States have been cut by 62% and workplace injuries have declined by 42%.

OSHA regularly conducts inspections to determine if organizations are in compliance with safety standards and assesses financial penalties for violations.  In order to both promote workplace safety and avoid penalties, organizations provide their employees with training on the appropriate precautions and procedures to follow in the event of an accident or an emergency.

Training programs certify new employees in safety protocols and indoctrinate them into the culture of a safety-conscious workplace.  By requiring periodic re-certification, all employees maintain awareness of their personal responsibility in both avoiding workplace accidents and responding appropriately to emergencies.

Although there has been some debate about the effectiveness of the regulations and the enforcement policies, over the years OSHA has unquestionably brought about many necessary changes, especially in the area of industrial work site safety where dangerous machinery and hazardous materials are quite common. 

Obviously, even with well-defined safety standards in place, workplace accidents will still occasionally occur.  However, these standards have helped greatly reduce both the frequency and severity of the accidents.  And most importantly, safety has become a natural part of the organization's daily work routine.


A Culture of Data Quality

Similar to indoctrinating employees into the culture of a safety-conscious workplace, more and more organizations are realizing the importance of creating and maintaining the culture of a data quality conscious workplace.  A culture of data quality is essential for effective enterprise information management.

Waiting until a serious data quality issue negatively impacts the organization before starting an enterprise data quality program is analogous to waiting until a serious workplace accident occurs before starting a safety program.

Many data quality issues are caused by a lack of data ownership and an absence of clear guidelines indicating who is responsible for ensuring that data is of sufficient quality to meet the daily business needs of the enterprise.  In order for data quality to be taken seriously within your organization, everyone first needs to know that data quality is an enterprise-wide priority.

Additionally, data quality standards must be well-defined, and everyone must accept their personal responsibility in both preventing data quality issues and responding appropriately to mitigate the associated business risks when issues do occur.


Data Quality Assessments

The data equivalent of a safety inspection is a data quality assessment, which provides a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data. 

Performing a data quality assessment helps with a wide variety of tasks including: verifying data matches the metadata that describes it, preparing meaningful questions for subject matter experts, understanding how data is being used, quantifying the business impacts of poor quality data, and evaluating the ROI of data quality improvements.

An initial assessment provides a baseline and helps establish data quality standards as well as set realistic goals for improvement.  Subsequent data quality assessments, which should be performed on a regular basis, will track your overall progress.

Although preventing data quality issues is your ultimate goal, don't let the pursuit of perfection undermine your efforts.  Always be mindful of the data quality issues that remain unresolved, but let them serve as motivation.  Learn from your mistakes without focusing on your failures – focus instead on making steady progress toward improving your data quality.


Data Governance

The data equivalent of verifying compliance with safety standards is data governance, which establishes policies and procedures to align people throughout the organization.  Enterprise data quality programs require a data governance framework in order successfully deploy data quality as an enterprise-wide initiative. 

By facilitating the collaboration of all business and technical stakeholders, aligning data usage with business metrics, enforcing data ownership, and prioritizing data quality, data governance enables effective enterprise information management.

Obviously, even with well-defined and well-managed data governance policies and procedures in place, data quality issues will still occasionally occur.  However, your goal is to greatly reduce both the frequency and severity of your data quality issues. 

And most importantly, the responsibility for ensuring that data is of sufficient quality to meet your daily business needs, has now become a natural part of your organization's daily work routine.


Days Without A Data Quality Issue

Organizations commonly display a sign indicating how long they have gone without a workplace accident.  Proving that I certainly did not miss my calling as a graphic designer, I created this “sign” for Days Without A Data Quality Issue:

Days Without A Data Quality Issue


Related Posts

Poor Data Quality is a Virus

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

Data Governance and Data Quality

Poor Quality Data Sucks

Fenway Park 2008 Home Opener

Over the last few months on his Information Management blog, Steve Miller has been writing posts inspired by a great 2008 book that we both highly recommend: The Drunkard's Walk: How Randomness Rules Our Lives by Leonard Mlodinow.

In his most recent post The Demise of the 2009 Boston Red Sox: Super-Crunching Takes a Drunkard's Walk, Miller takes on my beloved Boston Red Sox and the less than glorious conclusion to their 2009 season. 

For those readers who are not baseball fans, the Los Angeles Angels of Anaheim swept the Red Sox out of the playoffs.  I will let Miller's words describe their demise: “Down two to none in the best of five series, the Red Sox took a 6-4 lead into the ninth inning, turning control over to impenetrable closer Jonathan Papelbon, who hadn't allowed a run in 26 postseason innings.  The Angels, within one strike of defeat on three occasions, somehow managed a miracle rally, scoring 3 runs to take the lead 7-6, then holding off the Red Sox in the bottom of the ninth for the victory to complete the shocking sweep.”


Baseball and Data Quality

What, you may be asking, does baseball have to do with data quality?  Beyond simply being two of my all-time favorite topics, quite a lot actually.  Baseball data is mostly transaction data describing the statistical events of games played.

Statistical analysis has been a beloved pastime even longer than baseball has been America's Pastime.  Number-crunching is far more than just a quantitative exercise in counting.  The qualitative component of statistics – discerning what the numbers mean, analyzing them to discover predictive patterns and trends – is the very basis of data-driven decision making.

“The Red Sox,” as Miller explained, “are certainly exemplars of the data and analytic team-building methodology” chronicled in Moneyball: The Art of Winning an Unfair Game, the 2003 book by Michael Lewis.  Red Sox General Manager Theo Epstein has always been an advocate of the so-called evidenced-based baseball, or baseball analytics, pioneered by Bill James, the baseball writer, historian, statistician, current Red Sox consultant, and founder of Sabermetrics.

In another book that Miller and I both highly recommend, Super Crunchers, author Ian Ayres explained that “Bill James challenged the notion that baseball experts could judge talent simply by watching a player.  James's simple but powerful thesis was that data-based analysis in baseball was superior to observational expertise.  James's number-crunching approach was particular anathema to scouts.” 

“James was baseball's herald,” continues Ayres, “of data-driven decision making.”


The Drunkard's Walk

As Mlodinow explains in the prologue: “The title The Drunkard's Walk comes from a mathematical term describing random motion, such as the paths molecules follow as they fly through space, incessantly bumping, and being bumped by, their sister molecules.  The surprise is that the tools used to understand the drunkard's walk can also be employed to help understand the events of everyday life.”

Later in the book, Mlodinow describes the hidden effects of randomness by discussing how to build a mathematical model for the probability that a baseball player will hit a home run: “The result of any particular at bat depends on the player's ability, of course.  But it also depends on the interplay of many other factors: his health, the wind, the sun or the stadium lights, the quality of the pitches he receives, the game situation, whether he correctly guesses how the pitcher will throw, whether his hand-eye coordination works just perfectly as he takes his swing, whether that brunette he met at the bar kept him up too late, or the chili-cheese dog with garlic fries he had for breakfast soured his stomach.”

“If not for all the unpredictable factors,” continues Mlodinow, “a player would either hit a home run on every at bat or fail to do so.  Instead, for each at bat all you can say is that he has a certain probability of hitting a home run and a certain probability of failing to hit one.  Over the hundreds of at bats he has each year, those random factors usually average out and result in some typical home run production that increases as the player becomes more skillful and then eventually decreases owing to the same process that etches wrinkles in his handsome face.  But sometimes the random factors don't average out.  How often does that happen, and how large is the aberration?”



I have heard some (not Mlodinow or anyone else mentioned in this post) argue that data quality is an irrelevant issue.  The basis of their argument is that poor quality data are simply random factors that, in any data set of statistically significant size, will usually average out and therefore have a negligible effect on any data-based decisions. 

However, the random factors don't always average out.  It is important to not only measure exactly how often poor quality data occur, but acknowledge the large aberration poor quality data are, especially in data-driven decision making.

As every citizen of Red Sox Nation is taught from birth, the only acceptable opinion of our American League East Division rivals, the New York Yankees, is encapsulated in the chant heard throughout the baseball season (and not just at Fenway Park):

“Yankees Suck!”

From their inception, the day-to-day business decisions of every organization are based on its data.  This decision-critical information drives the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace. 

It doesn't quite roll off the tongue as easily, but a chant heard throughout these enterprise information initiatives is:

“Poor Quality Data Sucks!”

Books Recommended by Red Sox Nation

Mind Game: How the Boston Red Sox Got Smart, Won a World Series, and Created a New Blueprint for Winning

Feeding the Monster: How Money, Smarts, and Nerve Took a Team to the Top

Theology: How a Boy Wonder Led the Red Sox to the Promised Land

Now I Can Die in Peace: How The Sports Guy Found Salvation Thanks to the World Champion (Twice!) Red Sox

Mistake Driven Learning

In his Copyblogger article How to Stop Making Yourself Crazy with Self-Editing, Sean D'Souza explains:

“Competency is a state of mind you reach when you’ve made enough mistakes.”

One of my continuing challenges is staying informed about the latest trends in data quality and its related disciplines, including Master Data Management (MDM), Dystopian Automated Transactional Analysis (DATA), and Data Governance (DG) – I am fairly certain that one of those three things isn't real, but I haven't figured out which one yet.

I read all of the latest books, as well as the books that I was supposed to have read years ago, when I was just pretending to have read all of the latest books.  I also read the latest articles, whitepapers, and blogs.  And I go to as many conferences as possible.

The basis of this endless quest for knowledge is fear.  Please understand – I have never been afraid to look like an idiot.  After all, we idiots are important members of society – we make everyone else look smart by comparison. 

However, I also market myself as a data quality expert.  Therefore, when I consult, speak, write, or blog, I am always at least a little afraid of not getting things quite right.  Being afraid of making mistakes can drive you crazy. 

But as a wise man named Seal Henry Olusegun Olumide Adeola Samuel (wisely better known by only his first name) lyrically taught us back in 1991:

“We're never gonna survive unless, we get a little crazy.”

“It’s not about getting things right in your brain,” explains D’Souza, “it’s about getting things wrong.  The brain has to make hundreds, even thousands of mistakes — and overcome those mistakes — to be able to reach a level of competency.”


So, get a little crazy, make a lot of mistakes, and never stop learning.


Related Posts

The Fragility of Knowledge

The Wisdom of Failure

A Portrait of the Data Quality Expert as a Young Idiot

The Nine Circles of Data Quality Hell

Poor Data Quality is a Virus

“A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.” 

These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.

“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus.  Like a virus, data by itself is inert.  Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”

“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”

“The antonym of viral data is trusted information.”

Data Quality

“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.”  Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:

  • David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
  • Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses.  It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
  • Thomas Redman: “Data are of high quality if those who use them say so.  Usually, high-quality data must be both free of defects and possess features that customers desire.”

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives.  Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative.  This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

However, the enterprise-wide data quality standards must be understood as dynamic.  Therefore, enforcing strict conformance to data quality standards can be self-defeating.  On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”

Defining data quality is both an essential and challenging exercise for every enterprise.  “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business.  Business data is data about the business, and any data about the business is metadata.  First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”


Data Governance

“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.” 

“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.” 

For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance. 

However,  these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”

Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”



“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”

Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace. 

Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions.  As the pathogen replicates, more and more decision-critical enterprise information will be compromised.

According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:

“Prevent viral data and preserve trusted information.”

Books Referenced in this Post

Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman

Enterprise Knowledge Management: The Data Quality Approach by David Loshin

Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information by Danette McGilvray

Data Quality: The Field Guide by Thomas Redman

Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services by Joseph Juran

Customer Data Integration: Reaching a Single Version of the Truth by Jill Dyché and Evan Levy


Related Posts

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

Data Governance and Data Quality

Fantasy League Data Quality

For over 25 years, I have been playing fantasy league baseball and football.  For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team.  Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?


Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

  • Master Data – data describing the real-world players on the roster of each fantasy team.

  • Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

  • Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number.  Team attributes include name, owner, home city, and the name and seating capacity of their stadium.

  • Definitions – Player and Team have both Professional and Fantasy definitions.  Professional teams and players are real-world objects managed independent of fantasy sports.  Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League).  Therefore, Fantasy Team and Fantasy Player are the true master data objects.  The distinction between professional and fantasy teams is simpler than between professional and fantasy players.  Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).

  • Roles – In baseball, the player roles are Batter, Pitcher, and Fielder.  In football, the player roles are Offense, Defense and Special Teams.  In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).

  • Connections – Fantasy Players are connected to Fantasy Teams via a roster.  On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball).  These connections change throughout the season.  Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).

  • Taxonomies – Positions played are defined individually and organized into taxonomies.  In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield.  Second base and short stop are also infield positions, and more specifically middle infield.  And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).


Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse.  I downloaded master, reference, and transaction data from my fantasy league's website.  I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team.  All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables.  For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions. 

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics.  These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball. 

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics.  This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues.  Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams.  Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only.  Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables.  In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century). 

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century.  However, there are still statistics with multiple meanings.  For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas.  Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.


Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team.  Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website.  This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.”  However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites. 

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online.  Of course, this wasn't always the case. 

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers.  We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs. 

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves.  I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.


Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

  • Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.

  • Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.

  • Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.

  • Presentation Quality – A measure of how information is presented to and collected from those who utilize it.  Format and appearance support appropriate use of the information.

  • Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.



I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you.  It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby.  They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

Hailing Frequencies Open

“This is Captain James E. Harris of the Data Quality Starship Collaboration...”

Clearly, I am a Star Trek nerd – but I am also a people person.  Although people, process, and technology are all important for successful data quality initiatives, without people, process and technology are useless. 

Collaboration is essential.  More than anything else, it requires effective communication – which begins with effective listening.


Seek First to Understand...Then to Be Understood

This is Habit 5 from Stephen Covey's excellent book The 7 Habits of Highly Effective People.  “We typically seek first to be understood,” explains Covey.  “Most people do not listen with the intent to understand; they listen with the intent to reply.”

We are all proud of our education, knowledge, understanding, and experience.  Since it is commonly believed that experience is the path that separates knowledge from wisdom, we can't wait to share our wisdom with the world.  However, as Covey cautions, our desire to be understood can make “our conversations become collective monologues.”

Covey explains that listening is an activity that can be practiced at one of the following five levels:

  1. Ignoring – we are not really listening at all.
  2. Pretending – we are only waiting for our turn to speak, constantly nodding and saying: “Yeah. Uh-huh. Right.” 
  3. Selective Listening – we are only hearing certain parts of the conversation, such as when we're listening to the constant chatter of a preschool child.
  4. Attentive Listening – we are paying attention and focusing energy on the words that are being said.
  5. Empathic Listening – we are actually listening with the intent to really try to understand the other person's frame of reference.  You look out through it, you see the world the way they see the world, you understand their paradigm, you understand how they feel.

“Empathy is not sympathy,” explains Covey.  “Sympathy is a form of agreement, a form of judgment.  And it is sometimes the more appropriate response.  But people often feed on sympathy.  It makes them dependent.  The essence of empathic listening is not that you agree with someone; it's that you fully, deeply, understand that person, emotionally as well as intellectually.”



Some people balk at discussing the use of emotion in a professional setting, where typically it is believed that rational analysis must protect us from irrational emotions.  To return to a Star Trek metaphor, these people model their professional behavior after the Vulcans. 

Vulcans live according to the philosopher Surak's code of emotional self-control.  Starting at a very young age, they are taught meditation and other techniques in order to suppress their emotions and live a life guided by reason and logic alone.


Be Truly Extraordinary

In all professions, it is fairly common to encounter rational and logically intelligent people. 

Truly extraordinary people masterfully blend both kinds of intelligence – intellectual and emotional.  A well-grounded sense of self-confidence, an empathetic personality, and excellent communication skills, exert a more powerfully positive influence than simply remarkable knowledge and expertise alone.


Your Away Mission

As a data quality consultant, when I begin an engagement with a new client, I often joke that I shouldn't be allowed to speak for the first two weeks.  This is my way of explaining that I will be asking more questions than providing answers. 

I am seeking first to understand the current environment from both the business and technical perspectives.  Only after I have achieved this understanding, will I then seek to be understood regarding my extensive experience of the best practices that I have seen work on successful data quality initiatives.

As fellow Star Trek nerds know, the captain doesn't go on away missions.  Therefore, your away mission is to try your best to practice empathic listening at your next data quality discussion – “Make It So!”

Data quality initiatives require a holistic approach involving people, process, and technology.  You must consider the people factor first and foremost, because it will be the people involved, and not the process or the technology, that will truly allow your data quality initiative to “Live Long and Prosper.”


As always, hailing frequencies remain open to your comments.  And yes, I am trying my best to practice empathic listening.


Related Posts

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

Hyperactive Data Quality (Second Edition)

In the first edition of Hyperactive Data Quality, I discussed reactive and proactive approaches using the data quality lake analogy from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset:

“...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source – the factories.

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data.”

Reactive Data Quality

Reactive Data Quality (i.e. “cleaning the lake” in Redman's analogy) focuses entirely on finding and fixing the problems with existing data after it has been extracted from its sources. 

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert “lake cleaners”).  Data quality problems can be very insidious and even the best “lake cleaning” process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.


Proactive Data Quality

Proactive Data Quality focuses on preventing errors at the sources where data is entered or received, and before it is extracted for use by downstream applications (i.e. “enters the lake” in Redman's analogy). 

Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

“It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect.”

Proactive data quality advocates reevaluating business processes that create data, implementing improved controls on data entry screens and web forms, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the information needs of your consumers before delivering enterprise data for their use.


Proactive Data Quality > Reactive Data Quality

Proactive data quality is clearly the superior approach.  Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information. 

Reactive data quality essentially treats the symptoms without curing the disease.  As Redman explains: “...the problem with being a good lake cleaner is that life never gets gets worse as more data...conspire to mean there is more work every day.”

So why do the vast majority of data quality initiatives use a reactive approach?


An Arrow Thickly Smeared With Poison

In Buddhism, there is a famous parable:

A man was shot with an arrow thickly smeared with poison.  His friends wanted to get a doctor to heal him, but the man objected by saying:

“I will neither allow this arrow to be pulled out nor accept any medical treatment until I know the name of the man who wounded me, whether he was a nobleman or a soldier or a merchant or a farmer or a lowly peasant, whether he was tall or short or of average height, whether he used a long bow or a crossbow, and whether the arrow that wounded me was hoof-tipped or curved or barbed.” 

While his friends went off in a frantic search for these answers, the man slowly, and painfully, dies.


“Flight to Data Quality”

In economics, the term “flight to quality” describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar “flight to data quality” can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal. 

Driven by a business triage for critical data problems, reactive data cleansing is purposefully chosen over proactive defect prevention.  The priority is finding and fixing the near-term problems rather than worrying about the long-term consequences of not identifying the root cause and implementing process improvements that would prevent it from happening again.

The enterprise has been shot with an arrow thickly smeared with poison – poor data quality.  Now is not the time to point out that the enterprise has actually shot itself by failing to have proactive measures in place. 

Reactive data quality only treats the symptoms.  However, during triage, the priority is to stabilize the patient.  A cure for the underlying condition is worthless if the patient dies before it can be administered.


Hyperactive Data Quality

Proactive data quality is the best practice.  Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing. 

A data governance framework is necessary for proactive data quality to be successful.  Patience and understanding are also necessary.  Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly. 

Even when not facing an immediate crisis, the reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.

Just like any complex problem, data quality has no fast and easy solution.  Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into an approach that I refer to as Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.


Please share your thoughts and experiences.


Related Posts

Hyperactive Data Quality (First Edition)

The General Theory of Data Quality

The Very True Fear of False Positives

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:


Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.


The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.


Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

Not So Strange Case of Dr. Technology and Mr. Business

Strange Case of Dr Jekyll and Mr Hyde was Robert Louis Stevenson's classic novella about the duality of human nature and the inner conflict of our personal sense of good and evil that can undermine our noblest intentions.  The novella exemplified this inner conflict using the self-destructive split-personality of Henry Jekyll and Edward Hyde.

The duality of data quality's nature can sometimes cause an organizational conflict between the Business and IT.  The complexity of a data quality project can sometimes work against your best intentions.  Knowledge about data, business processes and supporting technology are spread throughout the organization. 

Neither the Business nor IT alone has all of the necessary information required to achieve data quality success. 

As a data quality consultant, I am often asked to wear many hats – and not just because my balding head is distractingly shiny. 

I often play a hybrid role that helps facilitate the business and technical collaboration of the project team.

I refer to this hybrid role as using the split-personality of Dr. Technology and Mr. Business.


Dr. Technology

With relatively few exceptions, IT is usually the first group that I meet with when I begin an engagement with a new client.  However, this doesn't mean that IT is more important than the Business.  Consultants are commonly brought on board after the initial business requirements have been drafted and the data quality tool has been selected.  Meeting with IT first is especially common if one of my tasks is to help install and configure the data quality tool.

When I meet with IT, I use my Dr. Technology personality.  IT needs to know that I am there to share my extensive experience and best practices from successful data quality projects to help them implement a well architected technical solution.  I ask about data quality solutions that have been attempted previously, how well they were received by the Business, and if they are still in use.  I ask if IT has any issues with or concerns about the data quality tool that was selected.

I review the initial business requirements with IT to make sure I understand any specific technical challenges such as data access, server capacity, security protocols, scheduled maintenance and after-hours support.  I freely “geek out” in techno-babble.  I debate whether Farscape or Battlestar Galactica was the best science fiction series in television history.  I verify the favorite snack foods of the data architects, DBAs, and server administrators since whenever I need a relational database table created or more temporary disk space allocated, I know the required currency will often be Mountain Dew and Doritos.


Mr. Business

When I meet with the Business for the first time, I do so without my IT entourage and I use my Mr. Business personality.  The Business needs to know that I am there to help customize a technical solution to their specific business needs.  I ask them to share their knowledge in their natural language using business terminology.  Regardless of my experience with other companies in their industry, every organization and their data is unique.  No assumptions should be made by any of us.

I review the initial requirements with the Business to make sure I understand who owns the data and how it is used to support the day-to-day operation of each business unit and initiative.  I ask if the requirements were defined before or after the selection of the data quality tool.  Knowing how the data quality tool works can sometimes cause a “framing effect” where requirements are defined in terms of tool functionality, framing them as a technical problem instead of a business problem.  All data quality tools provide viable solutions driven by impressive technology.  Therefore, the focus should always be on stating the problem and solution criteria in business terms.


Dr. Technology and Mr. Business Must Work Together

As the cross-functional project team starts working together, my Dr. Technology and Mr. Business personalities converge to help clarify communication by providing bi-directional translation, mentoring, documentation, training and knowledge transfer.  I can help interpret business requirements and functional specifications, help explain business and technical challenges, and help maintain an ongoing dialogue between the Business and IT. 

I can also help each group save face by playing the important role of Designated Asker of Stupid Questions – one of those intangible skills you can't find anywhere on my resume.

As the project progresses, the communication and teamwork between the Business and IT will become more and more natural and I will become less and less necessary – one of my most important success criteria.


Success is Not So Strange

When the Business and IT forges an ongoing collaborative partnership throughout the entire project, success is not so strange.

In fact, your data quality project can be the beginning of a beautiful friendship between the Business and IT. 

Everyone on the project team can develop a healthy split-personality. 

IT can use their Mr. Business (or Ms. Business) personality to help them understand the intricacies of business processes. 

The Business can use their Dr. Technology personality to help them “get their geek on.”


Data quality success is all about shiny happy people holding hands – and what's so strange about that?


Related Posts

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You


Additional Resources

From the Data Quality Pro forum, read the discussion: Data Quality is not an IT issue

From the blog Inside the Biz with Jill Dyché, read her posts:

From Paul Erb's blog Comedy of the Commons, read his post: I Don't Know Much About Data, but I Know What I Like