Worthy Data Quality Whitepapers (Part 3)

In my April 2009 blog post Data Quality Whitepapers are Worthless, I called for data quality whitepapers worth reading.

This post is now the third entry in an ongoing series about data quality whitepapers that I have read and can endorse as worthy.


Matching Technology Improves Data Quality

Steve Sarsfield recently published Matching Technology Improves Data Quality, a worthy data quality whitepaper, which is a primer on the elementary principles, basic theories, and strategies of record matching.

This free whitepaper is available for download from Talend (requires registration by providing your full contact information).

The whitepaper describes the nuances of deterministic and probabilistic matching and the algorithms used to identify the relationships among records.  It covers the processes to employ in conjunction with matching technology to transform raw data into powerful information that drives success in enterprise applications, including customer relationship management (CRM), data warehousing, and master data management (MDM).

Steve Sarsfield is the Talend Data Quality Product Marketing Manager, and author of the book The Data Governance Imperative and the popular blog Data Governance and Data Quality Insider.


Whitepaper Excerpts

Excerpts from Matching Technology Improves Data Quality:

  • “Matching plays an important role in achieving a single view of customers, parts, transactions and almost any type of data.”
  • “Since data doesn’t always tell us the relationship between two data elements, matching technology lets us define rules for items that might be related.”
  • “Nearly all experts agree that standardization is absolutely necessary before matching.  The standardization process improves matching results, even when implemented along with very simple matching algorithms.  However, in combination with advanced matching techniques, standardization can improve information quality even more.”
  • “There are two common types of matching technology on the market today, deterministic and probabilistic.”
  • “Deterministic or rules-based matching is where records are compared using fuzzy algorithms.”
  • “Probabilistic matching is where records are compared using statistical analysis and advanced algorithms.”
  • “Data quality solutions often offer both types of matching, since one is not necessarily superior to the other.”
  • “Organizations often evoke a multi-match strategy, where matching is analyzed from various angles.”
  • “Matching is vital to providing data that is fit-for-use in enterprise applications.”

Related Posts

Identifying Duplicate Customers

Customer Incognita

To Parse or Not To Parse

The Very True Fear of False Positives

Data Governance and Data Quality

Worthy Data Quality Whitepapers (Part 2)

Worthy Data Quality Whitepapers (Part 1)

Data Quality Whitepapers are Worthless

Fantasy League Data Quality

For over 25 years, I have been playing fantasy league baseball and football.  For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team.  Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?


Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

  • Master Data – data describing the real-world players on the roster of each fantasy team.

  • Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

  • Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number.  Team attributes include name, owner, home city, and the name and seating capacity of their stadium.

  • Definitions – Player and Team have both Professional and Fantasy definitions.  Professional teams and players are real-world objects managed independent of fantasy sports.  Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League).  Therefore, Fantasy Team and Fantasy Player are the true master data objects.  The distinction between professional and fantasy teams is simpler than between professional and fantasy players.  Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).

  • Roles – In baseball, the player roles are Batter, Pitcher, and Fielder.  In football, the player roles are Offense, Defense and Special Teams.  In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).

  • Connections – Fantasy Players are connected to Fantasy Teams via a roster.  On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball).  These connections change throughout the season.  Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).

  • Taxonomies – Positions played are defined individually and organized into taxonomies.  In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield.  Second base and short stop are also infield positions, and more specifically middle infield.  And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).


Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse.  I downloaded master, reference, and transaction data from my fantasy league's website.  I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team.  All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables.  For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions. 

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics.  These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball. 

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics.  This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues.  Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams.  Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only.  Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables.  In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century). 

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century.  However, there are still statistics with multiple meanings.  For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas.  Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.


Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team.  Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website.  This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.”  However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites. 

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online.  Of course, this wasn't always the case. 

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers.  We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs. 

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves.  I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.


Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

  • Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.

  • Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.

  • Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.

  • Presentation Quality – A measure of how information is presented to and collected from those who utilize it.  Format and appearance support appropriate use of the information.

  • Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.



I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you.  It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby.  They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

Worthy Data Quality Whitepapers (Part 2)

Overall Approach to Data Quality ROI

Overall Approach to Data Quality ROI is a worthy data quality whitepaper freely available (name and email required for download) from the McKnight Consulting Group.


William McKnight

The author of the whitepaper is William McKnight, President of McKnight Consulting Group.  William focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures.  William has more than 20 years of information management experience, nearly half of which was gained in IT leadership positions, dealing firsthand with the challenging issues his clients now face.  His IT and consulting teams have won best practice competitions for their implementations.  In 11 years of consulting, he has been a part of 150 client programs worldwide, has over 300 articles, whitepapers and tips in publication and is a frequent international speaker.  William and his team provide clients with action plans, architectures, complete programs, vendor-neutral tool selection and right-fit resources. 

Additionally, William has an excellent blog on the B-eye-Network and a new course now available on eLearningCurve.


Whitepaper Excerpts

Excerpts from Overall Approach to Data Quality ROI:

  • “Data quality is an elusive subject that can defy measurement and yet be critical enough to derail any single IT project, strategic initiative, or even a company as a whole.”
  • “Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise.  Put simply, it is a competitive strategy.”
  • Six key steps to help you realize tangible ROI on your data quality initiative:
    1. System Profiling – survey and prioritize your company systems according to their use of and need for quality data.
    2. Data Quality Rule Determination – data quality can be defined as a lack of intolerable defects.
    3. Data Profiling – usually no one can articulate how clean or dirty corporate data is.  Without this measurement of cleanliness, the effectiveness of activities that are aimed at improving data quality cannot be measured.
    4. Data Quality Scoring – scoring is a relative measure of conformance to rules.  System scores are an aggregate of the rule scores for that system and the overall score is a prorated aggregation of the system scores.
    5. Measure Impact of Various Levels of Data Quality – ROI is about accumulating all returns and investments from a project’s build, maintenance, and associated business and IT activities through to the ultimate desired results – all while considering the possible outcomes and their likelihood.
    6. Data Quality Improvement – it is much more costly to fix data quality errors in downstream systems than it is at the point of origin.

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Data Quality Whitepapers are Worthless

Worthy Data Quality Whitepapers (Part 1)

In my April blog post Data Quality Whitepapers are Worthless, I called for data quality whitepapers that are worth reading.

This post will be the first in an ongoing series about data quality whitepapers that I have read and can endorse as worthy.


It is about the data – the quality of the data

This is the subtitle of two brief but informative data quality whitepapers freely available (no registration required) from the Electronic Commerce Code Management Association (ECCMA)Transparency and Data Portability.



ECCMA is an international association of industry and government master data managers working together to increase the quality and lower the cost of descriptions of individuals, organizations, goods and services through developing and promoting International Standards for Master Data Quality. 

Formed in April 1999, ECCMA has brought together thousands of experts from around the world and provides them a means of working together in the fair, open and extremely fast environment of the Internet to build and maintain the global, open standard dictionaries that are used to unambiguously label information.  The existence of these dictionaries of labels allows information to be passed from one computer system to another without losing meaning.


Peter Benson

The author of the whitepapers is Peter Benson, the Executive Director and Chief Technical Officer of the ECCMA.  Peter is an expert in distributed information systems, content encoding and master data management.  He designed one of the very first commercial electronic mail software applications, WordStar Messenger and was granted a landmark British patent in 1992 covering the use of electronic mail systems to maintain distributed databases.

Peter designed and oversaw the development of a number of strategic distributed database management systems used extensively in the UK and US by the Public Relations and Media Industries.  From 1994 to 1998, Peter served as the elected chairman of the American National Standards Institute Accredited Committee ANSI ASCX 12E, the Standards Committee responsible for the development and maintenance of EDI standard for product data.

Peter is known for the design, development and global promotion of the UNSPSC as an internationally recognized commodity classification and more recently for the design of the eOTD, an internationally recognized open technical dictionary based on the NATO codification system.

Peter is an expert in the development and maintenance of Master Data Quality as well as an internationally recognized proponent of Open Standards that he believes are critical to protect data assets from the applications used to create and manipulate them. 

Peter is the Project Leader for ISO 8000, which is a new international standard for data quality.

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000


Whitepaper Excerpts

Excerpts from Transparency:

  • “Today, more than ever before, our access to data, the ability of our computer applications to use it and the ultimate accuracy of the data determines how we see and interact with the world we live and work in.”
  • “Data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”
  • “Transparency requires that transaction data accurately identifies who, what, where and when and master data accurately describes who, what and where.”


Excerpts from Data Portability:

  • “In an environment where the life cycle of software applications used to capture and manage data is but a fraction of the life cycle of the data itself, the issues of data portability and long-term data preservation are critical.”
  • “Claims that an application exports data in XML does address the syntax part of the problem, but that is the easy part.  What is required is to be able to export all of the data in a form that can be easily uploaded into another application.”
  • “In a world rapidly moving towards SaaS and cloud computing, it really pays to pause and consider not just the physical security of your data but its portability.”


Data Quality Whitepapers are Worthless

During a 1609 interview, William Shakespeare was asked his opinion about an emerging genre of theatrical writing known as Data Quality Whitepapers.  The "Bard of Avon" was clearly not a fan.  His famously satirical response was:

Data quality's but a writing shadow, a poor paper

That struts and frets its words upon the page

And then is heard no more:  it is a tale

Told by a vendor, full of sound and fury

Signifying nothing.


Four centuries later, I find myself in complete agreement with Shakespeare (and not just because Harold Bloom told me so).


Today is April Fool's Day, but I am not joking around - call Dennis Miller and Lewis Black - because I am ready to RANT.


I am sick and tired of reading whitepapers.  Here is my "Bottom Ten List" explaining why: 

  1. Ones that make me fill out a "please mercilessly spam me later" contact information form before I am allowed to download them remind me of Mrs. Bun: "I DON'T LIKE SPAM!"
  2. Ones that after I read their supposed pearls of wisdom, make me shake my laptop violently like an Etch-A-Sketch.  I have lost count of how many laptops I have destroyed this way.  I have starting buying them in bulk at Wal-Mart.
  3. Ones comprised entirely of the exact same information found on the vendor's website make www = World Wide Worthless.
  4. Ones that start out good, but just when they get to the really useful stuff, refer to content only available to paying customers.  What a great way to guarantee that neither I nor anyone I know will ever become your paying customer!
  5. Ones that have a "Shock and Awe" title followed by "Aw Shucks" content because apparently the entire marketing budget was spent on the title.
  6. Ones that promise me the latest BUZZ but deliver only ZZZ are not worthless only when I have insomnia.
  7. Ones that claim to be about data quality, but have nothing at all to do with data quality:  "...don't make me angry.  You wouldn't like me when I'm angry."
  8. Ones that take the adage "a picture is worth a thousand words" too far by using a dizzying collage of logos, charts, graphs and other visual aids.  This is one reason we're happy that Pablo Picasso was a painter.  However, he did once write that "art is a lie that makes us realize the truth."  Maybe he was defending whitepapers.
  9. Ones that use acronyms without ever defining what they stand for remind me of that scene from Good Morning, Vietnam: "Excuse me, sir.  Seeing as how the VP is such a VIP, shouldn't we keep the PC on the QT?  Because if it leaks to the VC he could end up MIA, and then we'd all be put out in KP."
  10. Ones that really know they're worthless but aren't honest about it.  Don't promise me "The Top 10 Metrics for Data Quality Scorecards" and give me a list as pointless as this one.


I am officially calling out all writers of Data Quality Whitepapers. 

Shakespeare and I both believe that you can't write anything about data quality that is worth reading. 

Send your data quality whitepapers to Obsessive-Compulsive Data Quality and if it is not worthless, then I will let the world know that you proved Shakespeare and I wrong.


And while I am on a rant roll, I am officially calling out all Data Quality Bloggers.

The International Association for Information and Data Quality (IAIDQ) is celebrating its five year anniversary by hosting:

El Festival del IDQ Bloggers – A Blog Carnival for Information/Data Quality Bloggers

For more information about the blog carnival, please follow this link:  IAIDQ Blog Carnival