Expectation and Data Quality

One of my favorite recently read books is You Are Not So Smart by David McRaney.  Earlier this week, the book’s chapter about expectation was excerpted as an online article on Why We Can’t Tell Good Wine From Bad, which also provided additional examples about how we can be fooled by altering our expectations.

“In one Dutch study,” McRaney explained, “participants were put in a room with posters proclaiming the awesomeness of high-definition, and were told they would be watching a new high-definition program.  Afterward, the subjects said they found the sharper, more colorful television to be a superior experience to standard programming.”

No surprise there, right?  After all, a high-definition television is expected to produce a high-quality image.

“What they didn’t know,” McRaney continued, “was they were actually watching a standard-definition image.  The expectation of seeing a better quality image led them to believe they had.  Recent research shows about 18 percent of people who own high-definition televisions are still watching standard-definition programming on the set, but think they are getting a better picture.”

I couldn’t help but wonder if establishing an expectation of delivering high-quality data could lead business users to believe that, for example, the data quality of the data warehouse met or exceeded their expectations.  Could business users actually be fooled by altering their expectations about data quality?  Wouldn’t their experience of using the data eventually reveal the truth?

Retailers expertly manipulate us with presentation, price, good marketing, and great service in order to create an expectation of quality in the things we buy.  “The actual experience is less important,” McRaney explained.  “As long as it isn’t total crap, your experience will match up with your expectations.  The build up to an experience can completely change how you interpret the information reaching your brain from your otherwise objective senses.  In psychology, true objectivity is pretty much considered to be impossible.  Memories, emotions, conditioning, and all sorts of other mental flotsam taint every new experience you gain.  In addition to all this, your expectations powerfully influence the final vote in your head over what you believe to be reality.”

“Your expectations are the horse,” McRaney concluded, “and your experience is the cart.”  You might think it should be the other way around, but when your expectations determine your direction, you shouldn’t be surprised by the journey you experience.

If you find it difficult to imagine a positive expectation causing people to overlook poor quality in their experience with data, how about the opposite?  I have seen the first impression of a data warehouse initially affected by poor data quality create a negative expectation causing people to overlook the improved data quality in their subsequent experiences with the data warehouse.  Once people expect to experience poor data quality when using it, people stop trusting, and stop using, the data warehouse.

Data warehousing is only one example of how expectation can affect the data quality experience.  How are your organization’s expectations affecting its experiences with data quality?


Related Posts

The Data Outhouse

Data Quality and Anton’s Syndrome

Data Quality and Chicken Little Syndrome

Data Quality and Miracle Exceptions

Availability Bias and Data Quality Improvement

DQ-View: The Five Stages of Data Quality

Data Quality and the Bystander Effect

Data Quality and the Q Test

Data Myopia and Business Relativity

Plato’s Data

The Illusion-of-Quality Effect

Perception Filters and Data Quality


Do the Eyes Have It?

Predictably Poor Data Quality

Freudian Data Quality

Data Psychedelicatessen

Data Geeks and Business Blindness

The Data Sharpshooter Fallacy

Data Quality and the Barber’s Paradox

Open Source Business Intelligence

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I discuss open source business intelligence (OSBI) with Lyndsay Wise, author of the insightful new book Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize ROI.

Lyndsay Wise is the President and Founder of WiseAnalytics, an independent analyst firm and consultancy specializing in business intelligence for small and mid-sized organizations.  For more than ten years, Lyndsay Wise has assisted clients in business systems analysis, software selection, and implementation of enterprise applications.

Lyndsay Wise conducts regular research studies, consults, writes articles, and speaks about how to implement a successful business intelligence approach and improving the value of business intelligence within organizations.


Open Source Business Intelligence

Additional listening options:


Win a copy of the Book

Lyndsay Wise wants to give one OCDQ Radio listener a free copy of Using Open Source Platforms for Business Intelligence.


Here is how the book contest will work:

(1) Book Contest Question — Name one of the considerations for evaluating whether OSBI is the right choice for your organization that Lyndsay Wise discussed during this OCDQ Radio episode.


(2) Book Contest Deadline — By or before January 31, 2013, Email Jim Harris with your answer to the book contest question.


(3) Book Contest Winner — In February 2013, one winner will be randomly selected from the emails containing the answer to the contest question, and Lyndsay Wise will email the winner requesting a postal address for sending a free copy of the book.


Related Lyndsay Wise Articles

What You Need to Know about Open Source BI

Open Source BI Considerations and Implications

Do Self-Service and Open Source Co-Exist?

Everything Executives Need to Know about Open Source BI

The Importance of Data Management for Business People


Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Data Myopia and Business Relativity

Since how data quality is defined has a significant impact on how data quality is perceived, measured, and managed, in this post I examine the two most prevalent perspectives on defining data quality, real-world alignment and fitness for the purpose of use, which respectively represent what I refer to as the danger of data myopia and the challenge of business relativity.


Real-World Alignment: The Danger of Data Myopia

Whether it’s an abstract description of real-world entities (i.e., master data) or an abstract description of real-world interactions (i.e., transaction data) among entities, data is an abstract description of reality.  The creation and maintenance of these abstract descriptions shapes the organization’s perception of the real world, which I philosophically pondered in my post Plato’s Data.

The inconvenient truth is that the real world is not the same thing as the digital worlds captured within our databases.

And, of course, creating and maintaining these digital worlds is no easy task, which is exactly the danger inherent with the real-world alignment definition of data quality — when the organization’s data quality efforts are focused on minimizing the digital distance between data and the constantly changing real world that data attempts to describe, it can lead to a hyper-focus on the data in isolation, otherwise known as data myopia.

Even if we create and maintain perfect real-world alignment, what value does high-quality data possess independent of its use?

Real-world alignment reflects the perspective of the data provider, and its advocates argue that providing a trusted source of data to the organization will be able to satisfy any and all business requirements, i.e., high-quality data should be fit to serve as the basis for every possible use.  Therefore, in theory, real-world alignment provides an objective data foundation independent of the subjective uses defined by the organization’s many data consumers.

However, providing the organization with a single system of record, a single version of the truth, a single view, a golden copy, or a consolidated repository of trusted data has long been the rallying cry and siren song of enterprise data warehousing (EDW), and more recently, of master data management (MDM).  Although these initiatives can provide significant business value, it is usually poor data quality that undermines the long-term success and sustainability of EDW and MDM implementations.

Perhaps the enterprise needs a Ulysses pact to protect it from believing in EDW or MDM as a miracle exception for data quality?

A significant challenge for the data provider perspective on data quality is that it is difficult to make a compelling business case on the basis of trusted data without direct connections to the specific business needs of data consumers, whose business, data, and technical requirements are often in conflict with one another.

In other words, real-world alignment does not necessarily guarantee business-world alignment.

So, if using real-world alignment as the definition of data quality has inherent dangers, we might be tempted to conclude that the fitness for the purpose of use definition of data quality is the better choice.  Unfortunately, that is not necessarily the case.


Fitness for the Purpose of Use: The Challenge of Business Relativity

In M. C. Escher’s famous 1953 lithograph Relativity, although we observe multiple, and conflicting, perspectives of reality, from the individual perspective of each person, everything must appear normal, since they are all casually going about their daily activities.

I have always thought this is an apt analogy for the multiple business perspectives on data quality that exists within every organization.

Like truth, beauty, and art, data quality can be said to be in the eyes of the beholder, or when data quality is defined as fitness for the purpose of use — the eyes of the user.

Most data has both multiple uses and users.  Data of sufficient quality for one use or user may not be of sufficient quality for other uses and users.  These multiple, and often conflicting, perspectives are considered irrelevant from the perspective of an individual user, who just needs quality data to support their own business activities.

Therefore, the user (i.e., data consumer) perspective establishes a relative business context for data quality.

Whereas the real-world alignment definition of data quality can cause a data-myopic focus, the business-world alignment goal of the fitness for the purpose of use definition must contend with the daunting challenge of business relativity.  Most data has multiple data consumers, each with their own relative business context for data quality, making it difficult to balance the diverse data needs and divergent data quality perspectives within the conflicting, and rather Escher-like, reality of the organization.

The data consumer perspective on data quality is often the root cause of the data silo problem, the bane of successful enterprise data management prevalent in most organizations, where each data consumer maintains their own data silo, customized to be fit for the purpose of their own use.  Organizational culture and politics also play significant roles since data consumers legitimately fear that losing their data silos would revert the organization to a one-size-fits-all data provider perspective on data quality.

So, clearly the fitness for the purpose of use definition of data quality is not without its own considerable challenges to overcome.


How does your organization define data quality?

As I stated at the beginning of this post, how data quality is defined has a significant impact on how data quality is perceived, measured, and managed.  I have witnessed the data quality efforts of an organization struggle with, and at times fail because of, either the danger of data myopia or the challenge of business relativity — or, more often than not, some combination of both.

Although some would define real-world alignment as data quality and fitness for the purpose of use as information quality, I have found adding the nuance of data versus information only further complicates an organization’s data quality discussions.

But for now, I will just conclude a rather long (sorry about that) post by asking for reader feedback on this perennial debate.

How does your organization define data quality?  Please share your thoughts and experiences by posting a comment below.


Related Posts

Data Quality: Quo Vadimus?

Data Quality and Miracle Exceptions

Plato’s Data

Once Upon a Time in the Data

The Most August Imagination

Data in the (Oscar) Wilde

The Idea of Order in Data

Hell is other people’s data

Song of My Data

You Say Potato and I Say Tater Tot


Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Redefining Data Quality — Guest Peter Perera discusses his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.
  • Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Turning Data Silos into Glass Houses

Although data silos are denounced as inherently bad since they complicate the coordination of enterprise-wide business activities, since they are often used to support some of those business activities, whether or not data silos are good or bad is a matter of perspective.  For example, data silos are bad when different business units are redundantly storing and maintaining their own private copies of the same data, but data silos are good when they are used to protect sensitive data that should not be shared.

Providing the organization with a single system of record, a single version of the truth, a single view, a golden copy, or a consolidated repository of trusted data has long been the anti-data-silo siren song of enterprise data warehousing (EDW), and more recently, of master data management (MDM).  Although these initiatives can provide significant business value, somewhat ironically, many data silos start with EDW or MDM data that was replicated and customized in order to satisfy the particular needs of an operational project or tactical initiative.  This customized data either becomes obsolesced after the conclusion of its project or initiative — or it continues to be used because it is satisfying a business need that EDW and MDM are not.

One of the early goals of a new data governance program should be to provide the organization with a substantially improved view of how it is using its data — including data silos — to support its operational, tactical, and strategic business activities.

Data governance can help the organization catalog existing data sources, build a matrix of data usage and related business processes and technology, identify potential external reference sources to use for data enrichment, as well as help define the metrics that meaningfully measure data quality using business-relevant terminology.

The transparency provided by this combined analysis of the existing data, business, and technology landscape will provide a more comprehensive overview of enterprise data management problems, which will help the organization better evaluate any existing data and technology re-use and redundancies, as well as whether investing in new technology will be necessary.

Data governance can help topple data silos by first turning them into glass houses through transparency, empowering the organization to start throwing stones at those glass houses that must be eliminated.  And when data silos are allowed to persist, they should remain glass houses, clearly illustrating whether or not they have the business-justified reasons for continued use.


Related Posts

Data and Process Transparency

The Good Data

The Data Outhouse

Time Silos

Sharing Data

Single Version of the Truth

Beyond a “Single Version of the Truth”

The Quest for the Golden Copy

The Idea of Order in Data

Hell is other people’s data

Studying Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode, Gordon Hamilton and I discuss data quality key concepts, including those which we have studied in some of our favorite data quality books, and more important, those which we have implemented in our careers as data quality practitioners.

Gordon Hamilton is a Data Quality and Data Warehouse professional, whose 30 years’ experience in the information business encompasses many industries, including government, legal, healthcare, insurance and financial.  Gordon was most recently engaged in the healthcare industry in British Columbia, Canada, where he continues to advise several health care authorities on data quality and business intelligence platform issues.

Gordon Hamilton’s passion is to bring together:

  • Exposure of business rules through data profiling as recommended by Ralph Kimball.
  • Monitoring business rules in the EQTL (Extract-Quality-Transform-Load) pipeline leading into the data warehouse.
  • Managing the business rule violations through systemic and specific solutions within the statistical process control framework of Shewhart/Deming.
  • Researching how to sustain data quality metrics as the “fit for purpose” definitions change faster than the information product process can easily adapt.

Gordon Hamilton’s moniker of DQStudent on Twitter hints at his plan to dovetail his Lean Six Sigma skills and experience with the data quality foundations to improve the manufacture of the “information product” in today’s organizations.  Gordon is a member of IAIDQ, TDWI, and ASQ, as well as an enthusiastic reader of anything pertaining to data.

Gordon Hamilton recently became an Information Quality Certified Professional (IQCP), via the IAIDQ certification program.


Studying Data Quality

Additional listening options:


Recommended Data Quality Books

By no means a comprehensive list, and listed in no particular order whatsoever, the following books were either discussed during this OCDQ Radio episode, or are otherwise recommended for anyone looking to study data quality and its related disciplines:


PLEASE NOTE: You are welcome to recommend additional books for this reading list by posting a comment below.


Related Posts

The Data Outhouse

Finding Data Quality

The Dichotomy Paradox, Data Quality and Zero Defects

The Data Quality Wager

What Data Quality Technology Wants

DAMA International

The Higher Education of Data Quality

International Data Quality

Big Data and Big Analytics

Organizing for Data Quality

Data Profiling Early and Often

Data Governance Star Wars

Master Data Management in Practice

The Art of Data Matching

Data Quality Pro

A Brave New Data World

The Data Outhouse

This is a screen capture of the results of last week’s unscientific data quality poll where it was noted that in many organizations a data warehouse is the only system where data from numerous and disparate operational sources has been integrated into a single system of record containing fully integrated and historical data.  Although the rallying cry and promise of the data warehouse has long been that it will serve as the source for most of the enterprise’s reporting and decision support needs, many simply get ignored by the organization, which continues to rely on its data silos and spreadsheets for reporting and decision making.

Based on my personal experience, the most common reason is that these big boxes of data are often built with little focus on the quality of the data being delivered.  However, since that’s just my opinion, I launched the poll and invited your comments.


Commendable Comments

Stephen Putman commented that data warehousing “projects are usually so large that if you approach them in a big-bang, OLTP management fashion, the foundational requirements of the thing change between inception and delivery.”

“I’ve seen very few data warehouses live up to the dream,” Dylan Jones commented.  “I’ve always found that silos still persisted after a warehouse introduction because the turnaround on adding new dimensions and reports to the warehouse/mart meant that the business users simply had no option.  I think data quality obviously plays a part.  The business side only need to be burnt once or twice before they lose faith.  That said, a data warehouse is one of the best enablers of data quality motivation, so without them a lot of projects simply wouldn’t get off the ground.”

“I just voted Outhouse too,” commented Paul Drenth, “because I agree with Dylan that the business side keeps using other systems out of disappointment in the trustworthiness of the data warehouse.  I agree that bad data quality plays a role in that, but more often it’s also a lack of discipline in the organization which causes a downward spiral of missing information, and thus deciding to keep other information in a separate or local system.  So I think usability of data warehouse systems still needs to be improved significantly, also by adding invisible or automatic data quality assurance, the business might gain more trust.”

“Great point Paul, useful addition,” Dylan responded.  “I think discipline is a really important aspect, this ties in with change management.  A lot of business people simply don’t see the sense of urgency for moving their reports to a warehouse so lack the discipline to follow the procedures.  Or we make the procedures too inflexible.  On one site I noticed that whenever the business wanted to add a new dimension or category it would take a 2-3 week turnaround to sign off.  For a financial services company this was a killer because they had simply been used to dragging another column into their Excel spreadsheets, instantly getting the data they needed.  If we’re getting into information quality for a second, then the dimension of presentation quality and accessibility become far more important than things like accuracy and completeness.  Sure a warehouse may be able to show you data going back 15 years and cross validates results with surrogate sources to confirm accuracy, but if the business can’t get it in a format they need, then it’s all irrelevant.”

“I voted Data Warehouse,” commented Jarrett Goldfedder, “but this is marked with an asterisk.  I would say that 99% of the time, a data warehouse becomes an outhouse, crammed with data that serves no purpose.  I think terminology is important here, though.  In my previous organization, we called the Data Warehouse the graveyard and the people who did the analytics were the morticians.  And actually, that’s not too much of a stretch considering our job was to do CSI-type investigations and autopsies on records that didn’t fit with the upstream information.  This did not happen often, but when it did, we were quite grateful for having historical records maintained.  IMHO, if the records can trace back to the existing data and will save the organization money in the long-run, then the warehouse has served its purpose.”

“I’m having a difficult time deciding,” Corinna Martinez commented, “since most of the ones I have seen are high quality data, but not enough of it and therefore are considered Data Outhouses.  You may want to include some variation in your survey that covers good data but not enough; and bad data but lots to shift through in order to find something.”

“I too have voted Outhouse,” Simon Daniels commented, “and have also seen beautifully designed, PhD-worthy data warehouse implementations that are fundamentally of no practical use.  Part of the reason for this I think, particularly from a marketing point-of-view, which is my angle, is that how the data will be used is not sufficiently thought through.  In seeking to create marketing selections, segmentation and analytics, how will the insight locked-up in the warehouse be accessed within the context of campaign execution and subsequent response analysis?  Often sitting in splendid isolation, the data warehouse doesn’t offer the accessibility needed in day-to-day activities.”

Thanks to everyone who voted and special thanks to everyone who commented.  As always, your feedback is greatly appreciated.


Can MDM and Data Governance save the Data Warehouse?

During last week’s Informatica MDM Tweet Jam, Dan Power explained that master data management (MDM) can deliver to the business “a golden copy of the data that they can trust” and I remarked how companies expected that from their data warehouse.

“Most companies had unrealistic expectations from data warehouses,” Power responded, “which ended up being expensive, read-only, and updated infrequently.  MDM gives them the capability to modify the data, publish to a data warehouse, and manage complex hierarchies.  I think MDM offers more flexibility than the typical data warehouse.  That’s why business intelligence (BI) on top of MDM (or more likely, BI on top of a data warehouse that draws data from MDM) is so popular.”

As a follow-up question, I asked if MDM should be viewed as a complement or a replacement for the data warehouse.  “Definitely a complement,” Power responded. “MDM fills a void in the middle between transactional systems and the data warehouse, and does things that neither can do to data.”

In his recent blog post How to Keep the Enterprise Data Warehouse Relevant, Winston Chen explains that the data quality deficiencies of most data warehouses could be aided by MDM and data governance, which “can define and enforce data policies for quality across the data landscape.”  Chen believes that the data warehouse “is in a great position to be the poster child for data governance, and in doing so, it can keep its status as the center of gravity for all things data in an enterprise.”

I agree with Power that MDM can complement the data warehouse, and I agree with Chen that data governance can make the data warehouse (as well as many other things) better.  So perhaps MDM and data governance can save the data warehouse.

However, I must admit that I remain somewhat skeptical.  The same challenges that have caused most data warehouses to become data outhouses are also fundamental threats to the success of MDM and data governance.


Thinking outside the house

Just like real outhouses were eventually obsolesced by indoor plumbing, I wonder if data outhouses will eventually be obsolesced, perhaps ironically by emerging trends of outdoor plumbing, i.e., open source, cloud computing, and software as a service (SaaS).

Many industry analysts are also advocating the evolution of data as a service (DaaS), where data is taken out of all of its houses, meaning that the answer to my poll question might be neither data warehouse nor data outhouse.

Although none of these trends obviate the need for data quality nor alleviate the other significant challenges mentioned above, perhaps when it comes to data, we need to start thinking outside the house.


Related Posts

DQ-Poll: Data Warehouse or Data Outhouse?

Podcast: Data Governance is Mission Possible

Once Upon a Time in the Data

The Idea of Order in Data

Fantasy League Data Quality

Which came first, the Data Quality Tool or the Business Need?

Finding Data Quality

The Circle of Quality

TDWI World Conference Orlando 2010

Last week I attended the TDWI World Conference held November 7-12 in Orlando, Florida at the Loews Royal Pacific Resort.

As always, TDWI conferences offer a variety of full-day and half-day courses taught in an objective, vendor-neutral manner, designed for professionals and taught by in-the-trenches practitioners who are well known in the industry.

In this blog post, I summarize a few key points from two of the courses I attended.  I used Twitter to help me collect my notes, and you can access the complete archive of my conference tweets on Twapper Keeper.


A Practical Guide to Analytics

Wayne Eckerson, author of the book Performance Dashboards: Measuring, Monitoring, and Managing Your Business, described the four waves of business intelligence:

  1. Reporting – What happened?
  2. Analysis – Why did it happen?
  3. Monitoring – What’s happening?
  4. Prediction – What will happen?

“Reporting is the jumping off point for analytics,” explained Eckerson, “but many executives don’t realize this.  The most powerful aspect of analytics is testing our assumptions.”  He went on to differentiate the two strains of analytics:

  1. Exploration and Analysis – Top-down and deductive, primarily uses query tools
  2. Prediction and Optimization – Bottom-up and inductive, primarily uses data mining tools

“A huge issue for predictive analytics is getting people to trust the predictions,” remarked Eckerson.  “Technology is the easy part, the hard part is selling the business benefits and overcoming cultural resistance within the organization.”

“The key is not getting the right answers, but asking the right questions,” he explained, quoting Ken Rudin of Zynga.

“Deriving insight from its unique information will always be a competitive advantage for every organization.”  He recommended the book Competing on Analytics: The New Science of Winning as a great resource for selling the business benefits of analytics.


Data Governance for BI Professionals

Jill Dyché, a partner and co-founder of Baseline Consulting, explained that data governance transcends business intelligence and other enterprise information initiatives such as data warehousing, master data management, and data quality.

“Data governance is the organizing framework,” explained Dyché, “for establishing strategy, objectives, and policies for corporate data.  Data governance is the business-driven policy making and oversight of corporate information.”

“Data governance is necessary,” remarked Dyché, “whenever multiple business units are sharing common, reusable data.”

“Data governance aligns data quality with business measures and acceptance, positions enterprise data issues as cross-functional, and ensures data is managed separately from its applications, thereby evolving data as a service (DaaS).”

In her excellent 2007 article Serving the Greater Good: Why Data Hoarding Impedes Corporate Growth, Dyché explained the need for “systemizing the notion that data – corporate asset that it is – belongs to everyone.”

“Data governance provides the decision rights around the corporate data asset.”


Related Posts

DQ-View: From Data to Decision

Podcast: Data Governance is Mission Possible

The Business versus IT—Tear down this wall!

MacGyver: Data Governance and Duct Tape

Live-Tweeting: Data Governance

Enterprise Data World 2010

Enterprise Data World 2009

TDWI World Conference Chicago 2009

Light Bulb Moments at DataFlux IDEAS 2010

DataFlux IDEAS 2009

DQ-Poll: Data Warehouse or Data Outhouse?

In many organizations, a data warehouse is the only system where data from numerous and disparate operational sources has been integrated into a single repository of enterprise data.

The rapid delivery of a single system of record containing fully integrated and historical data to be used as the source for most of the enterprise’s reporting and decision support needs has long been the rallying cry and promise of the data warehouse.

However, I have witnessed beautifully architected, elegantly implemented, and diligently maintained data warehouses simply get ignored by the organization, which continues to rely on its data silos and spreadsheets for reporting and decision making.

The most common reason is that these big boxes of data are often built with little focus on the quality of the data being delivered.

But that’s just my opinion based on my personal experience.  So let’s conduct an unscientific poll.


Additionally, please feel free to post a comment below and explain your vote or simply share your opinions and experiences.

Fantasy League Data Quality

For over 25 years, I have been playing fantasy league baseball and football.  For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team.  Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?


Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

  • Master Data – data describing the real-world players on the roster of each fantasy team.

  • Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

  • Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number.  Team attributes include name, owner, home city, and the name and seating capacity of their stadium.

  • Definitions – Player and Team have both Professional and Fantasy definitions.  Professional teams and players are real-world objects managed independent of fantasy sports.  Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League).  Therefore, Fantasy Team and Fantasy Player are the true master data objects.  The distinction between professional and fantasy teams is simpler than between professional and fantasy players.  Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).

  • Roles – In baseball, the player roles are Batter, Pitcher, and Fielder.  In football, the player roles are Offense, Defense and Special Teams.  In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).

  • Connections – Fantasy Players are connected to Fantasy Teams via a roster.  On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball).  These connections change throughout the season.  Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).

  • Taxonomies – Positions played are defined individually and organized into taxonomies.  In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield.  Second base and short stop are also infield positions, and more specifically middle infield.  And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).


Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse.  I downloaded master, reference, and transaction data from my fantasy league's website.  I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team.  All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables.  For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions. 

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics.  These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball. 

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics.  This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues.  Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams.  Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only.  Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables.  In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century). 

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century.  However, there are still statistics with multiple meanings.  For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas.  Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.


Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team.  Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website.  This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.”  However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites. 

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online.  Of course, this wasn't always the case. 

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers.  We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs. 

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves.  I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.


Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

  • Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.

  • Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.

  • Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.

  • Presentation Quality – A measure of how information is presented to and collected from those who utilize it.  Format and appearance support appropriate use of the information.

  • Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.



I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you.  It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby.  They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

TDWI World Conference Chicago 2009

Founded in 1995, TDWI (The Data Warehousing Institute™) is the premier educational institute for business intelligence and data warehousing that provides education, training, certification, news, and research for executives and information technology professionals worldwide.  TDWI conferences always offer a variety of full-day and half-day courses taught in an objective, vendor-neutral manner.  The courses taught are designed for professionals and taught by in-the-trenches practitioners who are well known in the industry.


TDWI World Conference Chicago 2009 was held May 3-8 in Chicago, Illinois at the Hyatt Regency Hotel and was a tremendous success.  I attended as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ).

I used Twitter to provide live reporting from the conference.  Here are my notes from the courses I attended: 


BI from Both Sides: Aligning Business and IT

Jill Dyché, CBIP, is a partner and co-founder of Baseline Consulting, a management and technology consulting firm that provides data integration and business analytics services.  Jill is responsible for delivering industry and client advisory services, is a frequent lecturer and writer on the business value of IT, and writes the excellent Inside the Biz blog.  She is the author of acclaimed books on the business value of information: e-Data: Turning Data Into Information With Data Warehousing and The CRM Handbook: A Business Guide to Customer Relationship Management.  Her latest book, written with Evan Levy, is Customer Data Integration: Reaching a Single Version of the Truth.

Course Quotes from Jill Dyché:

  • Five Critical Success Factors for Business Intelligence (BI):
    1. Organization - Build organizational structures and skills to foster a sustainable program
    2. Processes - Align both business and IT development processes that facilitate delivery of ongoing business value
    3. Technology - Select and build technologies that deploy information cost-effectively
    4. Strategy - Align information solutions to the company's strategic goals and objectives
    5. Information - Treat data as an asset by separating data management from technology implementation
  • Three Different Requirement Categories:
    1. What is the business need, pain, or problem?  What business questions do we need to answer?
    2. What data is necessary to answer those business questions?
    3. How do we need to use the resulting information to answer those business questions?
  • “Data warehouses are used to make business decisions based on data – so data quality is critical”
  • “Even companies with mature enterprise data warehouses still have data silos - each business area has its own data mart”
  • “Instead of pushing a business intelligence tool, just try to get people to start using data”
  • “Deliver a usable system that is valuable to the business and not just a big box full of data”


TDWI Data Governance Summit

Philip Russom is the Senior Manager of Research and Services at TDWI, where he oversees many of TDWI’s research-oriented publications, services, and events.  Prior to joining TDWI in 2005, he was an industry analyst covering BI at Forrester Research, as well as a contributing editor with Intelligent Enterprise and Information Management (formerly DM Review) magazines.

Summit Quotes from Philip Russom:

  • “Data Governance usually boils down to some form of control for data and its usage”
  • “Four Ps of Data Governance: People, Policies, Procedures, Process”
  • “Three Pillars of Data Governance: Compliance, Business Transformation, Business Integration”
  • “Two Foundations of Data Governance: Business Initiatives and Data Management Practices”
  • “Cross-functional collaboration is a requirement for successful Data Governance”


Becky Briggs, CBIP, CMQ/OE, is a Senior Manager and Data Steward for Airlines Reporting Corporation (ARC) and has 25 years of experience in data processing and IT - the last 9 in data warehousing and BI.  She leads the program team responsible for product, project, and quality management, business line performance management, and data governance/stewardship.

Summit Quotes from Becky Briggs:

  • “Data Governance is the act of managing the organization's data assets in a way that promotes business value, integrity, usability, security and consistency across the company”
  • Five Steps of Data Governance:
    1. Determine what data is required
    2. Evaluate potential data sources (internal and external)
    3. Perform data profiling and analysis on data sources
    4. Data Services - Definition, modeling, mapping, quality, integration, monitoring
    5. Data Stewardship - Classification, access requirements, archiving guidelines
  • “You must realize and accept that Data Governance is a program and not just a project”


Barbara Shelby is a Senior Software Engineer for IBM with over 25 years of experience holding positions of technical specialist, consultant, and line management.  Her global management and leadership positions encompassed network authentication, authorization application development, corporate business systems data architecture, and database development.

Summit Quotes from Barbara Shelby:

  • Four Common Barriers to Data Governance:
    1. Information - Existence of information silos and inconsistent data meanings
    2. Organization - Lack of end-to-end data ownership and organization cultural challenges
    3. Skill - Difficulty shifting resources from operational to transformational initiatives
    4. Technology - Business data locked in large applications and slow deployment of new technology
  • Four Key Decision Making Bodies for Data Governance:
    1. Enterprise Integration Team - Oversees the execution of CIO funded cross enterprise initiatives
    2. Integrated Enterprise Assessment - Responsible for the success of transformational initiatives
    3. Integrated Portfolio Management Team - Responsible for making ongoing business investment decisions
    4. Unit Architecture Review - Responsible for the IT architecture compliance of business unit solutions


Lee Doss is a Senior IT Architect for IBM with over 25 years of information technology experience.  He has a patent for process of aligning strategic capability for business transformation and he has held various positions including strategy, design, development, and customer support for IBM networking software products.

Summit Quotes from Lee Doss:

  • Five Data Governance Best Practices:
    1. Create a sense of urgency that the organization can rally around
    2. Start small, grow fast...pick a few visible areas to set an example
    3. Sunset legacy systems (application, data, tools) as new ones are deployed
    4. Recognize the importance of organization culture…this will make or break you
    5. Always, always, always – Listen to your customers


Kevin Kramer is a Senior Vice President and Director of Enterprise Sales for UMB Bank and is responsible for development of sales strategy, sales tool development, and implementation of enterprise-wide sales initiatives.

Summit Quotes from Kevin Kramer:

  • “Without Data Governance, multiple sources of customer information can produce multiple versions of the truth”
  • “Data Governance helps break down organizational silos and shares customer data as an enterprise asset”
  • “Data Governance provides a roadmap that translates into best practices throughout the entire enterprise”


Kanon Cozad is a Senior Vice President and Director of Application Development for UMB Bank and is responsible for overall technical architecture strategy and oversees information integration activities.

Summit Quotes from Kanon Cozad:

  • “Data Governance identifies business process priorities and then translates them into enabling technology”
  • “Data Governance provides direction and Data Stewardship puts direction into action”
  • “Data Stewardship identifies and prioritizes applications and data for consolidation and improvement”


Jill Dyché, CBIP, is a partner and co-founder of Baseline Consulting, a management and technology consulting firm that provides data integration and business analytics services.  (For Jill's complete bio, please see above).

Summit Quotes from Jill Dyché:

  • “The hard part of Data Governance is the data
  • “No data will be formally sanctioned unless it meets a business need”
  • “Data Governance focuses on policies and strategic alignment”
  • “Data Management focuses on translating defined polices into executable actions”
  • “Entrench Data Governance in the development environment”
  • “Everything is customer data – even product and financial data”


Data Quality Assessment - Practical Skills

Arkady Maydanchik is a co-founder of Data Quality Group, a recognized practitioner, author, and educator in the field of data quality and information integration.  Arkady's data quality methodology and breakthrough ARKISTRA technology were used to provide services to numerous organizations.  Arkady is the author of the excellent book Data Quality Assessment, a frequent speaker at various conferences and seminars, and a contributor to many journals and online publications.  Data quality curriculum by Arkady Maydanchik can be found at eLearningCurve.

Course Quotes from Arkady Maydanchik:

  • “Nothing is worse for data quality than desperately trying to fix it during the last few weeks of an ETL project”
  • “Quality of data after conversion is in direct correlation with the amount of knowledge about actual data”
  • “Data profiling tools do not do data profiling - it is done by data analysts using data profiling tools”
  • “Data Profiling does not answer any questions - it helps us ask meaningful questions”
  • “Data quality is measured by its fitness to the purpose of use – it's essential to understand how data is used”
  • “When data has multiple uses, there must be data quality rules for each specific use”
  • “Effective root cause analysis requires not stopping after the answer to your first question - Keep asking: Why?”
  • “The central product of a Data Quality Assessment is the Data Quality Scorecard”
  • “Data quality scores must be both meaningful to a specific data use and be actionable”
  • “Data quality scores must estimate both the cost of bad data and the ROI of data quality initiatives”


Modern Data Quality Techniques in Action - A Demonstration Using Human Resources Data

Gian Di Loreto formed Loreto Services and Technologies in 2004 from the client services division of Arkidata Corporation.  Loreto Services provides data cleansing and integration consulting services to Fortune 500 companies.  Gian is a classically trained scientist - he received his PhD in elementary particle physics from Michigan State University.

Course Quotes from Gian Di Loreto:

  • “Data Quality is rich with theory and concepts – however it is not an academic exercise, it has real business impact”
  • “To do data quality well, you must walk away from the computer and go talk with the people using the data”
  • “Undertaking a data quality initiative demands developing a deeper knowledge of the data and the business”
  • “Some essential data quality rules are ‘hidden’ and can only be discovered by ‘clicking around’ in the data”
  • “Data quality projects are not about systems working together - they are about people working together”
  • “Sometimes, data quality can be ‘good enough’ for source systems but not when integrated with other systems”
  • “Unfortunately, no one seems to care about bad data until they have it”
  • “Data quality projects are only successful when you understand the problem before trying to solve it”


Mark Your Calendar

TDWI World Conference San Diego 2009 - August 2-7, 2009.

TDWI World Conference Orlando 2009 - November 1-6, 2009.

TDWI World Conference Las Vegas 2010 - February 21-26, 2010.