Timeliness refers to the time expectation for the accessibility of data. Due to the increasing demand for real-time data-driven decisions, timeliness is the most important dimension of data quality.Read More
Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.
“Data quality tools do not solve data quality problems—People solve data quality problems.”
This DQ-Tip came from the DataFlux IDEAS 2010 Assessing Data Quality Maturity workshop conducted by David Loshin, whose new book The Practitioner's Guide to Data Quality Improvement will be released next month.
Just like all technology, data quality tools are enablers. Data quality tools provide people with the capability for solving data quality problems, for which there are no fast and easy solutions. Although incredible advancements in technology continue, there are no Magic Beans for data quality.
And there never will be.
An organization’s data quality initiative can only be successful when people take on the challenge united by collaboration, guided by an effective methodology, and of course, enabled by powerful technology.
By far the most important variable in implementing successful and sustainable data quality improvements is acknowledging David’s sage advice: people—not tools—solve data quality problems.
A few weeks ago, David Loshin, whose new book The Practitioner's Guide to Data Quality Improvement will soon be released, wrote the excellent blog post First Cuts at Compliance, which examines a challenging aspect of regulatory compliance.
David uses a theoretical, but nonetheless very realistic, example of a new government regulation that requires companies to submit a report in order to be compliant. An associated government agency can fine companies that do not accurately report.
Therefore, it’s in the company’s best interest to submit a report because not doing so would raise a red flag, since it would make the company implicitly non-compliant. For the same reason, it’s in the government agency’s best interest to focus their attention on those companies that have not yet reported—since no checks for accuracy need to be performed on non-submitted reports.
David then raises the excellent question about the quality of that reported, but unverified, data, and shares a link to a real-world example where the verification was actually performed by an investigative reporter—who discovered significant discrepancies.
This blog post made me view the submitted report as a red herring, which is a literacy device, quite common in mystery fiction, where the reader is intentionally misled by the author in order to build suspense or divert attention from important information.
Therefore, when faced with regulatory compliance, companies might conveniently choose a red herring over a red flag.
After all, it is definitely easier to submit an inaccurate report on time, which feigns compliance, than it is to submit an accurate report that might actually prove non-compliance. Even if the inaccuracies are detected—which is a big IF—then the company could claim that it was simply poor data quality—not actual non-compliance—and promise to resubmit an accurate report.
(Or as is apparently the case in the real-world example linked to in David's blog post, the company could provide the report data in a format not necessarily amenable to a straightforward verification of accuracy.)
The primary focus of data governance is the strategic alignment of people throughout the organization through the definition, and enforcement, of policies in relation to data access, data sharing, data quality, and effective data usage, all for the purposes of supporting critical business decisions and enabling optimal business performance.
Simply establishing these internal data governance policies is often no easy task to accomplish. Just as passing a law creating new government regulations can also be extremely challenging.
However, without enforcement and compliance, policies and regulations are powerless to affect the real changes necessary.
This is where I have personally witnessed many data governance programs and regulatory compliance initiatives fail.
Red Flag or Red Herring?
Are you implementing data governance policies that raise red flags, not only for implicit, but also for explicit non-compliance?
Or are you instead establishing a system that will simply encourage the submission of unverified—or unverifiable—red herrings?
“A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.”
These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.
“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus. Like a virus, data by itself is inert. Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”
“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”
“The antonym of viral data is trusted information.”
“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.” Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:
- David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
- Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses. It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
- Thomas Redman: “Data are of high quality if those who use them say so. Usually, high-quality data must be both free of defects and possess features that customers desire.”
Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative. This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.
However, the enterprise-wide data quality standards must be understood as dynamic. Therefore, enforcing strict conformance to data quality standards can be self-defeating. On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”
Defining data quality is both an essential and challenging exercise for every enterprise. “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business. Business data is data about the business, and any data about the business is metadata. First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”
“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.”
“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.”
For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance.
However, these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”
Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”
“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”
Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.
Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions. As the pathogen replicates, more and more decision-critical enterprise information will be compromised.
According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:
“Prevent viral data and preserve trusted information.”
Books Referenced in this Post
For over 25 years, I have been playing fantasy league baseball and football. For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team. Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.
What does any of this have to do with data quality?
Master Data Management
In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”
In fantasy sports, this distinction is very easy to make:
- Master Data – data describing the real-world players on the roster of each fantasy team.
- Transaction Data – data describing the statistical events of the real-world games played.
In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”
In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:
- Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number. Team attributes include name, owner, home city, and the name and seating capacity of their stadium.
- Definitions – Player and Team have both Professional and Fantasy definitions. Professional teams and players are real-world objects managed independent of fantasy sports. Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League). Therefore, Fantasy Team and Fantasy Player are the true master data objects. The distinction between professional and fantasy teams is simpler than between professional and fantasy players. Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).
- Roles – In baseball, the player roles are Batter, Pitcher, and Fielder. In football, the player roles are Offense, Defense and Special Teams. In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).
- Connections – Fantasy Players are connected to Fantasy Teams via a roster. On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball). These connections change throughout the season. Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).
- Taxonomies – Positions played are defined individually and organized into taxonomies. In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield. Second base and short stop are also infield positions, and more specifically middle infield. And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).
Combining a personal hobby with professional development, I built a fantasy baseball data warehouse. I downloaded master, reference, and transaction data from my fantasy league's website. I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.
My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team. All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.
For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).
Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables. For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions.
The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.
My base fact tables were daily transactions for Batting Statistics and Pitching Statistics. These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball.
The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics. This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues. Alternatively, I could have stored each fantasy league in a separate data mart.
Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams. Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only. Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.
Conformed facts were used in both the base and aggregate fact tables. In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century).
For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century. However, there are still statistics with multiple meanings. For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.
Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas. Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.
For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.
In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”
As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team. Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.
The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website. This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.” However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites.
This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online. Of course, this wasn't always the case.
Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers. We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.
Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs.
I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves. I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.
In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:
- Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.
- Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.
- Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.
- Presentation Quality – A measure of how information is presented to and collected from those who utilize it. Format and appearance support appropriate use of the information.
- Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.
I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you. It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).
However, fantasy sports are more than just a hobby. They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.
So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:
“Are you ready for some data quality?”
Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management. This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content. With 200 hours of in-depth tutorials, hands-on workshops, practical sessions and insightful keynotes, the conference was a tremendous success. Congratulations and thanks to Tony Shaw, Maya Stosskopf and the entire Wilshire staff.
I attended Enterprise Data World 2009 as a member of the Iowa Chapter of DAMA and as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ).
I used Twitter to provide live reporting from the sessions that I was attending.
I wish that I could have attended every session, but here are some highlights from ten of my favorites:
8 Ways Data is Changing Everything
Keynote by Stephen Baker from BusinessWeek.
Quotes from the keynote:
- "Data is changing how we understand ourselves and how we understand our world"
- "Predictive data mining is about the mathematical modeling of humanity"
- "Anthropologists are looking at social networking (e.g. Twitter, Facebook) to understand the science of friendship"
Master Data Management: Proven Architectures, Products and Best Practices
Tutorial by David Loshin from Knowledge Integrity.
Quotes from the tutorial:
- "Master Data are the core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies"
- "Master Data Management (MDM) provides a unified view of core data subject areas (e.g. Customers, Products)"
- "With MDM, it is important not to over-invest and under-implement - invest in and implement only what you need"
Master Data Management: Ignore the Hype and Keep the Focus on Data
Quotes from the case study:
- "The most important thing about Master Data Management (MDM) is improving business processes"
- "80% of any enterprise implementation should be the testing phase"
- "MDM Data Quality (DQ) Challenge: Any % wrong means you’re 100% certain you’re not always right"
- "MDM DQ Solution: Re-design applications to ensure the ‘front-door’ protects data quality"
- "Technology is critical, however thinking through the operational processes is more important"
A Case of Usage: Working with Use Cases on Data-Centric Projects
Case Study by Susan Burk from IBM.
Quotes from the case study:
- "Use Case is a sequence of actions performed to yield a result of observable business value"
- "The primary focus of data-centric projects is data structure, data delivery and data quality"
- "Don’t like use cases? – ok, call them business acceptance criteria – because that’s what a use case is"
Crowdsourcing: People are Smart, When Computers are Not
Session by Sharon Chiarella from Amazon Web Services.
Quotes from the session:
- "Crowdsourcing is outsourcing a task typically performed by employees to a general community of people"
- "Crowdsourcing eliminates over-staffing, lowers costs and reduces work turnaround time"
- "An excellent example of crowdsourcing is open source software development (e.g. Linux)"
Improving Information Quality using Lean Six Sigma Methodology
Session by Atul Borkar and Guillermo Rueda from Intel.
Quotes from the session:
- "Information Quality requires a structured methodology in order to be successful"
- Lean Six Sigma Framework: DMAIC – Define, Measure, Analyze, Improve, Control:
- Define = Describe the challenge, goal, process and customer requirements
- Measure = Gather data about the challenge and the process
- Analyze = Use hypothesis and data to find root causes
- Improve = Develop, implement and refine solutions
- Control = Plan for stability and measurement
Universal Data Quality: The Key to Deriving Business Value from Corporate Data
Session by Stefanos Damianakis from Netrics.
Quotes from the session:
- "The information stored in databases is NEVER perfect, consistent and complete – and it never can be!"
- "Gartner reports that 25% of critical data within large businesses is somehow inaccurate or incomplete"
- "Gartner reports that 50% of implementations fail due to lack of attention to data quality issues"
- "A powerful approach to data matching is the mathematical modeling of human decision making"
- "The greatest advantage of mathematical modeling is that there are no data matching rules to build and maintain"
Defining a Balanced Scorecard for Data Management
Seminar by C. Lwanga Yonke, a founding member of the International Association for Information and Data Quality (IAIDQ).
Quotes from the seminar:
- "Entering the same data multiple times is like paying the same invoice multiple times"
- "Good metrics help start conversations and turn strategy into action"
- Good metrics have the following characteristics:
- Business Relevance
- Clarity of Definition
- Trending Capability (i.e. metric can be tracked over time)
- Easy to aggregate and roll-up to a summary
- Easy to drill-down to the details that comprised the measurement
Closing Panel: Data Management’s Next Big Thing!
Quotes from Panelist Peter Aiken from Data Blueprint:
- Capability Maturity Levels:
- "Most companies are at a capability maturity level of (1) Initial or (2) Repeatable"
- "Data should be treated as a durable asset"
Quotes from Panelist Noreen Kendle from Burton Group:
- "A new age for data and data management is on horizon – a perfect storm is coming"
- "The perfect storm is being caused by massive data growth and software as a service (i.e. cloud computing)"
- "Always remember that you can make lemonade from lemons – the bad in life can be turned into something good"
Quotes from Panelist Karen Lopez from InfoAdvisors:
- "If you keep using the same recipe, then you keep getting the same results"
- "Our biggest problem is not technical in nature - we simply need to share our knowledge"
- "Don’t be a dinosaur! Adopt a ‘go with what is’ philosophy and embrace the future!"
Quotes from Panelist Eric Miller from Zepheira:
- "Applications should not be ON The Web, but OF The Web"
- "New Acronym: LED – Linked Enterprise Data"
- "Semantic Web is the HTML of DATA"
Quotes from Panelist Daniel Moody from University of Twente:
- "Unified Modeling Language (UML) was the last big thing in software engineering"
- "The next big thing will be ArchiMate, which is a unified language for enterprise architecture modeling"
Mark Your Calendar
Enterprise Data World 2010 will take place in San Francisco, California at the Hilton San Francisco on March 14-18, 2010.