For over 25 years, I have been playing fantasy league baseball and football. For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team. Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.
What does any of this have to do with data quality?
Master Data Management
In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”
In fantasy sports, this distinction is very easy to make:
- Master Data – data describing the real-world players on the roster of each fantasy team.
- Transaction Data – data describing the statistical events of the real-world games played.
In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”
In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:
- Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number. Team attributes include name, owner, home city, and the name and seating capacity of their stadium.
- Definitions – Player and Team have both Professional and Fantasy definitions. Professional teams and players are real-world objects managed independent of fantasy sports. Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League). Therefore, Fantasy Team and Fantasy Player are the true master data objects. The distinction between professional and fantasy teams is simpler than between professional and fantasy players. Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).
- Roles – In baseball, the player roles are Batter, Pitcher, and Fielder. In football, the player roles are Offense, Defense and Special Teams. In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).
- Connections – Fantasy Players are connected to Fantasy Teams via a roster. On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball). These connections change throughout the season. Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).
- Taxonomies – Positions played are defined individually and organized into taxonomies. In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield. Second base and short stop are also infield positions, and more specifically middle infield. And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).
Combining a personal hobby with professional development, I built a fantasy baseball data warehouse. I downloaded master, reference, and transaction data from my fantasy league's website. I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.
My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team. All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.
For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).
Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables. For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions.
The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.
My base fact tables were daily transactions for Batting Statistics and Pitching Statistics. These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball.
The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics. This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues. Alternatively, I could have stored each fantasy league in a separate data mart.
Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams. Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only. Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.
Conformed facts were used in both the base and aggregate fact tables. In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century).
For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century. However, there are still statistics with multiple meanings. For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.
Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas. Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.
For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.
In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”
As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team. Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.
The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website. This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.” However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites.
This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online. Of course, this wasn't always the case.
Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers. We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.
Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs.
I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves. I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.
In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:
- Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.
- Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.
- Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.
- Presentation Quality – A measure of how information is presented to and collected from those who utilize it. Format and appearance support appropriate use of the information.
- Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.
I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you. It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).
However, fantasy sports are more than just a hobby. They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.
So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:
“Are you ready for some data quality?”