One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources.
Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling.
However, in this post I want to focus on some of the benefits of using a data profiling tool.
Freq'ing Awesome Analysis
Data profiling can help you perform essential analysis such as:
- Verifying data matches the metadata that describes it
- Identifying missing values
- Identifying potential default values
- Identifying potential invalid values
- Checking data formats for inconsistencies
- Preparing meaningful questions to ask subject matter experts
Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.
How can a data profiling tool help you? Let me count the ways
Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:
- NULL – count of the number of records with a NULL value
- Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
- Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
- Completeness – percentage calculated as Actual divided by the total number of records
- Cardinality – count of the number of distinct actual values
- Uniqueness – percentage calculated as Cardinality divided by the total number of records
- Distinctness – percentage calculated as Cardinality divided by Actual
The absence of data can be represented many different ways with NULL being most common for relational database columns. However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes. Consistently representing the absence of data is a common data quality standard.
Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.
Distinctness can be useful in evaluating the potential for duplicate records. For example, a Tax ID field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it can not be considered a potential single primary key because it can not be used to uniquely identify every record). If the Tax ID field is also less than 100% distinct (i.e. some distinct actual values occur on more than one record), then this could indicate the presence of potential duplicate records.
Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).
Show Me the Value (or the Format)
A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.
A frequency distribution of unique values is useful for:
- Fields with an extremely low cardinality (i.e. indicating potential default values)
- Fields with a relatively low cardinality (e.g. gender code and source system code)
- Fields with a relatively small number of valid values (e.g. state abbreviation and country code)
A frequency distribution of unique formats is useful for:
- Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
- Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
- Fields with free-form values and a high cardinality (e.g. customer name and postal address)
Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.
Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers).
Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).
Understanding your data is essential to using it effectively and improving its quality. In order to achieve these goals, there is simply no substitute for data analysis.
A data profiling tool can help you by automating some of the grunt work needed to begin this analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly – translate your analysis into meaningful reports and questions to share with the rest of the project team. Well performed data profiling is a highly interactive and iterative process.
Data profiling is typically one of the first tasks performed on a data quality project. This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality. All of which are necessary to progress from profiling your data to performing a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.
Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and...
Get your data freq on!