The Data Quality Goldilocks Zone

In astronomy, the habitable region of space where stellar conditions are favorable for life as it is found on Earth is referred to as the "Goldilocks Zone" because such a region of space is neither too close to the sun (making it too hot) nor too far away from the sun (making it too cold), but is "just right."


In data quality, there is also a Goldilocks Zone, which is the habitable region of time when project conditions are favorable for success.


Too many projects fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” as opposed to needing eternal vigilance.  In order to be successful, projects must always be understood as an iterative process.  Return on investment (ROI) will be achieved by targeting well defined objectives that can deliver small incremental returns that will build momentum to larger success over time. 


Data quality projects are easy to get started, even easier to end in failure, and often lack the decency of at least failing quickly.  Just like any complex problem, there is no fast and easy solution for data quality.


Projects are launched to understand and remediate the poor data quality that is negatively impacting decision critical enterprise information.  Data-driven problems require data-driven solutions.  At that point in the project lifecycle when the team must decide if the efforts of the current iteration are ready for implementation, they are dealing with the Data Quality Goldilocks Zone, which instead of being measured by proximity to the sun, is measured by proximity to full data remediation, otherwise known as perfection.


The obvious problem is that perfection is impossible.  An obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause.  Data quality problems can be very insidious and even the best data remediation process will still produce exceptions.  As a best practice, your process should be designed to identify and report exceptions when they occur.  In fact, many implementations will include logic to provide the ability to suspend exceptions for manual review and correction.


Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.


For example, let’s imagine that your project is processing one billion records and that exhaustive analysis has determined that the results are correct 99.99999% of the time, meaning that exceptions occur in only 0.00001% of the total data population.  Now, imagine explaining these statistics to the project team, but providing only the 100 exception records for review.  Do not underestimate the difficulty that the human mind has with large numbers (i.e. 100 is an easy number to relate to but one billion is practically incomprehensible).  Also, don’t ignore the effect known as “negativity bias” where bad evokes a stronger reaction than good in the human mind - just compare an insult and a compliment, which one do you remember more often?  Focusing on the exceptions can undermine confidence and prevent acceptance of an overwhelmingly successful implementation.


If you can accept there will be exceptions, admit perfection is impossible, implement data quality improvements in iterations, and acknowledge when the current iteration has reached the Data Quality Goldilocks Zone, then your data quality initiative will not be perfect, but it will be "just right."