Hyperactive Data Quality
Jim Harris in
Books,
Data Quality tagged
Philosophy,
Thomas Redman
Tuesday, April 28, 2009 at 2:44PM In economics, the term "flight to quality" describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.
A similar "flight to data quality" can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information. Some examples include a customer service nightmare, a regulatory compliance failure or a financial reporting scandal. Whatever the triggering event, a common response is data quality suddenly becomes prioritized as a critical issue and an enterprise information initiative is launched.
Congratulations! You've realized (albeit the hard way) that this "data quality thing" is really important.
Now what are you going to do about it? How are you going to attempt to actually solve the problem?
In his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman uses an excellent analogy called the data quality lake:
"...a lake represents a database and the water therein the data. The stream, which adds new water, is akin to a business process that creates new data and adds them to the database. The lake...is polluted, just as the data are dirty. Two factories pollute the lake. Likewise, flaws in the business process are creating errors...
One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.
The alternative is to reduce the pollutant at the point source - the factories.
The contrast between the two approaches is stark. In the first, the focus is on the lake; in the second, it is on the stream. So too with data. Finding and fixing errors focuses on the database and data that have already been created. Preventing errors focuses on the business processes and future data."
Reactive Data Quality
A "flight to data quality" usually prompts an approach commonly referred to as Reactive Data Quality (i.e. "cleaning the lake" to use Redman's excellent analogy). The majority of enterprise information initiatives are reactive. The focus is typically on finding and fixing the problems with existing data in an operational data store (ODS), enterprise data warehouse (EDW) or other enterprise information repository. In other words, the focus is on fixing data after it has been extracted from its sources.
An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert "lake cleaners"). Data quality problems can be very insidious and even the best "lake cleaning" process will still produce exceptions. Your process should be designed to identify and report exceptions when they occur. In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.
However, as Redman cautions: "...the problem with being a good lake cleaner is that life never gets better. Indeed, it gets worse as more data...conspire to mean there is more work every day." I tell my clients the only way to guarantee that reactive data quality will be successful is to unplug all the computers so that no one can add new data or modify existing data.
Proactive Data Quality
Attempting to prevent data quality problems before they happen is commonly referred to as Proactive Data Quality. The focus is on preventing errors at the sources where data is entered or received and before it is extracted for use by downstream applications (i.e. "enters the lake"). Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:
"It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect."
Proactive data quality advocates implementing improved edit controls on data entry screens, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the business needs of your enterprise information consumers before you deliver data to them.
Obviously, it is impossible to truly prevent every problem before it happens. However, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.
Hyperactive Data Quality
Too many enterprise information initiatives fail because they are launched based on a "flight to data quality" response and have the unrealistic perspective that data quality problems can be quickly and easily resolved. However, just like any complex problem, there is no fast and easy solution for data quality.
In order to be successful, you must combine aspects of both reactive and proactive data quality in order to create an enterprise-wide best practice that I call Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.
Please share your thoughts and experiences. Is your data quality Reactive, Proactive or Hyperactive?
Reader Comments (5)
From the LinkedIn Master Data Management Interest Group, Len Dubois commented:
"At the enterprise level, the combination of proactive, reactive and hyperactive is usually best dealt with at the data governance level. Enterprise level data quality really needs the "governance" of a well thought out business strategy and executive sponsorship to bring it together successfully."
Over on the SmartData Collective, Daniel Gent commented:
"Having gone through the aspects of Proactive Data Quality, I have to admit it is the way to go. Sticking with the water metaphor, it's better to check the boat for holes before you set sail downstream."
And I responded:
I have to admit that I have spent much of my career implementing Reactive Data Quality. Sometimes, when you are so focused on “cleaning the lake” you forget to ask “how did the lake get so dirty in the first place?”
However, I agree with Daniel that Proactive Data Quality is the way to go (when you can). Sadly, no one seems to notice it when you are good at it. I think that is one of the reasons that people become so good at Reactive Data Quality - because it gives data quality practitioners more exposure as “heroes” battling poor data quality down in the project trenches.
Over on the SmartData Collective, Edith Ohri commented:
"There is a third way: to make the very use indifferent to quality or robust enough to overcome normal level of "pollution"...not many models can do that, yet it is worth searching. Control over the data entry is an insufficient solution for open systems that get input from other sources, as is the case in integrated, large or old systems. Prevention rather correction, looks to me here as a theoretical ideal. Actually, in my data mining forum, there's a recent entry that just says "data are not deterministic", arguing against the common assumption that data are solid entities. In reality, one has to accept that data is never clear cut, and there is a double stochastic behavior to consider: one (which is usually addressed) is the stochastic nature of cause-effect relations, and the second is the non-deterministic nature of data values themselves that for some reason fails to be mentioned. I recall one of the first lab experiments that we had as students at the Thechnion of Haifa. The experiment demonstrated how even a strict fact of physics turns to a stochastic measure when registered repeatedly several times… The lesson is that we have here a basic truth, a nature law of information, data elements are stochastic by nature."
And I responded:
Having spent my career advocating a pragmatic approach to data quality, I admit that I have often stated “prevention rather than correction…[is]…a theoretical ideal” that cannot be attained, therefore let’s just get to the practical work of finding and fixing the problems. However, I do believe that the more control that can be enforced where data originates, the better the overall quality will be. As for the nature of data elements, I can’t help but ponder whether the data itself is stochastic or if it is the information derived from data that is stochastic?
And Edith responded:
"My answer is that the data elements themselves are stochastic. It holds even in case that the data are machine made. Many assume that automatic data are free from human error, yet machines and systems are made by humans, therefore their definitions are subjected to human variations, and their integration involves human intervention which enables errors. Also, past generations of equipment & software create necessarily a degree of inconsistent diversity.
The result is, in my mind, that no matter how hard we try to clean the data:
(a) Some data entry sources remain out of control and therefore cannot be cleaned
(b) Conflicts in the meta-data should be expected
(c) Incoherence in application legacy is unavoidable
Altogether, it spells according to this line of thought, data uncertainty (in addition to cause-effect uncertainty)."
Over on the SmartData Collective, Peter Thomas commented:
I think the hybrid approach that you have suggested is the way to go.
You asked about other people's experiences. I blogged about mine in this article a while back:
Using BI to drive improvements in data quality
From the LinkedIn Master Data Management Interest Group, Ravi Shankar Devaraj commented:
"Nice analogy. I agree that a mix of proactive and reactive approaches to data quality are required within an enterprise. If you are using an MDM hub to manage the single version of the truth, the reactive approach translates to data being authored at the source ERP/CRM systems (system of entry), then brought into the MDM hub, through batch or real-time, to create the golden record. The proactive approach would be to stop using the source systems as system of entry for master data and have that data created directly within the MDM Hub itself (for e.g., using Siperian Business Data Director). But that requires an higher-level of data governance maturity, which only some organizations adopt. The hyperactive approach would be a mix of reactive and proactive approaches for different types of master data. For e.g., an organization might use a proactive approach for customer data but use a reactive approach to product data. Again, it depends upon the data governance needs and maturity of the organization."