The Real Data Value is Business Insight
Jim Harris in
Blogs,
Data Quality tagged
Accuracy,
Best of 2010,
Business Intelligence,
Completeness,
Data Governance,
Data Profiling,
Data Quality Assessment
Monday, August 23, 2010 at 3:33PM Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.
A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).
However, a common mistake is to hyper-focus on the data values.
Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.
Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.
“Begin with the decision in mind”
In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.” Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.
Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries. And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.
However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind. Find the decisions that are going to make a difference to business results—to the metrics that drive the organization. Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”
Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.
The Real Data Value is Business Insight
Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.
Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.
However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.
But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.
Oh, no! The organization must be teetering on the edge of oblivion, right? Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin. However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?
As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”
So, would reducing your duplicate rate to only 1% automatically result in better customer insight? Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have? (Or maybe the 11% indicates the matching criteria was too aggressive).
My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?
The fundamental question that every data quality metric you create must answer is: How does this provide business insight?
If a data quality (or any other data) metric can not answer this question, then it is meaningless. Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind. Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.
Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.
Although analyzing your data values is important, you must always remember that the real data value is business insight.
Related Posts
Data Quality and the Cupertino Effect
Is your data complete and accurate, but useless to your business?
You Can’t Always Get the Data You Want
DQ-Tip: “There is no point in monitoring data quality…”
Which came first, the Data Quality Tool or the Business Need?
Selling the Business Benefits of Data Quality



Reader Comments (11)
Indeed Jim,
Without insight, low quality data and high quality all become the same thing: an abyss for executives to wade through.
This is the challenge that many organizations have today -- pulling real business insight from their data. We live in such an instant world too that executives are demanding that this insight be delivered immediately, not within a week after analysts have crunched the data but NOW as the insight weighs heavily on a pressing business decision. What we need to do is to optimize our information. It must be accessible, real-time, and relevant irregardless of the age, format or repository.
You're on to something here Jim!
Thanks for your insightful comment, Natasha of Vivisimo :-)
Yes, deriving business insight from their data is essential for many organizations today.
And speed kills (the too slow kind) because we no longer have the luxury of data mining for weeks or months on end before making the business decisions that are critical to the organization’s survival in today’s fast-paced world.
Best Regards,
Jim
Hi Jim,
Again I love the clearness of your statement. Somehow you are able to find the words I'm only capable of thinking of.
And the posting is reflecting my current situation in some ways as buy and implementation decisions (e.g., regarding Metadata Management, etc.) are driven by technology doubts and not business needs. Thank you for that.
Regards,
Rayk
Jim, your headline is absolutely right, but on the other hand, I really haven’t seen anyone taking the opposite stance.
If we keep our feet on the ground and look at your table with country value frequency it surely shows a common state of database content around, having two classic data quality issues:
• Multiple values for the same real world country
• Values that don’t represent a real world country
Yes, you may conduct daily business operations with these data and things may seem to work just fine.
Yes, you may make important business decisions based on these data, and no one will ever know if things would have turned out differently if the data was cleansed before.
Yes, there may be more important things to do in an organization than cleansing these data and keeping them optimized (usually things such as rolling out SAP globally or looking for tangible cost reductions).
It’s a matter of risk. Maybe you will get away with doing daily business operations and making business decisions based on these data. Maybe some rainy day you don’t. But nobody has ever gone to jail for using bad data. Or, wait …
@Rayk — Thanks for the kind words, I am glad you found the post useful to your current situation. I too have been in many situations where decisions, such as which product to buy and how an initiative should be implemented, were driven by technology doubts and not business needs.
@Henrik — Yes, I know of no one who would admit to disagreeing with my headline in theory, but I have seen many organizations (and data quality professionals) disagree in practice.
I intentionally included those two classic data quality issues in the Country data value frequency distributions, and I agree with all of the excellent points that you made.
However, the missing question is: Does anyone use the Country field to make a business decision?
In my experience, a lot of data is cleansed (and then, hopefully, defect prevention controls are also put in place) simply because it needs to be—again, because of those two classic data quality issues.
However, these data quality efforts can sometimes take on a life of their own, where achieving high quality data is allowed to become the raison d'être of the organization's data management strategy.
In other words, the organization starts managing their data for the sake of managing their data—and not for the sake of enabling better business decisions and delivering optimal business performance.
Nice post. Yes, quality is in the eye of the beholder. Data quality metrics must be calculated within the context of a data consumer. This context is missing in most software tools on the market.
Another important metric is what I call the Materiality Metric.
In your example, 50% of customer data is inaccurate. It'd be helpful if we know which 50%. Are they the customers that generate the most revenue and profits, or are they dormant customers? Are they test records that were never purged from the system? We can calculate the materiality metric by aggregating a relevant business metric for those bad records.
For example, 85% of the year-to-date revenue is associated with those 50% bad customer records.
Now we know this is serious!
@Winston — I really like your idea of creating a materiality metric, which can definitely provide more business relevance to data quality metrics such as completeness, validity, accuracy, uniqueness, etc.
In essence, the materiality metric provides a business insight filter for the data quality metrics.
From the LinkedIn Group for the IAIDQ Information/Data Quality Professional Open Community,
Val Pushkarev commented:
“Absolutely! Since data quality is all about fitness for use it makes little sense to carry on with data analysis without getting detailed business insight first.
It's easy to get carried away with developing a bunch of metrics that carry no value to the end user. As data custodians, we tend to place too much focus on each data element.
Making data picture-perfect is not what data quality is all about.”
And I responded:
Yes, far too often, data quality efforts take on a life of their own, where achieving high quality data becomes the entire focus of an organization's data management strategy, where the organization starts managing their data for the sake of managing their data—and not for the sake of enabling better business decisions and delivering optimal business performance.
And then Val Pushkarev made a great follow-up comment:
“Great point, Jim! And yet on the other hand, when we manage to find a way to educate our end users about added value of new metrics that might bring some additional business insight as well.
I find it fascinating how often business users have no idea about the meaning of some basic data quality indicators and their potential value. It's also common for technical folks to provide requested detail without any explanations. So now we're back to the infamous IT/Business gap.
Educating end users takes time and effort but carries as much if not more value than the data itself.”
Jim, I know, it’s an essential question: Does anyone use the country field to make a business decision?
It may be that the field is only used in daily operations, for example as part of a postal address, and delivery is made despite of different spelling of the country and that the invalid country value only exists in rows never used for postal mailing.
In that case you could decide that the data governance action should be one of the following:
• Delete the damn country field.
• If you don’t dare deleting: Keep the field and document that it’s useless.
• If you don’t trust that the documentation will be used in the future: Cleanse the field and adapt the business processes to only result in optimized values in the future.
In any case: Take action.
@Henrik — Of course, I knew that you knew that it’s an essential question, but not everyone is as smart as you :-)
I completely agree with your recommendations, especially the excellent advice that it’s based on: Take action!
Well said, sir. As always, well said.
From the LinkedIn Group for Enterprise Data Quality, Amer Malik commented:
“I don't disagree with what you've written, but I find myself not agreeing to it either. Forgive me for asking a bit of a numpty question, but what exactly do you mean by business insight? If what is meant is insight and analysis, I would argue that this is only leveraging a part of the asset we call data. If business insight encompasses other business needs, such as financial reporting, then I would say I agree totally with what you've said.”
And I responded:
Yes, business insight can be just as nebulous as accurate data.
I am defining business insight as a data-driven solution for a business problem, which, in most cases, would be data being used to make a business decision, or encompassing other business needs, such as financial reporting.
And then Mark Besaans made a great follow-up comment:
“I agree with your insightful statement.
To my way of thinking, the word insight means an understanding of cause and effect.
When applied to business, it means reaching an understanding of what precursors cause certain specific business outcomes or conversely what outcomes are impacted by which precursors.
This intelligence is derived from information. Information which allows us to apply science to business by measuring it. The measurements are made possible by the collection and processing of data. Data which must have a high degree of quality, because a small amount of infidelity is amplified by orders of magnitude as it propagates through the process.
The insight has to be reliable because it will be embedded into the nervous system of the organization, causing it to plan, execute and respond accordingly. Any infidelity in the insight can lead to eventual catastrophic failure.
(We need only examine the cause of the sub-prime housing debt crisis).”
From the SmartData Collective, James Taylor commented:
“Obviously I completely agree with you!
I am constantly amazed at the number of folks I meet who are paralyzed about advanced analytics, saying that "we have to fix/clean/integrate all our data before we can do that". They don't know if the data would even be relevant, haven't considered getting the data from an external source and haven't checked to see if the analytic techniques being considered could handle the bad or incomplete data automatically! Lots of techniques used in data mining were invented when data was hard to come by and very "dirty" so they are actually pretty good at coping. Unless someone thinks about the decision you want to improve, and the analytics they will need to do so, I don't see how they can say their data is too dirty, too inconsistent to be used.”
And I responded:
As with many complex challenges, data quality can be viewed as a binary problem, where at first it appears we must choose between two polar opposites, what I call The Pair of Perilous P’s: Procrastination and Perfection.
I have encountered many organizations (and even data quality professionals) who believe that data must be perfect before it can be used, and therefore they procrastinate from beginning such a daunting challenge.
However, it is simply unrealistic to be able to either identity or resolve every data quality problem—and attempting to do so is a sure fire way to guarantee failure.
In order to be successful, data quality must always be understood as an iterative process. I advocate watching out for the “Goldilocks Zone” on data quality initiatives, which is the time when the efforts of the current iteration, although not perfect, are “just right” for implementation.
Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a critical business decision.
When it comes to the quality of the data being used to make these business decisions, you can’t always get the data you want, but if you try sometimes, you just might find, you get the business insight you need.