Hyperactive Data Quality (Second Edition)

In the first edition of Hyperactive Data Quality, I discussed reactive and proactive approaches using the data quality lake analogy from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset:

“...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The lake...is polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source – the factories.

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data.”

Reactive Data Quality

Reactive Data Quality (i.e. “cleaning the lake” in Redman's analogy) focuses entirely on finding and fixing the problems with existing data after it has been extracted from its sources. 

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert “lake cleaners”).  Data quality problems can be very insidious and even the best “lake cleaning” process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

 

Proactive Data Quality

Proactive Data Quality focuses on preventing errors at the sources where data is entered or received, and before it is extracted for use by downstream applications (i.e. “enters the lake” in Redman's analogy). 

Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

“It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect.”

Proactive data quality advocates reevaluating business processes that create data, implementing improved controls on data entry screens and web forms, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the information needs of your consumers before delivering enterprise data for their use.

 

Proactive Data Quality > Reactive Data Quality

Proactive data quality is clearly the superior approach.  Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information. 

Reactive data quality essentially treats the symptoms without curing the disease.  As Redman explains: “...the problem with being a good lake cleaner is that life never gets better...it gets worse as more data...conspire to mean there is more work every day.”

So why do the vast majority of data quality initiatives use a reactive approach?

 

An Arrow Thickly Smeared With Poison

In Buddhism, there is a famous parable:

A man was shot with an arrow thickly smeared with poison.  His friends wanted to get a doctor to heal him, but the man objected by saying:

“I will neither allow this arrow to be pulled out nor accept any medical treatment until I know the name of the man who wounded me, whether he was a nobleman or a soldier or a merchant or a farmer or a lowly peasant, whether he was tall or short or of average height, whether he used a long bow or a crossbow, and whether the arrow that wounded me was hoof-tipped or curved or barbed.” 

While his friends went off in a frantic search for these answers, the man slowly, and painfully, dies.

 

“Flight to Data Quality”

In economics, the term “flight to quality” describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar “flight to data quality” can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal. 

Driven by a business triage for critical data problems, reactive data cleansing is purposefully chosen over proactive defect prevention.  The priority is finding and fixing the near-term problems rather than worrying about the long-term consequences of not identifying the root cause and implementing process improvements that would prevent it from happening again.

The enterprise has been shot with an arrow thickly smeared with poison – poor data quality.  Now is not the time to point out that the enterprise has actually shot itself by failing to have proactive measures in place. 

Reactive data quality only treats the symptoms.  However, during triage, the priority is to stabilize the patient.  A cure for the underlying condition is worthless if the patient dies before it can be administered.

 

Hyperactive Data Quality

Proactive data quality is the best practice.  Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing. 

A data governance framework is necessary for proactive data quality to be successful.  Patience and understanding are also necessary.  Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly. 

Even when not facing an immediate crisis, the reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.

Just like any complex problem, data quality has no fast and easy solution.  Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into an approach that I refer to as Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

 

Please share your thoughts and experiences.

 

Related Posts

Hyperactive Data Quality (First Edition)

The General Theory of Data Quality

The General Theory of Data Quality

In one of the famous 1905 Annus Mirabilis Papers On the Electrodynamics of Moving Bodies, Albert Einstein published what would later become known as his Special Theory of Relativity.

This theory introduced the concept that space and time are interrelated entities forming a single continuum and that the passage of time can be a variable that could change for each specific observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective. 

As Einstein's Padawan Obi-Wan Kenobi would later explain in his remarkable 1983 “paper” on The Return of the Jedi:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

Although the Special Theory of Relativity could explain the different perspectives of different observers, it could not explain the shared perspective of all observers.  Special relativity ignored a foundational force in classical physics – gravity.  So in 1916, Einstein used the force to incorporate a new perspective on gravity into what he called his General Theory of Relativity.

 

The Data-Information Continuum

In my popular post The Data-Information Continuum, I explained that data and information are also interrelated entities forming a single continuum.  I used the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers).

I explained that although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I defined information as data in use or data in action

I went on to the explain that data's quality must be objectively measured separate from its many uses and that information's quality can only be subjectively measured according to its specific use.

 

The Special Theory of Data Quality

The majority of data quality initiatives are reactive projects launched in the aftermath of an event when poor data quality negatively impacted decision-critical information. 

Many of these projects end in failure.  Some fail because of lofty expectations or unmanaged scope creep.  Most fail because they are based on the flawed perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.

Whenever an organization approaches data quality as a one-time project and not as a sustained program, they are accepting what I refer to as the Special Theory of Data Quality.

However, similar to the accuracy of special relativity for solving a narrowly defined problem, sometimes applications of the Special Theory of Data Quality can yield successful results – from a certain point of view. 

Tactical initiatives will often have a necessarily narrow focus.  Reactive data quality projects are sometimes driven by a business triage for the most critical data problems requiring near-term prioritization that simply can't wait for the effects that would be caused by implementing a proactive strategic initiative (i.e. one that may have prevented the problems from happening).

One of the worst things that can happen to an organization is a successful data quality project – because it is almost always an implementation of information quality customized to the needs of the tactical initiative that provided its funding. 

Ultimately, this misperceived success simply delays an actual failure when one of the following happens:

  1. When the project is over, the team returns to their previous activities only to be forced into triage once again when the next inevitable crisis occurs where poor data quality negatively impacts decision-critical information.
  2. When either a new project (or later phase of the same project) attempts to enforce the information quality standards throughout the organization as if they were enterprise data quality standards.

 

The General Theory of Data Quality

True data quality standards are enterprise-wide standards providing an objective data foundation.  True information quality standards must always be customized to meet the subjective needs of a specific business process and/or initiative.

Both aspects of this shared perspective of quality must be incorporated into a single sustained program that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

Whenever an organization approaches data quality as a sustained program and not as a one-time project, they are accepting what I refer to as the General Theory of Data Quality.

Data governance provides the framework for crossing the special to general theoretical threshold necessary to evolve data quality from a project to a sustained program.  However, in this post, I want to remain focused on which theory an organization accepts because if you don't accept the General Theory of Data Quality, you likely also don't accept the crucial role that data governance plays in a data quality initiative – and in all fairness, data governance obviously involves much more than just data quality.

 

Theory vs. Practice

Even though I am an advocate for the General Theory of Data Quality, I also realize that no one works at a company called Perfect, Incorporated.  I would be lying if I said that I had not worked on more projects than programs, implemented more reactive data cleansing than proactive defect prevention, or that I have never championed a “single version of the truth.”

Therefore, my career has more often exemplified the Special Theory of Data Quality.  Or perhaps my career has exemplified what could be referred to as the General Practice of Data Quality?

What theory of data quality does your organization accept?  Which one do you personally accept? 

More importantly, what does your organization actually practice when it comes to data quality?

 

Related Posts

The Data-Information Continuum

Hyperactive Data Quality (Second Edition)

Hyperactive Data Quality (First Edition)

Data Governance and Data Quality

Schrödinger's Data Quality

Adventures in Data Profiling (Part 3)

In Part 2 of this series:  The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields.  This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.

Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.

In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address

 

Birth Date

Field Summary for Birth Date

 

  The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool.  Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year). 

  Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.

  However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.

  This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context.  Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.

 

Field Values for Birth Date

 

  We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.

  The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record).  Therefore, we will limit the review to only the top ten most frequently occurring values.

  Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values.  One aspect of this analysis is that it can be used as an easier method for examining the customer age range.

  Here we also see two contextually invalid birth years: 2011 and 2012.  Any thoughts on a possible explanation for this data anomaly?

 

Telephone Number

Field Summary for Telephone Number

  The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

  The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.

  The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.

  Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

  This summary information points to the need to use drill-downs in order to review more detailed information.

 

Field Values for Telephone Number

  The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.

  With only ten profiled field formats, we can easily review them all.  Most formats appear to be representative of potentially valid telephone numbers.  However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States).  Additionally, there are two formats that appear invalid based on North American standards.

  However, a common data quality challenge is that valid field formats can conceal invalid field values.

  Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values.  In this case, more obviously invalid telephone numbers are discovered.  

 

E-mail Address

Field Summary for E-mail Address

 

  The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool.  In order to save some space, I have intentionally omitted the additional profiling statistics for this field.

  E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.

  Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

 

Field Values for E-mail Address

 

  In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD).  These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).

  Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures.  However, some formats are missing one of the three required sub-values.  The formats missing User could be an indication that the field sometimes contains a Website Address.

  Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.

 

 

What other questions can you think of for these fields?  Additional analysis could be done using drill-downs to perform a more detailed review of records of interest.  What other analysis do you think should be performed for these fields? 

 

In Part 4 of this series:  We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Adventures in Data Profiling (Part 2)

In Part 1 of this series:  The adventures began with the following scenario – You are an external consultant on a new data quality initiative.  You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.

You have no prior knowledge of the data or its expected characteristics.  You are performing this analysis without the aid of either business requirements or subject matter experts.  Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.

 

The customer data source was processed by the data profiling tool, which provided the following statistical summaries:

 

Data Profiling Summary

 

The Adventures Continue...

In Part 1, we asked if Customer ID was the primary key for this data source.  In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:

 

Field Summary for Customer ID  Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts.  This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.

  This field summary for Customer ID includes some input metadata, identifying the expected data type and field length.  Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

  The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.

 

 

Field Details for Customer ID

  We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.

  The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS).  Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).

  Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values).  Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.

  The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit.  We could also continue drilling down to see the actual field values behind the field formats.

 

Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?

 

In Part 1, we asked why the Gender Code field has 8 distinct values.  Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields). 

 

Field Values for Gender Code

  We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.

  It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male.  Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).

However, it is dangerous to assume anything and what about those numeric values?  Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields.  For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name? 

To attempt to answer these questions, it may be helpful to review records with each of these field values.  Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

 Record Drill-down for Gender Code

As is so often the case, data rarely conforms to our assumptions about it.  Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?

 

In Part 3 of this series:  We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Adventures in Data Profiling (Part 1)

In my popular post Getting Your Data Freq On, I explained that understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis. 

I explained the benefits of using a data profiling tool to help automate some of the grunt work, but that you need to perform the actual analysis and then prepare meaningful questions and reports to share with the rest of your team.

 

Series Overview

This post is the beginning of a vendor-neutral series on the methodology of data profiling.

In order to narrow the scope of the series, the scenario used will be that a customer data source for a new data quality initiative has been made available to an external consultant who has no prior knowledge of the data or its expected characteristics.  Also, the business requirements have not yet been documented, and the subject matter experts are not currently available.

The series will not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that are covered.  Both the data profiling tool and the data used throughout the series will be fictional.  The “screen shots” have been customized to illustrate concepts and are not modeled after any particular data profiling tool.

 

The Adventures Begin...

 Data Profiling Summary  

The customer data source has been processed by a data profiling tool, which has provided the above counts and percentages that summarize the following field content characteristics:

  • NULL – count of the number of records with a NULL value
  • Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
  • Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
  • Completeness – percentage calculated as Actual divided by the total number of records
  • Cardinality – count of the number of distinct actual values
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records
  • Distinctness – percentage calculated as Cardinality divided by Actual

 

Some initial questions based on your analysis of these statistical summaries might include the following:

  1. Is Customer ID the primary key for this data source?
  2. Is Customer Name 1 the primary name on the account?  If so, why isn't it always populated?
  3. Do the statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records?
  4. Why does the Gender Code field have 8 distinct values?
  5. Do the 5 distinct values in Country Code indicate international postal addresses?

Please remember the series scenario – You are an external consultant with no prior knowledge of the data or its expected characteristics, who is performing this analysis without the aid of either business requirements or subject matter experts.

 

What other questions can you think of based on analyzing the statistical summaries provided by the data profiling tool?

 

In Part 2 of this series:  We will continue the adventures by attempting to answer these questions (and more) by beginning our analysis of the frequency distributions of the unique values and formats found within the fields.  Additionally, we will begin using drill-down analysis in order to perform a more detailed review of records of interest.

 

Related Posts

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

Data Quality: The Reality Show?

Over on the DataFlux Community of Experts, Dylan Jones (of Data Quality Pro fame) posted Data Quality and Social Proof, which advocates an interesting approach to convincing stakeholders of the need to act on poor data quality:

Use video testimonials from knowledge workers to record what data quality really means to the people who use data and drive value in your business.

My overactive imagination and sense of humor couldn't help but wonder what some of these testimonials would be like...

 

A Few Good Knowledge Workers

“You want the truth?  You can't handle the truth!  We live in a world that has data and the quality of those data need to be guarded by workers with knowledge.  Who's gonna do it?  You?  I have a greater responsibility than you can possibly fathom.  You have the luxury of not knowing what I know. 

You don't want the truth because deep down in places you don't talk about at board meetings, you want me on that data, you need me on that data!

We use words like completeness, consistency, accuracy, timeliness.  We use them as the backbone of a career spent trying to defend data.  You use them as bullet points on a presentation slide.

I suggest that you pick up a pen and sign the authorization for our data quality initiative!”

 

Data-pocalypse Now

I've seen poor data quality...poor data quality that you've seen.  It's impossible for words to describe what is necessary to those who do not know what poor data quality means.  Poor data quality has a face:

A customer that we can not provide service, an auditor that we can not prevent from failing us on regulatory compliance, a stockholder to whom we can not accurately report revenue.

Poor data quality...the horror...the horror...”

Data Busters

“You want to know how poor our data quality is? 

Our data is headed for a disaster of Y2K proportions.  What do we mean by Y2K? 

Old Mainframe, real wrath of EBCDIC type stuff.  Fire and brimstone coming down from the codepages!  Rivers and seas of boiling data!  Forty years of darkness!  Hard drive crashes!  HTTP 404!  Deleted records rising from the Recycle Bin!  Precision sacrifice!  Dogs and cats living together...Mass Hysteria!

We are all terrified beyond the capacity for rational thought. 

If someone asks if you are going to approve our data quality initiative...you say YES!”

 

Your Data Quality Reality Show

What would your video testimonial show about the reality of data quality in your organization? 

How would you respond if asked to help convince your stakeholders of the need to act on poor data quality?

The Wisdom of Failure

Earlier this month, I had the honor of being interviewed by Ajay Ohri on his blog Decision Stats, which is an excellent source of insights on business intelligence and data mining as well as interviews with industry thought leaders and chief evangelists.

One of the questions Ajay asked me during my interview was what methods and habits would I recommend to young analysts just starting in the business intelligence field and part of my response was:

“Don't be afraid to ask questions or admit when you don't know the answers.  The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don't know the answers.”

It is perhaps one of life’s cruelest paradoxes that some lessons simply cannot be taught, but instead have to be learned through the pain of making mistakes.  To err is human, but not all humans learn from their errors.  In fact, some of us find it extremely difficult to even simply acknowledge when we have made a mistake.  This was certainly true for me earlier in my career.

 

The Wisdom of Crowds

One of my favorite books is The Wisdom of Crowds by James Surowiecki.  Before reading it, I admit that I believed crowds were incapable of wisdom and that the best decisions are based on the expert advice of carefully selected individuals.  However, Surowiecki wonderfully elucidates the folly of “chasing the expert” and explains the four conditions that characterize wise crowds: diversity of opinion, independent thinking, decentralization and aggregation.  The book is also balanced by examining the conditions (e.g. confirmation bias and groupthink) that can commonly undermine the wisdom of crowds.  All and all, it is a wonderful discourse on both collective intelligence and collective ignorance with practical advice on how to achieve the former and avoid the latter.

 

Chasing the Data Quality Expert

Without question, a data quality expert can be an invaluable member of your team.  Often an external consultant, a data quality expert can provide extensive experience and best practices from successful implementations.  However, regardless of their experience, even with other companies in your industry, every organization and its data is unique.  An expert's perspective definitely has merit, but their opinions and advice should not be allowed to dominate the decision making process. 

“The more power you give a single individual in the face of complexity,” explains Surowiecki, “the more likely it is that bad decisions will get made.”  No one person regardless of their experience and expertise can succeed on their own.  According to Surowiecki, the best experts “recognize the limits of their own knowledge and of individual decision making.”

 

“Success is on the far side of failure”

One of the most common obstacles organizations face with data quality initiatives is that many initial attempts end in failure.  Some fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.  However, regardless of the reason for the failure, it can negatively affect morale and cause employees to resist participating in the next data quality effort.

Although a common best practice is to perform a post-mortem in order to document the lessons learned, sometimes the stigma of failure persuades an organization to either skip the post-mortem or ignore its findings. 

However, in the famous words of IBM founder Thomas J. Watson: “Success is on the far side of failure.” 

A failed data quality initiative may have been closer to success than you realize.  At the very least, there are important lessons to be learned from the mistakes that were made.  The sooner you can recognize your mistakes, the sooner you can mitigate their effects and hopefully prevent them from happening again.

 

The Wisdom of Failure

In one of my other favorite books, How We Decide, Jonah Lehrer explains:

“The brain always learns the same way, accumulating wisdom through error...there are no shortcuts to this painstaking process...becoming an expert just takes time and practice...once you have developed expertise in a particular area...you have made the requisite mistakes.”

Therefore, although it may be true that experience is the path that separates knowledge from wisdom, I have come to realize that the true wisdom of my experience is the wisdom of failure.

 

Related Posts

A Portrait of the Data Quality Expert as a Young Idiot

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Nine Circles of Data Quality Hell

Getting Your Data Freq On

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

 

Freq'ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

 

How can a data profiling tool help you?  Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

  • NULL count of the number of records with a NULL value
  • Missing count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
  • Actual count of the number of records with an actual value (i.e. non-NULL and non-missing)
  • Completeness percentage calculated as Actual divided by the total number of records
  • Cardinality count of the number of distinct actual values
  • Uniqueness percentage calculated as Cardinality divided by the total number of records
  • Distinctness percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns.  However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes.  Consistently representing the absence of data is a common data quality standard. 

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential for duplicate records.  For example, a Tax ID field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it can not be considered a potential single primary key because it can not be used to uniquely identify every record).  If the Tax ID field is also less than 100% distinct (i.e. some distinct actual values occur on more than one record), then this could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

 

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality (i.e. indicating potential default values)
  • Fields with a relatively low cardinality (e.g. gender code and source system code)
  • Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
  • Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
  • Fields with free-form values and a high cardinality  (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers). 

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

 

Conclusion

Understanding your data is essential to using it effectively and improving its quality.  In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis.  However, it is important to remember that the analysis itself can not be automated you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly translate your analysis into meaningful reports and questions to share with the rest of the project team.  Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project.  This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality.  All of which are necessary to progress from profiling your data to performing a full data quality assessment.  However, these are not acceptable excuses for delaying data profiling.

 

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and...

Get your data freq on! 

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Schrödinger's Data Quality

Data Gazers

The Very True Fear of False Positives

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

 

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

 

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

 

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

Worthy Data Quality Whitepapers (Part 2)

Overall Approach to Data Quality ROI

Overall Approach to Data Quality ROI is a worthy data quality whitepaper freely available (name and email required for download) from the McKnight Consulting Group.

 

William McKnight

The author of the whitepaper is William McKnight, President of McKnight Consulting Group.  William focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures.  William has more than 20 years of information management experience, nearly half of which was gained in IT leadership positions, dealing firsthand with the challenging issues his clients now face.  His IT and consulting teams have won best practice competitions for their implementations.  In 11 years of consulting, he has been a part of 150 client programs worldwide, has over 300 articles, whitepapers and tips in publication and is a frequent international speaker.  William and his team provide clients with action plans, architectures, complete programs, vendor-neutral tool selection and right-fit resources. 

Additionally, William has an excellent blog on the B-eye-Network and a new course now available on eLearningCurve.

 

Whitepaper Excerpts

Excerpts from Overall Approach to Data Quality ROI:

  • “Data quality is an elusive subject that can defy measurement and yet be critical enough to derail any single IT project, strategic initiative, or even a company as a whole.”
  • “Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise.  Put simply, it is a competitive strategy.”
  • Six key steps to help you realize tangible ROI on your data quality initiative:
    1. System Profiling – survey and prioritize your company systems according to their use of and need for quality data.
    2. Data Quality Rule Determination – data quality can be defined as a lack of intolerable defects.
    3. Data Profiling – usually no one can articulate how clean or dirty corporate data is.  Without this measurement of cleanliness, the effectiveness of activities that are aimed at improving data quality cannot be measured.
    4. Data Quality Scoring – scoring is a relative measure of conformance to rules.  System scores are an aggregate of the rule scores for that system and the overall score is a prorated aggregation of the system scores.
    5. Measure Impact of Various Levels of Data Quality – ROI is about accumulating all returns and investments from a project’s build, maintenance, and associated business and IT activities through to the ultimate desired results – all while considering the possible outcomes and their likelihood.
    6. Data Quality Improvement – it is much more costly to fix data quality errors in downstream systems than it is at the point of origin.
 

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Data Quality Whitepapers are Worthless

Data Quality Blogging All-Stars

The 2009 Major League Baseball (MLB) All-Star Game is being held tonight at Busch Stadium in St. Louis, Missouri. 

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with the best statistical performances from the first half of the MLB season.

As I watch the 80th Midsummer Classic, I offer this exhibition that showcases the bloggers with the posts I have most enjoyed reading from the first half of the 2009 data quality blogging season.

 

Dylan Jones

From Data Quality Pro:

 

Daragh O Brien

From The DOBlog:

 

Steve Sarsfield

From Data Governance and Data Quality Insider:

 

Daniel Gent

From Data Quality Edge:

 

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

 

Stefanos Damianakis

From Netrics HD:

 

Vish Agashe

From Business Intelligence: Process, People and Products:

 

Mark Goloboy

From Boston Data, Technology & Analytics:

 

Additional Resources

Over on Data Quality Pro, read the data quality blog roundups from the first half of 2009:

From the IAIDQ, read the 2009 issues of the IAIDQ Blog Carnival:

Data Governance and Data Quality

Regular readers know that I often blog about the common mistakes I have observed (and made) in my professional services and application development experience in data quality (for example, see my post: The Nine Circles of Data Quality Hell).

According to Wikipedia: “Data governance is an emerging discipline with an evolving definition.  The discipline embodies a convergence of data quality, data management, business process management, and risk management surrounding the handling of data in an organization.”

Since I have never formally used the term “data governance” with my clients, I have been researching what data governance is and how it specifically relates to data quality.

Thankfully, I found a great resource in Steve Sarsfield's excellent book The Data Governance Imperative, where he explains:

“Data governance is about changing the hearts and minds of your company to see the value of information quality...data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise...at the root of the problems with managing your data are data quality problems...data governance guarantees that data can be trusted...putting people in charge of fixing and preventing issues with data...to have fewer negative events as a result of poor data.”

Although the book covers data governance more comprehensively, I focused on three of my favorite data quality themes:

  • Business-IT Collaboration
  • Data Quality Assessments
  • People Power

 

Business-IT Collaboration

Data governance establishes policies and procedures to align people throughout the organization.  Successful data quality initiatives require the Business and IT to forge an ongoing and iterative collaboration.  Neither the Business nor IT alone has all of the necessary knowledge and resources required to achieve data quality success.  The Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes. 

Steve Sarsfield explains:

“Business users need to understand that data quality is everyone's job and not just an issue with technology...the mantra of data governance is that technologists and business users must work together to define what good data is...constantly leverage both business users, who know the value of the data, and technologists, who can apply what the business users know to the data.” 

Data Quality Assessments

Data quality assessments provide a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data.  Data quality assessments help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements.  Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Steve Sarsfield explains:

“In order to know if you're winning in the fight against poor data quality, you have to keep score...use data quality scorecards to understand the detail about quality of data...and aggregate those scores into business value metrics...solid metrics...give you a baseline against which you can measure improvement over time.” 

People Power

Although incredible advancements continue, technology alone cannot provide the solution.  Data governance and data quality both require a holistic approach involving people, process and technology.  However, by far the most important of the three is people.  In my experience, it is always the people involved that make projects successful.

Steve Sarsfield explains:

“The most important aspect of implementing data governance is that people power must be used to improve the processes within an organization.  Technology will have its place, but it's most importantly the people who set up new processes who make the biggest impact.”

Conclusion

Data governance provides the framework for evolving data quality from a project to an enterprise-wide initiative.  By facilitating the collaboration of business and technical stakeholders, aligning data usage with business metrics, and enabling people to be responsible for data ownership and data quality, data governance provides for the ongoing management of the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

 

Related Posts

TDWI World Conference Chicago 2009

Not So Strange Case of Dr. Technology and Mr. Business

Schrödinger's Data Quality

The Three Musketeers of Data Quality

 

Additional Resources

Over on Data Quality Pro, read the following posts:

From the IAIDQ publications portal, read the 2008 industry report: The State of Information and Data Governance

Read Steve Sarsfield's book: The Data Governance Imperative and read his blog: Data Governance and Data Quality Insider

Missed It By That Much

In the mission to gain control over data chaos, a project is launched in order to implement a new system to help remediate the poor data quality that is negatively impacting decision-critical enterprise information. 

The project appears to be well planned.  Business requirements were well documented.  A data quality assessment was performed to gain an understanding of the data challenges that would be faced during development and testing.  Detailed architectural and functional specifications were written to guide these efforts.

The project appears to be progressing well.  Business, technical and data issues all come up from time to time.  Meetings are held to prioritize the issues and determine their impact.  Some issues require immediate fixes, while other issues are deferred to the next phase of the project.  All of these decisions are documented and well communicated to the end-user community.

Expectations appear to have been properly set for end-user acceptance testing.

As a best practice, the new system was designed to identify and report exceptions when they occur.  The end-users agreed that an obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause.  Data quality problems can be very insidious and even the best data remediation process will still produce exceptions.

Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.

Once the end-users start reviewing the exceptions, their confidence in the new system drops rapidly.  Even after some enhancements increase the number of records without an exception from 86% to 99% – the end-users continue to focus on the remaining 1% of the records that are still producing data quality exceptions.

Would you believe this incredibly common scenario can prevent acceptance of an overwhelmingly successful implementation?

How about if I quoted one of the many people who can help you get smarter than by only listening to me?

In his excellent book Why New Systems Fail: Theory and Practice Collide, Phil Simon explains:

“Systems are to  be appreciated by their general effects, and not by particular exceptions...

Errors are actually helpful the vast majority of the time.”

In fact, because the new system was designed to identify and report errors when they occur:

“End-users could focus on the root causes of the problem and not have to wade through hundreds of thousands of records in an attempt to find the problem records.”

I have seen projects fail in the many ways described by detailed case studies in Phil Simon's fantastic book.   However, one of the most common and frustrating data quality failures is the project that was so close to being a success but the focus on exceptions resulted in the end-users telling us that we “missed it by that much.”

I am neither suggesting that end-users are unrealistic nor that exceptions should be ignored. 

Reducing exceptions (i.e. poor data quality) is the whole point of the project and nobody understands the data better than the end-users.  However, chasing perfection can undermine the best intentions. 

In order to be successful, data quality projects must always be understood as an iterative process.  Small incremental improvements will build momentum to larger success over time. 

Instead of focusing on the exceptions – focus on the improvements. 

And you will begin making steady progress toward improving your data quality.

And loving it!

 

Related Posts

The Data Quality Goldilocks Zone

Schrödinger's Data Quality

The Nine Circles of Data Quality Hell

Worthy Data Quality Whitepapers (Part 1)

In my April blog post Data Quality Whitepapers are Worthless, I called for data quality whitepapers that are worth reading.

This post will be the first in an ongoing series about data quality whitepapers that I have read and can endorse as worthy.

 

It is about the data – the quality of the data

This is the subtitle of two brief but informative data quality whitepapers freely available (no registration required) from the Electronic Commerce Code Management Association (ECCMA)Transparency and Data Portability.

 

ECCMA

ECCMA is an international association of industry and government master data managers working together to increase the quality and lower the cost of descriptions of individuals, organizations, goods and services through developing and promoting International Standards for Master Data Quality. 

Formed in April 1999, ECCMA has brought together thousands of experts from around the world and provides them a means of working together in the fair, open and extremely fast environment of the Internet to build and maintain the global, open standard dictionaries that are used to unambiguously label information.  The existence of these dictionaries of labels allows information to be passed from one computer system to another without losing meaning.

 

Peter Benson

The author of the whitepapers is Peter Benson, the Executive Director and Chief Technical Officer of the ECCMA.  Peter is an expert in distributed information systems, content encoding and master data management.  He designed one of the very first commercial electronic mail software applications, WordStar Messenger and was granted a landmark British patent in 1992 covering the use of electronic mail systems to maintain distributed databases.

Peter designed and oversaw the development of a number of strategic distributed database management systems used extensively in the UK and US by the Public Relations and Media Industries.  From 1994 to 1998, Peter served as the elected chairman of the American National Standards Institute Accredited Committee ANSI ASCX 12E, the Standards Committee responsible for the development and maintenance of EDI standard for product data.

Peter is known for the design, development and global promotion of the UNSPSC as an internationally recognized commodity classification and more recently for the design of the eOTD, an internationally recognized open technical dictionary based on the NATO codification system.

Peter is an expert in the development and maintenance of Master Data Quality as well as an internationally recognized proponent of Open Standards that he believes are critical to protect data assets from the applications used to create and manipulate them. 

Peter is the Project Leader for ISO 8000, which is a new international standard for data quality.

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000

 

Whitepaper Excerpts

Excerpts from Transparency:

  • “Today, more than ever before, our access to data, the ability of our computer applications to use it and the ultimate accuracy of the data determines how we see and interact with the world we live and work in.”
  • “Data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”
  • “Transparency requires that transaction data accurately identifies who, what, where and when and master data accurately describes who, what and where.”

 

Excerpts from Data Portability:

  • “In an environment where the life cycle of software applications used to capture and manage data is but a fraction of the life cycle of the data itself, the issues of data portability and long-term data preservation are critical.”
  • “Claims that an application exports data in XML does address the syntax part of the problem, but that is the easy part.  What is required is to be able to export all of the data in a form that can be easily uploaded into another application.”
  • “In a world rapidly moving towards SaaS and cloud computing, it really pays to pause and consider not just the physical security of your data but its portability.”

 

Not So Strange Case of Dr. Technology and Mr. Business

Strange Case of Dr Jekyll and Mr Hyde was Robert Louis Stevenson's classic novella about the duality of human nature and the inner conflict of our personal sense of good and evil that can undermine our noblest intentions.  The novella exemplified this inner conflict using the self-destructive split-personality of Henry Jekyll and Edward Hyde.

The duality of data quality's nature can sometimes cause an organizational conflict between the Business and IT.  The complexity of a data quality project can sometimes work against your best intentions.  Knowledge about data, business processes and supporting technology are spread throughout the organization. 

Neither the Business nor IT alone has all of the necessary information required to achieve data quality success. 

As a data quality consultant, I am often asked to wear many hats – and not just because my balding head is distractingly shiny. 

I often play a hybrid role that helps facilitate the business and technical collaboration of the project team.

I refer to this hybrid role as using the split-personality of Dr. Technology and Mr. Business.

 

Dr. Technology

With relatively few exceptions, IT is usually the first group that I meet with when I begin an engagement with a new client.  However, this doesn't mean that IT is more important than the Business.  Consultants are commonly brought on board after the initial business requirements have been drafted and the data quality tool has been selected.  Meeting with IT first is especially common if one of my tasks is to help install and configure the data quality tool.

When I meet with IT, I use my Dr. Technology personality.  IT needs to know that I am there to share my extensive experience and best practices from successful data quality projects to help them implement a well architected technical solution.  I ask about data quality solutions that have been attempted previously, how well they were received by the Business, and if they are still in use.  I ask if IT has any issues with or concerns about the data quality tool that was selected.

I review the initial business requirements with IT to make sure I understand any specific technical challenges such as data access, server capacity, security protocols, scheduled maintenance and after-hours support.  I freely “geek out” in techno-babble.  I debate whether Farscape or Battlestar Galactica was the best science fiction series in television history.  I verify the favorite snack foods of the data architects, DBAs, and server administrators since whenever I need a relational database table created or more temporary disk space allocated, I know the required currency will often be Mountain Dew and Doritos.

 

Mr. Business

When I meet with the Business for the first time, I do so without my IT entourage and I use my Mr. Business personality.  The Business needs to know that I am there to help customize a technical solution to their specific business needs.  I ask them to share their knowledge in their natural language using business terminology.  Regardless of my experience with other companies in their industry, every organization and their data is unique.  No assumptions should be made by any of us.

I review the initial requirements with the Business to make sure I understand who owns the data and how it is used to support the day-to-day operation of each business unit and initiative.  I ask if the requirements were defined before or after the selection of the data quality tool.  Knowing how the data quality tool works can sometimes cause a “framing effect” where requirements are defined in terms of tool functionality, framing them as a technical problem instead of a business problem.  All data quality tools provide viable solutions driven by impressive technology.  Therefore, the focus should always be on stating the problem and solution criteria in business terms.

 

Dr. Technology and Mr. Business Must Work Together

As the cross-functional project team starts working together, my Dr. Technology and Mr. Business personalities converge to help clarify communication by providing bi-directional translation, mentoring, documentation, training and knowledge transfer.  I can help interpret business requirements and functional specifications, help explain business and technical challenges, and help maintain an ongoing dialogue between the Business and IT. 

I can also help each group save face by playing the important role of Designated Asker of Stupid Questions – one of those intangible skills you can't find anywhere on my resume.

As the project progresses, the communication and teamwork between the Business and IT will become more and more natural and I will become less and less necessary – one of my most important success criteria.

 

Success is Not So Strange

When the Business and IT forges an ongoing collaborative partnership throughout the entire project, success is not so strange.

In fact, your data quality project can be the beginning of a beautiful friendship between the Business and IT. 

Everyone on the project team can develop a healthy split-personality. 

IT can use their Mr. Business (or Ms. Business) personality to help them understand the intricacies of business processes. 

The Business can use their Dr. Technology personality to help them “get their geek on.”

 

Data quality success is all about shiny happy people holding hands – and what's so strange about that?

 

Related Posts

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

 

Additional Resources

From the Data Quality Pro forum, read the discussion: Data Quality is not an IT issue

From the blog Inside the Biz with Jill Dyché, read her posts:

From Paul Erb's blog Comedy of the Commons, read his post: I Don't Know Much About Data, but I Know What I Like