The Wisdom of Failure

Earlier this month, I had the honor of being interviewed by Ajay Ohri on his blog Decision Stats, which is an excellent source of insights on business intelligence and data mining as well as interviews with industry thought leaders and chief evangelists.

One of the questions Ajay asked me during my interview was what methods and habits would I recommend to young analysts just starting in the business intelligence field and part of my response was:

“Don't be afraid to ask questions or admit when you don't know the answers.  The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don't know the answers.”

It is perhaps one of life’s cruelest paradoxes that some lessons simply cannot be taught, but instead have to be learned through the pain of making mistakes.  To err is human, but not all humans learn from their errors.  In fact, some of us find it extremely difficult to even simply acknowledge when we have made a mistake.  This was certainly true for me earlier in my career.

 

The Wisdom of Crowds

One of my favorite books is The Wisdom of Crowds by James Surowiecki.  Before reading it, I admit that I believed crowds were incapable of wisdom and that the best decisions are based on the expert advice of carefully selected individuals.  However, Surowiecki wonderfully elucidates the folly of “chasing the expert” and explains the four conditions that characterize wise crowds: diversity of opinion, independent thinking, decentralization and aggregation.  The book is also balanced by examining the conditions (e.g. confirmation bias and groupthink) that can commonly undermine the wisdom of crowds.  All and all, it is a wonderful discourse on both collective intelligence and collective ignorance with practical advice on how to achieve the former and avoid the latter.

 

Chasing the Data Quality Expert

Without question, a data quality expert can be an invaluable member of your team.  Often an external consultant, a data quality expert can provide extensive experience and best practices from successful implementations.  However, regardless of their experience, even with other companies in your industry, every organization and its data is unique.  An expert's perspective definitely has merit, but their opinions and advice should not be allowed to dominate the decision making process. 

“The more power you give a single individual in the face of complexity,” explains Surowiecki, “the more likely it is that bad decisions will get made.”  No one person regardless of their experience and expertise can succeed on their own.  According to Surowiecki, the best experts “recognize the limits of their own knowledge and of individual decision making.”

 

“Success is on the far side of failure”

One of the most common obstacles organizations face with data quality initiatives is that many initial attempts end in failure.  Some fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.  However, regardless of the reason for the failure, it can negatively affect morale and cause employees to resist participating in the next data quality effort.

Although a common best practice is to perform a post-mortem in order to document the lessons learned, sometimes the stigma of failure persuades an organization to either skip the post-mortem or ignore its findings. 

However, in the famous words of IBM founder Thomas J. Watson: “Success is on the far side of failure.” 

A failed data quality initiative may have been closer to success than you realize.  At the very least, there are important lessons to be learned from the mistakes that were made.  The sooner you can recognize your mistakes, the sooner you can mitigate their effects and hopefully prevent them from happening again.

 

The Wisdom of Failure

In one of my other favorite books, How We Decide, Jonah Lehrer explains:

“The brain always learns the same way, accumulating wisdom through error...there are no shortcuts to this painstaking process...becoming an expert just takes time and practice...once you have developed expertise in a particular area...you have made the requisite mistakes.”

Therefore, although it may be true that experience is the path that separates knowledge from wisdom, I have come to realize that the true wisdom of my experience is the wisdom of failure.

 

Related Posts

A Portrait of the Data Quality Expert as a Young Idiot

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Nine Circles of Data Quality Hell

Getting Your Data Freq On

One of the most basic features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found within the fields of your data sources. 

Data profiling is often performed during a data quality assessment and involves much more than reviewing the output generated by a data profiling tool and a data quality assessment obviously involves much more than just data profiling. 

However, in this post I want to focus on some of the benefits of using a data profiling tool.

 

Freq'ing Awesome Analysis

Data profiling can help you perform essential analysis such as:

  • Verifying data matches the metadata that describes it
  • Identifying missing values
  • Identifying potential default values
  • Identifying potential invalid values
  • Checking data formats for inconsistencies
  • Preparing meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.

 

How can a data profiling tool help you?  Let me count the ways

Data profiling tools provide counts and percentages for each field that summarize its content characteristics such as:

  • NULL count of the number of records with a NULL value
  • Missing count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
  • Actual count of the number of records with an actual value (i.e. non-NULL and non-missing)
  • Completeness percentage calculated as Actual divided by the total number of records
  • Cardinality count of the number of distinct actual values
  • Uniqueness percentage calculated as Cardinality divided by the total number of records
  • Distinctness percentage calculated as Cardinality divided by Actual

The absence of data can be represented many different ways with NULL being most common for relational database columns.  However, character fields can contain all spaces or an empty string and numeric fields can contain all zeroes.  Consistently representing the absence of data is a common data quality standard. 

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  Required non-key fields may often be 100% complete but a low cardinality could indicate the presence of potential default values.

Distinctness can be useful in evaluating the potential for duplicate records.  For example, a Tax ID field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it can not be considered a potential single primary key because it can not be used to uniquely identify every record).  If the Tax ID field is also less than 100% distinct (i.e. some distinct actual values occur on more than one record), then this could indicate the presence of potential duplicate records.

Data profiling tools will often generate many other useful summary statistics for each field including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).

 

Show Me the Value (or the Format)

A frequency distribution of the unique formats found in a field is sometimes more useful than the unique values.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality (i.e. indicating potential default values)
  • Fields with a relatively low cardinality (e.g. gender code and source system code)
  • Fields with a relatively small number of valid values (e.g. state abbreviation and country code)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g. integer surrogate key or ZIP+4 add-on code)
  • Fields with a relatively limited number of valid formats (e.g. telephone number and birth date)
  • Fields with free-form values and a high cardinality  (e.g. customer name and postal address)

Cardinality can play a major role in deciding whether or not you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values.

Some fields can also be alternatively analyzed using partial values (e.g. birth year extracted from birth date) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers). 

Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by parsing and classifying the individual values within the field (e.g. salutation, given name, family name, title).

 

Conclusion

Understanding your data is essential to using it effectively and improving its quality.  In order to achieve these goals, there is simply no substitute for data analysis.

A data profiling tool can help you by automating some of the grunt work needed to begin this analysis.  However, it is important to remember that the analysis itself can not be automated you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more importantly translate your analysis into meaningful reports and questions to share with the rest of the project team.  Well performed data profiling is a highly interactive and iterative process.

Data profiling is typically one of the first tasks performed on a data quality project.  This is especially true when data is made available before business requirements are documented and subject matter experts are available to discuss usage, relevancy, standards and the metrics for measuring and improving data quality.  All of which are necessary to progress from profiling your data to performing a full data quality assessment.  However, these are not acceptable excuses for delaying data profiling.

 

Therefore, grab your favorite caffeinated beverage, settle into your most comfortable chair, roll up your sleeves and...

Get your data freq on! 

 

Related Posts

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Schrödinger's Data Quality

Data Gazers

The Very True Fear of False Positives

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

The need for data matching solutions is one of the primary reasons that companies invest in data quality software and services.

The great news is that there are many data quality vendors to choose from and all of them offer viable data matching solutions driven by impressive technologies and proven methodologies.

The not so great news is that the wonderful world of data matching has a very weird way with words.  Discussions about data matching techniques often include advanced mathematical terms like deterministic record linkage, probabilistic record linkage, Fellegi-Sunter algorithm, Bayesian statistics, conditional independence, bipartite graphs, or my personal favorite:

The redundant data capacitor, which makes accurate data matching possible using only 1.21 gigawatts of electricity and a customized DeLorean DMC-12 accelerated to 88 miles per hour.

All data matching techniques provide some way to rank their match results (e.g. numeric probabilities, weighted percentages, odds ratios, confidence levels).  Ranking is often used as a primary method in differentiating the three possible result categories:

  1. Automatic Matches
  2. Automatic Non-Matches
  3. Potential Matches requiring manual review

All data matching techniques must also face the daunting challenge of what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

For data examples that illustrate the challenge of false negatives and false positives, please refer to my Data Quality Pro articles:

 

Data Matching Techniques

Industry analysts, experts, vendors and consultants often engage in heated debates about the different approaches to data matching.  I have personally participated in many of these debates and I certainly have my own strong opinions based on over 15 years of professional services, application development and software engineering experience with data matching. 

However, I am not going to try to convince you which data matching technique provides the superior solution at least not until Doc Brown and I get our patent pending prototype of the redundant data capacitor working because I firmly believe in the following two things:

  1. Any opinion is biased by the practical limits of personal experience and motivated by the kind folks paying your salary
  2. There is no such thing as the best data matching technique every data matching technique has its pros and cons

But in the interests of full disclosure, the voices in my head have advised me to inform you that I have spent most of my career in the Fellegi-Sunter fan club.  Therefore, I will freely admit to having a strong bias for data matching software that uses probabilistic record linkage techniques. 

However, I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors.  Without exception, I have always been able to obtain the desired results regardless of the data matching techniques provided by the software.

For more detailed information about data matching techniques, please refer to the Additional Resources listed below.

 

The Very True Fear of False Positives

Fundamentally, the primary business problem being solved by data matching is the reduction of false negatives the identification of records within and across existing systems not currently linked that are preventing the enterprise from understanding the true data relationships that exist in their information assets.

However, the pursuit to reduce false negatives carries with it the risk of creating false positives. 

In my experience, I have found that clients are far more concerned about the potential negative impact on business decisions caused by false positives in the records automatically linked by data matching software, than they are about the false negatives not linked after all, those records were not linked before investing in the data matching software.  Not solving an existing problem is commonly perceived to be not as bad as creating a new problem.

The very true fear of false positives often motivates the implementation of an overly cautious approach to data matching that results in the perpetuation of false negatives.  Furthermore, this often restricts the implementation to exact (or near-exact) matching techniques and ignores the more robust capabilities of the data matching software to find potential matches.

When this happens, many points in the heated debate about the different approaches to data matching are rendered moot.  In fact, one of the industry's dirty little secrets is that many data matching applications could have been successfully implemented without the investment in data matching software because of the overly cautious configuration of the matching criteria.

My point is neither to discourage the purchase of data matching software, nor to suggest that the very true fear of false positives should simply be accepted. 

My point is that data matching debates often ignore this pragmatic concern.  It is these human and business factors and not just the technology itself that need to be taken into consideration when planning a data matching implementation. 

While acknowledging the very true fear of false positives, I try to help my clients believe that this fear can and should be overcome.  The harsh reality is that there is no perfect data matching solution.  The risk of false positives can be mitigated but never eliminated.  However, the risks inherent in data matching are worth the rewards.

Data matching must be understood to be just as much about art and philosophy as it is about science and technology.

 

Additional Resources

Data Quality and Record Linkage Techniques

The Art of Data Matching

Identifying Duplicate Customer Records - Case Study

Narrative Fallacy and Data Matching

Speaking of Narrative Fallacy

The Myth of Matching: Why We Need Entity Resolution

The Human Element in Identity Resolution

Probabilistic Matching: Sounds like a good idea, but...

Probabilistic Matching: Part Two

Worthy Data Quality Whitepapers (Part 2)

Overall Approach to Data Quality ROI

Overall Approach to Data Quality ROI is a worthy data quality whitepaper freely available (name and email required for download) from the McKnight Consulting Group.

 

William McKnight

The author of the whitepaper is William McKnight, President of McKnight Consulting Group.  William focuses on delivering business value and solving business problems utilizing proven, streamlined approaches in data warehousing, master data management and business intelligence, all with a focus on data quality and scalable architectures.  William has more than 20 years of information management experience, nearly half of which was gained in IT leadership positions, dealing firsthand with the challenging issues his clients now face.  His IT and consulting teams have won best practice competitions for their implementations.  In 11 years of consulting, he has been a part of 150 client programs worldwide, has over 300 articles, whitepapers and tips in publication and is a frequent international speaker.  William and his team provide clients with action plans, architectures, complete programs, vendor-neutral tool selection and right-fit resources. 

Additionally, William has an excellent blog on the B-eye-Network and a new course now available on eLearningCurve.

 

Whitepaper Excerpts

Excerpts from Overall Approach to Data Quality ROI:

  • “Data quality is an elusive subject that can defy measurement and yet be critical enough to derail any single IT project, strategic initiative, or even a company as a whole.”
  • “Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise.  Put simply, it is a competitive strategy.”
  • Six key steps to help you realize tangible ROI on your data quality initiative:
    1. System Profiling – survey and prioritize your company systems according to their use of and need for quality data.
    2. Data Quality Rule Determination – data quality can be defined as a lack of intolerable defects.
    3. Data Profiling – usually no one can articulate how clean or dirty corporate data is.  Without this measurement of cleanliness, the effectiveness of activities that are aimed at improving data quality cannot be measured.
    4. Data Quality Scoring – scoring is a relative measure of conformance to rules.  System scores are an aggregate of the rule scores for that system and the overall score is a prorated aggregation of the system scores.
    5. Measure Impact of Various Levels of Data Quality – ROI is about accumulating all returns and investments from a project’s build, maintenance, and associated business and IT activities through to the ultimate desired results – all while considering the possible outcomes and their likelihood.
    6. Data Quality Improvement – it is much more costly to fix data quality errors in downstream systems than it is at the point of origin.
 

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Data Quality Whitepapers are Worthless

Data Quality Blogging All-Stars

The 2009 Major League Baseball (MLB) All-Star Game is being held tonight at Busch Stadium in St. Louis, Missouri. 

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with the best statistical performances from the first half of the MLB season.

As I watch the 80th Midsummer Classic, I offer this exhibition that showcases the bloggers with the posts I have most enjoyed reading from the first half of the 2009 data quality blogging season.

 

Dylan Jones

From Data Quality Pro:

 

Daragh O Brien

From The DOBlog:

 

Steve Sarsfield

From Data Governance and Data Quality Insider:

 

Daniel Gent

From Data Quality Edge:

 

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

 

Stefanos Damianakis

From Netrics HD:

 

Vish Agashe

From Business Intelligence: Process, People and Products:

 

Mark Goloboy

From Boston Data, Technology & Analytics:

 

Additional Resources

Over on Data Quality Pro, read the data quality blog roundups from the first half of 2009:

From the IAIDQ, read the 2009 issues of the IAIDQ Blog Carnival:

Data Governance and Data Quality

Regular readers know that I often blog about the common mistakes I have observed (and made) in my professional services and application development experience in data quality (for example, see my post: The Nine Circles of Data Quality Hell).

According to Wikipedia: “Data governance is an emerging discipline with an evolving definition.  The discipline embodies a convergence of data quality, data management, business process management, and risk management surrounding the handling of data in an organization.”

Since I have never formally used the term “data governance” with my clients, I have been researching what data governance is and how it specifically relates to data quality.

Thankfully, I found a great resource in Steve Sarsfield's excellent book The Data Governance Imperative, where he explains:

“Data governance is about changing the hearts and minds of your company to see the value of information quality...data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise...at the root of the problems with managing your data are data quality problems...data governance guarantees that data can be trusted...putting people in charge of fixing and preventing issues with data...to have fewer negative events as a result of poor data.”

Although the book covers data governance more comprehensively, I focused on three of my favorite data quality themes:

  • Business-IT Collaboration
  • Data Quality Assessments
  • People Power

 

Business-IT Collaboration

Data governance establishes policies and procedures to align people throughout the organization.  Successful data quality initiatives require the Business and IT to forge an ongoing and iterative collaboration.  Neither the Business nor IT alone has all of the necessary knowledge and resources required to achieve data quality success.  The Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes. 

Steve Sarsfield explains:

“Business users need to understand that data quality is everyone's job and not just an issue with technology...the mantra of data governance is that technologists and business users must work together to define what good data is...constantly leverage both business users, who know the value of the data, and technologists, who can apply what the business users know to the data.” 

Data Quality Assessments

Data quality assessments provide a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data.  Data quality assessments help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements.  Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Steve Sarsfield explains:

“In order to know if you're winning in the fight against poor data quality, you have to keep score...use data quality scorecards to understand the detail about quality of data...and aggregate those scores into business value metrics...solid metrics...give you a baseline against which you can measure improvement over time.” 

People Power

Although incredible advancements continue, technology alone cannot provide the solution.  Data governance and data quality both require a holistic approach involving people, process and technology.  However, by far the most important of the three is people.  In my experience, it is always the people involved that make projects successful.

Steve Sarsfield explains:

“The most important aspect of implementing data governance is that people power must be used to improve the processes within an organization.  Technology will have its place, but it's most importantly the people who set up new processes who make the biggest impact.”

Conclusion

Data governance provides the framework for evolving data quality from a project to an enterprise-wide initiative.  By facilitating the collaboration of business and technical stakeholders, aligning data usage with business metrics, and enabling people to be responsible for data ownership and data quality, data governance provides for the ongoing management of the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

 

Related Posts

TDWI World Conference Chicago 2009

Not So Strange Case of Dr. Technology and Mr. Business

Schrödinger's Data Quality

The Three Musketeers of Data Quality

 

Additional Resources

Over on Data Quality Pro, read the following posts:

From the IAIDQ publications portal, read the 2008 industry report: The State of Information and Data Governance

Read Steve Sarsfield's book: The Data Governance Imperative and read his blog: Data Governance and Data Quality Insider

Missed It By That Much

In the mission to gain control over data chaos, a project is launched in order to implement a new system to help remediate the poor data quality that is negatively impacting decision-critical enterprise information. 

The project appears to be well planned.  Business requirements were well documented.  A data quality assessment was performed to gain an understanding of the data challenges that would be faced during development and testing.  Detailed architectural and functional specifications were written to guide these efforts.

The project appears to be progressing well.  Business, technical and data issues all come up from time to time.  Meetings are held to prioritize the issues and determine their impact.  Some issues require immediate fixes, while other issues are deferred to the next phase of the project.  All of these decisions are documented and well communicated to the end-user community.

Expectations appear to have been properly set for end-user acceptance testing.

As a best practice, the new system was designed to identify and report exceptions when they occur.  The end-users agreed that an obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause.  Data quality problems can be very insidious and even the best data remediation process will still produce exceptions.

Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.

Once the end-users start reviewing the exceptions, their confidence in the new system drops rapidly.  Even after some enhancements increase the number of records without an exception from 86% to 99% – the end-users continue to focus on the remaining 1% of the records that are still producing data quality exceptions.

Would you believe this incredibly common scenario can prevent acceptance of an overwhelmingly successful implementation?

How about if I quoted one of the many people who can help you get smarter than by only listening to me?

In his excellent book Why New Systems Fail: Theory and Practice Collide, Phil Simon explains:

“Systems are to  be appreciated by their general effects, and not by particular exceptions...

Errors are actually helpful the vast majority of the time.”

In fact, because the new system was designed to identify and report errors when they occur:

“End-users could focus on the root causes of the problem and not have to wade through hundreds of thousands of records in an attempt to find the problem records.”

I have seen projects fail in the many ways described by detailed case studies in Phil Simon's fantastic book.   However, one of the most common and frustrating data quality failures is the project that was so close to being a success but the focus on exceptions resulted in the end-users telling us that we “missed it by that much.”

I am neither suggesting that end-users are unrealistic nor that exceptions should be ignored. 

Reducing exceptions (i.e. poor data quality) is the whole point of the project and nobody understands the data better than the end-users.  However, chasing perfection can undermine the best intentions. 

In order to be successful, data quality projects must always be understood as an iterative process.  Small incremental improvements will build momentum to larger success over time. 

Instead of focusing on the exceptions – focus on the improvements. 

And you will begin making steady progress toward improving your data quality.

And loving it!

 

Related Posts

The Data Quality Goldilocks Zone

Schrödinger's Data Quality

The Nine Circles of Data Quality Hell

Worthy Data Quality Whitepapers (Part 1)

In my April blog post Data Quality Whitepapers are Worthless, I called for data quality whitepapers that are worth reading.

This post will be the first in an ongoing series about data quality whitepapers that I have read and can endorse as worthy.

 

It is about the data – the quality of the data

This is the subtitle of two brief but informative data quality whitepapers freely available (no registration required) from the Electronic Commerce Code Management Association (ECCMA)Transparency and Data Portability.

 

ECCMA

ECCMA is an international association of industry and government master data managers working together to increase the quality and lower the cost of descriptions of individuals, organizations, goods and services through developing and promoting International Standards for Master Data Quality. 

Formed in April 1999, ECCMA has brought together thousands of experts from around the world and provides them a means of working together in the fair, open and extremely fast environment of the Internet to build and maintain the global, open standard dictionaries that are used to unambiguously label information.  The existence of these dictionaries of labels allows information to be passed from one computer system to another without losing meaning.

 

Peter Benson

The author of the whitepapers is Peter Benson, the Executive Director and Chief Technical Officer of the ECCMA.  Peter is an expert in distributed information systems, content encoding and master data management.  He designed one of the very first commercial electronic mail software applications, WordStar Messenger and was granted a landmark British patent in 1992 covering the use of electronic mail systems to maintain distributed databases.

Peter designed and oversaw the development of a number of strategic distributed database management systems used extensively in the UK and US by the Public Relations and Media Industries.  From 1994 to 1998, Peter served as the elected chairman of the American National Standards Institute Accredited Committee ANSI ASCX 12E, the Standards Committee responsible for the development and maintenance of EDI standard for product data.

Peter is known for the design, development and global promotion of the UNSPSC as an internationally recognized commodity classification and more recently for the design of the eOTD, an internationally recognized open technical dictionary based on the NATO codification system.

Peter is an expert in the development and maintenance of Master Data Quality as well as an internationally recognized proponent of Open Standards that he believes are critical to protect data assets from the applications used to create and manipulate them. 

Peter is the Project Leader for ISO 8000, which is a new international standard for data quality.

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000

 

Whitepaper Excerpts

Excerpts from Transparency:

  • “Today, more than ever before, our access to data, the ability of our computer applications to use it and the ultimate accuracy of the data determines how we see and interact with the world we live and work in.”
  • “Data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”
  • “Transparency requires that transaction data accurately identifies who, what, where and when and master data accurately describes who, what and where.”

 

Excerpts from Data Portability:

  • “In an environment where the life cycle of software applications used to capture and manage data is but a fraction of the life cycle of the data itself, the issues of data portability and long-term data preservation are critical.”
  • “Claims that an application exports data in XML does address the syntax part of the problem, but that is the easy part.  What is required is to be able to export all of the data in a form that can be easily uploaded into another application.”
  • “In a world rapidly moving towards SaaS and cloud computing, it really pays to pause and consider not just the physical security of your data but its portability.”

 

Not So Strange Case of Dr. Technology and Mr. Business

Strange Case of Dr Jekyll and Mr Hyde was Robert Louis Stevenson's classic novella about the duality of human nature and the inner conflict of our personal sense of good and evil that can undermine our noblest intentions.  The novella exemplified this inner conflict using the self-destructive split-personality of Henry Jekyll and Edward Hyde.

The duality of data quality's nature can sometimes cause an organizational conflict between the Business and IT.  The complexity of a data quality project can sometimes work against your best intentions.  Knowledge about data, business processes and supporting technology are spread throughout the organization. 

Neither the Business nor IT alone has all of the necessary information required to achieve data quality success. 

As a data quality consultant, I am often asked to wear many hats – and not just because my balding head is distractingly shiny. 

I often play a hybrid role that helps facilitate the business and technical collaboration of the project team.

I refer to this hybrid role as using the split-personality of Dr. Technology and Mr. Business.

 

Dr. Technology

With relatively few exceptions, IT is usually the first group that I meet with when I begin an engagement with a new client.  However, this doesn't mean that IT is more important than the Business.  Consultants are commonly brought on board after the initial business requirements have been drafted and the data quality tool has been selected.  Meeting with IT first is especially common if one of my tasks is to help install and configure the data quality tool.

When I meet with IT, I use my Dr. Technology personality.  IT needs to know that I am there to share my extensive experience and best practices from successful data quality projects to help them implement a well architected technical solution.  I ask about data quality solutions that have been attempted previously, how well they were received by the Business, and if they are still in use.  I ask if IT has any issues with or concerns about the data quality tool that was selected.

I review the initial business requirements with IT to make sure I understand any specific technical challenges such as data access, server capacity, security protocols, scheduled maintenance and after-hours support.  I freely “geek out” in techno-babble.  I debate whether Farscape or Battlestar Galactica was the best science fiction series in television history.  I verify the favorite snack foods of the data architects, DBAs, and server administrators since whenever I need a relational database table created or more temporary disk space allocated, I know the required currency will often be Mountain Dew and Doritos.

 

Mr. Business

When I meet with the Business for the first time, I do so without my IT entourage and I use my Mr. Business personality.  The Business needs to know that I am there to help customize a technical solution to their specific business needs.  I ask them to share their knowledge in their natural language using business terminology.  Regardless of my experience with other companies in their industry, every organization and their data is unique.  No assumptions should be made by any of us.

I review the initial requirements with the Business to make sure I understand who owns the data and how it is used to support the day-to-day operation of each business unit and initiative.  I ask if the requirements were defined before or after the selection of the data quality tool.  Knowing how the data quality tool works can sometimes cause a “framing effect” where requirements are defined in terms of tool functionality, framing them as a technical problem instead of a business problem.  All data quality tools provide viable solutions driven by impressive technology.  Therefore, the focus should always be on stating the problem and solution criteria in business terms.

 

Dr. Technology and Mr. Business Must Work Together

As the cross-functional project team starts working together, my Dr. Technology and Mr. Business personalities converge to help clarify communication by providing bi-directional translation, mentoring, documentation, training and knowledge transfer.  I can help interpret business requirements and functional specifications, help explain business and technical challenges, and help maintain an ongoing dialogue between the Business and IT. 

I can also help each group save face by playing the important role of Designated Asker of Stupid Questions – one of those intangible skills you can't find anywhere on my resume.

As the project progresses, the communication and teamwork between the Business and IT will become more and more natural and I will become less and less necessary – one of my most important success criteria.

 

Success is Not So Strange

When the Business and IT forges an ongoing collaborative partnership throughout the entire project, success is not so strange.

In fact, your data quality project can be the beginning of a beautiful friendship between the Business and IT. 

Everyone on the project team can develop a healthy split-personality. 

IT can use their Mr. Business (or Ms. Business) personality to help them understand the intricacies of business processes. 

The Business can use their Dr. Technology personality to help them “get their geek on.”

 

Data quality success is all about shiny happy people holding hands – and what's so strange about that?

 

Related Posts

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

 

Additional Resources

From the Data Quality Pro forum, read the discussion: Data Quality is not an IT issue

From the blog Inside the Biz with Jill Dyché, read her posts:

From Paul Erb's blog Comedy of the Commons, read his post: I Don't Know Much About Data, but I Know What I Like

The Data-Information Continuum

Data is one of the enterprise's most important assets.  Data quality is a fundamental success factor for the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

When the results of these initiatives don't meet expectations, analysis often reveals poor data quality is a root cause.   Projects are launched to understand and remediate this problem by establishing enterprise-wide data quality standards.

However, a common issue is a lack of understanding about what I refer to as the Data-Information Continuum.

 

The Data-Information Continuum

In physics, the Space-Time Continuum explains that space and time are interrelated entities forming a single continuum.  In classical mechanics, the passage of time can be considered a constant for all observers of spatial objects in motion.  In relativistic contexts, the passage of time is a variable changing for each specific observer of spatial objects in motion.

Data and information are also interrelated entities forming a single continuum.  It is crucial to understand how they are different and how they relate.  I like using the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). 

A common data quality definition is fitness for the purpose of use.  A common challenge is data has multiple uses, each with its own fitness requirements.  I like to view each intended use as the information that is derived from data, defining information as data in use or data in action.

Data could be considered a constant while information is a variable that redefines data for each specific use.  Data is not truly a constant since it is constantly changing.  However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again). 

Quality within the Data-Information Continuum has both objective and subjective dimensions.

 

Objective Data Quality

Data's quality must be objectively measured separate from its many uses.  Enterprise-wide data quality standards must provide a highest common denominator for all business units to use as an objective data foundation for their specific tactical and strategic initiatives.  Raw data extracted directly from its sources must be profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs.  At this phase, the manipulations of raw data by these processes must be limited to objective standards and not be customized for any subjective use.

 

Subjective Information Quality

Information's quality can only be subjectively measured according to its specific use.  Information quality standards are not enterprise-wide, they are customized to a specific business unit or initiative.  However, all business units and initiatives must begin defining their information quality standards by using the enterprise-wide data quality standards as a foundation.  This approach allows leveraging a consistent enterprise understanding of data while also deriving the information necessary for the day-to-day operation of each business unit and initiative.

 

A “Single Version of the Truth” or the “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a single version of the truth.

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Conclusion

There is a significant difference between data and information and therefore a significant difference between data quality and information quality.  Many data quality projects are in fact implementations of information quality customized to the specific business unit or initiative that is funding the project.  Although these projects can achieve some initial success, they encounter failures in later iterations and phases when information quality standards try to act as enterprise-wide data quality standards. 

Significant time and money can be wasted by not understanding the Data-Information Continuum.

The Three Musketeers of Data Quality

People, process and technology.  All three are necessary for success on your data quality project.  By far, the most important of the three is people.  However, who exactly are some of the most important people on your data quality project? 

Or to phrase the question in a much more entertaining way...

 

Who are The Three Musketeers of Data Quality?

1. Athos, the Executive Sponsor - Provides the mandate for the Business and IT to forge an ongoing and iterative collaboration throughout the entire project.  You might not see him roaming the halls or sitting in on most of the meetings.  However, Athos provides oversight and arbitrates any issues of organization politics.  Without an executive sponsor, a data quality project can not get very far and can easily lose momentum or focus.  Perhaps most importantly, Athos is also usually the source of the project's funding.

 

2. Porthos, the Project Manager - Facilitates the strategic and tactical collaboration of the project team.  Knowledge about data, business processes and supporting technology are spread throughout your organization.  Neither the Business nor IT alone has all of the necessary information required to achieve data quality success.  Porthos coordinates discussions with all of the stakeholders.  Business users are able to share their knowledge in their natural language and IT users are able to “geek out” in techno-babble.  Porthos clarifies communication by providing bi-directional translation.  He interprets end user business requirements, explains technical challenges and maintains an ongoing dialogue between the Business and IT.  Yes, Porthos is also responsible for the project plan.  But he realizes that project management is more about providing leadership. 

 

3. Aramis, the Subject Matter Expert - Provides detailed knowledge about specific data subject areas and business processes.  Aramis reviews the reports from the data quality assessments and provides feedback based on his understanding of how the data is actually being used.  He helps identify the data most valuable to the business.  Aramis will often be an excellent source for undocumented business rules and can quickly clarify seemingly complex issues based on his data-centric point of view.

 

Alexandre Dumas fans will recall the novel's plucky primary protagonist was an outsider who became the Fourth Musketeer and yes, data quality has one too:

 

4. D'Artagnan, the Data Quality Consultant - Provides extensive experience and best practices from successful data quality implementations.  Most commonly, d'Artagnan is a certified expert with the data quality tool you have selected.  D'Artagnan's goal is to help you customize a technical solution to your specific business needs.  Unlike the Dumas character, your d'Artagnan usually doesn't accept the Musketeer commission at the end of the project.  Therefore, his primary responsibility is to make himself obsolete as quickly as possible by providing mentoring, documentation, training and knowledge transfer. 

 

Your data quality project will typically have more than one person (and obviously not just men) playing each of these classic roles although you may use different job titles.  Additionally, there will be many other important people on your project playing many other key roles, such as data architect, business analyst, application developer and system tester - to name just a few. 

Data quality truly takes a team effort.  Remember that you are all in this together.

So if anyone asks you who is the most important person on your project, then just respond with the Musketeer Motto:

"All for Data Quality, Data Quality for All"


The Two Headed Monster of Data Matching

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

Data matching is commonly plagued by what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

 

I Fought The Two Headed Monster...

On a recent (mostly) business trip to Las Vegas, I scheduled a face-to-face meeting with a potential business partner that I had previously communicated with via phone and email only.  We agreed to a dinner meeting at a restaurant in the hotel/casino where I was staying. 

I would be meeting with the President/CEO and the Vice President of Business Development, a man and a woman respectively.

I was facing a real world data matching problem.

I knew their names, but I had no idea what they looked like.  Checking their company website and LinkedIn profiles didn't help - no photos.  I neglected to get their mobile phone numbers, however they had mine.

The restaurant was inside the casino and the only entrance was adjacent to a Starbucks that had tables and chairs facing the casino floor.  I decided to arrive at the restaurant 15 minutes early and camp out at Starbucks since anyone going near the restaurant would have to walk right past me.

I was more concerned about avoiding false positives.  I didn't want to walk up to every potential match and introduce myself since casino security would soon intervene (and I have seen enough movies to know that scene always ends badly). 

I decided to apply some probabilistic data matching principles to evaluate the mass of humanity flowing past me. 

If some of my matching criteria seems odd, please remember I was in a Las Vegas casino. 

I excluded from consideration all:

  • Individuals wearing a uniform or a costume
  • Groups consisting of more than two people
  • Groups consisting of two men or two women
  • Couples carrying shopping bags or souvenirs
  • Couples demonstrating a public display of affection
  • Couples where one or both were noticeably intoxicated
  • Couples where one or both were scantily clad
  • Couples where one or both seemed too young or too old

I carefully considered any:

  • Couples dressed in business attire or business casual attire
  • Couples pausing to wait at the restaurant entrance
  • Couples arriving close to the scheduled meeting time

I was quite pleased with myself for applying probabilistic data matching principles to a real world situation.

However, the scheduled meeting time passed.  At first, I simply assumed they might be running a little late or were delayed by traffic.  As the minutes continued to pass, I started questioning my matching criteria.

 

...And The Two Headed Monster Won

When the clock reached 30 minutes past the scheduled meeting time, my mobile phone rang.  My dinner companions were calling to ask if I was running late.  They had arrived on time, were inside the restaurant, and had already ordered.

Confused, I entered the restaurant.  Sure enough, there sat a man and a woman that had walked right past me.  I excluded them from consideration because of how they were dressed.  The Vice President of Business Development was dressed in jeans, sneakers and a casual shirt.  The President/CEO was wearing shorts, sneakers and a casual shirt.

I had dismissed them as a vacationing couple.

I had been defeated by a false negative.

 

The Harsh Reality is that Monsters are Real

My data quality expertise could not guarantee victory in this particular battle with The Two Headed Monster. 

Monsters are real and the hero of the story doesn't always win.

And it doesn’t matter if the match algorithms I use are deterministic, probabilistic, or even supercalifragilistic. 

The harsh reality is that false negatives and false positives can be reduced, but never eliminated.

 

Are You Fighting The Two Headed Monster?

Are you more concerned about false negatives or false positives?  Please share your battles with The Two Headed Monster.

 

Related Articles

Back in February and March, I published a five part series of articles on data matching methodology on Data Quality Pro

Parts 2 and 3 of the series provided data examples to illustrate the challenge of false negatives and false positives within the context of identifying duplicate customers:

The Nine Circles of Data Quality Hell

“Abandon all hope, ye who enter here.” 

In Dante’s Inferno, these words are inscribed above the entrance into hell.  The Roman poet Virgil was Dante’s guide through its nine circles, each an allegory for unrepentant sins beyond forgiveness.

The Very Model of a Modern DQ General will be your guide on this journey through nine of the most common mistakes that can doom your data quality project:

 

1. Thinking data quality is an IT issue (or a business issue) - Data quality is not an IT issue.  Data quality is also not a business issue.  Data quality is everyone's issue.  Successful data quality projects are driven by an executive management mandate for the business and IT to forge an ongoing and iterative collaboration throughout the entire project.  The business usually owns the data and understands its meaning and use in the day to day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes.

 

2. Waiting for poor data quality to affect you - Data quality projects are often launched in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure or a financial reporting scandal.  Whatever the triggering event, a common response is data quality suddenly becomes prioritized as a critical issue.

 

3. Believing technology alone is the solution - Although incredible advancements continue, technology alone cannot provide the solution.  Data quality requires a holistic approach involving people, process and technology.  Your project can only be successful when people take on the challenge united by collaboration, guided by an effective methodology, and of course, implemented with amazing technology.

 

4. Listening only to the expert - An expert can be an invaluable member of the data quality project team.  However, sometimes an expert can dominate the decision making process.  The expert's perspective needs to be combined with the diversity of the entire project team in order for success to be possible.

 

5. Losing focus on the data - The complexity of your data quality project can sometimes work against your best intentions.  It is easy to get pulled into the mechanics of documenting the business requirements and functional specifications and then charging ahead with application development.  Once the project achieves some momentum, it can take on a life of its own and the focus becomes more and more about making progress against the tasks in the project plan, and less and less on the project's actual goal, which is to improve the quality of your data.

  • This common mistake was the theme of my post: Data Gazers.

 

6. Chasing perfection - An obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause.  Data quality problems can be very insidious and even the best data quality process will still produce exceptions.  Although this is easy to accept in theory, it is notoriously difficult to accept in practice.  Do not let the pursuit of perfection undermine your data quality project.

 

7. Viewing your data quality assessment as a one-time event - Your data quality project should begin with a data quality assessment to assist with aligning perception with reality and to get the project off to a good start by providing a clear direction and a working definition of success.  However, the data quality assessment is not a one-time event that ends when development begins.  You should perform iterative data quality assessments throughout the entire development lifecycle.

 

8. Forgetting about the people - People, process and technology.  All three are necessary for success on your data quality project.  However, I have found that the easiest one to forget about (and by far the most important of the three) is people.

 

9. Assuming if you build it, data quality will come - There are many important considerations when planning a data quality project.  One of the most important is to realize that data quality problems cannot be permanently “fixed" by implementing a one-time "solution" that doesn't require ongoing improvements.

 

Knowing these common mistakes is no guarantee that your data quality project couldn't still find itself lost in a dark wood.

However, knowledge could help you realize when you have strayed from the right road and light a path to find your way back.

Schrödinger's Data Quality

In 1935, Austrian physicist Erwin Schrödinger described a now famous thought experiment where:

  “A cat, a flask containing poison, a tiny bit of radioactive substance and a Geiger counter are placed into a sealed box for one hour.  If the Geiger counter doesn't detect radiation, then nothing happens and the cat lives.  However if radiation is detected, then the flask is shattered, releasing the poison which kills the cat.  According to the Copenhagen interpretation of quantum mechanics, until the box is opened, the cat is simultaneously alive and dead.  Yet, once you open the box, the cat will either be alive or dead, not a mixture of alive and dead.” 

This was only a thought experiment.  Therefore, no actual cat was harmed. 

This paradox of quantum physics, known as Schrödinger's Cat, poses the question:

  “When does a quantum system stop existing as a mixture of states and become one or the other?”

 

Unfortunately, data quality projects are not thought experiments.  They are complex, time consuming and expensive enterprise initiatives.  Typically, a data quality tool is purchased, expert consultants are hired to supplement staffing, production data is copied to a development server and the project begins.  Until it is completed and the new system goes live, the project is a potential success or failure.  Yet, once the new system starts being used, the project will become either a success or failure.

This paradox, which I refer to as Schrödinger's Data Quality, poses the question:

  “When does a data quality project stop existing as potential success or failure and become one or the other?”

 

Data quality projects should begin with the parallel and complementary efforts of drafting the business requirements while also performing a data quality assessment, which can help you:

  • Verify data matches the metadata that describes it
  • Identify potential missing, invalid and default values
  • Prepare meaningful questions for subject matter experts
  • Understand how data is being used
  • Prioritize critical data errors
  • Evaluate potential ROI of data quality improvements
  • Define data quality standards
  • Reveal undocumented business rules
  • Review and refine the business requirements
  • Provide realistic estimates for development, testing and implementation

Therefore, the data quality assessment assists with aligning perception with reality and gets the project off to a good start by providing a clear direction and a working definition of success.

 

However, a common mistake is to view the data quality assessment as a one-time event that ends when development begins. 

 

Projects should perform iterative data quality assessments throughout the entire development lifecycle, which can help you:

  • Gain a data-centric view of the project's overall progress
  • Build data quality monitoring functionality into the new system
  • Promote data-driven development
  • Enable more effective unit testing
  • Perform impact analysis on requested enhancements (i.e. scope creep)
  • Record regression cases for testing modifications
  • Identify data exceptions that require suspension for manual review and correction
  • Facilitate early feedback from the user community
  • Correct problems that could undermine user acceptance
  • Increase user confidence that the new system will meet their needs

 

If you wait until the end of the project to learn if you have succeeded or failed, then you treat data quality like a game of chance.

And to paraphrase Albert Einstein:

  “Do not play dice with data quality.”


Data Gazers

The Matrix Within cubicles randomly dispersed throughout the sprawling office space of companies large and small, there exist countless unsung heroes of enterprise information initiatives.  Although their job titles might be labeling them as a Business Analyst, Programmer Analyst, Account Specialist or Application Developer, their true vocation is a far more noble calling.

 

They are Data Gazers.

 

In his excellent book Data Quality Assessment, Arkady Maydanchik explains that:

"Data gazing involves looking at the data and trying to reconstruct a story behind these data.  Following the real story helps identify parameters about what might or might not have happened and how to design data quality rules to verify these parameters.  Data gazing mostly uses deduction and common sense."

All enterprise information initiatives are complex endeavors and data quality projects are certainly no exception.  Success requires people taking on the challenge united by collaboration, guided by an effective methodology, and implementing a solution using powerful technology.

But the complexity of the project can sometimes work against your best intentions.  It is easy to get pulled into the mechanics of documenting the business requirements and functional specifications and then charging ahead on the common mantra:

"We planned the work, now we work the plan." 

Once the project achieves some momentum, it can take on a life of its own and the focus becomes more and more about making progress against the tasks in the project plan, and less and less on the project's actual goal...improving the quality of the data. 

In fact, I have often observed the bizarre phenomenon where as a project "progresses" it tends to get further and further away from the people who use the data on a daily basis.

 

However, Arkady Maydanchik explains that:

"Nobody knows the data better than the users.  Unknown to the big bosses, the people in the trenches are measuring data quality every day.  And while they rarely can give a comprehensive picture, each one of them has encountered certain data problems and developed standard routines to look for them.  Talking to the users never fails to yield otherwise unknown data quality rules with many data errors."

There is a general tendency to consider that working directly with the users and the data during application development can only be disruptive to the project's progress.  There can be a quiet comfort and joy in simply developing off of documentation and letting the interaction with the users and the data wait until the project plan indicates that user acceptance testing begins. 

The project team can convince themselves that the documented business requirements and functional specifications are suitable surrogates for the direct knowledge of the data that users possess.  It is easy to believe that these documents tell you what the data is and what the rules are for improving the quality of the data.

Therefore, although ignoring the users and the data until user acceptance testing begins may be a good way to keep a data quality project on schedule, you will only be delaying the project's inevitable failure because as all data gazers know and as my mentor Morpheus taught me:

"Unfortunately, no one can be told what the Data is. You have to see it for yourself."