Data Governance Frameworks are like Jigsaw Puzzles

Data Governance Jigsaw Puzzle.png

In a recent interview, Jill Dyché explained a common misconception, namely that a data governance framework is not a strategy.  “Unlike other strategic initiatives that involve IT,” Jill explained, “data governance needs to be designed.  The cultural factors, the workflow factors, the organizational structure, the ownership, the political factors, all need to be accounted for when you are designing a data governance roadmap.”

“People need a mental model, that is why everybody loves frameworks,” Jill continued.  “But they are not enough and I think the mistake that people make is that once they see a framework, rather than understanding its relevance to their organization, they will just adapt it and plaster it up on the whiteboard and show executives without any kind of context.  So they are already defeating the purpose of data governance, which is to make it work within the context of your business problems, not just have some kind of mental model that everybody can agree on, but is not really the basis for execution.”

“So it’s a really, really dangerous trend,” Jill cautioned, “that we see where people equate strategy with framework because strategy is really a series of collected actions that result in some execution — and that is exactly what data governance is.”

And in her excellent article Data Governance Next Practices: The 5 + 2 Model, Jill explained that data governance requires a deliberate design so that the entire organization can buy into a realistic execution plan, not just a sound bite.  As usual, I agree with Jill, since, in my experience, many people expect a data governance framework to provide eureka-like moments of insight.

In The Myths of Innovation, Scott Berkun debunked the myth of the eureka moment using the metaphor of a jigsaw puzzle.

“When you put the last piece into place, is there anything special about that last piece or what you were wearing when you put it in?” Berkun asked.  “The only reason that last piece is significant is because of the other pieces you’d already put into place.  If you jumbled up the pieces a second time, any one of them could turn out to be the last, magical piece.”

“The magic feeling at the moment of insight, when the last piece falls into place,” Berkun explained, “is the reward for many hours (or years) of investment coming together.  In comparison to the simple action of fitting the puzzle piece into place, we feel the larger collective payoff of hundreds of pieces’ worth of work.”

Perhaps the myth of the data governance framework could also be debunked using the metaphor of a jigsaw puzzle.

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, change management — and many other puzzle pieces.

How could a data governance framework possibly predict how you will assemble the puzzle pieces?  Or how the puzzle pieces will fit together within your unique corporate culture?  Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization?  And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles.

Data Quality and the Bystander Effect

In his recent Harvard Business Review blog post Break the Bad Data Habit, Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated.  “At a minimum,” Redman explained, “others using the erred data may not spot the error.  There is no telling where it might turn up or who might be victimized.”  And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause.  The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all.  People and departments must continue to seek out and correct errors.  They must also provide feedback and communicate requirements to their data sources.”

In his blog post The Secret to an Effective Data Quality Feedback Loop, Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

I definitely agree with Redman and Jones about the need for feedback loops, but I have found, more often than not, that no feedback at all is provided on data quality issues because of the assumption that data quality is someone else’s responsibility.

This general lack of accountability for data quality issues is similar to what is known in psychology as the Bystander Effect, which refers to people often not offering assistance to the victim in an emergency situation when other people are present.  Apparently, the mere presence of other bystanders greatly decreases intervention, and the greater the number of bystanders, the less likely it is that any one of them will help.  Psychologists believe that the reason this happens is that as the number of bystanders increases, any given bystander is less likely to interpret the incident as a problem, and less likely to assume responsibility for taking action.

In my experience, the most common reason that data quality issues are often neither reported nor corrected is that most people throughout the enterprise act like data quality bystanders, making them less likely to interpret bad data as a problem or, at the very least, not their responsibility.  But the enterprise’s data quality is perhaps most negatively affected by this bystander effect, which may make it the worst bad data habit that the enterprise needs to break.

 

Related Posts

DQ-Tip: “Don't pass bad data on to the next person...”

Hyperactive Data Quality (Second Edition)

A Farscape Analogy for Data Quality

There is No Such Thing as a Root Cause

Data Quality and the Q Test

The Data Quality Wager

The Third Law of Data Quality

The Data Governance Oratorio

Shared Responsibility

The Algebra of Collaboration

Collaboration isn’t Brain Surgery

The Three Most Important Letters in Data Governance

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Redefining Data Quality — Guest Peter Perera discusses his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Data Quality Pro

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Links for Data Quality Pro and Dylan Jones:

On this episode, I am joined by special guest Dylan Jones, the community leader of Data Quality Pro, the largest membership resource dedicated entirely to the data quality profession.

Dylan is currently overseeing the re-build and re-launch of Data Quality Pro into a next generation membership platform, and during our podcast discussion, Dylan describes some of the great new features that will be coming soon to Data Quality Pro.

 

Data Quality Pro

Additional listening options:

 

Related Posts

#FollowFriday Spotlight: @DataQualityPro

The Once and Future Data Quality Expert

DQ-Tip: “There is no such thing as data accuracy...”

DQ-View: Is Data Quality the Sun?

Microwavable Data Quality

Customer Incognita

The Only Thing Necessary for Poor Data Quality

The Two Headed Monster of Data Matching

Identifying Duplicate Customers

A Brave New Data World

#FollowFriday Spotlight: @DataQualityPro

FollowFriday Spotlight is an OCDQ regular segment highlighting someone you should follow—and not just Fridays on Twitter.

Links for Data Quality Pro and Dylan Jones:

Data Quality Pro, founded and maintained by Dylan Jones, is a free and independent community resource dedicated to helping data quality professionals take their career or business to the next level.  Data Quality Pro is your free expert resource providing data quality articles, webinars, forums and tutorials from the world’s leading experts, every day.

With the mission to create the most beneficial data quality resource that is freely available to members around the world, the goal of Data Quality Pro is “winning-by-sharing” and they believe that by contributing a small amount of their experience, skill or time to support other members then truly great things can be achieved.

Membership is 100% free and provides a broad range of additional content for professionals of all backgrounds and skill levels.

Check out the Best of Data Quality Pro, which includes the following great blog posts written by Dylan Jones in 2010:

 

Related Posts

#FollowFriday and Re-Tweet-Worthiness

#FollowFriday and The Three Tweets

Dilbert, Data Quality, Rabbits, and #FollowFriday

Twitter, Meaningful Conversations, and #FollowFriday

The Fellowship of #FollowFriday

Social Karma (Part 7) – Twitter

DQ-Tip: “There is no such thing as data accuracy...”

Data Quality (DQ) Tips is an OCDQ regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“There is no such thing as data accuracy — There are only assertions of data accuracy.”

This DQ-Tip came from the Data Quality Pro webinar ISO 8000 Master Data Quality featuring Peter Benson of ECCMA.

You can download (.pdf file) quotes from this webinar by clicking on this link: Data Quality Pro Webinar Quotes - Peter Benson

ISO 8000 is the international standards for data quality.  You can get more information by clicking on this link: ISO 8000

 

Data Accuracy

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy).

“The definition of data quality,” according to Peter and the ISO 8000 standards, “is the ability of the data to meet requirements.”

Although accuracy is only one of many dimensions of data quality, whenever we refer to data as accurate, we are referring to the ability of the data to meet specific requirements, and quite often it’s the ability to support making a critical business decision.

I agree with Peter and the ISO 8000 standards because we can’t simply take an accuracy metric on a data quality dashboard (or however else the assertion is presented to us) at face value without understanding how the metric is both defined and measured.

However, even when well defined and properly measured, data accuracy is still only an assertion.  Oftentimes, the only way to verify the assertion is by putting the data to its intended use.

If by using it you discover that the data is inaccurate, then by having established what the assertion of accuracy was based on, you have a head start on performing root cause analysis, enabling faster resolution of the issues—not only with the data, but also with the business and technical processes used to define and measure data accuracy.

 

Related Posts

Worthy Data Quality Whitepapers (Part 1)

Why isn’t our data quality worse?

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “There is no point in monitoring data quality...”

DQ-Tip: “Don't pass bad data on to the next person...”

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is about more than just improving your data...” 

DQ-Tip: “Start where you are...”

DQ-View: Is Data Quality the Sun?

Data Quality (DQ) View is an OCDQ regular segment.  Each DQ-View is a brief video discussion of a data quality key concept.

DataQualityPro

This recent tweet by Dylan Jones of Data Quality Pro succinctly expresses a vitally important truth about the data quality profession.

Although few would debate the necessary requirement of skill, some might doubt the need for passion.  Therefore, in this new DQ-View segment, I want to discuss why data quality initiatives require passionate data professionals.

 

DQ-View: Is Data Quality the Sun?

 

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: DQ-View on Vimeo

 

Related Posts

Data Gazers

Finding Data Quality

Oh, the Data You’ll Show!

Data Rock Stars: The Rolling Forecasts

The Second Law of Data Quality

The General Theory of Data Quality

DQ-Tip: “Start where you are...”

Sneezing Data Quality

The 2010 Data Quality Blogging All-Stars

The 2010 Major League Baseball (MLB) All-Star Game is being held tonight (July 13) at Angel Stadium in Anaheim, California.

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with (for the most part) the best statistical performances during the first half of the MLB season.

Last summer, I began my own annual exhibition of showcasing the bloggers whose posts I have personally most enjoyed reading during the first half of the data quality blogging season. 

Therefore, this post provides links to stellar data quality blog posts that were published between January 1 and June 30 of 2010.  My definition of a “data quality blog post” also includes Data Governance, Master Data Management, and Business Intelligence. 

Please Note: There is no implied ranking in the order that bloggers or blogs are listed, other than that Individual Blog All-Stars are listed first, followed by Vendor Blog All-Stars, and the blog posts are listed in reverse chronological order by publication date.

 

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

 

Dylan Jones

From Data Quality Pro:

 

Julian Schwarzenbach

From Data and Process Advantage Blog:

 

Rich Murnane

From Rich Murnane's Blog:

 

Phil Wright

From Data Factotum:

 

Initiate – an IBM Company

From Mastering Data Management:

 

Baseline Consulting

From their three blogs: Inside the Biz with Jill Dyché, Inside IT with Evan Levy, and In the Field with our Experts:

 

DataFlux – a SAS Company

From Community of Experts:

 

Related Posts

Recently Read: May 15, 2010

Recently Read: March 22, 2010

Recently Read: March 6, 2010

Recently Read: January 23, 2010

The 2009 Data Quality Blogging All-Stars

 

Additional Resources

From the IAIDQ, read the 2010 issues of the Blog Carnival for Information/Data Quality:

Microwavable Data Quality

Data quality is definitely not a one-time project, but instead requires a sustained program of enterprise-wide best practices that are best implemented within a data governance framework that “bakes in” defect prevention, data quality monitoring, and near real-time standardization and matching services—all ensuring high quality data is available to support daily business decisions.

However, implementing a data governance program is an evolutionary process requiring time and patience.

Baking and cooking also require time and patience.  Microwavable meals can be an occasional welcome convenience, and if you are anything like me (my condolences) and you can’t bake or cook, then microwavable meals can be an absolute necessity.

Data cleansing can also be an occasional (not necessarily welcome) convenience, or a relative necessity (i.e., a “necessary evil”).

Last year on Data Quality Pro, Dylan Jones hosted a great debate on the necessity of data cleansing, which is well worth reading, especially since the over 25 (and continuing) comments it received proves it is a polarizing topic for the data quality profession.

I reheated this debate (using the Data Quality Microwave, of course) earlier this year with my A Tale of Two Q’s blog post, which also received many commendable comments (but far less than Dylan’s blog post—not that I am counting or anything).

Similarly, a heated debate can be had over the health implications of the microwave.  Eating too many microwavable meals is certainly not healthy, but I have many friends and family who would argue quite strongly for either side of this “food fight.”

Both of these great debates can be as deeply polarizing as Pepsi vs. Coke and Soccer vs. Football.  Just for the official record, I am firmly for both Pepsi and Football—and by Football, I mean NFL Football—and firmly against both Coke and Soccer. 

Just as I advocate that everyone (myself included) should learn how to cook, but still accept the eternal reality of the microwave, I definitely advocate the implementation of a data governance program, but I also accept the eternal reality of data cleansing.   

However, my lawyers have advised me to report that beta testing for an actual Data Quality Microwave has not been promising.

 

Related Posts

A Tale of Two Q’s

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

 

Follow OCDQ

If you enjoyed this blog post, then please subscribe to OCDQ via my RSS feed, my E-mail updates, or Google Reader.

You can also follow OCDQ on Twitter, fan the Facebook page for OCDQ, and connect with me on LinkedIn.


Customer Incognita

Many enterprise information initiatives are launched in order to unravel that riddle, wrapped in a mystery, inside an enigma, that great unknown, also known as...Customer.

Centuries ago, cartographers used the Latin phrase terra incognita (meaning “unknown land”) to mark regions on a map not yet fully explored.  In this century, companies simply can not afford to use the phrase customer incognita to indicate what information about their existing (and prospective) customers they don't currently have or don't properly understand.

 

What is a Customer?

First things first, what exactly is a customer?  Those happy people who give you money?  Those angry people who yell at you on the phone or say really mean things about your company on Twitter and Facebook?  Why do they have to be so mean? 

Mean people suck.  However, companies who don't understand their customers also suck.  And surely you don't want to be one of those companies, do you?  I didn't think so.

Getting back to the question, here are some insights from the Data Quality Pro discussion forum topic What is a customer?:

  • Someone who purchases products or services from you.  The word “someone” is key because it’s not the role of a “customer” that forms the real problem, but the precision of the term “someone” that causes challenges when we try to link other and more specific roles to that “someone.”  These other roles could be contract partner, payer, receiver, user, owner, etc.
  • Customer is a role assigned to a legal entity in a complete and precise picture of the real world.  The role is established when the first purchase is accepted from this real-world entity.  Of course, the main challenge is whether or not the company can establish and maintain a complete and precise picture of the real world.

These working definitions were provided by fellow blogger and data quality expert Henrik Liliendahl Sørensen, who recently posted 360° Business Partner View, which further examines the many different ways a real-world entity can be represented, including when, instead of a customer, the real-world entity represents a citizen, patient, member, etc.

A critical first step for your company is to develop your definition of a customer.  Don't underestimate either the importance or the difficulty of this process.  And don't assume it is simply a matter of semantics.

Some of my consulting clients have indignantly told me: “We don't need to define it, everyone in our company knows exactly what a customer is.”  I usually respond: “I have no doubt that everyone in your company uses the word customer, however I will work for free if everyone defines the word customer in exactly the same way.”  So far, I haven't had to work for free.  

 

How Many Customers Do You Have?

You have done the due diligence and developed your definition of a customer.  Excellent!  Nice work.  Your next challenge is determining how many customers you have.  Hopefully, you are not going to try using any of these techniques:

  • SELECT COUNT(*) AS "We have this many customers" FROM Customers
  • SELECT COUNT(DISTINCT Name) AS "No wait, we really have this many customers" FROM Customers
  • Middle-Square or Blum Blum Shub methods (i.e. random number generation)
  • Magic 8-Ball says: “Ask again later”

One of the most common and challenging data quality problems is the identification of duplicate records, especially redundant representations of the same customer information within and across systems throughout the enterprise.  The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.

Earlier this year on Data Quality Pro, I published a five part series of articles on identifying duplicate customers, which focused on the methodology for defining your business rules and illustrated some of the common data matching challenges.

Topics covered in the series:

  • Why a symbiosis of technology and methodology is necessary when approaching this challenge
  • How performing a preliminary analysis on a representative sample of real data prepares effective examples for discussion
  • Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
  • How both false negatives and false positives illustrate the highly subjective nature of this problem
  • How to document your business rules for identifying duplicate customers
  • How to set realistic expectations about application development
  • How to foster a collaboration of the business and technical teams throughout the entire project
  • How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

To download the associated presentation (no registration required), please follow this link: OCDQ Downloads

 

Conclusion

“Knowing the characteristics of your customers,” stated Jill Dyché and Evan Levy in the opening chapter of their excellent book, Customer Data Integration: Reaching a Single Version of the Truth, “who they are, where they are, how they interact with your company, and how to support them, can shape every aspect of your company's strategy and operations.  In the information age, there are fewer excuses for ignorance.”

For companies of every size and within every industry, customer incognita is a crippling condition that must be replaced with customer cognizance in order for the company to continue to remain competitive in a rapidly changing marketplace.

Do you know your customers?  If not, then they likely aren't your customers anymore.

The Only Thing Necessary for Poor Data Quality

“Demonstrate projected defects and business impacts if the business fails to act,” explains Dylan Jones of Data Quality Pro in his recent and remarkable post How To Deliver A Compelling Data Quality Business Case

“Presenting a future without data quality management...leaves a simple take-away message – do nothing and the situation will deteriorate.”

I can not help but be reminded of the famous quote often attributed to the 18th century philosopher Edmund Burke:

“The only thing necessary for the triumph of evil, is for good men to do nothing.”

Or the even more famous quote often attributed to the long time ago Jedi Master Yoda:

Poor data quality is the path to the dark side.  Poor data quality leads to bad business decisions. 

Bad business decisions leads to lost revenue.  Lost revenue leads to suffering.”

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, demonstrate that poor data quality is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision-critical enterprise information.

Preventing poor data quality is mission-critical.  Poor data quality will undermine the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

“The only thing necessary for Poor Data Quality – is for good businesses to Do Nothing.”

Related Posts

Hyperactive Data Quality (Second Edition)

Data Quality: The Reality Show?

Data Governance and Data Quality

Data Quality Blogging All-Stars

The 2009 Major League Baseball (MLB) All-Star Game is being held tonight at Busch Stadium in St. Louis, Missouri. 

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with the best statistical performances from the first half of the MLB season.

As I watch the 80th Midsummer Classic, I offer this exhibition that showcases the bloggers with the posts I have most enjoyed reading from the first half of the 2009 data quality blogging season.

 

Dylan Jones

From Data Quality Pro:

 

Daragh O Brien

From The DOBlog:

 

Steve Sarsfield

From Data Governance and Data Quality Insider:

 

Daniel Gent

From Data Quality Edge:

 

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

 

Stefanos Damianakis

From Netrics HD:

 

Vish Agashe

From Business Intelligence: Process, People and Products:

 

Mark Goloboy

From Boston Data, Technology & Analytics:

 

Additional Resources

Over on Data Quality Pro, read the data quality blog roundups from the first half of 2009:

From the IAIDQ, read the 2009 issues of the IAIDQ Blog Carnival:

The Two Headed Monster of Data Matching

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

Data matching is commonly plagued by what I refer to as The Two Headed Monster:

  • False Negatives - records that did not match, but should have been matched
  • False Positives - records that matched, but should not have been matched

 

I Fought The Two Headed Monster...

On a recent (mostly) business trip to Las Vegas, I scheduled a face-to-face meeting with a potential business partner that I had previously communicated with via phone and email only.  We agreed to a dinner meeting at a restaurant in the hotel/casino where I was staying. 

I would be meeting with the President/CEO and the Vice President of Business Development, a man and a woman respectively.

I was facing a real world data matching problem.

I knew their names, but I had no idea what they looked like.  Checking their company website and LinkedIn profiles didn't help - no photos.  I neglected to get their mobile phone numbers, however they had mine.

The restaurant was inside the casino and the only entrance was adjacent to a Starbucks that had tables and chairs facing the casino floor.  I decided to arrive at the restaurant 15 minutes early and camp out at Starbucks since anyone going near the restaurant would have to walk right past me.

I was more concerned about avoiding false positives.  I didn't want to walk up to every potential match and introduce myself since casino security would soon intervene (and I have seen enough movies to know that scene always ends badly). 

I decided to apply some probabilistic data matching principles to evaluate the mass of humanity flowing past me. 

If some of my matching criteria seems odd, please remember I was in a Las Vegas casino. 

I excluded from consideration all:

  • Individuals wearing a uniform or a costume
  • Groups consisting of more than two people
  • Groups consisting of two men or two women
  • Couples carrying shopping bags or souvenirs
  • Couples demonstrating a public display of affection
  • Couples where one or both were noticeably intoxicated
  • Couples where one or both were scantily clad
  • Couples where one or both seemed too young or too old

I carefully considered any:

  • Couples dressed in business attire or business casual attire
  • Couples pausing to wait at the restaurant entrance
  • Couples arriving close to the scheduled meeting time

I was quite pleased with myself for applying probabilistic data matching principles to a real world situation.

However, the scheduled meeting time passed.  At first, I simply assumed they might be running a little late or were delayed by traffic.  As the minutes continued to pass, I started questioning my matching criteria.

 

...And The Two Headed Monster Won

When the clock reached 30 minutes past the scheduled meeting time, my mobile phone rang.  My dinner companions were calling to ask if I was running late.  They had arrived on time, were inside the restaurant, and had already ordered.

Confused, I entered the restaurant.  Sure enough, there sat a man and a woman that had walked right past me.  I excluded them from consideration because of how they were dressed.  The Vice President of Business Development was dressed in jeans, sneakers and a casual shirt.  The President/CEO was wearing shorts, sneakers and a casual shirt.

I had dismissed them as a vacationing couple.

I had been defeated by a false negative.

 

The Harsh Reality is that Monsters are Real

My data quality expertise could not guarantee victory in this particular battle with The Two Headed Monster. 

Monsters are real and the hero of the story doesn't always win.

And it doesn’t matter if the match algorithms I use are deterministic, probabilistic, or even supercalifragilistic. 

The harsh reality is that false negatives and false positives can be reduced, but never eliminated.

 

Are You Fighting The Two Headed Monster?

Are you more concerned about false negatives or false positives?  Please share your battles with The Two Headed Monster.

 

Related Articles

Back in February and March, I published a five part series of articles on data matching methodology on Data Quality Pro

Parts 2 and 3 of the series provided data examples to illustrate the challenge of false negatives and false positives within the context of identifying duplicate customers:

Identifying Duplicate Customers

I just finished publishing a five part series of articles on data matching methodology for dealing with the common data quality problem of identifying duplicate customers. 

The article series was published on Data Quality Pro, which is the leading data quality online magazine and free independent community resource dedicated to helping data quality professionals take their career or business to the next level.

Topics covered in the series:

  • Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
  • How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
  • Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
  • How both false negatives and false positives illustrate the highly subjective nature of this problem
  • How to document your business rules for identifying duplicate customers
  • How to set realistic expectations about application development
  • How to foster a collaboration of the business and technical teams throughout the entire project
  • How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

Do you have obsessive-compulsive data quality (OCDQ)?

Obsessive-compulsive data quality (OCDQ) affects millions of people worldwide.

The most common symptoms of OCDQ are:

  • Obsessively verifying data used in critical business decisions
  • Compulsively seeking an understanding of data in business terms
  • Repeatedly checking that data is complete and accurate before sharing it
  • Habitually attempting to calculate the cost of poor data quality
  • Constantly muttering a mantra that data quality must be taken seriously

While the good folks at Prescott Pharmaceuticals are busy working on a treatment, I am dedicating this independent blog as group therapy to all those who (like me) have dealt with OCDQ their entire professional lives.

Over the years, the work of many individuals and organizations has been immensely helpful to those of us with OCDQ.

Some of these heroes deserve special recognition:

Data Quality Pro – Founded and maintained by Dylan Jones, Data Quality Pro is a free independent community resource dedicated to helping data quality professionals take their career or business to the next level. With the mission to create the most beneficial data quality resource that is freely available to members around the world, Data Quality Pro provides free software, job listings, advice, tutorials, news, views and forums. Their goal is "winning-by-sharing” and they believe that by contributing a small amount of their experience, skill or time to support other members then truly great things can be achieved. With the new Member Service Register, consultants, service providers and technology vendors can promote their services and include links to their websites and blogs.

 

International Association for Information and Data Quality (IAIDQ) – Chartered in January 2004, IAIDQ is a not-for-profit, vendor-neutral professional association whose purpose is to create a world-wide community of people who desire to reduce the high costs of low quality information and data by applying sound quality management principles to the processes that create, maintain and deliver data and information. IAIDQ was co-founded by Larry English and Tom Redman, who are two of the most respected and well-known thought and practice leaders in the field of information and data quality.IAIDQ also provides two excellent blogs: IQ Trainwrecks and Certified Information Quality Professional (CIQP).

 

Beth Breidenbach – her blog Confessions of a database geek is fantastic in and of itself, but she has also compiled an excellent list of data quality blogs and provides them via aggregated feeds in both Feedburner and Google Reader formats.

 

Vincent McBurney – his blog Tooling Around in the IBM InfoSphere is an entertaining and informative look at data integration in the IBM InfoSphere covering many IBM Information Server products such as DataStage, QualityStage and Information Analyzer.

 

Daragh O Brien – is a leading writer, presenter and researcher in the field of information quality management, with a particular interest in legal aspects of information quality. His blog The DOBlog is a popular and entertaining source of great material.

 

Steve Sarsfield – his blog Data Governance and Data Quality Insider covers the world of data integration, data governance, and data quality from the perspective of an industry insider. Also, check out his new book: The Data Governance Imperative.