April 28, 2009

Hyperactive Data Quality

April 28, 2009/ Jim Harris

In economics, the term "flight to quality" describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar "flight to data quality" can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information. Some examples include a customer service nightmare, a regulatory compliance failure or a financial reporting scandal. Whatever the triggering event, a common response is data quality suddenly becomes prioritized as a critical issue and an enterprise information initiative is launched.

Congratulations! You've realized (albeit the hard way) that this "data quality thing" is really important.

Now what are you going to do about it? How are you going to attempt to actually solve the problem?

In his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman uses an excellent analogy called the data quality lake:

"...a lake represents a database and the water therein the data. The stream, which adds new water, is akin to a business process that creates new data and adds them to the database. The lake...is polluted, just as the data are dirty. Two factories pollute the lake. Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source - the factories.

The contrast between the two approaches is stark. In the first, the focus is on the lake; in the second, it is on the stream. So too with data. Finding and fixing errors focuses on the database and data that have already been created. Preventing errors focuses on the business processes and future data."

Reactive Data Quality

A "flight to data quality" usually prompts an approach commonly referred to as Reactive Data Quality (i.e. "cleaning the lake" to use Redman's excellent analogy). The majority of enterprise information initiatives are reactive. The focus is typically on finding and fixing the problems with existing data in an operational data store (ODS), enterprise data warehouse (EDW) or other enterprise information repository. In other words, the focus is on fixing data after it has been extracted from its sources.

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert "lake cleaners"). Data quality problems can be very insidious and even the best "lake cleaning" process will still produce exceptions. Your process should be designed to identify and report exceptions when they occur. In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

However, as Redman cautions: "...the problem with being a good lake cleaner is that life never gets better. Indeed, it gets worse as more data...conspire to mean there is more work every day." I tell my clients the only way to guarantee that reactive data quality will be successful is to unplug all the computers so that no one can add new data or modify existing data.

Proactive Data Quality

Attempting to prevent data quality problems before they happen is commonly referred to as Proactive Data Quality. The focus is on preventing errors at the sources where data is entered or received and before it is extracted for use by downstream applications (i.e. "enters the lake"). Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

"It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect."

Proactive data quality advocates implementing improved edit controls on data entry screens, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the business needs of your enterprise information consumers before you deliver data to them.

Obviously, it is impossible to truly prevent every problem before it happens. However, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.

Hyperactive Data Quality

Too many enterprise information initiatives fail because they are launched based on a "flight to data quality" response and have the unrealistic perspective that data quality problems can be quickly and easily resolved. However, just like any complex problem, there is no fast and easy solution for data quality.

In order to be successful, you must combine aspects of both reactive and proactive data quality in order to create an enterprise-wide best practice that I call Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

Please share your thoughts and experiences. Is your data quality Reactive, Proactive or Hyperactive?

April 22, 2009

All I Really Need To Know About Data Quality I Learned In Kindergarten

April 22, 2009/ Jim Harris

Robert Fulghum's excellent book All I Really Need to Know I Learned in Kindergarten dominated the New York Times Bestseller List for all of 1989 and much of 1990. The 15th Anniversary Edition, which was published in 2003, revised and expanded on the original inspirational essays.

A far less noteworthy achievement of the book is that it also inspired me to write about how:

All I Really Need To Know About Data Quality I Learned in Kindergarten

Show And Tell

I loved show and tell. An opportunity to deliver an interactive presentation that encouraged audience participation. No PowerPoint slides. No podium. No power suit. Just me wearing the dorky clothes my parents bought me, standing right in front of the class, waving my Millennium Falcon over my head and explaining that "traveling through hyper-space ain't like dustin' crops, boy" while my classmates (and my teacher) were laughing so hard many of them fell out of their seats. My show and tell made it clear that if you came over my house after school to play, then you knew exactly what to expect - a geek who loved Star Wars - perhaps a little too much.

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, remember the lessons of show and tell. Poor data quality is not a theoretical problem - it is a real business problem that negatively impacts the quality of decision critical enterprise information. Your presentation should make it clear that if the data quality initiative doesn't get approved, then everyone will know exactly what to expect:

"Poor data quality is the path to the dark side.

Poor data quality leads to bad business decisions.

Bad business decisions leads to lost revenue.

Lost revenue leads to suffering."

The Five Second Rule

If you drop your snack on the floor, then as long as you pick it up within five seconds you can safely eat it. When you have poor quality data in your enterprise systems, you do have more than five seconds to do something about it. However, the longer poor quality data goes without remediation, the more likely it will negatively impact critical business decisions. Don't let your data become the "smelly kid" in class. No one likes to share their snacks with the smelly kid. And no one trusts information derived from "smelly data."

When You Make A Mistake, Say You're Sorry

Nobody's perfect. We all have bad days. We all occasionally say and do stupid things. When you make a mistake, own up to it and apologize for it. You don't want to have to wear the dunce cap or stand in the corner for a time-out. And don't be too hard on your friend that had to wear the dunce cap today. It was simply their turn to make a mistake. It will probably be your turn tomorrow. They had to say they were sorry. You also have to forgive them. Who else is going to share their cookies with you when your mom once again packs carrots as your snack?

Learn Something New Every Day

We didn't stop learning after we "graduated" from kindergarten, did we? We are all proud of our education, knowledge, understanding, and experience. It may be true that experience is the path that separates knowledge from wisdom. However, we must remain open to learning new things. Socrates taught us that "the only true wisdom consists in knowing that you know nothing." I bet Socrates headlined the story time circuit in the kindergartens of Ancient Greece.

Hold Hands And Stick Together

I remember going on numerous field trips in kindergarten. We would visit museums, zoos and amusement parks. Wherever we went, our teacher would always have us form an interconnected group by holding the hand of the person in front of you and the person behind you. We were told to stick together and look out for one another. This important lesson is also applicable to data quality initiatives. Teamwork and collaboration are essential for success. Remember that you are all in this together.

What did you learn about data quality in kindergarten?

April 18, 2009

A Portrait of the Data Quality Expert as a Young Idiot

April 18, 2009/ Jim Harris

Once upon a time (and a very good time it was), there was a young data quality consultant that fancied himself an expert.

He went from client to client and project to project, all along espousing his expertise. He believed he was smarter than everyone else. He didn't listen well - he simply waited for his turn to speak. He didn't foster open communication without bias - he believed his ideas were the only ones of value. He didn't seek mutual understanding on difficult issues - he bullied people until he got his way. He didn't believe in the importance of the people involved in the project - he believed the project would be successful with or without them.

He was certain he was always right.

And he failed - many, many times.

In his excellent book How We Decide, Jonah Lehrer advocates paying attention to your inner disagreements, becoming a student of your own errors, and avoiding the trap of certainty. When you are certain that you're right, you stop considering the possibility that you might be wrong.

James Joyce wrote that "mistakes are the portals of discovery" and T.S. Eliot wrote that "we must not cease from exploration and the end of all our exploring will be to arrive where we began and to know the place for the first time."

Once upon a time, there was a young data quality consultant that realized he was an idiot - and a very good time it was.

April 15, 2009

Are You Afraid Of Your Data Quality Solution?

April 15, 2009/ Jim Harris

As a data quality consultant, when I begin an engagement with a new client, I ask many questions. I seek an understanding of the current environment from both the business and technical perspectives. Some of the common topics I cover are what data quality solutions have been attempted previously, how successful were they and are they still in use today. To their credit, I find that many of my clients have successfully implemented data quality solutions that are still in use.

However, this revelation frequently leads to some form of the following dialogue:

OCDQ: "Am I here to help with the enhancements for the next iteration of the project?"

Client: "No, we don't want to enhance our existing solution, we want you to build us a brand new one."

OCDQ: "I thought you had successfully implemented a data quality solution. Is that not true?"

Client: "We believe the current solution is working as intended. It appears to handle many of our data quality issues."

OCDQ: "How long have you been using the current solution?"

Client: "Five years."

OCDQ: "You haven't made any changes in five years? Haven't there been requests for bug fixes and enhancements?"

Client: "Yes, of course. However, we didn't want to make any modifications because we were afraid we would break it."

OCDQ: "Who created the current solution? Didn't they provide documentation, training and knowledge transfer?"

Client: "A previous consultant created it. He provided some documentation and training, but only on how to run it."

A common data quality adage is:

"If you can't measure it, then you can't manage it."

A far more important data quality adage is:

"If you don't know how to maintain it, then you shouldn't implement it."

There are many important considerations when planning a data quality initiative. One of the most common mistakes is the unrealistic perspective that data quality problems can be permanently “fixed" by implementing a one-time "solution" that doesn't require ongoing improvements. This flawed perspective leads many organizations to invest in powerful software and expert consultants, believing that:

"If they build it, data quality will come."

However, data quality is not a field of dreams - and I know because I actually live in Iowa.

The reality is data quality initiatives can only be successful when they follow these very simple and time-tested instructions:

Measure, Improve, Repeat.

April 11, 2009

Enterprise Data World 2009

April 11, 2009/ Jim Harris

Formerly known as the DAMA International Symposium and Wilshire MetaData Conference, Enterprise Data World 2009 was held April 5-9 in Tampa, Florida at the Tampa Convention Center.

Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management. This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content. With 200 hours of in-depth tutorials, hands-on workshops, practical sessions and insightful keynotes, the conference was a tremendous success. Congratulations and thanks to Tony Shaw, Maya Stosskopf and the entire Wilshire staff.

I attended Enterprise Data World 2009 as a member of the Iowa Chapter of DAMA and as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ).

I used Twitter to provide live reporting from the sessions that I was attending.

I wish that I could have attended every session, but here are some highlights from ten of my favorites:

8 Ways Data is Changing Everything

Keynote by Stephen Baker from BusinessWeek.

His article Math Will Rock Your World inspired his excellent book The Numerati. Additionally, check out his blog: Blogspotting.

Quotes from the keynote:

"Data is changing how we understand ourselves and how we understand our world"
"Predictive data mining is about the mathematical modeling of humanity"
"Anthropologists are looking at social networking (e.g. Twitter, Facebook) to understand the science of friendship"

Master Data Management: Proven Architectures, Products and Best Practices

Tutorial by David Loshin from Knowledge Integrity.

Included material from his excellent book Master Data Management. Additionally, check out his blog: David Loshin.

Quotes from the tutorial:

"Master Data are the core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies"
"Master Data Management (MDM) provides a unified view of core data subject areas (e.g. Customers, Products)"
"With MDM, it is important not to over-invest and under-implement - invest in and implement only what you need"

Master Data Management: Ignore the Hype and Keep the Focus on Data

Case Study by Tony Fisher from DataFlux and Jeff Grayson from Equinox Fitness.

Quotes from the case study:

"The most important thing about Master Data Management (MDM) is improving business processes"
"80% of any enterprise implementation should be the testing phase"
"MDM Data Quality (DQ) Challenge: Any % wrong means you’re 100% certain you’re not always right"
"MDM DQ Solution: Re-design applications to ensure the ‘front-door’ protects data quality"
"Technology is critical, however thinking through the operational processes is more important"

A Case of Usage: Working with Use Cases on Data-Centric Projects

Case Study by Susan Burk from IBM.

Quotes from the case study:

"Use Case is a sequence of actions performed to yield a result of observable business value"
"The primary focus of data-centric projects is data structure, data delivery and data quality"
"Don’t like use cases? – ok, call them business acceptance criteria – because that’s what a use case is"

Crowdsourcing: People are Smart, When Computers are Not

Session by Sharon Chiarella from Amazon Web Services.

Quotes from the session:

"Crowdsourcing is outsourcing a task typically performed by employees to a general community of people"
"Crowdsourcing eliminates over-staffing, lowers costs and reduces work turnaround time"
"An excellent example of crowdsourcing is open source software development (e.g. Linux)"

Improving Information Quality using Lean Six Sigma Methodology

Session by Atul Borkar and Guillermo Rueda from Intel.

Quotes from the session:

"Information Quality requires a structured methodology in order to be successful"
Lean Six Sigma Framework: DMAIC – Define, Measure, Analyze, Improve, Control:
- Define = Describe the challenge, goal, process and customer requirements
- Measure = Gather data about the challenge and the process
- Analyze = Use hypothesis and data to find root causes
- Improve = Develop, implement and refine solutions
- Control = Plan for stability and measurement

Universal Data Quality: The Key to Deriving Business Value from Corporate Data

Session by Stefanos Damianakis from Netrics.

Quotes from the session:

"The information stored in databases is NEVER perfect, consistent and complete – and it never can be!"
"Gartner reports that 25% of critical data within large businesses is somehow inaccurate or incomplete"
"Gartner reports that 50% of implementations fail due to lack of attention to data quality issues"
"A powerful approach to data matching is the mathematical modeling of human decision making"
"The greatest advantage of mathematical modeling is that there are no data matching rules to build and maintain"

Defining a Balanced Scorecard for Data Management

Seminar by C. Lwanga Yonke, a founding member of the International Association for Information and Data Quality (IAIDQ).

Quotes from the seminar:

"Entering the same data multiple times is like paying the same invoice multiple times"
"Good metrics help start conversations and turn strategy into action"
Good metrics have the following characteristics:
- Business Relevance
- Clarity of Definition
- Trending Capability (i.e. metric can be tracked over time)
- Easy to aggregate and roll-up to a summary
- Easy to drill-down to the details that comprised the measurement

Closing Panel: Data Management’s Next Big Thing!

Quotes from Panelist Peter Aiken from Data Blueprint:

Capability Maturity Levels:
1. Initial
2. Repeatable
3. Defined
4. Managed
5. Optimized
"Most companies are at a capability maturity level of (1) Initial or (2) Repeatable"
"Data should be treated as a durable asset"

Quotes from Panelist Noreen Kendle from Burton Group:

"A new age for data and data management is on horizon – a perfect storm is coming"
"The perfect storm is being caused by massive data growth and software as a service (i.e. cloud computing)"
"Always remember that you can make lemonade from lemons – the bad in life can be turned into something good"

Quotes from Panelist Karen Lopez from InfoAdvisors:

"If you keep using the same recipe, then you keep getting the same results"
"Our biggest problem is not technical in nature - we simply need to share our knowledge"
"Don’t be a dinosaur! Adopt a ‘go with what is’ philosophy and embrace the future!"

Quotes from Panelist Eric Miller from Zepheira:

"Applications should not be ON The Web, but OF The Web"
"New Acronym: LED – Linked Enterprise Data"
"Semantic Web is the HTML of DATA"

Quotes from Panelist Daniel Moody from University of Twente:

"Unified Modeling Language (UML) was the last big thing in software engineering"
"The next big thing will be ArchiMate, which is a unified language for enterprise architecture modeling"

Mark Your Calendar

Enterprise Data World 2010 will take place in San Francisco, California at the Hilton San Francisco on March 14-18, 2010.

April 05, 2009

There are no Magic Beans for Data Quality

April 05, 2009/ Jim Harris

The CIO put Jack in charge of an enterprise initiative with a sizable budget to spend on improving data quality.

Jack was sent to a leading industry conference to evaluate data quality vendors. While his flight was delayed, Jack was passing the time in the airport bar when he was approached by Machiavelli, a salesperson from a data quality software company called Magic Beans.

Machiavelli told Jack that he didn't need to go to the conference to evaluate vendors. Instead, Jack could simply trade his entire budget for a unlimited license of Magic Beans.

Machiavelli assured Jack that Magic Beans had the following features:

Simple to install
Remarkably intuitive user interface
Processes a gazillion records per nanosecond
Clairvoyantly detects and corrects existing data quality problems
Prevents all future data quality problems from happening

Jack agreed to the trade and went back to the office with Magic Beans.

Eighteen months later, Jack and the CIO carpooled to Washington, D.C. to ask Congress for a sizable bailout.

What is the moral of this story?

(Other than never trust a salesperson named Machiavelli.)

There are many data quality vendors to choose from and all of them offer viable solutions driven by impressive technology.

However, technology sometimes carries with it a dangerous conceit – that what works in the laboratory and the engineering department will work in the boardroom and the accounting department, that what is true for the mathematician and the computer scientist will be true for the business analyst and the data steward.

My point is neither to discourage the purchase of data quality software, nor to try to convince you which vendor I think provides the superior solution – especially since these types of opinions are usually biased by the practical limits of your personal experience and motivated by the kind folks who are currently paying your salary.

And I am certainly not a Luddite opposed to the use of technology. I am first, foremost, and proudly a techno-geek of the highest order. However, I have seen too many projects fail when a solution to data quality problems was attempted by “throwing technology at it.” I have seen beautifully architected, wonderfully coded, elegantly implemented technical solutions result in complete and utter failure. These projects failed neither because using technology was the wrong approach nor because the wrong data quality software was selected.

Data quality solutions require a holistic approach involving people, methodology, and technology.

People

Sometimes, people doubt that data quality problems could be prevalent in their systems. This “data denial” is not necessarily a matter of blissful ignorance, but is often a natural self-defense mechanism from the data owners on the business side and/or the process owners on the technical side. No one likes to feel blamed for causing or failing to fix the data quality problems. This is one of the many human dynamics that is missing from the relative clean room of the laboratory where the technology was developed. You must consider the human factor because it will be the people involved in the project, and not the technology itself, that will truly make the project successful.

Methodology

Data characteristics and their associated quality challenges are unique from company to company. Data quality can be defined differently by different functional areas within the same company. Business rules can change from project to project. Decision makers on the same project can have widely varying perspectives. All of this points to the need for having an effective methodology, which will help you maximize the time and effort as well as the subsequent return on whatever technology you invest in.

Technology

I have used software from most of the Gartner Data Quality Magic Quadrant and many of the so-called niche vendors. So I speak from experience when I say that all data quality vendors have viable solutions driven by impressive technology. However, don't let the salesperson “blind you with science” to have unrealistic expectations of the software. I am not trying to accuse all salespeople of Machiavellian machinations (even though we have all encountered a few who would shamelessly sell their mother’s soul to meet their quota).

Conclusion

Just like any complex problem, there is no fast and easy solution. Although incredible advancements in technology continue, there are no Magic Beans for Data Quality.

And there never will be.

An organization's data quality initiative can only be successful when people take on the challenge united by collaboration, guided by an effective methodology, and of course, implemented with amazing technology.

April 01, 2009

Data Quality Whitepapers are Worthless

April 01, 2009/ Jim Harris

During a 1609 interview, William Shakespeare was asked his opinion about an emerging genre of theatrical writing known as Data Quality Whitepapers. The "Bard of Avon" was clearly not a fan. His famously satirical response was:

Data quality's but a writing shadow, a poor paper

That struts and frets its words upon the page

And then is heard no more: it is a tale

Told by a vendor, full of sound and fury

Signifying nothing.

Four centuries later, I find myself in complete agreement with Shakespeare (and not just because Harold Bloom told me so).

Today is April Fool's Day, but I am not joking around - call Dennis Miller and Lewis Black - because I am ready to RANT.

I am sick and tired of reading whitepapers. Here is my "Bottom Ten List" explaining why:

Ones that make me fill out a "please mercilessly spam me later" contact information form before I am allowed to download them remind me of Mrs. Bun: "I DON'T LIKE SPAM!"
Ones that after I read their supposed pearls of wisdom, make me shake my laptop violently like an Etch-A-Sketch. I have lost count of how many laptops I have destroyed this way. I have starting buying them in bulk at Wal-Mart.
Ones comprised entirely of the exact same information found on the vendor's website make www = World Wide Worthless.
Ones that start out good, but just when they get to the really useful stuff, refer to content only available to paying customers. What a great way to guarantee that neither I nor anyone I know will ever become your paying customer!
Ones that have a "Shock and Awe" title followed by "Aw Shucks" content because apparently the entire marketing budget was spent on the title.
Ones that promise me the latest BUZZ but deliver only ZZZ are not worthless only when I have insomnia.
Ones that claim to be about data quality, but have nothing at all to do with data quality: "...don't make me angry. You wouldn't like me when I'm angry."
Ones that take the adage "a picture is worth a thousand words" too far by using a dizzying collage of logos, charts, graphs and other visual aids. This is one reason we're happy that Pablo Picasso was a painter. However, he did once write that "art is a lie that makes us realize the truth." Maybe he was defending whitepapers.
Ones that use acronyms without ever defining what they stand for remind me of that scene from Good Morning, Vietnam: "Excuse me, sir. Seeing as how the VP is such a VIP, shouldn't we keep the PC on the QT? Because if it leaks to the VC he could end up MIA, and then we'd all be put out in KP."
Ones that really know they're worthless but aren't honest about it. Don't promise me "The Top 10 Metrics for Data Quality Scorecards" and give me a list as pointless as this one.

I am officially calling out all writers of Data Quality Whitepapers.

Shakespeare and I both believe that you can't write anything about data quality that is worth reading.

Send your data quality whitepapers to Obsessive-Compulsive Data Quality and if it is not worthless, then I will let the world know that you proved Shakespeare and I wrong.

And while I am on a rant roll, I am officially calling out all Data Quality Bloggers.

The International Association for Information and Data Quality (IAIDQ) is celebrating its five year anniversary by hosting:

El Festival del IDQ Bloggers – A Blog Carnival for Information/Data Quality Bloggers

For more information about the blog carnival, please follow this link: IAIDQ Blog Carnival

March 30, 2009

You're So Vain, You Probably Think Data Quality Is About You

March 30, 2009/ Jim Harris

Don't you?

"Data Quality is an IT issue because information is stored in databases and applications that they manage. Therefore, if there are problems with the data, then IT is responsible for cleaning up their own mess."

"Data Quality is a Business issue because information is created by business processes and users that they manage. Therefore, if there are problems with the data, then the Business is responsible for cleaning up their own mess."

Responding to these common views (channeling the poet Walt Whitman), I sound my barbaric yawp over the roofs of the world:

"Data Quality is not an IT issue. Data Quality is not a Business issue. Data Quality is everyone's issue."

Unsuccessful data quality projects are most often characterized by the Business meeting independently to define the requirements and IT meeting independently to write the specifications. Typically, IT then follows the all too common mantra of “code it, test it, implement it into production, and declare victory” that leaves the Business frustrated with the resulting “solution.”

Successful data quality projects are driven by an executive management mandate for the Business and IT to forge an ongoing and iterative collaboration throughout the entire project. The Business usually owns the data and understands its meaning and use in the day to day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes.

Here are some recommendations for fostering collaboration on your data quality project:

Provide Leadership – not only does the project require an executive sponsor to provide oversight and arbitrate any issues of organization politics, but the Business and IT must each designate a team leader for the initiative. Choose these leaders wisely. The best choice is not necessarily those with the most seniority or authority. You must choose leaders who know how to listen well, foster open communication without bias, seek mutual understanding on difficult issues, and truly believe it is the people involved that make projects successful. Your team leaders should also collectively meet with the executive sponsor on a regular basis in order to demonstrate to the entire project team that collaboration is an imperative to be taken seriously.

Formalize the Relationship – consider creating a service level agreement (SLA) where the Business views IT as a supplier and IT views the Business as a customer. However, there is no need to get the lawyers involved. My point is that this internal strategic partnership should be viewed no differently than an external one. Remember that you are formalizing a relationship based on mutual trust and cooperation.

Share Ideas – foster an environment in which a diversity of viewpoints is freely shared without prejudice. For example, the Business often has practical insight on application development tasks, and IT often has a pragmatic view about Business processes. Consider including everyone as optional invitees to meetings. You may be pleasantly surprised at how often people not only attend but also make meaningful contributions. Remember that you are all in this together.

Conclusion

Data quality is not about you. Data quality is about us.

I believe in us.

Don't you?

March 28, 2009

The Data Quality Goldilocks Zone

March 28, 2009/ Jim Harris

In astronomy, the habitable region of space where stellar conditions are favorable for life as it is found on Earth is referred to as the "Goldilocks Zone" because such a region of space is neither too close to the sun (making it too hot) nor too far away from the sun (making it too cold), but is "just right."

In data quality, there is also a Goldilocks Zone, which is the habitable region of time when project conditions are favorable for success.

Too many projects fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” as opposed to needing eternal vigilance. In order to be successful, projects must always be understood as an iterative process. Return on investment (ROI) will be achieved by targeting well defined objectives that can deliver small incremental returns that will build momentum to larger success over time.

Data quality projects are easy to get started, even easier to end in failure, and often lack the decency of at least failing quickly. Just like any complex problem, there is no fast and easy solution for data quality.

Projects are launched to understand and remediate the poor data quality that is negatively impacting decision critical enterprise information. Data-driven problems require data-driven solutions. At that point in the project lifecycle when the team must decide if the efforts of the current iteration are ready for implementation, they are dealing with the Data Quality Goldilocks Zone, which instead of being measured by proximity to the sun, is measured by proximity to full data remediation, otherwise known as perfection.

The obvious problem is that perfection is impossible. An obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause. Data quality problems can be very insidious and even the best data remediation process will still produce exceptions. As a best practice, your process should be designed to identify and report exceptions when they occur. In fact, many implementations will include logic to provide the ability to suspend exceptions for manual review and correction.

Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.

For example, let’s imagine that your project is processing one billion records and that exhaustive analysis has determined that the results are correct 99.99999% of the time, meaning that exceptions occur in only 0.00001% of the total data population. Now, imagine explaining these statistics to the project team, but providing only the 100 exception records for review. Do not underestimate the difficulty that the human mind has with large numbers (i.e. 100 is an easy number to relate to but one billion is practically incomprehensible). Also, don’t ignore the effect known as “negativity bias” where bad evokes a stronger reaction than good in the human mind - just compare an insult and a compliment, which one do you remember more often? Focusing on the exceptions can undermine confidence and prevent acceptance of an overwhelmingly successful implementation.

If you can accept there will be exceptions, admit perfection is impossible, implement data quality improvements in iterations, and acknowledge when the current iteration has reached the Data Quality Goldilocks Zone, then your data quality initiative will not be perfect, but it will be "just right."

March 26, 2009

Identifying Duplicate Customers

March 26, 2009/ Jim Harris

I just finished publishing a five part series of articles on data matching methodology for dealing with the common data quality problem of identifying duplicate customers.

The article series was published on Data Quality Pro, which is the leading data quality online magazine and free independent community resource dedicated to helping data quality professionals take their career or business to the next level.

Topics covered in the series:

Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
How both false negatives and false positives illustrate the highly subjective nature of this problem
How to document your business rules for identifying duplicate customers
How to set realistic expectations about application development
How to foster a collaboration of the business and technical teams throughout the entire project
How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

March 21, 2009

The Very Model of a Modern DQ General

March 21, 2009/ Jim Harris

LyttonMajorGeneral

With apologies to fellow fans of Gilbert and Sullivan and Sir Henry Lytton, I offer the following Data Quality (DQ) General's Song. It is certainly not up to the high standards of The Pirates of Penzance or any other comic opera for that matter. However, I hope that you find it entertaining.

The DQ General's Song

I am the very model of a modern DQ General,
I've cleansed data customer, product, and informational,
I know the challenges of data quality, and I quote issues historical,
From the Business to IT, in order categorical,
I'm very well acquainted, too, with matters very practical,
I understand application development is really quite iterational,
About teamwork and collaboration, I'm teeming with a lot o' news,
With many cheerful facts about how to succeed and not to lose.

I know the key to successful projects is the people, golly gee,
From executive sponsors and team leaders down to every busy bee,
Only together can we achieve great things tactically and strategically,
I have learned what progress has been made with modern technology,
But I understand that those are business problems we all see,
And nothing can be achieved without effective methodology;
In short, with data customer, product, and informational,
I am the very model of a modern DQ General.

I know understanding data is essential to using it effectively,
And that data's best friends are its stewards, analysts, and SMEs,
Profiling and statistical analysis can be a wonderful tool,
But if I forget the business context then I'll look like a fool,
I check for completeness and accuracy in all of my fields,
But always verify relevancy to boost my analytical yields.

I'm very good at matching and linking records probabilistically,
But I know often it can be done just as well deterministically,
And have even seen it performed quite supercalifragilistically;
In short, with data customer, product, and informational,
I am the very model of a modern DQ General.

Even with my impressive knowledge, I am still learning and must stay adventury,
With hard work and dedication, I will know everything by the end of the century;
But still, with data customer, product, and informational,
I am the very model of a modern DQ General.

March 13, 2009

Do you have obsessive-compulsive data quality (OCDQ)?

March 13, 2009/ Jim Harris

Obsessive-compulsive data quality (OCDQ) affects millions of people worldwide.

The most common symptoms of OCDQ are:

Obsessively verifying data used in critical business decisions
Compulsively seeking an understanding of data in business terms
Repeatedly checking that data is complete and accurate before sharing it
Habitually attempting to calculate the cost of poor data quality
Constantly muttering a mantra that data quality must be taken seriously

While the good folks at Prescott Pharmaceuticals are busy working on a treatment, I am dedicating this independent blog as group therapy to all those who (like me) have dealt with OCDQ their entire professional lives.

Over the years, the work of many individuals and organizations has been immensely helpful to those of us with OCDQ.

Some of these heroes deserve special recognition:

Data Quality Pro – Founded and maintained by Dylan Jones, Data Quality Pro is a free independent community resource dedicated to helping data quality professionals take their career or business to the next level. With the mission to create the most beneficial data quality resource that is freely available to members around the world, Data Quality Pro provides free software, job listings, advice, tutorials, news, views and forums. Their goal is "winning-by-sharing” and they believe that by contributing a small amount of their experience, skill or time to support other members then truly great things can be achieved. With the new Member Service Register, consultants, service providers and technology vendors can promote their services and include links to their websites and blogs.

International Association for Information and Data Quality (IAIDQ) – Chartered in January 2004, IAIDQ is a not-for-profit, vendor-neutral professional association whose purpose is to create a world-wide community of people who desire to reduce the high costs of low quality information and data by applying sound quality management principles to the processes that create, maintain and deliver data and information. IAIDQ was co-founded by Larry English and Tom Redman, who are two of the most respected and well-known thought and practice leaders in the field of information and data quality.IAIDQ also provides two excellent blogs: IQ Trainwrecks and Certified Information Quality Professional (CIQP).

Beth Breidenbach – her blog Confessions of a database geek is fantastic in and of itself, but she has also compiled an excellent list of data quality blogs and provides them via aggregated feeds in both Feedburner and Google Reader formats.

Vincent McBurney – his blog Tooling Around in the IBM InfoSphere is an entertaining and informative look at data integration in the IBM InfoSphere covering many IBM Information Server products such as DataStage, QualityStage and Information Analyzer.

Daragh O Brien – is a leading writer, presenter and researcher in the field of information quality management, with a particular interest in legal aspects of information quality. His blog The DOBlog is a popular and entertaining source of great material.

Steve Sarsfield – his blog Data Governance and Data Quality Insider covers the world of data integration, data governance, and data quality from the perspective of an industry insider. Also, check out his new book: The Data Governance Imperative.

OCDQ Blog

OCDQ Blog

OCDQ Blog

OCDQ Blog

Hyperactive Data Quality

Reactive Data Quality

Proactive Data Quality

Hyperactive Data Quality

All I Really Need To Know About Data Quality I Learned In Kindergarten

Show And Tell

The Five Second Rule

When You Make A Mistake, Say You're Sorry

Learn Something New Every Day

Hold Hands And Stick Together

A Portrait of the Data Quality Expert as a Young Idiot

Are You Afraid Of Your Data Quality Solution?

8 Ways Data is Changing Everything

Master Data Management: Proven Architectures, Products and Best Practices

Master Data Management: Ignore the Hype and Keep the Focus on Data

A Case of Usage: Working with Use Cases on Data-Centric Projects

Crowdsourcing: People are Smart, When Computers are Not

Improving Information Quality using Lean Six Sigma Methodology

Universal Data Quality: The Key to Deriving Business Value from Corporate Data

Defining a Balanced Scorecard for Data Management

Closing Panel: Data Management’s Next Big Thing!

Mark Your Calendar

There are no Magic Beans for Data Quality

People

Methodology

Technology

Conclusion

Data Quality Whitepapers are Worthless

You're So Vain, You Probably Think Data Quality Is About You

Conclusion

The Data Quality Goldilocks Zone

Identifying Duplicate Customers

The Very Model of a Modern DQ General

The DQ General's Song

Do you have obsessive-compulsive data quality (OCDQ)?

OCDQ Blog