Recently Read: December 7, 2009

Recently Read is an OCDQ regular segment.  Each entry provides links to blog posts, articles, books, and other material I found interesting enough to share.  Please note “recently read” is literal – therefore what I share wasn't necessarily recently published.

 

Data Quality

For simplicity, “Data Quality” also includes Data Governance, Master Data Management, and Business Intelligence.

  • Data Quality Blog Roundup - November 2009 Edition – Dylan Jones at Data Quality Pro always provides a great collection of the previous month's best blog posts, which covers most of the my “recently reads” for data quality.

     

  • The value of Christmas cards – In this Data Value Talk blog post from Human Inference, we learn about how sending Christmas cards can optimize your data quality.

     

  • Santa Quality – Yes, Virginia, there is a Santa Claus—as well as a Saint Nicholas, a Père Noël, a Weihnachtsmann, and a Julemand.  In this blog post, Henrik Liliendahl Sørensen explains some ho-ho-holiday data quality issues.

     

  • Some TLC for Your Data – Data really needs some tender loving care.  Daniel Gent explains in his latest blog post.

     

  • Determining data quality is the first key step – In the second part of a blog series on data migration, James Standen explains that a data migration project will be required to actually improve data quality at the same time, and therefore it is really two projects in one.  The post contains the great line: “data quality sense tingling.”

     

  • Data Chaos and Five Truisms of Data Quality – In his debut post on the DataFlux Community of Experts, my good friend Phil Simon provides a quick case study and five universal truths of data quality.

 

Social Media

For simplicity, “Social Media” also includes Blogging, Writing, Social Networking, and Online Marketing.

 

Awesome Stuff

An eclectic list of articles, blog posts, and other “non-data quality, non-social media, but still awesome” stuff.

  • The Greatest Book Of All Time? – Josh Hanagarne (a.k.a. the “World’s Strongest Librarian”) recently reviewed a book he received from Ethan.  Josh has a simple philosophy of life — “Don’t make anyone’s day worse” — if you are having a bad day (like I was the day I found this), then check this out.

     

  • Cute Apple parody from The Sun – Rob Beschizza on Boing Boing shares a great one minute video of a recent commercial from The Sun about “The UK's best handheld for 40 years.”


Live-Tweeting: Data Governance

The term “live-tweeting” describes using Twitter to provide near real-time reporting from an event.  I live-tweet from the sessions I attend at industry conferences as well as interesting webinars.

Recently, I live-tweeted Successful Data Stewardship Through Data Governance, which was a data governance webinar featuring Marty Moseley of Initiate Systems and Jill Dyché of Baseline Consulting.

Instead of writing a blog post summarizing the webinar, I thought I would list my tweets with brief commentary.  My goal is to provide an example of this particular use of Twitter so you can decide its value for yourself.

 

As the webinar begins, Marty Moseley and Jill Dyché provide some initial thoughts on data governance:

Live-Tweets 1

 

Jill Dyché provides a great list of data governance myths and facts:

Live-Tweets 2

 

Jill Dyché provides some data stewardship insights:

Live-Tweets 3

 

As the webinar ends, Marty Moseley and Jill Dyché provide some closing thoughts about data governance and data quality:

Live-Tweets 4

 

Please Share Your Thoughts

If you attended the webinar, then you know additional material was presented.  Did my tweets do the webinar justice?  Did you follow along on Twitter during the webinar?  If you did not attend the webinar, then are these tweets helpful?

What are your thoughts in general regarding the pros and cons of live-tweeting? 

 

Related Posts

The following three blog posts are conference reports based largely on my live-tweets from the events:

Enterprise Data World 2009

TDWI World Conference Chicago 2009

DataFlux IDEAS 2009

Data Quality is Sexy

 

Jim Harris 017

I am sick and tired of hearing people talk about how data quality (DQ) is not sexy.

I was talking with my friend J.T. the other day and he told me I simply needed to remind people data quality has always been sexy.  Sometimes, people just have a tendency to forget. 

J.T. told me:

“You know what you gotta do J.H.?  You gotta bring DQ Sexy back.”

True dat, J.T.

 

I'm Bringing DQ Sexy Back

 

Jim Harris 001

 

I’m bringing DQ Sexy back

All you naysayers, watch how I attack

I think your data’s special, why does your quality lack?

Grant me some access, and I’ll pick up the slack

 

 

Jim Harris 008

 

Dirty data – you see the problems everywhere

Let me be your data cleanser, and baby, I'll be there

We'll whip the Business Process if it misbehaves

But just remember – trying to be perfect – it's not the way

 

 

Jim Harris 005 

I’m bringing DQ Sexy back

Them non-team players don’t know how to act

Let our collaboration get us back on track

Working together, we'll make the right impact

 

 

Jim Harris 010

 

Look at that data – it's your 'prise asset 
Treat it well, and all your business needs will be met

Understanding it will really make you smile 
To get started, you really need to profile

There's no need for you to be afraid – come on 
Go ahead – get your data freak on

 

Jim Harris 014 

I’m bringing DQ Sexy back

Any non-believers left?  Don't make me give you a smack

If you have data, you'd better watch out for what it lacks

'Cause quality is what it needs – and that’s a fact

 

 

Data Quality is Sexy

Jim Harris 015

That’s right. 

Data Quality is Sexy. 

Always has been. 

Always will be.

True dat, J.H.

Fo real!

 

Adventures in Data Profiling (Part 8)

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.  This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

 

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics.  Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered.  Both the data profiling tool and data used throughout the series were fictional.  The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

  1. Counts and Percentages
  2. Values and Formats
  3. Drill-down Analysis

 

Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

 Data Profiling Summary

  • NULL – count of the number of records with a NULL value 
  • Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces) 
  • Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing) 
  • Completeness – percentage calculated as Actual divided by the total number of records 
  • Cardinality – count of the number of distinct actual values 
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records 
  • Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records.  In Part 6, Account Number and Tax ID were used as examples.  Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record).  The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).  Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.

 

Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
  • Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
  • Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
  • Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
  • Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record). 

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.  Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid. 

 

Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest.  Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage.  As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series.  However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).

 

Conclusion

Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative.  I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions. 

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality.  All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.  However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important translate your analysis into meaningful reports and questions to share with the rest of your team. 

Always remember that well performed data profiling is both a highly interactive and a very iterative process.

 

Thank You

I want to thank you for providing your feedback throughout this series. 

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments. 

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.

 

The Complete Series


Recently Read: November 28, 2009

Recently Read is an OCDQ regular segment.  Each entry provides links to blog posts, articles, books, and other material I found interesting enough to share.  Please note “recently read” is literal – therefore what I share wasn't necessarily recently published.

 

Data Quality Blog Posts

For simplicity, “Data Quality” also includes Data Governance, Master Data Management, and Business Intelligence.

 

Social Media Blog Posts

For simplicity, “Social Media” also includes Blogging, Social Networking, and Online Marketing.

 

Book Quotes

An eclectic list of quotes from some recently read (and/or simply my favorite) books.

  • From The Wisdom of Crowds by James Surowiecki – “Refuse to allow the merit of an idea to be determined by the status of the person advocating it.”

     

  • From Purple Cow by Seth Godin – “We mistakenly believe that criticism leads to failure.”

     

  • From How We Decide by Jonah Lehrer – “The best decision-makers don't despair.  Instead, they become students of error, determined to learn from what went wrong.”

     

  • From The Whuffie Factor by Tara Hunt – “Whuffie is the residual outcome—the currency—of your reputation.  You lose or gain it based on positive or negative actions, your contributions to the community, and what people think of you.”

     

  • From Trust Agents by Chris Brogan and Julien Smith – “You accrue social capital as a side benefit of doing good, but doing good by itself is its own reward.”

Commendable Comments (Part 4)

Thanksgiving

Photo via Flickr (Creative Commons License) by: ella_marie 

Today is Thanksgiving Day, which is a United States holiday with a long and varied history.  The most consistent themes remain family and friends gathering together to share a large meal and express their gratitude.

This is the fourth entry in my ongoing series for expressing my gratitude to my readers for their truly commendable comments on my blog posts.  Receiving comments is the most rewarding aspect of my blogging experience.  Although I am truly grateful to all of my readers, I am most grateful to my commenting readers. 

 

Commendable Comments

On Days Without A Data Quality Issue, Steve Sarsfield commented:

“Data quality issues probably occur on some scale in most companies every day.  As long as you qualify what is and isn't a data quality issue, this gets back to what the company thinks is an acceptable level of data quality.

I've always advocated aggregating data quality scores to form business metrics.  For example, what data quality metrics would you combine to ensure that customers can always be contacted in case of an upgrade, recall or new product offering?  If you track the aggregation, it gives you more of a business feel.”

On Customer Incognita, Daragh O Brien commented:

“Back when I was with the phone company I was (by default) the guardian of the definition of a 'Customer'.  Basically I think they asked for volunteers to step forward and I was busy tying my shoelace when the other 11,000 people in the company as one entity took a large step backwards.

I found that the best way to get a definition of a customer was to lock the relevant stakeholders in a room and keep asking 'What' and 'Why'. 

My 'data modeling' methodology was simple.  Find out what the things were that were important to the business operation, define each thing in English without a reference to itself, and then we played the 'Yes/No Game Show' to figure out how that entity linked to other things and what the attributes of that thing were.

Much to IT's confusion, I insisted that the definition needed to be a living thing, not carved in two stone tablets we'd lug down from on top of the mountain. 

However, because of the approach that had been taken we found that when new requirements were raised (27 from one stakeholder), the model accommodated all of them either through an expansion of a description or the addition of a piece of reference data to part of the model.

Fast-forward a few months from the modeling exercise.  I was asked by IT to demo the model to a newly acquired subsidiary.  It was a significantly different business.  I played the 'Yes/No Game Show' with them for a day.  The model fitted their needs with just a minor tweak. 

The IT team from the subsidiary wanted to know how had I gone about normalizing the data to come up with the model, which is kind of like cutting up a perfectly good apple pie to find out how what an apple is and how to make pastry.

What I found about the 'Yes/No Game Show' approach was that it made people open up their thinking a bit, but it took some discipline and perseverance on my part to keep asking what and why.  Luckily, having spent most of the previous few years trying to get these people to think seriously about data quality they already thought I was a moron so they were accommodating to me.

A key learning for me out of the whole thing is that, even if you are doing a data management exercise for a part of a larger business, you need to approach it in a way that can be evolved and continuously improved to ensure quality across the entire organization. 

Also, it highlighted the fallacy of assuming that a company can only have one kind of customer.”

On The Once and Future Data Quality Expert, Dylan Jones commented:

“I recently attended a conference and sat in on a panel that discussed some of the future trends, such as cloud computing.  It was a great discussion, highly polarized, and as I came home I thought about how far we've come as a profession but more importantly, how much more there is to do.

The reality is that the world is changing, the volumes of data held by businesses are immense and growing exponentially, our desire for new forms of information delivery insatiable, and the opportunities for innovation boundless.

I really believe we're not innovating as an industry anything like we should be.  The cloud, as an example, offers massive opportunities for a range of data quality services but I've certainly not read anything in the media or press that indicates someone is capitalizing on this.

There are a few recent data quality technology innovations which have caught my eye, but I also think there is so much more vendors should be doing.

On the personal side of the profession, I think online education is where we're headed.  The concept of localized training is now being replaced by online learning.  With the Internet you can now train people on every continent, so why aren't more people going down this route?

I find it incredibly ironic when I speak to data quality specialists who admit that 'they don't have the first clue about all this social media stuff.'  This is the next generation of information management, it's here right now, they should be embracing it.  I think if you're a 'guru' author, trainer or consultant you need to think of new ways to engage with your clients/trainees using the tools available.

What worries me is that the growth of information doesn't match the maturity and growth of our profession.  For example, we really need more people who can articulate the value of what we can offer. 

Ted Friedman made a great point on Twitter recently when he talked about how people should stop moaning about executives that 'don't get it' and instead focus on improving ways to demonstrate the value of data quality improvement.

Just because we've come a long way doesn't mean we know it all, there is still a hell of a long way to go.”

Thanks for giving your comments

Thank you very much for giving your comments and sharing your perspectives with our collablogaunity.  Since there have been so many commendable comments, please don't be offended if your commendable comment hasn't been featured yet. 

Please keep on commenting and stay tuned for future entries in the series. 

 

Related Posts

Commendable Comments (Part 1)

Commendable Comments (Part 2)

Commendable Comments (Part 3)

DQ-Tip: “Data quality is about more than just improving your data...”

Data Quality (DQ) Tips is an OCDQ regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Data quality is about more than just improving your data.

Ultimately, the goal is improving your organization.”

This DQ-Tip is from Tony Fisher's great book The Data Asset: How Smart Companies Govern Their Data for Business Success.

In the book, Fisher explains that one of the biggest mistakes organizations make is not viewing their data as a corporate asset.  This common misconception often prevents data quality from being rightfully viewed a critical priority. 

Data quality is misperceived to be an activity performed just for the sake of improving data.  When in fact, data quality is an activity performed for the sake of improving business processes.

“Better data leads to better decisions,” explains Fisher, “which ultimately leads to better business.  Therefore, the very success of your organization is highly dependent on the quality of your data.”

 

Related Posts

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “Don't pass bad data on to the next person...”

Beyond a “Single Version of the Truth”

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below).  Thanks!

 

The “Point of View” Paradox

In the early 20th century, within his Special Theory of Relativity, Albert Einstein introduced the concept that space and time are interrelated entities forming a single continuum, and therefore the passage of time can be a variable that could change for each individual observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective. 

It was Einstein's apprentice, Obi-Wan Kenobi (to whom Albert explained “Gravity will be with you, always”), who stated:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

The Data-Information Continuum

In the early 21st century, within his popular blog post The Data-Information Continuum, Jim Harris introduced the concept that data and information are interrelated entities forming a single continuum, and that speaking of oneself in the third person is the path to the dark side.

I use the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g., customers, vendors, suppliers).

Although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Quality within the Data-Information Continuum has both objective and subjective dimensions.  Data's quality is objectively measured separate from its many uses, while information's quality is subjectively measured according to its specific use.

 

Objective Data Quality

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. 

In order to lay this foundation, raw data is extracted directly from its sources, profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs. 

At this phase of the architecture, the manipulations of raw data must be limited to objective standards and not be customized for any subjective use.  From this perspective, data is now fit to serve (as at least the basis for) each and every purpose.

 

Subjective Information Quality

Information quality standards (starting from the objective data foundation) are customized to meet the subjective needs of each business unit and initiative.  This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

But please understand: customization should not be performed simply for the sake of it.  You must always define your information quality standards by using the enterprise-wide data quality standards as your initial framework. 

Whenever possible, enterprise-wide standards should be enforced without customization.  The key word within the phrase “subjective information quality standards” is standards — as opposed to subjective, which can quite often be misinterpreted as “you can do whatever you want.”  Yes you can – just as long as you have justifiable business reasons for doing so.

This approach to implementing information quality standards has three primary advantages.  First, it reinforces a consistent understanding and usage of data throughout the enterprise.  Second, it requires each business unit and initiative to clearly explain exactly how they are using data differently from the rest of your organization, and more important, justify why.  Finally, all deviations from enterprise-wide data quality standards will be fully documented. 

 

The “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a “Single Version of the Truth.”

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Beyond a “Single Version of the Truth”

In the classic 1985 film Mad Max Beyond Thunderdome, the title character arrives in Bartertown, ruled by the evil Auntie Entity, where people living in the post-apocalyptic Australian outback go to trade for food, water, weapons, and supplies.  Auntie Entity forces Mad Max to fight her rival Master Blaster to the death within a gladiator-like arena known as Thunderdome, which is governed by one simple rule:

“Two men enter, one man leaves.”

I have always struggled with the concept of creating a “Single Version of the Truth.”  I imagine all of the key stakeholders from throughout the enterprise arriving in Corporatetown, ruled by the Machiavellian CEO known only as Veritas, where all business units and initiatives must go to request funding, staffing, and continued employment.  Veritas forces all of them to fight their Master Data Management rivals within a gladiator-like arena known as Meetingdome, which is governed by one simple rule:

“Many versions of the truth enter, a Single Version of the Truth leaves.”

For any attempted “version of the truth” to truly be successfully implemented within your organization, it must take into account both the objective and subjective dimensions of quality within the Data-Information Continuum. 

Both aspects of this shared perspective of quality must be incorporated into a “Shared Version of the Truth” that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

The Data-Information Continuum is governed by one simple rule:

“All validly different points of view must be allowed to enter,

In order for an all encompassing Shared Version of the Truth to be achieved.”

 

You are the Judge

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate.  A link to the same poll is provided on all three blogs.  Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals. 

The poll will remain open for one week, closing at midnight on November 19 so that the “medal ceremony” can be conducted via Twitter on Friday, November 20.  Additionally, please share your thoughts and perspectives on this debate by posting a comment below.  Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.

 

Related Posts

Poor Data Quality is a Virus

The General Theory of Data Quality

The Data-Information Continuum

The Once and Future Data Quality Expert

World Quality Day 2009

Wednesday, November 11 is World Quality Day 2009.

World Quality Day was established by the United Nations in 1990 as a focal point for the quality management profession and as a celebration of the contribution that quality makes to the growth and prosperity of nations and organizations.  The goal of World Quality Day is to raise awareness of how quality approaches (including data quality best practices) can have a tangible effect on business success, as well as contribute towards world-wide economic prosperity.

 

IAIDQ

The International Association for Information and Data Quality (IAIDQ) was chartered in January 2004 and is a not-for-profit, vendor-neutral professional association whose purpose is to create a world-wide community of people who desire to reduce the high costs of low quality information and data by applying sound quality management principles to the processes that create, maintain and deliver data and information.

Since 2007 the IAIDQ has celebrated World Quality Day as a springboard for improvement and a celebration of successes.  Please join us to celebrate World Quality Day by participating in our interactive webinar in which the Board of Directors of the IAIDQ will share with you stories and experiences to promote data quality improvements within your organization.

In my recent Data Quality Pro article The Future of Information and Data Quality, I reported on the IAIDQ Ask The Expert Webinar with co-founders Larry English and Tom Redman, two of the industry pioneers for data quality and two of the most well-known data quality experts.

 

Data Quality Expert

As World Quality Day 2009 approaches, my personal reflections are focused on what the title data quality expert has meant in the past, what it means today, and most important, what it will mean in the future.

With over 15 years of professional services and application development experience, I consider myself to be a data quality expert.  However, my experience is paltry by comparison to English, Redman, and other industry luminaries such as David Loshin, to use one additional example from many. 

Experience is popularly believed to be the path that separates knowledge from wisdom, which is usually accepted as another way of defining expertise. 

Oscar Wilde once wrote that “experience is simply the name we give our mistakes.”  I agree.  I have found that the sooner I can recognize my mistakes, the sooner I can learn from the lessons they provide, and hopefully prevent myself from making the same mistakes again. 

The key is early detection.  As I gain experience, I gain an improved ability to more quickly recognize my mistakes and thereby expedite the learning process.

James Joyce wrote that “mistakes are the portals of discovery” and T.S. Eliot wrote that “we must not cease from exploration and the end of all our exploring will be to arrive where we began and to know the place for the first time.”

What I find in the wisdom of these sages is the need to acknowledge the favor our faults do for us.  Therefore, although experience is the path that separates knowledge from wisdom, the true wisdom of experience is the wisdom of failure.

As Jonah Lehrer explained: “Becoming an expert just takes time and practice.  Once you have developed expertise in a particular area, you have made the requisite mistakes.”

But expertise in any discipline is more than simply an accumulation of mistakes and birthdays.  And expertise is not a static state that once achieved, allows you to simply rest on your laurels.

In addition to my real-world experience working on data quality initiatives for my clients, I also read all of the latest books, articles, whitepapers, and blogs, as well as attend as many conferences as possible.

 

The Times They Are a-Changin'

Much of the discussion that I have heard regarding the future of the data quality profession has been focused on the need for the increased maturity of both practitioners and organizations.  Although I do not dispute this need, I am concerned about the apparent lack of attention being paid to how fast the world around us is changing.

Rapid advancements in technology, coupled with the meteoric rise of the Internet and social media (blogs, wikis,  Twitter, Facebook, LinkedIn, etc.) has created an amazing medium that is enabling people separated by vast distances and disparate cultures to come together, communicate, and collaborate in ways few would have thought possible just a few decades ago. 

I don't believe that it is an exaggeration to state that we are now living in an age where the contrast between the recent past and the near future is greater than perhaps it has ever been in human history.  This brave new world has such people and technology in it, that practically every new day brings the possibility of another quantum leap forward.

Although it has been argued by some that the core principles of data quality management are timeless, I must express my doubt.  The daunting challenges of dramatically increasing data volumes and the unrelenting progress of cloud computing, software as a service (SaaS), and mobile computing architectures, would appear to be racing toward a high-speed collision with our time-tested (but time-consuming to implement properly) data quality management principles.

The times they are indeed changing and I believe we must stop using terms like Six Sigma and Kaizen as if they were a shibboleth.  If these or any other disciplines are to remain relevant, then we must honestly assess them in the harsh and unforgiving light of our brave new world that is seemingly changing faster than the speed of light.

Expertise is not static.  Wisdom is not timeless.  The only constant is change.  For the data quality profession to truly mature, our guiding principles must change with the times, or be relegated to a past that is all too quickly becoming distant.

 

Share Your Perspectives

In celebration of World Quality Day, please share your perspectives regarding the past, present, and most important, the future of the data quality profession.  With apologies to T. H. White, I declare this debate to be about the difference between:

The Once and Future Data Quality Expert

Related Posts

Mistake Driven Learning

The Fragility of Knowledge

The Wisdom of Failure

A Portrait of the Data Quality Expert as a Young Idiot

The Nine Circles of Data Quality Hell

 

Additional IAIDQ Links

IAIDQ Ask The Expert Webinar: World Quality Day 2009

IAIDQ Ask The Expert Webinar with Larry English and Tom Redman

INTERVIEW: Larry English - IAIDQ Co-Founder

INTERVIEW: Tom Redman - IAIDQ Co-Founder

IAIDQ Publications Portal

Customer Incognita

Many enterprise information initiatives are launched in order to unravel that riddle, wrapped in a mystery, inside an enigma, that great unknown, also known as...Customer.

Centuries ago, cartographers used the Latin phrase terra incognita (meaning “unknown land”) to mark regions on a map not yet fully explored.  In this century, companies simply can not afford to use the phrase customer incognita to indicate what information about their existing (and prospective) customers they don't currently have or don't properly understand.

 

What is a Customer?

First things first, what exactly is a customer?  Those happy people who give you money?  Those angry people who yell at you on the phone or say really mean things about your company on Twitter and Facebook?  Why do they have to be so mean? 

Mean people suck.  However, companies who don't understand their customers also suck.  And surely you don't want to be one of those companies, do you?  I didn't think so.

Getting back to the question, here are some insights from the Data Quality Pro discussion forum topic What is a customer?:

  • Someone who purchases products or services from you.  The word “someone” is key because it’s not the role of a “customer” that forms the real problem, but the precision of the term “someone” that causes challenges when we try to link other and more specific roles to that “someone.”  These other roles could be contract partner, payer, receiver, user, owner, etc.
  • Customer is a role assigned to a legal entity in a complete and precise picture of the real world.  The role is established when the first purchase is accepted from this real-world entity.  Of course, the main challenge is whether or not the company can establish and maintain a complete and precise picture of the real world.

These working definitions were provided by fellow blogger and data quality expert Henrik Liliendahl Sørensen, who recently posted 360° Business Partner View, which further examines the many different ways a real-world entity can be represented, including when, instead of a customer, the real-world entity represents a citizen, patient, member, etc.

A critical first step for your company is to develop your definition of a customer.  Don't underestimate either the importance or the difficulty of this process.  And don't assume it is simply a matter of semantics.

Some of my consulting clients have indignantly told me: “We don't need to define it, everyone in our company knows exactly what a customer is.”  I usually respond: “I have no doubt that everyone in your company uses the word customer, however I will work for free if everyone defines the word customer in exactly the same way.”  So far, I haven't had to work for free.  

 

How Many Customers Do You Have?

You have done the due diligence and developed your definition of a customer.  Excellent!  Nice work.  Your next challenge is determining how many customers you have.  Hopefully, you are not going to try using any of these techniques:

  • SELECT COUNT(*) AS "We have this many customers" FROM Customers
  • SELECT COUNT(DISTINCT Name) AS "No wait, we really have this many customers" FROM Customers
  • Middle-Square or Blum Blum Shub methods (i.e. random number generation)
  • Magic 8-Ball says: “Ask again later”

One of the most common and challenging data quality problems is the identification of duplicate records, especially redundant representations of the same customer information within and across systems throughout the enterprise.  The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.

Earlier this year on Data Quality Pro, I published a five part series of articles on identifying duplicate customers, which focused on the methodology for defining your business rules and illustrated some of the common data matching challenges.

Topics covered in the series:

  • Why a symbiosis of technology and methodology is necessary when approaching this challenge
  • How performing a preliminary analysis on a representative sample of real data prepares effective examples for discussion
  • Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
  • How both false negatives and false positives illustrate the highly subjective nature of this problem
  • How to document your business rules for identifying duplicate customers
  • How to set realistic expectations about application development
  • How to foster a collaboration of the business and technical teams throughout the entire project
  • How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

To download the associated presentation (no registration required), please follow this link: OCDQ Downloads

 

Conclusion

“Knowing the characteristics of your customers,” stated Jill Dyché and Evan Levy in the opening chapter of their excellent book, Customer Data Integration: Reaching a Single Version of the Truth, “who they are, where they are, how they interact with your company, and how to support them, can shape every aspect of your company's strategy and operations.  In the information age, there are fewer excuses for ignorance.”

For companies of every size and within every industry, customer incognita is a crippling condition that must be replaced with customer cognizance in order for the company to continue to remain competitive in a rapidly changing marketplace.

Do you know your customers?  If not, then they likely aren't your customers anymore.

The Tell-Tale Data

It is a dark and stormy night in the data center.  The constant humming of hard drives is mimicking the sound of a hard rain falling in torrents, except at occasional intervals, when it is checked by a violent gust of conditioned air sweeping through the seemingly endless aisles of empty cubicles, rattling along desktops, fiercely agitating the flickering glow from flat panel monitors that are struggling against the darkness.

Tonight, amid this foreboding gloom with only my thoughts for company, I race to complete the production implementation of the Dystopian Automated Transactional Analysis (DATA) system.  Nervous, very, very dreadfully nervous I have been, and am, but why will you say that I am mad?  Observe how calmly I can tell you the whole story.

Eighteen months ago, I was ordered by executive management to implement the DATA system.  The vendor's salesperson was an oddly charming fellow named Machiavelli, who had the eye of a vulture — a pale blue eye, with a film over it.  Whenever this eye fell upon me, my blood ran cold. 

Machiavelli assured us all that DATA's seamlessly integrated Magic Beans software would migrate and consolidate all of our organization's information, clairvoyantly detecting and correcting our existing data quality problems, and once DATA was implemented into production, Magic Beans would prevent all future data quality problems from happening.

As soon as a source was absorbed into DATA, Magic Beans automatically did us the favor of freeing up disk space by deleting all traces of the source, somehow even including our off-site archives.  DATA would then become our only system of record, truly our Single Version of the Truth.

It is impossible to say when doubt first entered my brain, but once conceived, it haunted me day and night.  Whenever I thought about it, my blood ran cold — as cold as when that vulture eye was gazing upon me — very gradually, I made up my mind to simply load DATA and rid myself of my doubt forever.

Now this is the point where you will fancy me quite mad.  But madmen know nothing.  You should have seen how wisely I proceeded — with what caution — with what foresight — with what Zen-like tranquility, I went to work! 

I was never happier than I was these past eighteen months while I simply followed the vendor's instructions step by step and loaded DATA!  Would a madman have been so wise as this?  I think not.

Tomorrow morning, DATA goes live.  I can imagine how wonderful that will be.  I will be sitting at my desk, grinning wildly, deliriously happy with a job well done.  DATA will be loaded, data quality will trouble me no more.

It is now four o'clock in the morning, but still it is as dark as midnight.  But as bright as the coming dawn, I can now see three strange men as they gather around my desk. 

Apparently, a shriek had been heard from the business analysts and subject matter experts as soon as they started using DATA.  Suspicions had been aroused, complaints had been lodged, and they (now identifying themselves as auditors) had been called in by a regulatory agency to investigate.

I smile — for what have I to fear?  I welcome these fine gentlemen.  I give them a guided tour of DATA using its remarkably intuitive user interface.  I urge them audit — audit well.  They seemed satisfied.  My manner has convinced them.  I am singularly at ease.  They sit, and while I answer cheerily, they chat away about trivial things.  But before long, I feel myself growing pale and wish them gone.

My head aches and I hear a ringing in my ears, but still they sit and chat.  The ringing becomes more distinct.  I talk more freely, to get rid of the feeling, but it continues and gains volume — until I find that this noise is not within my ears.

No doubt I now grow very pale — but I talk more fluently, and with a heightened voice.  Yet the sound increases — and what can I do?  It is a low, dull, quick sound.  I gasp for breath — and yet the auditors hear it not. 

I talk more quickly — more vehemently — but the noise steadily increases.  I arise, and argue about trifles, in a high key and with violent gesticulations — but the noise steadily increases.  Why will they not be gone?  I pace the floor back and forth, with heavy strides, as if excited to fury by the unrelenting observations of the auditors — but the noise steadily increases.

What could I do?  I raved — I ranted — I raged!  I swung my chair and smashed my computer with it — but the noise rises over all of my attempts to silence it.  It grows louder — louder — louder!  And still the auditors chat pleasantly, and smile.  Is it really possible they can not hear it?  Is it really possible they did not notice me smashing my computer?

They hear! — they suspect! — they know! — they are making a mockery of my horror! — this I thought, and this I think.  But anything is better than this agony!  Anything is more tolerable than this derision!  I can not bear their hypocritical smiles any longer!  I feel that I must scream or die! — and now — again! — the noise!  Louder!  Louder!!  LOUDER!!!

 

“DATA!” I finally shriek.  “DATA has no quality!  NO DATA QUALITY!!!  What have I done?  What — Have — I — Done?!?”

 

With a sudden jolt, I awaken at my desk, with my old friend Edgar shaking me by the shoulders. 

“Hey, wake up!  Executive management wants us in the conference room in five minutes.  Apparently, there is a vendor here today pitching a new system called DATA using software called Magic Beans...” 

“...and the salesperson has this weird eye...”

Days Without A Data Quality Issue

In 1970, the United States Department of Labor created the Occupational Safety and Health Administration (OSHA).  The mission of OSHA is to prevent work-related injuries, illnesses, and deaths.  Based on statistics from 2007, since OSHA's inception, occupational deaths in the United States have been cut by 62% and workplace injuries have declined by 42%.

OSHA regularly conducts inspections to determine if organizations are in compliance with safety standards and assesses financial penalties for violations.  In order to both promote workplace safety and avoid penalties, organizations provide their employees with training on the appropriate precautions and procedures to follow in the event of an accident or an emergency.

Training programs certify new employees in safety protocols and indoctrinate them into the culture of a safety-conscious workplace.  By requiring periodic re-certification, all employees maintain awareness of their personal responsibility in both avoiding workplace accidents and responding appropriately to emergencies.

Although there has been some debate about the effectiveness of the regulations and the enforcement policies, over the years OSHA has unquestionably brought about many necessary changes, especially in the area of industrial work site safety where dangerous machinery and hazardous materials are quite common. 

Obviously, even with well-defined safety standards in place, workplace accidents will still occasionally occur.  However, these standards have helped greatly reduce both the frequency and severity of the accidents.  And most importantly, safety has become a natural part of the organization's daily work routine.

 

A Culture of Data Quality

Similar to indoctrinating employees into the culture of a safety-conscious workplace, more and more organizations are realizing the importance of creating and maintaining the culture of a data quality conscious workplace.  A culture of data quality is essential for effective enterprise information management.

Waiting until a serious data quality issue negatively impacts the organization before starting an enterprise data quality program is analogous to waiting until a serious workplace accident occurs before starting a safety program.

Many data quality issues are caused by a lack of data ownership and an absence of clear guidelines indicating who is responsible for ensuring that data is of sufficient quality to meet the daily business needs of the enterprise.  In order for data quality to be taken seriously within your organization, everyone first needs to know that data quality is an enterprise-wide priority.

Additionally, data quality standards must be well-defined, and everyone must accept their personal responsibility in both preventing data quality issues and responding appropriately to mitigate the associated business risks when issues do occur.

 

Data Quality Assessments

The data equivalent of a safety inspection is a data quality assessment, which provides a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data. 

Performing a data quality assessment helps with a wide variety of tasks including: verifying data matches the metadata that describes it, preparing meaningful questions for subject matter experts, understanding how data is being used, quantifying the business impacts of poor quality data, and evaluating the ROI of data quality improvements.

An initial assessment provides a baseline and helps establish data quality standards as well as set realistic goals for improvement.  Subsequent data quality assessments, which should be performed on a regular basis, will track your overall progress.

Although preventing data quality issues is your ultimate goal, don't let the pursuit of perfection undermine your efforts.  Always be mindful of the data quality issues that remain unresolved, but let them serve as motivation.  Learn from your mistakes without focusing on your failures – focus instead on making steady progress toward improving your data quality.

 

Data Governance

The data equivalent of verifying compliance with safety standards is data governance, which establishes policies and procedures to align people throughout the organization.  Enterprise data quality programs require a data governance framework in order successfully deploy data quality as an enterprise-wide initiative. 

By facilitating the collaboration of all business and technical stakeholders, aligning data usage with business metrics, enforcing data ownership, and prioritizing data quality, data governance enables effective enterprise information management.

Obviously, even with well-defined and well-managed data governance policies and procedures in place, data quality issues will still occasionally occur.  However, your goal is to greatly reduce both the frequency and severity of your data quality issues. 

And most importantly, the responsibility for ensuring that data is of sufficient quality to meet your daily business needs, has now become a natural part of your organization's daily work routine.

 

Days Without A Data Quality Issue

Organizations commonly display a sign indicating how long they have gone without a workplace accident.  Proving that I certainly did not miss my calling as a graphic designer, I created this “sign” for Days Without A Data Quality Issue:

Days Without A Data Quality Issue

 

Related Posts

Poor Data Quality is a Virus

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

Data Governance and Data Quality

We are the (IBM Information) Champions

Recently, I was honored to be named a 2009-2010 IBM Information Champion

From Vality Technology, through Ascential Software, and eventually with IBM, I have spent most of my career working with the data quality tool that is now known as IBM InfoSphere QualityStage. 

Throughout my time in Research and Development (as a Senior Software Engineer and a Development Engineer) and Professional Services (as a Principal Consultant and a Senior Technical Instructor), I was often asked to wear many hats for QualityStage – and not just because my balding head is distractingly shiny.

True champions are championship teams.  The QualityStage team (past and present) is the most remarkable group of individuals that I have ever had the great privilege to know, let alone the good fortune to work with.  Thank you all very, very much.

 

The IBM Information Champion Program

Previously known as the Data Champion Program, the IBM Information Champion Program honors individuals making outstanding contributions to the Information Management community. 

Technical communities, websites, books, conference speakers, and blogs all contribute to the success of IBM’s Information Management products.  But these activities don’t run themselves. 

Behind the scenes, there are dedicated and loyal individuals who put in their own time to run user groups, manage community websites, speak at conferences, post to forums, and write blogs.  Their time is uncompensated by IBM.

IBM honors the commitment of these individuals with a special designation — Information Champion — as a way of showing their appreciation for the time and energy these exceptional community members expend.

Information Champions are objective experts.  They have no official obligation to IBM. 

They simply share their opinions and years of experience with others in the field, and their work contributes greatly to the overall success of IBM Information Management.

 

We are the Champions

The IBM Information Champion Program has been expanded from the Data Management segment to all segments in Information Management, and now includes IBM Cognos, Enterprise Content Management, and InfoSphere. 

To read more about all of the Information Champions, please follow this link:  Profiles of the IBM Information Champions

 

IBM Website Links

IBM Information Champion Community Space

IBM Information Management User Groups

IBM developerWorks

IBM Information On Demand 2009 Global Conference

IBM Home Page (United States)

 

QualityStage Website Links

IBM Redbook for QualityStage

QualityStage Forum on IBM developerWorks

QualityStage Forum on DSXchange

LinkedIn Group for IBM InfoSphere QualityStage

DataQualityFirst

If you tweet away, I will follow

Today is Friday, which for Twitter users like me, can mean only one thing...

Every Friday, Twitter users recommend other users that you should follow.  FollowFriday has kind of become the Twitter version of peer pressure in other words, I recommended you, why didn't you recommend me?

Among my fellow Twitter addicts, it has come to be viewed either as a beloved tradition of social media community building, or a hated annoyance.  It is almost as deeply polarizing as Pepsi vs. Coke or Soccer vs. Football (by the way, just for the official record, I love FollowFriday and I am firmly in the Pepsi and Football camps and by Football, I mean American Football).

If you are curious how it got started, then check out the Interview with Micah Baldwin, Father of FollowFriday on TwiTip.

In this blog post, I want to provide you with some examples of what I do on FollowFriday, and how I manage to actually follow (or do I?) so many people (586 and counting).

 

FollowFriday Example # 1 – The List

Perhaps the most common example of a FollowFriday tweet is to simply list as many users as you can within the 140 characters:

Twitter FollowFriday 1

 

FollowFriday Example # 2 – The Tweet-Out

An alternative FollowFriday tweet is to send a detailed Tweet-Out (the Twitter version of a Shout-Out) to a single user:

Twitter FollowFriday 2

 

FollowFriday Example # 3 – The Twitter Roll

Yet another alternative FollowFriday tweet is to send a link to a Twitter Roll (the Twitter version of a Blog Roll):

Twitter FollowFriday 3

To add your Twitter link so we can follow you, please click here:  OCDQ Twitter Roll

 

Give a Hoot, Use HootSuite

Most of my FollowFriday tweets are actually scheduled.  In part, I do this because I follow people from all around the world and by the time I finally crawl out of bed on Friday, many of my tweeps have already started their weekend.  And let's face it, the other reason that I schedule my FollowFriday tweets has a lot to do with why obsessive-compulsive is in the name of my blog. 

For scheduling tweets, I like using HootSuite:

HootSuite

Please note that the limitation of 140 characters has necessitated the abbreviation #FF instead of the #followfriday “standard.”

 

The Tweet-rix

The Matrix

Unless you only follow a few people, it is a tremendous challenge to actually follow every user you follow.  To be perfectly honest, I do not follow everyone I follow – no, I wasn't just channeling Yogi Berra (I am a Boston Red Sox fan!).  To borrow an analogy from Phil Simon, trying to watch your entire Twitter stream (i.e. The Tweet-rix) is like being an operator on The Matrix.

My primary Twitter application is TweetDeck:

TweetDeck

Not that I am all about me, but I do pay the most attention to Mentions and Direct Messages.  Next, since I am primarily interested in data quality, I use an embedded search to follow any tweets that use the #dataquality hashtag or mention the phrase “data quality.”  TweetDeck is one of many clients allowing you to create Groups of users to help organize The Tweet-rix. 

To further prove my Sci-Fi geek status, I created a group called TweetDeck Actual, which is an homage to BattleStar Galactica, where saying “This is Galactica Actual” confirms an open communications channel has been established with the Galactica. 

I rotate the users I follow in and out of TweetDeck Actual on a regular basis in order to provide for a narrowly focused variety of trenchant tweets.  (By the way, I learned the word trenchant from a Jill Dyché tweet).

 

The Search for Tweets

You do not need to actually have a Twitter account in order to follow tweets.  There are several search engines designed specifically for Twitter.  And according to recent rumors, tweets will be coming soon to a Google near you.

Here are a just a few ways to search Twitter for data quality content:

 

Conclusion

With apologies to fellow fans of U2 (one of my all-time favorite bands):

If you tweet away, tweet away
I tweet away, tweet away
I will follow
If you tweet away, tweet away
I tweet away, tweet away
I will follow
I will follow

Related Posts

Tweet 2001: A Social Media Odyssey

Poor Quality Data Sucks

Fenway Park 2008 Home Opener

Over the last few months on his Information Management blog, Steve Miller has been writing posts inspired by a great 2008 book that we both highly recommend: The Drunkard's Walk: How Randomness Rules Our Lives by Leonard Mlodinow.

In his most recent post The Demise of the 2009 Boston Red Sox: Super-Crunching Takes a Drunkard's Walk, Miller takes on my beloved Boston Red Sox and the less than glorious conclusion to their 2009 season. 

For those readers who are not baseball fans, the Los Angeles Angels of Anaheim swept the Red Sox out of the playoffs.  I will let Miller's words describe their demise: “Down two to none in the best of five series, the Red Sox took a 6-4 lead into the ninth inning, turning control over to impenetrable closer Jonathan Papelbon, who hadn't allowed a run in 26 postseason innings.  The Angels, within one strike of defeat on three occasions, somehow managed a miracle rally, scoring 3 runs to take the lead 7-6, then holding off the Red Sox in the bottom of the ninth for the victory to complete the shocking sweep.”

 

Baseball and Data Quality

What, you may be asking, does baseball have to do with data quality?  Beyond simply being two of my all-time favorite topics, quite a lot actually.  Baseball data is mostly transaction data describing the statistical events of games played.

Statistical analysis has been a beloved pastime even longer than baseball has been America's Pastime.  Number-crunching is far more than just a quantitative exercise in counting.  The qualitative component of statistics – discerning what the numbers mean, analyzing them to discover predictive patterns and trends – is the very basis of data-driven decision making.

“The Red Sox,” as Miller explained, “are certainly exemplars of the data and analytic team-building methodology” chronicled in Moneyball: The Art of Winning an Unfair Game, the 2003 book by Michael Lewis.  Red Sox General Manager Theo Epstein has always been an advocate of the so-called evidenced-based baseball, or baseball analytics, pioneered by Bill James, the baseball writer, historian, statistician, current Red Sox consultant, and founder of Sabermetrics.

In another book that Miller and I both highly recommend, Super Crunchers, author Ian Ayres explained that “Bill James challenged the notion that baseball experts could judge talent simply by watching a player.  James's simple but powerful thesis was that data-based analysis in baseball was superior to observational expertise.  James's number-crunching approach was particular anathema to scouts.” 

“James was baseball's herald,” continues Ayres, “of data-driven decision making.”

 

The Drunkard's Walk

As Mlodinow explains in the prologue: “The title The Drunkard's Walk comes from a mathematical term describing random motion, such as the paths molecules follow as they fly through space, incessantly bumping, and being bumped by, their sister molecules.  The surprise is that the tools used to understand the drunkard's walk can also be employed to help understand the events of everyday life.”

Later in the book, Mlodinow describes the hidden effects of randomness by discussing how to build a mathematical model for the probability that a baseball player will hit a home run: “The result of any particular at bat depends on the player's ability, of course.  But it also depends on the interplay of many other factors: his health, the wind, the sun or the stadium lights, the quality of the pitches he receives, the game situation, whether he correctly guesses how the pitcher will throw, whether his hand-eye coordination works just perfectly as he takes his swing, whether that brunette he met at the bar kept him up too late, or the chili-cheese dog with garlic fries he had for breakfast soured his stomach.”

“If not for all the unpredictable factors,” continues Mlodinow, “a player would either hit a home run on every at bat or fail to do so.  Instead, for each at bat all you can say is that he has a certain probability of hitting a home run and a certain probability of failing to hit one.  Over the hundreds of at bats he has each year, those random factors usually average out and result in some typical home run production that increases as the player becomes more skillful and then eventually decreases owing to the same process that etches wrinkles in his handsome face.  But sometimes the random factors don't average out.  How often does that happen, and how large is the aberration?”

 

Conclusion

I have heard some (not Mlodinow or anyone else mentioned in this post) argue that data quality is an irrelevant issue.  The basis of their argument is that poor quality data are simply random factors that, in any data set of statistically significant size, will usually average out and therefore have a negligible effect on any data-based decisions. 

However, the random factors don't always average out.  It is important to not only measure exactly how often poor quality data occur, but acknowledge the large aberration poor quality data are, especially in data-driven decision making.

As every citizen of Red Sox Nation is taught from birth, the only acceptable opinion of our American League East Division rivals, the New York Yankees, is encapsulated in the chant heard throughout the baseball season (and not just at Fenway Park):

“Yankees Suck!”

From their inception, the day-to-day business decisions of every organization are based on its data.  This decision-critical information drives the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace. 

It doesn't quite roll off the tongue as easily, but a chant heard throughout these enterprise information initiatives is:

“Poor Quality Data Sucks!”

Books Recommended by Red Sox Nation

Mind Game: How the Boston Red Sox Got Smart, Won a World Series, and Created a New Blueprint for Winning

Feeding the Monster: How Money, Smarts, and Nerve Took a Team to the Top

Theology: How a Boy Wonder Led the Red Sox to the Promised Land

Now I Can Die in Peace: How The Sports Guy Found Salvation Thanks to the World Champion (Twice!) Red Sox