Commendable Comments (Part 1)

Six month ago today, I launched this blog by asking: Do you have obsessive-compulsive data quality (OCDQ)?

As of September 10, here are the monthly traffic statistics provided by my blog platform:

OCDQ Blog Traffic Overview

 

It Takes a Village (Idiot)

In my recent Data Quality Pro article Blogging about Data Quality, I explained why I started this blog.  Blogging provides me a way to demonstrate my expertise.  It is one thing for me to describe myself as an expert and another to back up that claim by allowing you to read my thoughts and decide for yourself.

In general, I have always enjoyed sharing my experiences and insights.  A great aspect to doing this via a blog (as opposed to only via whitepapers and presentations) is the dialogue and discussion provided via comments from my readers.

This two-way conversation not only greatly improves the quality of the blog content, but much more importantly, it helps me better appreciate the difference between what I know and what I only think I know. 

Even an expert's opinions are biased by the practical limits of their personal experience.  Having spent most of my career working with what is now mostly IBM technology, I sometimes have to pause and consider if some of that yummy Big Blue Kool-Aid is still swirling around in my head (since I “think with my gut,” I have to “drink with my head”).

Don't get me wrong – “You're my boy, Blue!” – but there are many other vendors and all of them also offer viable solutions driven by impressive technologies and proven methodologies.

Data quality isn't exactly the most exciting subject for a blog.  Data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll. 

This doesn't mean that data quality isn't an important subject – it just means that you will not see a blog post about data quality hitting the front page of Digg anytime soon.

All blogging is more art than science.  My personal blogging style can perhaps best be described as mullet blogging – not “business in the front, party in the back” but “take your subject seriously, but still have a sense of humor about it.”

My blog uses a lot of metaphors and analogies (and sometimes just plain silliness) to try to make an important (but dull) subject more interesting.  Sometimes it works and sometimes it sucks.  However, I have never been afraid to look like an idiot.  After all, idiots are important members of society – they make everyone else look smart by comparison.

Therefore, I view my blog as a Data Quality Village.  And as the Blogger-in-Chief, I am the Village Idiot.

 

The Rich Stuff of Comments

Earlier this year in an excellent IT Business Edge article by Ann All, David Churbuck of Lenovo explained:

“You can host focus groups at great expense, you can run online surveys, you can do a lot of polling, but you won’t get the kind of rich stuff (you will get from blog comments).”

How very true.  But before we get to the rich stuff of our village, let's first take a look at a few more numbers:

  • Not counting this one, I have published 44 posts on this blog
  • Those blog posts have collectively received a total of 185 comments
  • Only 5 blog posts received no comments
  • 30 comments were actually me responding to my readers
  • 45 comments were from LinkedIn groups (23), SmartData Collective re-posts (17), or Twitter re-tweets (5)

The ten blog posts receiving the most comments:

  1. The Two Headed Monster of Data Matching 11 Comments
  2. Adventures in Data Profiling (Part 4)9 Comments
  3. Adventures in Data Profiling (Part 2) 9 Comments
  4. You're So Vain, You Probably Think Data Quality Is About You 8 Comments
  5. There are no Magic Beans for Data Quality 8 Comments
  6. The General Theory of Data Quality 8 Comments
  7. Adventures in Data Profiling (Part 1) 8 Comments
  8. To Parse or Not To Parse 7 Comments
  9. The Wisdom of Failure 7 Comments
  10. The Nine Circles of Data Quality Hell 7 Comments

 

Commendable Comments

This post will be the first in an ongoing series celebrating my heroes my readers.

As Darren Rowse and Chris Garrett explained in their highly recommended ProBlogger book: “even the most popular blogs tend to attract only about a 1 percent commenting rate.” 

Therefore, I am completely in awe of my blog's current 88 percent commenting rate.  Sure, I get my fair share of the simple and straightforward comments like “Great post!” or “You're an idiot!” but I decided to start this series because I am consistently amazed by the truly commendable comments that I regularly receive.

On The Data Quality Goldilocks Zone, Daragh O Brien commented:

“To take (or stretch) your analogy a little further, it is also important to remember that quality is ultimately defined by the consumers of the information.  For example, if you were working on a customer data set (or 'porridge' in Goldilocks terms) you might get it to a point where Marketing thinks it is 'just right' but your Compliance and Risk management people might think it is too hot and your Field Sales people might think it is too cold.  Declaring 'Mission Accomplished' when you have addressed the needs of just one stakeholder in the information can often be premature.

Also, one of the key learnings that we've captured in the IAIDQ over the past 5 years from meeting with practitioners and hosting our webinars is that, just like any Change Management effort, information quality change requires you to break the challenge into smaller deliverables so that you get regular delivery of 'just right' porridge to the various stakeholders rather than boiling the whole thing up together and leaving everyone with a bad taste in their mouths.  It also means you can more quickly see when you've reached the Goldilocks zone.”

On Data Quality Whitepapers are Worthless, Henrik Liliendahl Sørensen commented:

“Bashing in blogging must be carefully balanced.

As we all tend to find many things from gurus to tools in our own country, I have also found one of my favourite sayings from Søren Kirkegaard:

If One Is Truly to Succeed in Leading a Person to a Specific Place, One Must First and Foremost Take Care to Find Him Where He is and Begin There.

This is the secret in the entire art of helping.

Anyone who cannot do this is himself under a delusion if he thinks he is able to help someone else.  In order truly to help someone else, I must understand more than he–but certainly first and foremost understand what he understands.

If I do not do that, then my greater understanding does not help him at all.  If I nevertheless want to assert my greater understanding, then it is because I am vain or proud, then basically instead of benefiting him I really want to be admired by him.

But all true helping begins with a humbling.

The helper must first humble himself under the person he wants to help and thereby understand that to help is not to dominate but to serve, that to help is not to be the most dominating but the most patient, that to help is a willingness for the time being to put up with being in the wrong and not understanding what the other understands.”

On All I Really Need To Know About Data Quality I Learned In Kindergarten, Daniel Gent commented:

“In kindergarten we played 'Simon Says...'

I compare it as a way of following the requirements or business rules.

Simon says raise your hands.

Simon says touch your nose.

Touch your feet.

With that final statement you learned very quickly in kindergarten that you can be out of the game if you are not paying attention to what is being said.

Just like in data quality, to have good accurate data and to keep the business functioning properly you need to pay attention to what is being said, what the business rules are.

So when Simon says touch your nose, don't be touching your toes, and you'll stay in the game.”

Since there have been so many commendable comments, I could only list a few of them in the series debut.  Therefore, please don't be offended if your commendable comment didn't get featured in this post.  Please keep on commenting and stay tuned for future entries in the series.

 

Because of You

As Brian Clark of Copyblogger explains, The Two Most Important Words in Blogging are “You” and “Because.”

I wholeheartedly agree, but prefer to paraphrase it as: Blogging is “because of you.” 

Not you meaning me, the blogger you meaning you, the reader.

Thank You.

 

Related Posts

Commendable Comments (Part 2)

Commendable Comments (Part 3)


Fantasy League Data Quality

For over 25 years, I have been playing fantasy league baseball and football.  For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team.  Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?

 

Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

  • Master Data – data describing the real-world players on the roster of each fantasy team.

  • Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

  • Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number.  Team attributes include name, owner, home city, and the name and seating capacity of their stadium.

  • Definitions – Player and Team have both Professional and Fantasy definitions.  Professional teams and players are real-world objects managed independent of fantasy sports.  Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League).  Therefore, Fantasy Team and Fantasy Player are the true master data objects.  The distinction between professional and fantasy teams is simpler than between professional and fantasy players.  Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).

  • Roles – In baseball, the player roles are Batter, Pitcher, and Fielder.  In football, the player roles are Offense, Defense and Special Teams.  In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).

  • Connections – Fantasy Players are connected to Fantasy Teams via a roster.  On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball).  These connections change throughout the season.  Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).

  • Taxonomies – Positions played are defined individually and organized into taxonomies.  In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield.  Second base and short stop are also infield positions, and more specifically middle infield.  And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).

 

Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse.  I downloaded master, reference, and transaction data from my fantasy league's website.  I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team.  All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables.  For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions. 

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics.  These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball. 

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics.  This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues.  Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams.  Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only.  Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables.  In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century). 

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century.  However, there are still statistics with multiple meanings.  For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas.  Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.

 

Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team.  Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website.  This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.”  However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites. 

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online.  Of course, this wasn't always the case. 

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers.  We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs. 

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves.  I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.

 

Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

  • Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.

  • Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.

  • Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.

  • Presentation Quality – A measure of how information is presented to and collected from those who utilize it.  Format and appearance support appropriate use of the information.

  • Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.

 

Conclusion

I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you.  It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby.  They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”


Resistance is NOT Futile

Locutus of Borg “Your opinion is irrelevant.  We wish to improve ourselves. 

We will add your business and technological distinctiveness to our own. 

Your culture will adapt to service us. 

You will be assimilated.  Resistance is futile.”

Continuing my Star Trek theme, which began with my previous post Hailing Frequencies Open, imagine that you have been called into the ready room to be told your enterprise has decided to implement the proven data quality framework known as Business Operations and Reporting Governance – the BORG.

 

Frameworks are NOT Futile

Please understand – I am an advocate for methodology and best practices, and there are certainly many excellent frameworks that are far from futile.  I have worked on many data quality initiatives that were following a framework and have seen varying degrees of success in their implementation.

However, the fictional BORG framework that I am satirizing exemplifies a general problem that I have with any framework that advocates a one-size-fits-all strategy, which I believe is an approach that is doomed to fail.

Any implemented framework must be customized to adapt to an organization's unique culture.  In part, this is necessary because implementing changes of any kind will be met with initial resistance.  An attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

 

Resistance is NOT Futile

Everyone has opinions – and opinions are never irrelevant.  Fundamentally, all change starts with changing people's minds. 

The starting point has to be improving communication and encouraging open dialogue.  This means listening to what people throughout the organization have to say and not just telling them what to do.  Keeping data aligned with business processes and free from poor quality requires getting people aligned and free to communicate their concerns.

Obviously, there will be dissension.  However, you must seek a mutual understanding by practicing empathic listening.  The goal is to foster an environment in which a diversity of viewpoints is freely shared without bias.

“One of the real dangers is emphasizing consensus over dissent,” explains James Surowiecki in his excellent book The Wisdom of Crowds.  “The best collective decisions are the product of disagreement and contest, not consensus or compromise.  Group deliberations are more successful when they have a clear agenda and when leaders take an active role in making sure that everyone gets a chance to speak.”

 

Avoid Assimilation

In order to be successful in your attempt to implement any framework, you must have realistic expectations. 

Starting with a framework simply provides a reference of best practices and recommended options of what has worked on successful data quality initiatives.  But the framework must still be reviewed in order to determine what can be learned from it and to select what will work in the current environment and what simply won't. 

This doesn't mean that the customized components of the framework will be implemented simultaneously.  All change will be gradual and implemented in phases – without the use of BORG nanoprobes.  You will NOT be assimilated. 

Your organization's collective consciousness will be best served by adapting the framework to your corporate culture. 

Your data quality initiative will facilitate the collaboration of business and technical stakeholders, as well as align data usage with business metrics, and enable people to be responsible for data ownership and data quality. 

Best practices will be disseminated throughout your collective – while also maintaining your individual distinctiveness.

 

Related Posts

Hailing Frequencies Open

Data Governance and Data Quality

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

You're So Vain, You Probably Think Data Quality Is About You

Hailing Frequencies Open

“This is Captain James E. Harris of the Data Quality Starship Collaboration...”

Clearly, I am a Star Trek nerd – but I am also a people person.  Although people, process, and technology are all important for successful data quality initiatives, without people, process and technology are useless. 

Collaboration is essential.  More than anything else, it requires effective communication – which begins with effective listening.

 

Seek First to Understand...Then to Be Understood

This is Habit 5 from Stephen Covey's excellent book The 7 Habits of Highly Effective People.  “We typically seek first to be understood,” explains Covey.  “Most people do not listen with the intent to understand; they listen with the intent to reply.”

We are all proud of our education, knowledge, understanding, and experience.  Since it is commonly believed that experience is the path that separates knowledge from wisdom, we can't wait to share our wisdom with the world.  However, as Covey cautions, our desire to be understood can make “our conversations become collective monologues.”

Covey explains that listening is an activity that can be practiced at one of the following five levels:

  1. Ignoring – we are not really listening at all.
  2. Pretending – we are only waiting for our turn to speak, constantly nodding and saying: “Yeah. Uh-huh. Right.” 
  3. Selective Listening – we are only hearing certain parts of the conversation, such as when we're listening to the constant chatter of a preschool child.
  4. Attentive Listening – we are paying attention and focusing energy on the words that are being said.
  5. Empathic Listening – we are actually listening with the intent to really try to understand the other person's frame of reference.  You look out through it, you see the world the way they see the world, you understand their paradigm, you understand how they feel.

“Empathy is not sympathy,” explains Covey.  “Sympathy is a form of agreement, a form of judgment.  And it is sometimes the more appropriate response.  But people often feed on sympathy.  It makes them dependent.  The essence of empathic listening is not that you agree with someone; it's that you fully, deeply, understand that person, emotionally as well as intellectually.”

 

Vulcans

Some people balk at discussing the use of emotion in a professional setting, where typically it is believed that rational analysis must protect us from irrational emotions.  To return to a Star Trek metaphor, these people model their professional behavior after the Vulcans. 

Vulcans live according to the philosopher Surak's code of emotional self-control.  Starting at a very young age, they are taught meditation and other techniques in order to suppress their emotions and live a life guided by reason and logic alone.

 

Be Truly Extraordinary

In all professions, it is fairly common to encounter rational and logically intelligent people. 

Truly extraordinary people masterfully blend both kinds of intelligence – intellectual and emotional.  A well-grounded sense of self-confidence, an empathetic personality, and excellent communication skills, exert a more powerfully positive influence than simply remarkable knowledge and expertise alone.

 

Your Away Mission

As a data quality consultant, when I begin an engagement with a new client, I often joke that I shouldn't be allowed to speak for the first two weeks.  This is my way of explaining that I will be asking more questions than providing answers. 

I am seeking first to understand the current environment from both the business and technical perspectives.  Only after I have achieved this understanding, will I then seek to be understood regarding my extensive experience of the best practices that I have seen work on successful data quality initiatives.

As fellow Star Trek nerds know, the captain doesn't go on away missions.  Therefore, your away mission is to try your best to practice empathic listening at your next data quality discussion – “Make It So!”

Data quality initiatives require a holistic approach involving people, process, and technology.  You must consider the people factor first and foremost, because it will be the people involved, and not the process or the technology, that will truly allow your data quality initiative to “Live Long and Prosper.”

 

As always, hailing frequencies remain open to your comments.  And yes, I am trying my best to practice empathic listening.

 

Related Posts

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

Hyperactive Data Quality (Second Edition)

In the first edition of Hyperactive Data Quality, I discussed reactive and proactive approaches using the data quality lake analogy from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset:

“...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The lake...is polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source – the factories.

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data.”

Reactive Data Quality

Reactive Data Quality (i.e. “cleaning the lake” in Redman's analogy) focuses entirely on finding and fixing the problems with existing data after it has been extracted from its sources. 

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert “lake cleaners”).  Data quality problems can be very insidious and even the best “lake cleaning” process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

 

Proactive Data Quality

Proactive Data Quality focuses on preventing errors at the sources where data is entered or received, and before it is extracted for use by downstream applications (i.e. “enters the lake” in Redman's analogy). 

Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

“It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect.”

Proactive data quality advocates reevaluating business processes that create data, implementing improved controls on data entry screens and web forms, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the information needs of your consumers before delivering enterprise data for their use.

 

Proactive Data Quality > Reactive Data Quality

Proactive data quality is clearly the superior approach.  Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information. 

Reactive data quality essentially treats the symptoms without curing the disease.  As Redman explains: “...the problem with being a good lake cleaner is that life never gets better...it gets worse as more data...conspire to mean there is more work every day.”

So why do the vast majority of data quality initiatives use a reactive approach?

 

An Arrow Thickly Smeared With Poison

In Buddhism, there is a famous parable:

A man was shot with an arrow thickly smeared with poison.  His friends wanted to get a doctor to heal him, but the man objected by saying:

“I will neither allow this arrow to be pulled out nor accept any medical treatment until I know the name of the man who wounded me, whether he was a nobleman or a soldier or a merchant or a farmer or a lowly peasant, whether he was tall or short or of average height, whether he used a long bow or a crossbow, and whether the arrow that wounded me was hoof-tipped or curved or barbed.” 

While his friends went off in a frantic search for these answers, the man slowly, and painfully, dies.

 

“Flight to Data Quality”

In economics, the term “flight to quality” describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar “flight to data quality” can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal. 

Driven by a business triage for critical data problems, reactive data cleansing is purposefully chosen over proactive defect prevention.  The priority is finding and fixing the near-term problems rather than worrying about the long-term consequences of not identifying the root cause and implementing process improvements that would prevent it from happening again.

The enterprise has been shot with an arrow thickly smeared with poison – poor data quality.  Now is not the time to point out that the enterprise has actually shot itself by failing to have proactive measures in place. 

Reactive data quality only treats the symptoms.  However, during triage, the priority is to stabilize the patient.  A cure for the underlying condition is worthless if the patient dies before it can be administered.

 

Hyperactive Data Quality

Proactive data quality is the best practice.  Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing. 

A data governance framework is necessary for proactive data quality to be successful.  Patience and understanding are also necessary.  Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly. 

Even when not facing an immediate crisis, the reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.

Just like any complex problem, data quality has no fast and easy solution.  Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into an approach that I refer to as Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

 

Please share your thoughts and experiences.

 

Related Posts

Hyperactive Data Quality (First Edition)

The General Theory of Data Quality

The Wisdom of Failure

Earlier this month, I had the honor of being interviewed by Ajay Ohri on his blog Decision Stats, which is an excellent source of insights on business intelligence and data mining as well as interviews with industry thought leaders and chief evangelists.

One of the questions Ajay asked me during my interview was what methods and habits would I recommend to young analysts just starting in the business intelligence field and part of my response was:

“Don't be afraid to ask questions or admit when you don't know the answers.  The only difference between a young analyst just starting out and an expert is that the expert has already made and learned from all the mistakes caused by being afraid to ask questions or admitting when you don't know the answers.”

It is perhaps one of life’s cruelest paradoxes that some lessons simply cannot be taught, but instead have to be learned through the pain of making mistakes.  To err is human, but not all humans learn from their errors.  In fact, some of us find it extremely difficult to even simply acknowledge when we have made a mistake.  This was certainly true for me earlier in my career.

 

The Wisdom of Crowds

One of my favorite books is The Wisdom of Crowds by James Surowiecki.  Before reading it, I admit that I believed crowds were incapable of wisdom and that the best decisions are based on the expert advice of carefully selected individuals.  However, Surowiecki wonderfully elucidates the folly of “chasing the expert” and explains the four conditions that characterize wise crowds: diversity of opinion, independent thinking, decentralization and aggregation.  The book is also balanced by examining the conditions (e.g. confirmation bias and groupthink) that can commonly undermine the wisdom of crowds.  All and all, it is a wonderful discourse on both collective intelligence and collective ignorance with practical advice on how to achieve the former and avoid the latter.

 

Chasing the Data Quality Expert

Without question, a data quality expert can be an invaluable member of your team.  Often an external consultant, a data quality expert can provide extensive experience and best practices from successful implementations.  However, regardless of their experience, even with other companies in your industry, every organization and its data is unique.  An expert's perspective definitely has merit, but their opinions and advice should not be allowed to dominate the decision making process. 

“The more power you give a single individual in the face of complexity,” explains Surowiecki, “the more likely it is that bad decisions will get made.”  No one person regardless of their experience and expertise can succeed on their own.  According to Surowiecki, the best experts “recognize the limits of their own knowledge and of individual decision making.”

 

“Success is on the far side of failure”

One of the most common obstacles organizations face with data quality initiatives is that many initial attempts end in failure.  Some fail because of lofty expectations, unmanaged scope creep, and the unrealistic perspective that data quality problems can be permanently “fixed” by a one-time project as opposed to needing a sustained program.  However, regardless of the reason for the failure, it can negatively affect morale and cause employees to resist participating in the next data quality effort.

Although a common best practice is to perform a post-mortem in order to document the lessons learned, sometimes the stigma of failure persuades an organization to either skip the post-mortem or ignore its findings. 

However, in the famous words of IBM founder Thomas J. Watson: “Success is on the far side of failure.” 

A failed data quality initiative may have been closer to success than you realize.  At the very least, there are important lessons to be learned from the mistakes that were made.  The sooner you can recognize your mistakes, the sooner you can mitigate their effects and hopefully prevent them from happening again.

 

The Wisdom of Failure

In one of my other favorite books, How We Decide, Jonah Lehrer explains:

“The brain always learns the same way, accumulating wisdom through error...there are no shortcuts to this painstaking process...becoming an expert just takes time and practice...once you have developed expertise in a particular area...you have made the requisite mistakes.”

Therefore, although it may be true that experience is the path that separates knowledge from wisdom, I have come to realize that the true wisdom of my experience is the wisdom of failure.

 

Related Posts

A Portrait of the Data Quality Expert as a Young Idiot

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Nine Circles of Data Quality Hell

Data Governance and Data Quality

Regular readers know that I often blog about the common mistakes I have observed (and made) in my professional services and application development experience in data quality (for example, see my post: The Nine Circles of Data Quality Hell).

According to Wikipedia: “Data governance is an emerging discipline with an evolving definition.  The discipline embodies a convergence of data quality, data management, business process management, and risk management surrounding the handling of data in an organization.”

Since I have never formally used the term “data governance” with my clients, I have been researching what data governance is and how it specifically relates to data quality.

Thankfully, I found a great resource in Steve Sarsfield's excellent book The Data Governance Imperative, where he explains:

“Data governance is about changing the hearts and minds of your company to see the value of information quality...data governance is a set of processes that ensures that important data assets are formally managed throughout the enterprise...at the root of the problems with managing your data are data quality problems...data governance guarantees that data can be trusted...putting people in charge of fixing and preventing issues with data...to have fewer negative events as a result of poor data.”

Although the book covers data governance more comprehensively, I focused on three of my favorite data quality themes:

  • Business-IT Collaboration
  • Data Quality Assessments
  • People Power

 

Business-IT Collaboration

Data governance establishes policies and procedures to align people throughout the organization.  Successful data quality initiatives require the Business and IT to forge an ongoing and iterative collaboration.  Neither the Business nor IT alone has all of the necessary knowledge and resources required to achieve data quality success.  The Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise and must partner with IT in defining the necessary data quality standards and processes. 

Steve Sarsfield explains:

“Business users need to understand that data quality is everyone's job and not just an issue with technology...the mantra of data governance is that technologists and business users must work together to define what good data is...constantly leverage both business users, who know the value of the data, and technologists, who can apply what the business users know to the data.” 

Data Quality Assessments

Data quality assessments provide a much needed reality check for the perceptions and assumptions that the enterprise has about the quality of its data.  Data quality assessments help with many tasks including verifying metadata, preparing meaningful questions for subject matter experts, understanding how data is being used, and most importantly – evaluating the ROI of data quality improvements.  Building data quality monitoring functionality into the applications that support business processes provides the ability to measure the effect that poor data quality can have on decision-critical information.

Steve Sarsfield explains:

“In order to know if you're winning in the fight against poor data quality, you have to keep score...use data quality scorecards to understand the detail about quality of data...and aggregate those scores into business value metrics...solid metrics...give you a baseline against which you can measure improvement over time.” 

People Power

Although incredible advancements continue, technology alone cannot provide the solution.  Data governance and data quality both require a holistic approach involving people, process and technology.  However, by far the most important of the three is people.  In my experience, it is always the people involved that make projects successful.

Steve Sarsfield explains:

“The most important aspect of implementing data governance is that people power must be used to improve the processes within an organization.  Technology will have its place, but it's most importantly the people who set up new processes who make the biggest impact.”

Conclusion

Data governance provides the framework for evolving data quality from a project to an enterprise-wide initiative.  By facilitating the collaboration of business and technical stakeholders, aligning data usage with business metrics, and enabling people to be responsible for data ownership and data quality, data governance provides for the ongoing management of the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

 

Related Posts

TDWI World Conference Chicago 2009

Not So Strange Case of Dr. Technology and Mr. Business

Schrödinger's Data Quality

The Three Musketeers of Data Quality

 

Additional Resources

Over on Data Quality Pro, read the following posts:

From the IAIDQ publications portal, read the 2008 industry report: The State of Information and Data Governance

Read Steve Sarsfield's book: The Data Governance Imperative and read his blog: Data Governance and Data Quality Insider

Missed It By That Much

In the mission to gain control over data chaos, a project is launched in order to implement a new system to help remediate the poor data quality that is negatively impacting decision-critical enterprise information. 

The project appears to be well planned.  Business requirements were well documented.  A data quality assessment was performed to gain an understanding of the data challenges that would be faced during development and testing.  Detailed architectural and functional specifications were written to guide these efforts.

The project appears to be progressing well.  Business, technical and data issues all come up from time to time.  Meetings are held to prioritize the issues and determine their impact.  Some issues require immediate fixes, while other issues are deferred to the next phase of the project.  All of these decisions are documented and well communicated to the end-user community.

Expectations appear to have been properly set for end-user acceptance testing.

As a best practice, the new system was designed to identify and report exceptions when they occur.  The end-users agreed that an obsessive-compulsive quest to find and fix every data quality problem is a laudable pursuit but ultimately a self-defeating cause.  Data quality problems can be very insidious and even the best data remediation process will still produce exceptions.

Although all of this is easy to accept in theory, it is notoriously difficult to accept in practice.

Once the end-users start reviewing the exceptions, their confidence in the new system drops rapidly.  Even after some enhancements increase the number of records without an exception from 86% to 99% – the end-users continue to focus on the remaining 1% of the records that are still producing data quality exceptions.

Would you believe this incredibly common scenario can prevent acceptance of an overwhelmingly successful implementation?

How about if I quoted one of the many people who can help you get smarter than by only listening to me?

In his excellent book Why New Systems Fail: Theory and Practice Collide, Phil Simon explains:

“Systems are to  be appreciated by their general effects, and not by particular exceptions...

Errors are actually helpful the vast majority of the time.”

In fact, because the new system was designed to identify and report errors when they occur:

“End-users could focus on the root causes of the problem and not have to wade through hundreds of thousands of records in an attempt to find the problem records.”

I have seen projects fail in the many ways described by detailed case studies in Phil Simon's fantastic book.   However, one of the most common and frustrating data quality failures is the project that was so close to being a success but the focus on exceptions resulted in the end-users telling us that we “missed it by that much.”

I am neither suggesting that end-users are unrealistic nor that exceptions should be ignored. 

Reducing exceptions (i.e. poor data quality) is the whole point of the project and nobody understands the data better than the end-users.  However, chasing perfection can undermine the best intentions. 

In order to be successful, data quality projects must always be understood as an iterative process.  Small incremental improvements will build momentum to larger success over time. 

Instead of focusing on the exceptions – focus on the improvements. 

And you will begin making steady progress toward improving your data quality.

And loving it!

 

Related Posts

The Data Quality Goldilocks Zone

Schrödinger's Data Quality

The Nine Circles of Data Quality Hell

The Data-Information Continuum

Data is one of the enterprise's most important assets.  Data quality is a fundamental success factor for the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

When the results of these initiatives don't meet expectations, analysis often reveals poor data quality is a root cause.   Projects are launched to understand and remediate this problem by establishing enterprise-wide data quality standards.

However, a common issue is a lack of understanding about what I refer to as the Data-Information Continuum.

 

The Data-Information Continuum

In physics, the Space-Time Continuum explains that space and time are interrelated entities forming a single continuum.  In classical mechanics, the passage of time can be considered a constant for all observers of spatial objects in motion.  In relativistic contexts, the passage of time is a variable changing for each specific observer of spatial objects in motion.

Data and information are also interrelated entities forming a single continuum.  It is crucial to understand how they are different and how they relate.  I like using the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). 

A common data quality definition is fitness for the purpose of use.  A common challenge is data has multiple uses, each with its own fitness requirements.  I like to view each intended use as the information that is derived from data, defining information as data in use or data in action.

Data could be considered a constant while information is a variable that redefines data for each specific use.  Data is not truly a constant since it is constantly changing.  However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again). 

Quality within the Data-Information Continuum has both objective and subjective dimensions.

 

Objective Data Quality

Data's quality must be objectively measured separate from its many uses.  Enterprise-wide data quality standards must provide a highest common denominator for all business units to use as an objective data foundation for their specific tactical and strategic initiatives.  Raw data extracted directly from its sources must be profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs.  At this phase, the manipulations of raw data by these processes must be limited to objective standards and not be customized for any subjective use.

 

Subjective Information Quality

Information's quality can only be subjectively measured according to its specific use.  Information quality standards are not enterprise-wide, they are customized to a specific business unit or initiative.  However, all business units and initiatives must begin defining their information quality standards by using the enterprise-wide data quality standards as a foundation.  This approach allows leveraging a consistent enterprise understanding of data while also deriving the information necessary for the day-to-day operation of each business unit and initiative.

 

A “Single Version of the Truth” or the “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a single version of the truth.

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Conclusion

There is a significant difference between data and information and therefore a significant difference between data quality and information quality.  Many data quality projects are in fact implementations of information quality customized to the specific business unit or initiative that is funding the project.  Although these projects can achieve some initial success, they encounter failures in later iterations and phases when information quality standards try to act as enterprise-wide data quality standards. 

Significant time and money can be wasted by not understanding the Data-Information Continuum.

Data Gazers

The Matrix Within cubicles randomly dispersed throughout the sprawling office space of companies large and small, there exist countless unsung heroes of enterprise information initiatives.  Although their job titles might be labeling them as a Business Analyst, Programmer Analyst, Account Specialist or Application Developer, their true vocation is a far more noble calling.

 

They are Data Gazers.

 

In his excellent book Data Quality Assessment, Arkady Maydanchik explains that:

"Data gazing involves looking at the data and trying to reconstruct a story behind these data.  Following the real story helps identify parameters about what might or might not have happened and how to design data quality rules to verify these parameters.  Data gazing mostly uses deduction and common sense."

All enterprise information initiatives are complex endeavors and data quality projects are certainly no exception.  Success requires people taking on the challenge united by collaboration, guided by an effective methodology, and implementing a solution using powerful technology.

But the complexity of the project can sometimes work against your best intentions.  It is easy to get pulled into the mechanics of documenting the business requirements and functional specifications and then charging ahead on the common mantra:

"We planned the work, now we work the plan." 

Once the project achieves some momentum, it can take on a life of its own and the focus becomes more and more about making progress against the tasks in the project plan, and less and less on the project's actual goal...improving the quality of the data. 

In fact, I have often observed the bizarre phenomenon where as a project "progresses" it tends to get further and further away from the people who use the data on a daily basis.

 

However, Arkady Maydanchik explains that:

"Nobody knows the data better than the users.  Unknown to the big bosses, the people in the trenches are measuring data quality every day.  And while they rarely can give a comprehensive picture, each one of them has encountered certain data problems and developed standard routines to look for them.  Talking to the users never fails to yield otherwise unknown data quality rules with many data errors."

There is a general tendency to consider that working directly with the users and the data during application development can only be disruptive to the project's progress.  There can be a quiet comfort and joy in simply developing off of documentation and letting the interaction with the users and the data wait until the project plan indicates that user acceptance testing begins. 

The project team can convince themselves that the documented business requirements and functional specifications are suitable surrogates for the direct knowledge of the data that users possess.  It is easy to believe that these documents tell you what the data is and what the rules are for improving the quality of the data.

Therefore, although ignoring the users and the data until user acceptance testing begins may be a good way to keep a data quality project on schedule, you will only be delaying the project's inevitable failure because as all data gazers know and as my mentor Morpheus taught me:

"Unfortunately, no one can be told what the Data is. You have to see it for yourself."


TDWI World Conference Chicago 2009

Founded in 1995, TDWI (The Data Warehousing Institute™) is the premier educational institute for business intelligence and data warehousing that provides education, training, certification, news, and research for executives and information technology professionals worldwide.  TDWI conferences always offer a variety of full-day and half-day courses taught in an objective, vendor-neutral manner.  The courses taught are designed for professionals and taught by in-the-trenches practitioners who are well known in the industry.

 

TDWI World Conference Chicago 2009 was held May 3-8 in Chicago, Illinois at the Hyatt Regency Hotel and was a tremendous success.  I attended as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ).

I used Twitter to provide live reporting from the conference.  Here are my notes from the courses I attended: 

 

BI from Both Sides: Aligning Business and IT

Jill Dyché, CBIP, is a partner and co-founder of Baseline Consulting, a management and technology consulting firm that provides data integration and business analytics services.  Jill is responsible for delivering industry and client advisory services, is a frequent lecturer and writer on the business value of IT, and writes the excellent Inside the Biz blog.  She is the author of acclaimed books on the business value of information: e-Data: Turning Data Into Information With Data Warehousing and The CRM Handbook: A Business Guide to Customer Relationship Management.  Her latest book, written with Evan Levy, is Customer Data Integration: Reaching a Single Version of the Truth.

Course Quotes from Jill Dyché:

  • Five Critical Success Factors for Business Intelligence (BI):
    1. Organization - Build organizational structures and skills to foster a sustainable program
    2. Processes - Align both business and IT development processes that facilitate delivery of ongoing business value
    3. Technology - Select and build technologies that deploy information cost-effectively
    4. Strategy - Align information solutions to the company's strategic goals and objectives
    5. Information - Treat data as an asset by separating data management from technology implementation
  • Three Different Requirement Categories:
    1. What is the business need, pain, or problem?  What business questions do we need to answer?
    2. What data is necessary to answer those business questions?
    3. How do we need to use the resulting information to answer those business questions?
  • “Data warehouses are used to make business decisions based on data – so data quality is critical”
  • “Even companies with mature enterprise data warehouses still have data silos - each business area has its own data mart”
  • “Instead of pushing a business intelligence tool, just try to get people to start using data”
  • “Deliver a usable system that is valuable to the business and not just a big box full of data”

 

TDWI Data Governance Summit

Philip Russom is the Senior Manager of Research and Services at TDWI, where he oversees many of TDWI’s research-oriented publications, services, and events.  Prior to joining TDWI in 2005, he was an industry analyst covering BI at Forrester Research, as well as a contributing editor with Intelligent Enterprise and Information Management (formerly DM Review) magazines.

Summit Quotes from Philip Russom:

  • “Data Governance usually boils down to some form of control for data and its usage”
  • “Four Ps of Data Governance: People, Policies, Procedures, Process”
  • “Three Pillars of Data Governance: Compliance, Business Transformation, Business Integration”
  • “Two Foundations of Data Governance: Business Initiatives and Data Management Practices”
  • “Cross-functional collaboration is a requirement for successful Data Governance”

 

Becky Briggs, CBIP, CMQ/OE, is a Senior Manager and Data Steward for Airlines Reporting Corporation (ARC) and has 25 years of experience in data processing and IT - the last 9 in data warehousing and BI.  She leads the program team responsible for product, project, and quality management, business line performance management, and data governance/stewardship.

Summit Quotes from Becky Briggs:

  • “Data Governance is the act of managing the organization's data assets in a way that promotes business value, integrity, usability, security and consistency across the company”
  • Five Steps of Data Governance:
    1. Determine what data is required
    2. Evaluate potential data sources (internal and external)
    3. Perform data profiling and analysis on data sources
    4. Data Services - Definition, modeling, mapping, quality, integration, monitoring
    5. Data Stewardship - Classification, access requirements, archiving guidelines
  • “You must realize and accept that Data Governance is a program and not just a project”

 

Barbara Shelby is a Senior Software Engineer for IBM with over 25 years of experience holding positions of technical specialist, consultant, and line management.  Her global management and leadership positions encompassed network authentication, authorization application development, corporate business systems data architecture, and database development.

Summit Quotes from Barbara Shelby:

  • Four Common Barriers to Data Governance:
    1. Information - Existence of information silos and inconsistent data meanings
    2. Organization - Lack of end-to-end data ownership and organization cultural challenges
    3. Skill - Difficulty shifting resources from operational to transformational initiatives
    4. Technology - Business data locked in large applications and slow deployment of new technology
  • Four Key Decision Making Bodies for Data Governance:
    1. Enterprise Integration Team - Oversees the execution of CIO funded cross enterprise initiatives
    2. Integrated Enterprise Assessment - Responsible for the success of transformational initiatives
    3. Integrated Portfolio Management Team - Responsible for making ongoing business investment decisions
    4. Unit Architecture Review - Responsible for the IT architecture compliance of business unit solutions

 

Lee Doss is a Senior IT Architect for IBM with over 25 years of information technology experience.  He has a patent for process of aligning strategic capability for business transformation and he has held various positions including strategy, design, development, and customer support for IBM networking software products.

Summit Quotes from Lee Doss:

  • Five Data Governance Best Practices:
    1. Create a sense of urgency that the organization can rally around
    2. Start small, grow fast...pick a few visible areas to set an example
    3. Sunset legacy systems (application, data, tools) as new ones are deployed
    4. Recognize the importance of organization culture…this will make or break you
    5. Always, always, always – Listen to your customers

 

Kevin Kramer is a Senior Vice President and Director of Enterprise Sales for UMB Bank and is responsible for development of sales strategy, sales tool development, and implementation of enterprise-wide sales initiatives.

Summit Quotes from Kevin Kramer:

  • “Without Data Governance, multiple sources of customer information can produce multiple versions of the truth”
  • “Data Governance helps break down organizational silos and shares customer data as an enterprise asset”
  • “Data Governance provides a roadmap that translates into best practices throughout the entire enterprise”

 

Kanon Cozad is a Senior Vice President and Director of Application Development for UMB Bank and is responsible for overall technical architecture strategy and oversees information integration activities.

Summit Quotes from Kanon Cozad:

  • “Data Governance identifies business process priorities and then translates them into enabling technology”
  • “Data Governance provides direction and Data Stewardship puts direction into action”
  • “Data Stewardship identifies and prioritizes applications and data for consolidation and improvement”

 

Jill Dyché, CBIP, is a partner and co-founder of Baseline Consulting, a management and technology consulting firm that provides data integration and business analytics services.  (For Jill's complete bio, please see above).

Summit Quotes from Jill Dyché:

  • “The hard part of Data Governance is the data
  • “No data will be formally sanctioned unless it meets a business need”
  • “Data Governance focuses on policies and strategic alignment”
  • “Data Management focuses on translating defined polices into executable actions”
  • “Entrench Data Governance in the development environment”
  • “Everything is customer data – even product and financial data”

 

Data Quality Assessment - Practical Skills

Arkady Maydanchik is a co-founder of Data Quality Group, a recognized practitioner, author, and educator in the field of data quality and information integration.  Arkady's data quality methodology and breakthrough ARKISTRA technology were used to provide services to numerous organizations.  Arkady is the author of the excellent book Data Quality Assessment, a frequent speaker at various conferences and seminars, and a contributor to many journals and online publications.  Data quality curriculum by Arkady Maydanchik can be found at eLearningCurve.

Course Quotes from Arkady Maydanchik:

  • “Nothing is worse for data quality than desperately trying to fix it during the last few weeks of an ETL project”
  • “Quality of data after conversion is in direct correlation with the amount of knowledge about actual data”
  • “Data profiling tools do not do data profiling - it is done by data analysts using data profiling tools”
  • “Data Profiling does not answer any questions - it helps us ask meaningful questions”
  • “Data quality is measured by its fitness to the purpose of use – it's essential to understand how data is used”
  • “When data has multiple uses, there must be data quality rules for each specific use”
  • “Effective root cause analysis requires not stopping after the answer to your first question - Keep asking: Why?”
  • “The central product of a Data Quality Assessment is the Data Quality Scorecard”
  • “Data quality scores must be both meaningful to a specific data use and be actionable”
  • “Data quality scores must estimate both the cost of bad data and the ROI of data quality initiatives”

 

Modern Data Quality Techniques in Action - A Demonstration Using Human Resources Data

Gian Di Loreto formed Loreto Services and Technologies in 2004 from the client services division of Arkidata Corporation.  Loreto Services provides data cleansing and integration consulting services to Fortune 500 companies.  Gian is a classically trained scientist - he received his PhD in elementary particle physics from Michigan State University.

Course Quotes from Gian Di Loreto:

  • “Data Quality is rich with theory and concepts – however it is not an academic exercise, it has real business impact”
  • “To do data quality well, you must walk away from the computer and go talk with the people using the data”
  • “Undertaking a data quality initiative demands developing a deeper knowledge of the data and the business”
  • “Some essential data quality rules are ‘hidden’ and can only be discovered by ‘clicking around’ in the data”
  • “Data quality projects are not about systems working together - they are about people working together”
  • “Sometimes, data quality can be ‘good enough’ for source systems but not when integrated with other systems”
  • “Unfortunately, no one seems to care about bad data until they have it”
  • “Data quality projects are only successful when you understand the problem before trying to solve it”

 

Mark Your Calendar

TDWI World Conference San Diego 2009 - August 2-7, 2009.

TDWI World Conference Orlando 2009 - November 1-6, 2009.

TDWI World Conference Las Vegas 2010 - February 21-26, 2010.

Hyperactive Data Quality

In economics, the term "flight to quality" describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments. 

A similar "flight to data quality" can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure or a financial reporting scandal.  Whatever the triggering event, a common response is data quality suddenly becomes prioritized as a critical issue and an enterprise information initiative is launched.

Congratulations!  You've realized (albeit the hard way) that this "data quality thing" is really important.

Now what are you going to do about it?  How are you going to attempt to actually solve the problem?

In his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman uses an excellent analogy called the data quality lake:

"...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The lake...is polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH. 

The alternative is to reduce the pollutant at the point source - the factories. 

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data."

 

Reactive Data Quality

A "flight to data quality" usually prompts an approach commonly referred to as Reactive Data Quality (i.e. "cleaning the lake" to use Redman's excellent analogy).  The  majority of enterprise information initiatives are reactive.  The focus is typically on finding and fixing the problems with existing data in an operational data store (ODS), enterprise data warehouse (EDW) or other enterprise information repository.  In other words, the focus is on fixing data after it has been extracted from its sources.

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert "lake cleaners").  Data quality problems can be very insidious and even the best "lake cleaning" process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

However, as Redman cautions: "...the problem with being a good lake cleaner is that life never gets better.  Indeed, it gets worse as more data...conspire to mean there is more work every day."  I tell my clients the only way to guarantee that reactive data quality will be successful is to unplug all the computers so that no one can add new data or modify existing data.

 

Proactive Data Quality

Attempting to prevent data quality problems before they happen is commonly referred to as Proactive Data Quality.  The focus is on preventing errors at the sources where data is entered or received and before it is extracted for use by downstream applications (i.e. "enters the lake").  Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

"It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect."

Proactive data quality advocates implementing improved edit controls on data entry screens, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the business needs of your enterprise information consumers before you deliver data to them.

Obviously, it is impossible to truly prevent every problem before it happens.  However, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.

 

Hyperactive Data Quality

Too many enterprise information initiatives fail because they are launched based on a "flight to data quality" response and have the unrealistic perspective that data quality problems can be quickly and easily resolved.  However, just like any complex problem, there is no fast and easy solution for data quality.

In order to be successful, you must combine aspects of both reactive and proactive data quality in order to create an enterprise-wide best practice that I call Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

 

Please share your thoughts and experiences.  Is your data quality Reactive, Proactive or Hyperactive?

All I Really Need To Know About Data Quality I Learned In Kindergarten

Robert Fulghum's excellent book All I Really Need to Know I Learned in Kindergarten dominated the New York Times Bestseller List for all of 1989 and much of 1990.  The 15th Anniversary Edition, which was published in 2003, revised and expanded on the original inspirational essays.

A far less noteworthy achievement of the book is that it also inspired me to write about how:

All I Really Need To Know About Data Quality I Learned in Kindergarten

Show And Tell

I loved show and tell.  An opportunity to deliver an interactive presentation that encouraged audience participation.  No PowerPoint slides.  No podium.  No power suit.  Just me wearing the dorky clothes my parents bought me, standing right in front of the class, waving my Millennium Falcon over my head and explaining that "traveling through hyper-space ain't like dustin' crops, boy" while my classmates (and my teacher) were laughing so hard many of them fell out of their seats.  My show and tell made it clear that if you came over my house after school to play, then you knew exactly what to expect - a geek who loved Star Wars - perhaps a little too much. 

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, remember the lessons of show and tell.  Poor data quality is not a theoretical problem - it is a real business problem that negatively impacts the quality of decision critical enterprise information.  Your presentation should make it clear that if the data quality initiative doesn't get approved, then everyone will know exactly what to expect:

"Poor data quality is the path to the dark side. 

Poor data quality leads to bad business decisions. 

Bad business decisions leads to lost revenue. 

Lost revenue leads to suffering."

The Five Second Rule

If you drop your snack on the floor, then as long as you pick it up within five seconds you can safely eat it.  When you have poor quality data in your enterprise systems, you do have more than five seconds to do something about it.  However, the longer poor quality data goes without remediation, the more likely it will negatively impact critical business decisions.  Don't let your data become the "smelly kid" in class.  No one likes to share their snacks with the smelly kid.  And no one trusts information derived from "smelly data."

 

When You Make A Mistake, Say You're Sorry

Nobody's perfect.  We all have bad days.  We all occasionally say and do stupid things.  When you make a mistake, own up to it and apologize for it.  You don't want to have to wear the dunce cap or stand in the corner for a time-out.  And don't be too hard on your friend that had to wear the dunce cap today.  It was simply their turn to make a mistake.  It will probably be your turn tomorrow.  They had to say they were sorry.  You also have to forgive them.  Who else is going to share their cookies with you when your mom once again packs carrots as your snack?

 

Learn Something New Every Day

We didn't stop learning after we "graduated" from kindergarten, did we?  We are all proud of our education, knowledge, understanding, and experience.  It may be true that experience is the path that separates knowledge from wisdom.  However, we must remain open to learning new things.   Socrates taught us that "the only true wisdom consists in knowing that you know nothing."  I bet Socrates headlined the story time circuit in the kindergartens of Ancient Greece.

 

Hold Hands And Stick Together

I remember going on numerous field trips in kindergarten.  We would visit museums, zoos and amusement parks.  Wherever we went, our teacher would always have us form an interconnected group by holding the hand of the person in front of you and the person behind you.  We were told to stick together and look out for one another.  This important lesson is also applicable to data quality initiatives.  Teamwork and collaboration are essential for success.  Remember that you are all in this together.

 

What did you learn about data quality in kindergarten?

A Portrait of the Data Quality Expert as a Young Idiot

Once upon a time (and a very good time it was), there was a young data quality consultant that fancied himself an expert.

 

He went from client to client and project to project, all along espousing his expertise.  He believed he was smarter than everyone else.  He didn't listen well - he simply waited for his turn to speak.  He didn't foster open communication without bias - he believed his ideas were the only ones of value.  He didn't seek mutual understanding on difficult issues - he bullied people until he got his way.  He didn't believe in the importance of the people involved in the project - he believed the project would be successful with or without them.

 

He was certain he was always right.

 

And he failed - many, many times.

 

In his excellent book How We Decide, Jonah Lehrer advocates paying attention to your inner disagreements, becoming a student of your own errors, and avoiding the trap of certainty.  When you are certain that you're right, you stop considering the possibility that you might be wrong.

 

James Joyce wrote that "mistakes are the portals of discovery" and T.S. Eliot wrote that "we must not cease from exploration and the end of all our exploring will be to arrive where we began and to know the place for the first time."

 

Once upon a time, there was a young data quality consultant that realized he was an idiot - and a very good time it was.

Enterprise Data World 2009

Formerly known as the DAMA International Symposium and Wilshire MetaData Conference, Enterprise Data World 2009 was held April 5-9 in Tampa, Florida at the Tampa Convention Center.

 

Enterprise Data World is the business world’s most comprehensive vendor-neutral educational event about data and information management.  This year’s program was bigger than ever before, with more sessions, more case studies, and more can’t-miss content.  With 200 hours of in-depth tutorials, hands-on workshops, practical sessions and insightful keynotes, the conference was a tremendous success.  Congratulations and thanks to Tony Shaw, Maya Stosskopf and the entire Wilshire staff.

 

I attended Enterprise Data World 2009 as a member of the Iowa Chapter of DAMA and as a Data Quality Journalist for the International Association for Information and Data Quality (IAIDQ).

I used Twitter to provide live reporting from the sessions that I was attending.

I wish that I could have attended every session, but here are some highlights from ten of my favorites:

 

8 Ways Data is Changing Everything

Keynote by Stephen Baker from BusinessWeek

His article Math Will Rock Your World inspired his excellent book The Numerati.  Additionally, check out his blog: Blogspotting.

Quotes from the keynote:

  • "Data is changing how we understand ourselves and how we understand our world"
  • "Predictive data mining is about the mathematical modeling of humanity"
  • "Anthropologists are looking at social networking (e.g. Twitter, Facebook) to understand the science of friendship"

 

Master Data Management: Proven Architectures, Products and Best Practices

Tutorial by David Loshin from Knowledge Integrity.

Included material from his excellent book Master Data Management.  Additionally, check out his blog: David Loshin.

Quotes from the tutorial:

  • "Master Data are the core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies"
  • "Master Data Management (MDM) provides a unified view of core data subject areas (e.g. Customers, Products)"
  • "With MDM, it is important not to over-invest and under-implement - invest in and implement only what you need"

 

Master Data Management: Ignore the Hype and Keep the Focus on Data

Case Study by Tony Fisher from DataFlux and Jeff Grayson from Equinox Fitness.

Quotes from the case study:

  • "The most important thing about Master Data Management (MDM) is improving business processes"
  • "80% of any enterprise implementation should be the testing phase"
  • "MDM Data Quality (DQ) Challenge: Any % wrong means you’re 100% certain you’re not always right"
  • "MDM DQ Solution: Re-design applications to ensure the ‘front-door’ protects data quality"
  • "Technology is critical, however thinking through the operational processes is more important"

 

A Case of Usage: Working with Use Cases on Data-Centric Projects

Case Study by Susan Burk from IBM.

Quotes from the case study:

  • "Use Case is a sequence of actions performed to yield a result of observable business value"
  • "The primary focus of data-centric projects is data structure, data delivery and data quality"
  • "Don’t like use cases? – ok, call them business acceptance criteria – because that’s what a use case is"

 

Crowdsourcing: People are Smart, When Computers are Not

Session by Sharon Chiarella from Amazon Web Services.

Quotes from the session:

  • "Crowdsourcing is outsourcing a task typically performed by employees to a general community of people"
  • "Crowdsourcing eliminates over-staffing, lowers costs and reduces work turnaround time"
  • "An excellent example of crowdsourcing is open source software development (e.g. Linux)"

 

Improving Information Quality using Lean Six Sigma Methodology

Session by Atul Borkar and Guillermo Rueda from Intel.

Quotes from the session:

  • "Information Quality requires a structured methodology in order to be successful"
  • Lean Six Sigma Framework: DMAIC – Define, Measure, Analyze, Improve, Control:
    • Define = Describe the challenge, goal, process and customer requirements
    • Measure = Gather data about the challenge and the process
    • Analyze = Use hypothesis and data to find root causes
    • Improve = Develop, implement and refine solutions
    • Control = Plan for stability and measurement

 

Universal Data Quality: The Key to Deriving Business Value from Corporate Data

Session by Stefanos Damianakis from Netrics.

Quotes from the session:

  • "The information stored in databases is NEVER perfect, consistent and complete – and it never can be!"
  • "Gartner reports that 25% of critical data within large businesses is somehow inaccurate or incomplete"
  • "Gartner reports that 50% of implementations fail due to lack of attention to data quality issues"
  • "A powerful approach to data matching is the mathematical modeling of human decision making"
  • "The greatest advantage of mathematical modeling is that there are no data matching rules to build and maintain"

 

Defining a Balanced Scorecard for Data Management

Seminar by C. Lwanga Yonke, a founding member of the International Association for Information and Data Quality (IAIDQ).

Quotes from the seminar:

  • "Entering the same data multiple times is like paying the same invoice multiple times"
  • "Good metrics help start conversations and turn strategy into action"
  • Good metrics have the following characteristics:
    • Business Relevance
    • Clarity of Definition
    • Trending Capability (i.e. metric can be tracked over time)
    • Easy to aggregate and roll-up to a summary
    • Easy to drill-down to the details that comprised the measurement

 

Closing Panel: Data Management’s Next Big Thing!

Quotes from Panelist Peter Aiken from Data Blueprint:

  • Capability Maturity Levels:
    1. Initial
    2. Repeatable
    3. Defined
    4. Managed
    5. Optimized
  • "Most companies are at a capability maturity level of (1) Initial or (2) Repeatable"
  • "Data should be treated as a durable asset"

Quotes from Panelist Noreen Kendle from Burton Group:

  • "A new age for data and data management is on horizon – a perfect storm is coming"
  • "The perfect storm is being caused by massive data growth and software as a service (i.e. cloud computing)"
  • "Always remember that you can make lemonade from lemons – the bad in life can be turned into something good"

Quotes from Panelist Karen Lopez from InfoAdvisors:

  • "If you keep using the same recipe, then you keep getting the same results"
  • "Our biggest problem is not technical in nature - we simply need to share our knowledge"
  • "Don’t be a dinosaur! Adopt a ‘go with what is’ philosophy and embrace the future!"

Quotes from Panelist Eric Miller from Zepheira:

  • "Applications should not be ON The Web, but OF The Web"
  • "New Acronym: LED – Linked Enterprise Data"
  • "Semantic Web is the HTML of DATA"

Quotes from Panelist Daniel Moody from University of Twente:

  • "Unified Modeling Language (UML) was the last big thing in software engineering"
  • "The next big thing will be ArchiMate, which is a unified language for enterprise architecture modeling"

 

Mark Your Calendar

Enterprise Data World 2010 will take place in San Francisco, California at the Hilton San Francisco on March 14-18, 2010.