March 24, 2022

Baseball Data Analysis Challenge

March 24, 2022/ Jim Harris

Calling all data analysts, machine learning engineers, and data scientists!

I am working on building some demos and tutorials for machine learning. Of course, I will be sharing everything I do on GitHub. I thought it would be fun to share my input data with all of you before I start and make a little challenge out of this. While not as exciting or lucrative as a Kaggle competition, please feel free to have at it and use whatever techniques and tools you would like to discover any insights and/or make any predictions (even if you do not know anything about baseball).

The input data for this challenge represents 6 years (2016-2021) of Boston Red Sox Major League Baseball (MLB) regular season baseball game results, including a Game_Result column, labeled either 0 or 1, where 0 = Loss and 1 = Win.

The input data for this challenge is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_input.csv

The data profiling results for the input data is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_profile.csv

The raw data used in this challenge was collected via a paid subscription to: https://stathead.com/baseball/

Update for 2022 MLB Opening Day

I completed my initial work in time for the opening day of the 2022 MLB season, the results of which you can find in this Microsoft Excel file: Baseball Data Analysis Challenge 2022-04-05.xlsx. My baseball data analysis was performed using my employer’s (Vertica) in-database machine learning capabilities, and you can find my SQL scripts on GitHub.

I used logistic regression classification models to calculate win probabilities for the Red Sox across nine (9) game metrics: opponent, opponent’s division, month of year, day of week, runs scored, hits, extra base hits, home runs, and walks versus strikeouts. I also used the input data to train a Naïve Bayes classification model to predict wins and losses with an associated probability based on the runs scored, hits, extra base hits, home runs, and walks versus strikeouts game metrics (all of which are binned ranges of input data values). Its initial accuracy is only 77%, but I plan on making some adjustments. I also plan on using the 2022 baseball season as my test data. So not only will I be watching how many games the Red Sox win or lose this season, but I will also be watching how many games my machine learning model predicts correctly.

Think you can best my model? Game on! The baseball data analysis challenge continues. Play ball!

March 09, 2022

OCDQ Radio on Big Data and Data Science

March 09, 2022/ Jim Harris

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This podcast is no longer an active project, meaning not only do I rarely publish a new episode, but its episodes are only available to listen to on this website and no longer distributed on platforms such as Apple Podcasts and Google Podcasts.

I have been enjoying listening to many of the old episodes since I was happy to hear how evergreen they are, meaning their content is still applicable today. This post is part of my Best of OCDQ Radio series, organizing groups of episodes by topic(s).

Podcast Episodes on Big Data and Data Science

March 05, 2022

OCDQ Radio on Data Governance

March 05, 2022/ Jim Harris

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Podcast Episodes on Data Governance

March 01, 2022

OCDQ Radio on Data Quality

March 01, 2022/ Jim Harris

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Podcast Episodes on Data Quality

June 01, 2021

Why No One Cares about Poor Data Quality

June 01, 2021/ Jim Harris

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Why does no one care about poor data quality? Because you’re probably measuring data quality without connecting it to your organization’s business processes, applications, or other business uses for enterprise data.

During this episode, I discuss how this is accomplished through the implementation of a data governance policy as an executable process comprised of a combination of business rules and data rules that create and track meaningful data quality metrics framed within a relative business context and associated with a data quality threshold (i.e., tolerance for poor data quality). Each business use for enterprise data should be governed by its own policy. Compliance with these data governance policies aligns data quality with business insight, providing the missing link between poor data quality and poor business performance. And it is then—and only then—that anyone cares about poor data quality.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

January 01, 2019

Data Quality and Chicken Little Syndrome

January 01, 2019/ Jim Harris

“The sky is falling!” exclaimed Chicken Little after an acorn fell on his head, causing him to undertake a journey to tell the King that the world is coming to an end. So says the folk tale that became an allegory for people accused of being unreasonably afraid, or people trying to incite an unreasonable fear in those around them, sometimes referred to as Chicken Little Syndrome.

The sales pitches for data quality solutions often suffer from Chicken Little Syndrome, when vendors and consultants, instead of trying to sell the business benefits of data quality, focus too much on the negative aspects of not investing in data quality, and try scaring people into prioritizing data quality initiatives by exclaiming “your company is failing because your data quality is bad!”

The Chicken Littles of Data Quality use sound bites like “data quality problems cost businesses more than $600 billion a year!” or “poor data quality costs organizations 35% of their revenue!” However, the most common characteristic of these fear mongering estimates about the costs of poor data quality is that, upon closer examination, most of them either rely on anecdotal evidence, or hide behind the curtain of an allegedly proprietary case study, the details of which conveniently can’t be publicly disclosed.

Lacking a tangible estimate for the cost of poor data quality often complicates building the business case for data quality. Even though a data quality initiative has the long-term potential of reducing the costs, and mitigating the risks, associated with poor data quality, its initial costs are very tangible. For example, the short-term increased costs of a data quality initiative can include the purchase of data quality software, and the professional services needed for training and consulting to support installation, configuration, application development, testing, and production implementation. When considering these short-term costs, and especially when lacking a tangible estimate for the cost of poor data quality, many organizations understandably conclude that it’s less risky to gamble on not investing in a data quality initiative and hope things are just not as bad as Chicken Little claims.

“The sky isn’t falling on us.”

Furthermore, the reason that citing specific examples of poor data quality also doesn’t work very well is not just because of the lack of a verifiable estimate for the associated business costs. Another significant contributing factor is that people naturally dismiss the possibility that something bad that happened to someone else could also happen to them.

So, when Chicken Little undertakes a journey to tell the CEO that the organization is coming to an end due to poor data quality, exclaiming that “the sky is falling!” while citing one of those data quality disaster stories that befell another organization, should we really be surprised when the CEO looks up, scratches their head, and declares that “the sky isn’t falling on us.”

Sometimes, denying the existence of data quality issues is a natural self-defense mechanism for the people responsible for the business processes and technology surrounding data since nobody wants to be blamed for causing, or failing to fix, data quality issues. Other times, people suffer from the illusion-of-quality effect caused by the dark side of data cleansing. In other words, they don’t believe that data quality issues occur very often because the data made available to end users in dashboards and reports often passes through many processes that cleanse or otherwise sanitize the data before it reaches them.

Can we stop Playing Chicken with Data Quality?

Most of the time, advocating for data quality feels like we are playing chicken with executive sponsors and business stakeholders, as if we were driving toward them at full speed on a collision course, armed with fear mongering and disaster stories, hoping that they swerve in the direction of approving a data quality initiative. But there has to be a better way to advocate for data quality other than constantly exclaiming that “the sky is falling!” (Don’t cry fowl — I realize that I just mixed my chicken metaphors.)

September 05, 2018

What an Old Dictionary teaches us about Metadata

September 05, 2018/ Jim Harris

Spelling, pronunciation, and examples of usage are included in the dictionary definition of a word, which is a good example of one of the many uses of metadata, namely to provide a definition, description, and context for data.

Pictured to the left is the dictionary that has been on my desk for over 15 years, which is a good metaphor for the challenges of metadata management.

When I first bought the dictionary, it was, as its front cover attested, “The Newest. The Best. A Trusted Authority. A brand-new dictionary of the 1990s, for the 1990s. Comprehensive coverage of current words and terms, with clear, understandable definitions and up-to-the-minute usage guidance.”

And its back cover boasted of “60,000 entries assembled by a state-of-the-art authority using the most modern sources of information, and prepared by lexicographic experts to provide the one-stop reference book to turn to for all of your word questions.” (However, if one of your word questions was about metadata you were out of luck because it didn’t have an entry for it.)

The multidimensionality of metadata is exemplified by how a dictionary rarely contains a single definition for a word, and an old dictionary exemplifies how constantly changing semantics further complicate metadata management.

Using an old dictionary has several downsides, such as new words would not be in it, and some existing words would have either new definitions or an updated definition order based on the predominant context of current usage.

Organizations face a similar challenge while trying to maintain a metadata dictionary containing comprehensive coverage of business and technical terminology. Hopefully providing clear, understandable definitions and usage guidance prepared by subject matter experts, a metadata dictionary is a trusted authority and one-stop reference to turn to for all your data questions.

At least, that’s the theory. In practice, I haven’t encountered a metadata dictionary that could deliver on that promise.

And just as there are many dictionary publishers (e.g., Houghton Mifflin Harcourt, Merriam-Webster, Oxford University Press), as well as numerous online dictionaries (e.g., Collins, Urban, Wiktionary), there’s often more than one metadata dictionary within every organization as well. In fact, sometimes the organization has just as many metadata silos as it does data silos.

An old dictionary reminds us that language — and especially its everyday usage — evolves. An old dictionary also teaches us that metadata — and especially the data it defines, describes, and provides a context for — evolves as well. Which is probably why doing metadata management well is not, well, something that just automagically happens.

June 15, 2018

Beware the Data Governance Ides of March

June 15, 2018/ Jim Harris

WindowsLiveWriter-TheIdesofMarchandtheTheatreofDataQuality_80BF-

Morte de Césare (Death of Caesar) by Vincenzo Camuccini, 1798

Today is the Ides of March (March 15), which back in 44 BC was definitely not a good day to be Julius Caesar, who was literally stabbed in the back by the Roman Senate during his assassination in the Theatre of Pompey (as depicted above), which was spearheaded by Brutus and Cassius in a failed attempt to restore the Roman Republic, but instead resulted in a series of civil wars that ultimately led to the establishment of the permanent Roman Empire by Caesar’s heir Octavius (aka Caesar Augustus).

“Beware the Ides of March” is the famously dramatized warning from William Shakespeare’s play Julius Caesar, which has me pondering whether a data governance program implementation has an Ides of March (albeit a less dramatic one—hopefully).

Hybrid Approach (starting Top-Down) is currently leading my unscientific poll about the best way to approach data governance, acknowledging executive sponsorship and a data governance board will be required for the top-down-driven activities of funding, policy making and enforcement, decision rights, and arbitration of conflicting business priorities as well as organizational politics.

The definition of data governance policies illustrates the intersection of business, data, and technical knowledge spread throughout the organization, revealing how interconnected and interdependent the organization is. The policies provide a framework for the communication and collaboration of business, data, and technical stakeholders, and establish an enterprise-wide understanding of the roles and responsibilities involved, and the accountability required to support the organization’s daily business activities.

The process of defining data governance policies resembles the communication and collaboration of the Roman Republic, but the process of implementing and enforcing data governance policies resembles the command and control of the Roman Empire.

During this transition of power, from policy definition to policy implementation and enforcement, lies the greatest challenge for a data governance program. Even though no executive sponsor is the Data Governance Emperor (not even Caesar CEO) and the data governance board is not the Data Governance Senate, a heavy-handed top-down approach to data governance can make policy compliance feel like imperial rule and policy enforcement feel like martial law. Although a series of enterprise civil wars is unlikely to result, the data governance program is likely to fail without the support of a strong and stable bottom-up foundation.

The enforcement of data governance policies is often confused with traditional management notions of command and control, but the enduring success of data governance requires an organizational culture that embodies communication and collaboration, which is mostly facilitated by bottom-up-driven activities led by the example of data stewards and other peer-level change agents.

“Beware the Data Governance Ides of March” is my dramatized warning about relying too much on the top-down approach to implementing data governance—and especially if your organization has any data stewards named Brutus or Cassius.

March 01, 2018

Plato’s Data

March 01, 2018/ Jim Harris

Plato’s Cave is a famous allegory from philosophy that describes a fictional scenario where people mistake an illusion for reality.

The allegory describes a group of people who have lived their whole lives as prisoners chained motionless in a dark cave, forced to face a blank wall. Behind the prisoners is a large fire. In front of the fire are puppeteers that project shadows onto the cave wall, acting out little plays, which include mimicking voices and sound effects that echo off the cave walls. These shadows and echoes are only projections, partial reflections of a reality created by the puppeteers. However, this illusion represents the only reality the prisoners have ever known, and so to them the shadows are real sights and the echoes are real sounds.

When one of the prisoners is freed and permitted to turn around and see the source of the shadows and echoes, he rejects reality as an illusion. The prisoner is then dragged out of the cave into the sunlight, out into the bright, painful light of the real world, which he also rejects as an illusion. How could these sights and sounds be real to him when all he has ever known is the cave?

But eventually the prisoner acclimates to the real world, realizing that the real illusion was the shadows and echoes in the cave.

Unfortunately, this is when he’s returned to his imprisonment in the cave. Can you imagine how painful the rest of his life will be, once again being forced to watch the shadows and listen to the echoes — except now he knows that they are not real.

Plato’s Cinema

A modern update on the allegory is something we could call Plato’s Cinema, where a group of people live their whole lives as prisoners chained motionless in a dark cinema, forced to face a blank screen. Behind the audience is a large movie projector.

Please stop reading for a moment and try to imagine if everything you ever knew was based entirely on the movies you watched.

Now imagine you are one of the prisoners, and you did not get to choose the movies, but instead were forced to watch whatever the projectionist chooses to show you. Although the fictional characters and stories of these movies are only projections, partial reflections of a reality created by the movie producers, since this illusion would represent the only reality you have ever known, to you the characters would be real people and the stories would be real events.

If you were freed from this cinema prison, permitted to turn around and see the projector, wouldn’t you reject it as an illusion? If you were dragged out of the cinema into the sunlight, out into the bright, painful light of the real world, wouldn’t you also reject reality as an illusion? How could these sights and sounds be real to you when all you have ever known is the cinema?

Let’s say that you eventually acclimated to the real world, realizing that the real illusion was the projections on the movie screen.

However, now let’s imagine that you are then returned to your imprisonment in the dark cinema. Can you imagine how painful the rest of your life would be, once again being forced to watch the movies — except now you know that they are not real.

Plato’s Data

Whether it’s an abstract description of real-world entities (i.e., “master data”) or an abstract description of real-world interactions (i.e., “transaction data”) among entities, data is an abstract description of reality — let’s call this the allegory of Plato’s Data.

We often act as if we are being forced to face our computer screen, upon which data tells us a story about the real world that is just as enticing as the flickering shadows on the wall of Plato’s Cave, or the mesmerizing movies projected in Plato’s Cinema.

Data shapes our perception of the real world, but sometimes we forget that data is only a partial reflection of reality.

I am sure that it sounds silly to point out something so obvious, but imagine if, before you were freed, the other prisoners, in either the cave or the cinema, tried to convince you that the shadows or the movies weren’t real. Or imagine you’re the prisoner returning to either the cave or the cinema. How would you convince other prisoners that you’ve seen the true nature of reality?

A common question about Plato’s Cave is whether it’s crueler to show the prisoner the real world, or to return the prisoner to the cave after he has seen it. Much like the illusions of the cave and the cinema, data makes more sense the more we believe it is real.

However, with data, neither breaking the illusion nor returning ourselves to it is cruel, but is instead a necessary practice because it’s important to occasionally remind ourselves that data and the real world are not the same thing.

July 01, 2016

Data Governance Frameworks are like Jigsaw Puzzles

July 01, 2016/ Jim Harris

In a recent interview, Jill Dyché explained a common misconception, namely that a data governance framework is not a strategy. “Unlike other strategic initiatives that involve IT,” Jill explained, “data governance needs to be designed. The cultural factors, the workflow factors, the organizational structure, the ownership, the political factors, all need to be accounted for when you are designing a data governance roadmap.”

“People need a mental model, that is why everybody loves frameworks,” Jill continued. “But they are not enough and I think the mistake that people make is that once they see a framework, rather than understanding its relevance to their organization, they will just adapt it and plaster it up on the whiteboard and show executives without any kind of context. So they are already defeating the purpose of data governance, which is to make it work within the context of your business problems, not just have some kind of mental model that everybody can agree on, but is not really the basis for execution.”

“So it’s a really, really dangerous trend,” Jill cautioned, “that we see where people equate strategy with framework because strategy is really a series of collected actions that result in some execution — and that is exactly what data governance is.”

And in her excellent article Data Governance Next Practices: The 5 + 2 Model, Jill explained that data governance requires a deliberate design so that the entire organization can buy into a realistic execution plan, not just a sound bite. As usual, I agree with Jill, since, in my experience, many people expect a data governance framework to provide eureka-like moments of insight.

In The Myths of Innovation, Scott Berkun debunked the myth of the eureka moment using the metaphor of a jigsaw puzzle.

“When you put the last piece into place, is there anything special about that last piece or what you were wearing when you put it in?” Berkun asked. “The only reason that last piece is significant is because of the other pieces you’d already put into place. If you jumbled up the pieces a second time, any one of them could turn out to be the last, magical piece.”

“The magic feeling at the moment of insight, when the last piece falls into place,” Berkun explained, “is the reward for many hours (or years) of investment coming together. In comparison to the simple action of fitting the puzzle piece into place, we feel the larger collective payoff of hundreds of pieces’ worth of work.”

Perhaps the myth of the data governance framework could also be debunked using the metaphor of a jigsaw puzzle.

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, change management — and many other puzzle pieces.

How could a data governance framework possibly predict how you will assemble the puzzle pieces? Or how the puzzle pieces will fit together within your unique corporate culture? Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization? And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles.

January 01, 2016

Data Quality in Six Verbs

January 01, 2016/ Jim Harris

Once upon a time when asked on Twitter to identify a list of critical topics for data quality practitioners, my pithy (with only 140 characters in a tweet, pithy is as good as it gets) response was, and especially since I prefer emphasizing the need to take action, to propose six critical verbs: Investigate, Communicate, Collaborate, Remediate, Inebriate, and Reiterate.

Lest my pith be misunderstood aplenty, this blog post provides more detail, plus links to related posts, about what I meant.

1 — Investigate

Data quality is not exactly a riddle wrapped in a mystery inside an enigma. However, understanding your data is essential to using it effectively and improving its quality. Therefore, the first thing you must do is investigate.

So, grab your favorite (preferably highly caffeinated) beverage, get settled into your comfy chair, roll up your sleeves and starting analyzing that data. Data profiling tools can be very helpful with raw data analysis.

However, data profiling is elementary, my dear reader. In order for you to make sense of those data elements, you require business context. This means you must also go talk with data’s best friends—its stewards, analysts, and subject matter experts.

Six blog posts related to Investigate:

2 — Communicate

After you have completed your preliminary investigation, the next thing you must do is communicate your findings, which helps improve everyone’s understanding of how data is being used, verify data’s business relevancy, and prioritize critical issues.

Keep in mind that communication is mostly about listening. Also, be prepared to face “data denial” whenever data quality is discussed. This is a natural self-defense mechanism for the people responsible for business processes, technology, and data, which is understandable because nobody likes to be blamed (or feel blamed) for causing or failing to fix data quality problems.

No matter how uncomfortable these discussions may be at times, they are essential to evaluating the potential ROI of data quality improvements, defining data quality standards, and most importantly, providing a working definition of success.

Six blog posts related to Communicate:

3 — Collaborate

After you have investigated and communicated, now you must rally the team that will work together to improve the quality of your data. A cross-disciplinary team will be needed because data quality is neither a business nor a technical issue—it is both.

Therefore, you will need the collaborative effort of business and technical folks. The business folks usually own the data, or at least the business processes that create it, so they understand its meaning and daily use. The technical folks usually own the hardware and software comprising your data architecture. Both sets of folks must realize they are all “one company folk” that must collaborate in order to be successful.

No, you don’t need a folk singer, but you may need an executive sponsor. The need for collaboration might sound rather simple, but as one of my favorite folk singers taught me, sometimes the hardest thing to learn is the least complicated.

Six blog posts related to Collaborate:

4 — Remediate

Resolving data quality issues requires a combination of data cleansing and defect prevention. Data cleansing is reactive and its common (and deserved) criticism is that it essentially treats the symptoms without curing the disease.

Defect prevention is proactive and through root cause analysis and process improvements, it essentially is the cure for the quality ills that ail your data. However, a data governance framework is often necessary for defect prevention to be successful. As is patience and understanding since it will require a strategic organizational transformation that doesn’t happen overnight.

The unavoidable reality is that data cleansing is used to correct today’s problems while defect prevention is busy building a better tomorrow for your organization. Fundamentally, data quality requires a hybrid discipline that combines data cleansing and defect prevention into an enterprise-wide best practice.

Six blog posts related to Remediate:

5 — Inebriate

I am not necessarily advocating that kind of inebriation. Instead, think Emily Dickinson (i.e., “Inebriate of air am I” – it’s a line from a poem about happiness that, yes, also happens to make a good drinking song).

My point is that you must not only celebrate your successes, but celebrate them quite publicly. Channel yet another poet (Walt Whitman) and sound your barbaric yawp over the cubicles of your company: “We just improved the quality of our data!”

Of course, you will need to be more specific. Declare success using words illustrating the business impact of your achievements, such as mitigated risks, reduced costs, or increased revenues — those three are always guaranteed executive crowd pleasers.

Six blog posts related to Inebriate:

6 — Reiterate

Like the legend of the phoenix, the end is also a new beginning. Therefore, don’t get too inebriated, since you are not celebrating the end of your efforts. Your data quality journey has only just begun. Your continuous monitoring must continue and your ongoing improvements must remain ongoing. Which is why, despite the tension this reality, and this bad grammatical pun, might cause you, always remember that the tense of all six of these verbs is future continuous.

Six blog posts related to Reiterate:

What Say You?

Please let me know what you think, pithy or otherwise, by posting a comment below. And feel free to use more than six verbs.

December 25, 2015

Finding Data Quality

December 25, 2015/ Jim Harris

Have you ever experienced that sinking feeling, where you sense if you don’t find data quality, then data quality will find you?

In the spring of 2003, Pixar Animation Studios produced one of my all-time favorite Walt Disney Pictures—Finding Nemo.

This blog post is an hommage to not only the film, but also to the critically important role into which data quality is cast within all of your enterprise information initiatives, including business intelligence, master data management, and data governance.

I hope that you enjoy reading this blog post, but most important, I hope you always remember: “Data are friends, not food.”

Data Silos

“Mine! Mine! Mine! Mine! Mine!”

That’s the Data Silo Mantra—and it is also the bane of successful enterprise information management. Many organizations persist on their reliance on vertical data silos, where each and every business unit acts as the custodian of their own private data—thereby maintaining their own version of the truth.

Impressive business growth can cause an organization to become a victim of its own success. Significant collateral damage can be caused by this success, and most notably to the organization’s burgeoning information architecture.

Earlier in an organization’s history, it usually has fewer systems and easily manageable volumes of data, thereby making managing data quality and effectively delivering the critical information required to make informed business decisions everyday, a relatively easy task where technology can serve business needs well—especially when the business and its needs are small.

However, as the organization grows, it trades effectiveness for efficiency, prioritizing short-term tactics over long-term strategy, and by seeing power in the hoarding of data, not in the sharing of information, the organization chooses business unit autonomy over enterprise-wide collaboration—and without this collaboration, successful enterprise information management is impossible.

A data silo often merely represents a microcosm of an enterprise-wide problem—and this truth is neither convenient nor kind.

Data Profiling

“I see a light—I’m feeling good about my data . . .

Good feeling’s gone—AHH!”

Although it’s not exactly a riddle wrapped in a mystery inside an enigma, understanding your data is essential to using it effectively and improving its quality—to achieve these goals, there is simply no substitute for data analysis.

Data profiling can provide a reality check for the perceptions and assumptions you may have about the quality of your data. A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.

However, it is important to remember that the analysis itself can not be automated—you need to translate your analysis into the meaningful reports and questions that will facilitate more effective communication and help establish tangible business context.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions requires talking with data’s best friends—its stewards, analysts, and subject matter experts. These discussions are a critical prerequisite for determining data usage, standards, and the business relevant metrics for measuring and improving data quality. Always remember that well performed data profiling is highly interactive and a very iterative process.

Defect Prevention

“You, Data-Dude, takin’ on the defects.

You’ve got serious data quality issues, dude.

Awesome.”

Even though it is impossible to truly prevent every problem before it happens, proactive defect prevention is a highly recommended data quality best practice because the more control enforced where data originates, the better the overall quality will be for enterprise information.

Although defect prevention is most commonly associated with business and technical process improvements, after identifying the burning root cause of your data defects, you may predictably need to apply some of the principles of behavioral data quality.

In other words, understanding the complex human dynamics often underlying data defects is necessary for developing far more effective tactics and strategies for implementing successful and sustainable data quality improvements.

Data Cleansing

“Just keep cleansing. Just keep cleansing.

Just keep cleansing, cleansing, cleansing.

What do we do? We cleanse, cleanse.”

That’s not the Data Cleansing Theme Song—but it can sometimes feel like it. Especially whenever poor data quality negatively impacts decision-critical information, the organization may legitimately prioritize a reactive short-term response, where the only remediation will be fixing the immediate problems.

Balancing the demands of this data triage mentality with the best practice of implementing defect prevention wherever possible, will often create a very challenging situation for you to contend with on an almost daily basis.

Therefore, although comprehensive data remediation will require combining reactive and proactive approaches to data quality, you need to be willing and able to put data cleansing tools to good use whenever necessary.

Communication

“It’s like he’s trying to speak to me, I know it.

Look, you’re really cute, but I can’t understand what you’re saying.

Say that data quality thing again.”

I hear this kind of thing all the time (well, not the “you’re really cute” part).

Effective communication improves everyone’s understanding of data quality, establishes a tangible business context, and helps prioritize critical data issues.

Keep in mind that communication is mostly about listening. Also, be prepared to face “data denial” when data quality problems are discussed. Most often, this is a natural self-defense mechanism for the people responsible for business processes, technology, and data—and because of the simple fact that nobody likes to feel blamed for causing or failing to fix the data quality problems.

The key to effective communication is clarity. You should always make sure that all data quality concepts are clearly defined and in a language that everyone can understand. I am not just talking about translating the techno-mumbojumbo, because even business-speak can sound more like business-babbling—and not just to the technical folks.

Additionally, don’t be afraid to ask questions or admit when you don’t know the answers. Many costly mistakes can be made when people assume that others know (or pretend to know themselves) what key concepts and other terminology actually mean.

Never underestimate the potential negative impacts that the point of view paradox can have on communication. For example, the perspectives of the business and technical stakeholders can often appear to be diametrically opposed.

Practicing effective communication requires shutting our mouth, opening our ears, and empathically listening to each other, instead of continuing to practice ineffective communication, where we merely take turns throwing word-darts at each other.

Collaboration

“Oh and one more thing:

When facing the daunting challenge of collaboration,

Work through it together, don't avoid it.

Come on, trust each other on this one.

Yes—trust—it’s what successful teams do.”

Most organizations suffer from a lack of collaboration, and as noted earlier, without true enterprise-wide collaboration, true success is impossible.

Beyond the data silo problem, the most common challenge for collaboration is the divide perceived to exist between the Business and IT, where the Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise, and IT usually owns the hardware and software infrastructure of the enterprise’s technical architecture.

However, neither the Business nor IT alone has all of the necessary knowledge and resources required to truly be successful. Data quality requires that the Business and IT forge an ongoing and iterative collaboration.

You must rally the team that will work together to improve the quality of your data. A cross-disciplinary team will truly be necessary because data quality is neither a business issue nor a technical issue—it is both, truly making it an enterprise issue.

Executive sponsors, business and technical stakeholders, business analysts, data stewards, technology experts, and yes, even consultants and contractors—only when all of you are truly working together as a collaborative team, can the enterprise truly achieve great things, both tactically and strategically.

Successful enterprise information management is spelled E—A—C.

Of course, that stands for Enterprises—Always—Collaborate. The EAC can be one seriously challenging place, dude.

You don’t know if you know what they know, or if they know what you know, but when you know, then they know, you know?

It’s like first you are all like “Whoa!” and they are all like “Whoaaa!” then you are like “Sweet!” and then they are like “Totally!”

This critical need for collaboration might seem rather obvious. However, as all of the great philosophers have taught us, sometimes the hardest thing to learn is the least complicated.

Okay. Squirt will now give you a rundown of the proper collaboration technique:

“Good afternoon. We’re gonna have a great collaboration today.

Okay, first crank a hard cutback as you hit the wall.

There’s a screaming bottom curve, so watch out.

Remember: rip it, roll it, and punch it.”

Finding Data Quality

As more and more organizations realize the critical importance of viewing data as a strategic corporate asset, data quality is becoming an increasingly prevalent topic of discussion.

However, and somewhat understandably, data quality is sometimes viewed as a small fish—albeit with a “lucky fin”—in a much larger pond.

In other words, data quality is often discussed only in its relation to enterprise information initiatives such as data integration, master data management, data warehousing, business intelligence, and data governance.

There is nothing wrong with this perspective, and as a data quality expert, I admit to my general tendency to see data quality in everything. However, regardless of the perspective from which you begin your journey, I believe that eventually you will be Finding Data Quality wherever you look as well.

November 20, 2014

It’s Not about being Data-Driven

November 20, 2014/ Jim Harris

This post explains that becoming a successful organization in any industry is not about being data-driven, but whether data, and regardless of its source, is driving your organization to make better business decisions.

November 06, 2014

Data-Driven Intuition

November 06, 2014/ Jim Harris

This post, inspired by Jeffrey Ma’s book The House Advantage: Playing the Odds to Win Big In Business, explores the possibility that our intuition has always been more data-driven than we gave it credit for.

October 09, 2014

Big Data and Quantified Self-Awareness

October 09, 2014/ Jim Harris

If ignorance is bliss, what is digital abundance? This post posits a contrarian’s view on the quantified self movement, wondering if we are ready for the impact that big data will have on self-awareness.

OCDQ Blog

Update for 2022 MLB Opening Day

Podcast Episodes on Big Data and Data Science

Podcast Episodes on Data Governance

Podcast Episodes on Data Quality

Popular OCDQ Radio Episodes

“The sky isn’t falling on us.”

Can we stop Playing Chicken with Data Quality?

Plato’s Cinema

Plato’s Data

1 — Investigate

2 — Communicate

3 — Collaborate

4 — Remediate

5 — Inebriate

6 — Reiterate

What Say You?

Data Silos

Data Profiling

Defect Prevention

Data Cleansing

Communication

Collaboration

Finding Data Quality

OCDQ Blog