Data Quality and Chicken Little Syndrome

“The sky is falling!” exclaimed Chicken Little after an acorn fell on his head, causing him to undertake a journey to tell the King that the world is coming to an end.  So says the folk tale that became an allegory for people accused of being unreasonably afraid, or people trying to incite an unreasonable fear in those around them, sometimes referred to as Chicken Little Syndrome.

The sales pitches for data quality solutions often suffer from Chicken Little Syndrome, when vendors and consultants, instead of trying to sell the business benefits of data quality, focus too much on the negative aspects of not investing in data quality, and try scaring people into prioritizing data quality initiatives by exclaiming “your company is failing because your data quality is bad!”

The Chicken Littles of Data Quality use sound bites like “data quality problems cost businesses more than $600 billion a year!” or “poor data quality costs organizations 35% of their revenue!”  However, the most common characteristic of these fear mongering estimates about the costs of poor data quality is that, upon closer examination, most of them either rely on anecdotal evidence, or hide behind the curtain of an allegedly proprietary case study, the details of which conveniently can’t be publicly disclosed.

Lacking a tangible estimate for the cost of poor data quality often complicates building the business case for data quality.  Even though a data quality initiative has the long-term potential of reducing the costs, and mitigating the risks, associated with poor data quality, its initial costs are very tangible.  For example, the short-term increased costs of a data quality initiative can include the purchase of data quality software, and the professional services needed for training and consulting to support installation, configuration, application development, testing, and production implementation.  When considering these short-term costs, and especially when lacking a tangible estimate for the cost of poor data quality, many organizations understandably conclude that it’s less risky to gamble on not investing in a data quality initiative and hope things are just not as bad as Chicken Little claims.

“The sky isn’t falling on us.”

Furthermore, the reason that citing specific examples of poor data quality (e.g., IQTrainwrecks.com) also doesn’t work very well is not just because of the lack of a verifiable estimate for the associated business costs.  Another significant contributing factor is that people naturally dismiss the possibility that something bad that happened to someone else could also happen to them.

So, when Chicken Little undertakes a journey to tell the CEO that the organization is coming to an end due to poor data quality, exclaiming that “the sky is falling!” while citing one of those data quality disaster stories that befell another organization, should we really be surprised when the CEO looks up, scratches their head, and declares that “the sky isn’t falling on us.”

Sometimes, denying the existence of data quality issues is a natural self-defense mechanism for the people responsible for the business processes and technology surrounding data since nobody wants to be blamed for causing, or failing to fix, data quality issues.  Other times, people suffer from the illusion-of-quality effect caused by the dark side of data cleansing.  In other words, they don’t believe that data quality issues occur very often because the data made available to end users in dashboards and reports often passes through many processes that cleanse or otherwise sanitize the data before it reaches them.

Can we stop Playing Chicken with Data Quality?

Most of the time, advocating for data quality feels like we are playing chicken with executive sponsors and business stakeholders, as if we were driving toward them at full speed on a collision course, armed with fear mongering and disaster stories, hoping that they swerve in the direction of approving a data quality initiative.  But there has to be a better way to advocate for data quality other than constantly exclaiming that “the sky is falling!”  (Don’t cry fowl — I realize that I just mixed my chicken metaphors.)

Data Governance Frameworks are like Jigsaw Puzzles

Data Governance Jigsaw Puzzle.png

In a recent interview, Jill Dyché explained a common misconception, namely that a data governance framework is not a strategy.  “Unlike other strategic initiatives that involve IT,” Jill explained, “data governance needs to be designed.  The cultural factors, the workflow factors, the organizational structure, the ownership, the political factors, all need to be accounted for when you are designing a data governance roadmap.”

“People need a mental model, that is why everybody loves frameworks,” Jill continued.  “But they are not enough and I think the mistake that people make is that once they see a framework, rather than understanding its relevance to their organization, they will just adapt it and plaster it up on the whiteboard and show executives without any kind of context.  So they are already defeating the purpose of data governance, which is to make it work within the context of your business problems, not just have some kind of mental model that everybody can agree on, but is not really the basis for execution.”

“So it’s a really, really dangerous trend,” Jill cautioned, “that we see where people equate strategy with framework because strategy is really a series of collected actions that result in some execution — and that is exactly what data governance is.”

And in her excellent article Data Governance Next Practices: The 5 + 2 Model, Jill explained that data governance requires a deliberate design so that the entire organization can buy into a realistic execution plan, not just a sound bite.  As usual, I agree with Jill, since, in my experience, many people expect a data governance framework to provide eureka-like moments of insight.

In The Myths of Innovation, Scott Berkun debunked the myth of the eureka moment using the metaphor of a jigsaw puzzle.

“When you put the last piece into place, is there anything special about that last piece or what you were wearing when you put it in?” Berkun asked.  “The only reason that last piece is significant is because of the other pieces you’d already put into place.  If you jumbled up the pieces a second time, any one of them could turn out to be the last, magical piece.”

“The magic feeling at the moment of insight, when the last piece falls into place,” Berkun explained, “is the reward for many hours (or years) of investment coming together.  In comparison to the simple action of fitting the puzzle piece into place, we feel the larger collective payoff of hundreds of pieces’ worth of work.”

Perhaps the myth of the data governance framework could also be debunked using the metaphor of a jigsaw puzzle.

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, change management — and many other puzzle pieces.

How could a data governance framework possibly predict how you will assemble the puzzle pieces?  Or how the puzzle pieces will fit together within your unique corporate culture?  Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization?  And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles.

The Wisdom of Crowds, Friends, and Experts

I recently finished reading the TED Book by Jim Hornthal, A Haystack Full of Needles, which included an overview of the different predictive approaches taken by one of the most common forms of data-driven decision making in the era of big data, namely, the recommendation engines increasingly provided by websites, social networks, and mobile apps.

These recommendation engines primarily employ one of three techniques, choosing to base their data-driven recommendations on the “wisdom” provided by either crowds, friends, or experts.

 

The Wisdom of Crowds

In his book The Wisdom of Crowds, James Surowiecki explained that the four conditions characterizing wise crowds are diversity of opinion, independent thinking, decentralization, and aggregation.  Amazon is a great example of a recommendation engine using this approach by assuming that a sufficiently large population of buyers is a good proxy for your purchasing decisions.

For example, Amazon tells you that people who bought James Surowiecki’s bestselling book also bought Thinking, Fast and Slow by Daniel Kahneman, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business by Jeff Howe, and Wikinomics: How Mass Collaboration Changes Everything by Don Tapscott.  However, Amazon neither provides nor possesses knowledge of why people bought all four of these books or qualification of the subject matter expertise of these readers.

However, these concerns, which we could think of as potential data quality issues, and which would be exacerbated within a small amount of transaction data where the eclectic tastes and idiosyncrasies of individual readers would not help us decide what books to buy, within a large amount of transaction data, we achieve the Wisdom of Crowds effect when, taken in aggregate, we receive a general sense of what books we might like to read based on what a diverse group of readers collectively makes popular.

As I blogged about in my post Sometimes it’s Okay to be Shallow, sometimes the aggregated, general sentiment of a large group of unknown, unqualified strangers will be sufficient to effectively make certain decisions.

 

The Wisdom of Friends

Although the influence of our friends and family is the oldest form of data-driven decision making, historically this influence was delivered by word of mouth, which required you to either be there to hear those influential words when they were spoken, or have a large enough network of people you knew that would be able to eventually pass along those words to you.

But the rise of social networking services, such as Twitter and Facebook, has transformed word of mouth into word of data by transcribing our words into short bursts of social data, such as status updates, online reviews, and blog posts.

Facebook “Likes” are a great example of a recommendation engine that uses the Wisdom of Friends, where our decision to buy a book, see a movie, or listen to a song might be based on whether or not our friends like it.  Of course, “friends” is used in a very loose sense in a social network, and not just on Facebook, since it combines strong connections such as actual friends and family, with weak connections such as acquaintances, friends of friends, and total strangers from the periphery of our social network.

Social influence has never ended with the people we know well, as Nicholas Christakis and James Fowler explained in their book Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives.  But the hyper-connected world enabled by the Internet, and further facilitated by mobile devices, has strengthened the social influence of weak connections, and these friends form a smaller crowd whose wisdom is involved in more of our decisions than we may even be aware of.

 

The Wisdom of Experts

Since it’s more common to associate wisdom with expertise, Pandora is a great example of a recommendation engine that uses the Wisdom of Experts.  Pandora used a team of musicologists (professional musicians and scholars with advanced degrees in music theory) to deconstruct more than 800,000 songs into 450 musical elements that make up each performance, including qualities of melody, harmony, rhythm, form, composition, and lyrics, as part of what Pandora calls the Music Genome Project.

As Pandora explains, their methodology uses precisely defined terminology, a consistent frame of reference, redundant analysis, and ongoing quality control to ensure that data integrity remains reliably high, believing that delivering a great radio experience to each and every listener requires an incredibly broad and deep understanding of music.

Essentially, experts form the smallest crowd of wisdom.  Of course, experts are not always right.  At the very least, experts are not right about every one of their predictions.  Nor do experts always agree with other, which is why I imagine that one of the most challenging aspects of the Music Genome Project is getting music experts to consistently apply precisely the same methodology.

Pandora also acknowledges that each individual has a unique relationship with music (i.e., no one else has tastes exactly like yours), and allows you to “Thumbs Up” or “Thumbs Down” songs without affecting other users, producing more personalized results than either the popularity predicted by the Wisdom of Crowds or the similarity predicted by the Wisdom of Friends.

 

The Future of Wisdom

It’s interesting to note that the Wisdom of Experts is the only one of these approaches that relies on what data management and business intelligence professionals would consider a rigorous approach to data quality and decision quality best practices.  But this is also why the Wisdom of Experts is the most time-consuming and expensive approach to data-driven decision making.

In the past, the Wisdom of Crowds and Friends was ignored in data-driven decision making for the simple reason that this potential wisdom wasn’t digitized.  But now, in the era of big data, not only are crowds and friends digitized, but technological advancements combined with cost-effective options via open source (data and software) and cloud computing make these approaches quicker and cheaper than the Wisdom of Experts.  And despite the potential data quality and decision quality issues, the Wisdom of Crowds and/or Friends is proving itself a viable option for more categories of data-driven decision making.

I predict that the future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three potential sources of wisdom often acknowledged as contributors to data-driven decision making.

 

Related Posts

Sometimes it’s Okay to be Shallow

Word of Mouth has become Word of Data

The Wisdom of the Social Media Crowd

Data Management: The Next Generation

Exercise Better Data Management

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

The Big Data Theory

Finding a Needle in a Needle Stack

Big Data, Predictive Analytics, and the Ideal Chronicler

The Limitations of Historical Analysis

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

A Tale of Two Datas

Data Silence

This blog post is sponsored by the Enterprise CIO Forum and HP.

In the era of big data, information optimization is becoming a major topic of discussion.  But when some people discuss the big potential of big data analytics under the umbrella term of data science, they make it sound like since we have access to all the data we would ever need, all we have to do is ask the Data Psychic the right question and then listen intently to the answer.

However, in his recent blog post Silence Isn’t Always Golden, Bradley S. Fordham, PhD explained that “listening to what the data does not say is often as important as listening to what it does.  There can be various types of silences in data that we must get past to take the right actions.”  Fordham described these data silences as various potential gaps in our analysis.

One data silence is syntactic gaps, which is a proportionately small amount of data in a very large data set that “will not parse (be converted from raw data into meaningful observations with semantics or meaning) in the standard way.  A common response is to ignore them under the assumption there are too few to really matter.  The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other.  So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems.”

This data silence reminded me of my podcast discussion with Thomas C. Redman, PhD about big data and data quality, during which we discussed how some people erroneously assume that data quality issues can be ignored in larger data sets.

Another data silence is inferential gaps, which is basing an inference on only one variable in a data set.  The example Fordham uses is from a data set showing that 41% of the cars sold during the first quarter of the year were blue, from which we might be tempted to infer that customers bought more blue cars because they preferred blue.  However, by looking at additional variables in the data set and noticing that “70% of the blue cars sold were from the previous model year, it is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold.  So, maybe blue wasn’t so popular after all.”

Another data silence Fordham described using the same data set is gaps in field of view.  “At first glance, knowing everything on the window sticker of every car sold in the first quarter seems to provide a great set of data to understand what customers wanted and therefore were buying.  At least it did until we got a sinking feeling in our stomachs because we realized that this data only considers what the auto manufacturer actually built.  That field of view is too limited to answer the important customer desire and motivation questions being asked.  We need to break the silence around all the things customers wanted that were not built.”

This data silence reminded me of WYSIATI, which is an acronym coined by Daniel Kahneman to describe how the data you are looking at can greatly influence you to jump to the comforting, but false, conclusion that “what you see is all there is,” thereby preventing you from expanding your field of view to notice what data might be missing from your analysis.

As Fordham concluded, “we need to be careful to listen to all the relevant data, especially the data that is silent within our current analyses.  Applying that discipline will help avoid many costly mistakes that companies make by taking the wrong actions from data even with the best of techniques and intentions.”

Therefore, in order for your enterprise to leverage big data analytics for business success, you not only need to adopt a mindset that embraces the principles of data science, you also need to make sure that your ears are set to listen for data silence.

This blog post is sponsored by the Enterprise CIO Forum and HP.

 

Related Posts

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

WYSIWYG and WYSIATI

Will Big Data be Blinded by Data Science?

Big Data el Memorioso

Information Overload Revisited

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

The Big Data Theory

Finding a Needle in a Needle Stack

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

A Tale of Two Datas

Availability Bias and Data Quality Improvement

The availability heuristic is a mental shortcut that occurs when people make judgments based on the ease with which examples come to mind.  Although this heuristic can be beneficial, such as when it helps us recall examples of a dangerous activity to avoid, sometimes it leads to availability bias, where we’re affected more strongly by the ease of retrieval than by the content retrieved.

In his thought-provoking book Thinking, Fast and Slow, Daniel Kahneman explained how availability bias works by recounting an experiment where different groups of college students were asked to rate a course they had taken the previous semester by listing ways to improve the course — while varying the number of improvements that different groups were required to list.

Counterintuitively, students in the group required to list more necessary improvements gave the course a higher rating, whereas students in the group required to list fewer necessary improvements gave the course a lower rating.

According to Kahneman, the extra cognitive effort expended by the students required to list more improvements biased them into believing it was difficult to list necessary improvements, leading them to conclude that the course didn’t need much improvement, and conversely, the little cognitive effort expended by the students required to list few improvements biased them into concluding, since it was so easy to list necessary improvements, that the course obviously needed improvement.

This is counterintuitive because you’d think that the students would rate the course based on an assessment of the information retrieved from their memory regardless of how easy that information was to retrieve.  It would have made more sense for the course to be rated higher for needing fewer improvements, but availability bias lead the students to the opposite conclusion.

Availability bias can also affect an organization’s discussions about the need for data quality improvement.

If you asked stakeholders to rate the organization’s data quality by listing business-impacting incidents of poor data quality, would they reach a different conclusion if you asked them to list one incident versus asking them to list at least ten incidents?

In my experience, an event where poor data quality negatively impacted the organization, such as a regulatory compliance failure, is often easily dismissed by stakeholders as an isolated incident to be corrected by a one-time data cleansing project.

But would forcing stakeholders to list ten business-impacting incidents of poor data quality make them concede that data quality improvement should be supported by an ongoing program?  Or would the extra cognitive effort bias them into concluding, since it was so difficult to list ten incidents, that the organization’s data quality doesn’t really need much improvement?

I think that the availability heuristic helps explain why most organizations easily approve reactive data cleansing projects, and availability bias helps explain why most organizations usually resist proactively initiating a data quality improvement program.

 

Related Posts

DQ-View: The Five Stages of Data Quality

Data Quality: Quo Vadimus?

Data Quality and Chicken Little Syndrome

The Data Quality Wager

You only get a Return from something you actually Invest in

“Some is not a number and soon is not a time”

Why isn’t our data quality worse?

Data Quality and the Bystander Effect

Data Quality and the Q Test

Perception Filters and Data Quality

Predictably Poor Data Quality

WYSIWYG and WYSIATI

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Redefining Data Quality — Guest Peter Perera discusses his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

A Tale of Two Datas

Is big data more than just lots and lots of data?  Is big data unstructured and not-so-big data structured?  Malcolm Chisholm explored these questions in his recent Information Management column, where he posited that there are, in fact, two datas.

“One type of data,” Chisholm explained,  “represents non-material entities in vast computerized ecosystems that humans create and manage.  The other data consists of observations of events, which may concern material or non-material entities.”

Providing an example of the first type, Chisholm explained, “my bank account is not a physical thing at all; it is essentially an agreed upon idea between myself, the bank, the legal system, and the regulatory authorities.  It only exists insofar as it is represented, and it is represented in data.  The balance in my bank account is not some estimate with a positive and negative tolerance; it is exact.  The non-material entities of the financial sector are orderly human constructs.  Because they are orderly, we can more easily manage them in computerized environments.”

The orderly human constructs that are represented in data, in the stories told by data (including the stories data tell about us and the stories we tell data) is one of my favorite topics.  In our increasingly data-constructed world, it’s important to occasionally remind ourselves that data and the real world are not the same thing, especially when data represents non-material entities since, with the possible exception of Makers using 3-D printers, data-represented entities do not re-materialize into the real world.

Describing the second type, Chisholm explained, “a measurement is usually a comparison of a characteristic using some criteria, a count of certain instances, or the comparison of two characteristics.  A measurement can generally be quantified, although sometimes it’s expressed in a qualitative manner.  I think that big data goes beyond mere measurement, to observations.”

Chisholm called the first type the Data of Representation, and the second type the Data of Observation.

The data of representation tends to be structured, in the relational sense, but doesn’t need to be (e.g., graph databases) and the data of observation tends to be unstructured, but it can also be structured (e.g., the structured observations generated by either a data profiling tool analyzing structured relational tables or flat files, or a word-counting algorithm analyzing unstructured text).

Structured and unstructured,” Chisholm concluded, “describe form, not essence, and I suggest that representation and observation describe the essences of the two datas.  I would also submit that both datas need different data management approaches.  We have a good idea what these are for the data of representation, but much less so for the data of observation.”

I agree that there are two types of data (i.e., representation and observation, not big and not-so-big) and that different data uses will require different data management approaches.  Although data modeling is still important and data quality still matters, how much data modeling and data quality is needed before data can be effectively used for specific business purposes will vary.

In order to move our discussions forward regarding “big data” and its data management and business intelligence challenges, we have to stop fiercely defending our traditional perspectives about structure and quality in order to effectively manage both the form and essence of the two datas.  We also have to stop fiercely defending our traditional perspectives about data analytics, since there will be some data use cases where depth and detailed analysis may not be necessary to provide business insight.

 

A Tale of Two Datas

In conclusion, and with apologies to Charles Dickens and his A Tale of Two Cities, I offer the following A Tale of Two Datas:

It was the best of times, it was the worst of times.
It was the age of Structured Data, it was the age of Unstructured Data.
It was the epoch of SQL, it was the epoch of NoSQL.
It was the season of Representation, it was the season of Observation.
It was the spring of Big Data Myth, it was the winter of Big Data Reality.
We had everything before us, we had nothing before us,
We were all going direct to hoarding data, we were all going direct the other way.
In short, the period was so far like the present period, that some of its noisiest authorities insisted on its being signaled, for Big Data or for not-so-big data, in the superlative degree of comparison only.

Related Posts

HoardaBytes and the Big Data Lebowski

The Idea of Order in Data

The Most August Imagination

Song of My Data

The Lies We Tell Data

Our Increasingly Data-Constructed World

Plato’s Data

OCDQ Radio - Demystifying Master Data Management

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

Swimming in Big Data

Sometimes it’s Okay to be Shallow

Darth Vader, Big Data, and Predictive Analytics

The Big Data Theory

Finding a Needle in a Needle Stack

Exercise Better Data Management

Magic Elephants, Data Psychics, and Invisible Gorillas

Why Can’t We Predict the Weather?

Data and its Relationships with Quality

A Tale of Two Q’s

A Tale of Two G’s

Cooks, Chefs, and Data Governance

In their book Practical Wisdom, Barry Schwartz and Kenneth Sharpe quoted retired Lieutenant Colonel Leonard Wong, who is a Research Professor of Military Strategy in the Strategic Studies Institute at the United States Army War College, focusing on the human and organizational dimensions of the military.

“Innovation,” Wong explained, “develops when an officer is given a minimal number of parameters (e.g., task, condition, and standards) and the requisite time to plan and execute the training.  Giving the commanders time to create their own training develops confidence in operating within the boundaries of a higher commander’s intent without constant supervision.”

According to Wong, too many rules and requirements “remove all discretion, resulting in reactive instead of proactive thought, compliance instead of creativity, and adherence instead of audacity.”  Wong believed that it came down to a difference between cooks, those who are quite adept at carrying out a recipe, and chefs, those who can look at the ingredients available to them and create a meal.  A successful military strategy is executed by officers who are trained to be chefs, not cooks.

 

Data Governance’s Kitchen

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, and, perhaps most notably, policy enforcement.

Because of this complexity, many organizations think the only way to run data governance’s kitchen is to institute a bureaucracy that dictates policies and demands compliance.  In other words, data governance policies are recipes and employees are cooks.

Although implementing data governance policies does occasionally require a cook-adept-at-carrying-out-a-recipe mindset, the long-term success of a data governance program is going to also require chefs since the dynamic challenges faced, and overcome daily, by business analysts, data stewards, technical architects, and others, exemplify today’s constantly changing business world, which can not be successfully governed by forcing employees to systematically apply rules or follow rigid procedures.

Data governance requires chefs who are empowered with an understanding of the principles of the policies, and who are trusted to figure out how to best implement the policies in a particular business context by combining rules with the organizational ingredients available to them, and creating a flexible procedure that operates within the boundaries of the policy’s principles.

But, of course, just like a military can not be staffed entirely by officers, and a kitchen can not be staffed entirely by chefs, in order to implement a data governance program successfully, an organization needs both cooks and chefs.

Similar to how data governance is neither all-top-down nor all-bottom-up, it’s also neither all-cook nor all-chef.

Only the unique corporate culture of your organization can determine how to best staff your data governance kitchen.

 

Related Posts

Aristotle, Data Governance, and Lead Rulers

Data Governance Frameworks are like Jigsaw Puzzles

The Unsung Heroes of Data

MacGyver: Data Governance and Duct Tape

Data Governance and the Adjacent Possible

The Godfather of Data Governance

Jack Bauer and Enforcing Data Governance Policies

Beware the Data Governance Ides of March

Open MIKE Podcast — Episode 02: Information Governance and Distributing Power

Data Governance Star Wars: Balancing Bureaucracy And Agility

OCDQ Radio - Data Governance Star Wars

A Tale of Two G’s

The Data Governance Oratorio

The Three Most Important Letters in Data Governance

The Collaborative Culture of Data Governance

Exercise Better Data Management

Recently on Twitter, Daragh O Brien and I discussed his proposed concept.  “After Big Data,” Daragh tweeted, “we will inevitably begin to see the rise of MOData as organizations seek to grab larger chunks of data and digest it.  What is MOData?  It’s MO’Data, as in MOre Data. Or Morbidly Obese Data.  Only good data quality and data governance will determine which.”

Daragh asked if MO’Data will be the Big Data Killer.  I said only if MO’Data doesn’t include MO’BusinessInsight, MO’DataQuality, and MO’DataPrivacy (i.e., more business insight, more data quality, and more data privacy).

“But MO’Data is about more than just More Data,” Daragh replied.  “It’s about avoiding Morbidly Obese Data that clogs data insight and data quality, etc.”

I responded that More Data becomes Morbidly Obese Data only if we don’t exercise better data management practices.

Agreeing with that point, Daragh replied, “Bring on MOData and the Pilates of Data Quality and Data Governance.”

To slightly paraphrase lines from one of my favorite movies — Airplane! — the Cloud is getting thicker and the Data is getting laaaaarrrrrger.  Surely I know well that growing data volumes is a serious issue — but don’t call me Shirley.

Whether you choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how much reality bites, the truth is we were consuming way more than our recommended daily allowance of data long before the data management industry took a tip from McDonald’s and put the word “big” in front of its signature sandwich.  (Oh great . . . now I’m actually hungry for a Big Mac.)

But nowadays with silos replicating data, as well as new data, and new types of data, being created and stored on a daily basis, our data is resembling the size of Bob Parr in retirement, making it seem like not even Mr. Incredible in his prime possessed the super strength needed to manage all of our data.  Those were references to the movie The Incredibles, where Mr. Incredible was a superhero who, after retiring into civilian life under the alias of Bob Parr, elicits the observation from this superhero costume tailor: “My God, you’ve gotten fat.”  Yes, I admit not even Helen Parr (aka Elastigirl) could stretch that far for a big data joke.

A Healthier Approach to Big Data

Although Daragh’s concerns about morbidly obese data are valid, no superpowers (or other miracle exceptions) are needed to manage all of our data.  In fact, it’s precisely when we are so busy trying to manage all of our data that we hoard countless bytes of data without evaluating data usage, gathering data requirements, or planning for data archival.  It’s like we are trying to lose weight by eating more and exercising less, i.e., consuming more data and exercising less data quality and data governance.  As Daragh said, only good data quality and data governance will determine whether we get more data or morbidly obese data.

Losing weight requires a healthy approach to both diet and exercise.  A healthy approach to diet includes carefully choosing the food you consume and carefully controlling your portion size.  A healthy approach to exercise includes a commitment to exercise on a regular basis at a sufficient intensity level without going overboard by spending several hours a day, every day, at the gym.

Swimming is a great form of exercise, but swimming in big data without having a clear business objective before you jump into the pool is like telling your boss that you didn’t get any work done because you decided to spend all day working out at the gym.

Carefully choosing the data you consume and carefully controlling your data portion size is becoming increasingly important since big data is forcing us to revisit information overload.  However, the main reason that traditional data management practices often become overwhelmed by big data is because traditional data management practices are not always the right approach.

We need to acknowledge that some big data use cases differ considerably from traditional ones.  Data modeling is still important and data quality still matters, but how much data modeling and data quality is needed before big data can be effectively used for business purposes will vary.  In order to move the big data discussion forward, we have to stop fiercely defending our traditional perspectives about structure and quality.  We also have to stop fiercely defending our traditional perspectives about analytics, since there will be some big data use cases where depth and detailed analysis may not be necessary to provide business insight.

Better than Big or More

Jim Ericson explained that your data is big enough.  Rich Murnane explained that bigger isn’t better, better is better.  Although big data may indeed be followed by more data that doesn’t necessarily mean we require more data management in order to prevent more data from becoming morbidly obese data.  I think that we just need to exercise better data management.

 

Related Posts

Demystifying Master Data Management

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest John Owens and I attempt to demystify master data management (MDM) by explaining the three types of data (Transaction, Domain, Master) and the four master data entities (Party, Product, Location, Asset), as well as, and perhaps the most important concept of all, the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

John Owens is a thought leader, consultant, mentor, and writer in the worlds of business and data modelling, data quality, and master data management (MDM).  He has built an international reputation as a highly innovative specialist in these areas and has worked in and led multi-million dollar projects in a wide range of industries around the world.

John Owens has a gift for identifying the underlying simplicity in any enterprise, even when shrouded in complexity, and bringing it to the surface.  He is the creator of the Integrated Modelling Method (IMM), which is used by business and data analysts around the world.  Later this year, John Owens will be formally launching the IMM Academy, which will provide high quality resources, training, and mentoring for business and data analysts at all levels.

You can also follow John Owens on Twitter and connect with John Owens on Linkedin.  And if you’re looking for a MDM course, consider the online course from John Owens, which you can find by clicking on this link: MDM Online Course (Affiliate Link)

 

Demystifying Master Data Management

Additional listening options:

 

Related Posts

Choosing Your First Master Data Domain

Lycanthropy, Silver Bullets, and Master Data Management

Voyage of the Golden Records

The Quest for the Golden Copy (Part 1)

The Quest for the Golden Copy (Part 2)

The Quest for the Golden Copy (Part 3)

The Quest for the Golden Copy (Part 4)

How Social can MDM get?

Will Social MDM be the New Spam?

More Thoughts about Social MDM

Is Social MDM going the Wrong Way?

The Semantic Future of MDM

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Redefining Data Quality — Guest Peter Perera discusses his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

DQ-View: The Five Stages of Data Quality

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

In my experience, all organizations cycle through five stages while coming to terms with the daunting challenges of data quality, which are somewhat similar to The Five Stages of Grief.  So, in this short video, I explain The Five Stages of Data Quality:

  1. Denial — Our organization is well-managed and highly profitable.  We consistently meet, or exceed, our business goals.  We obviously understand the importance of high-quality data.  Data quality issues can’t possibly be happening to us.
  2. Anger — We’re now in the midst of a financial reporting scandal, and facing considerable fines in the wake of a regulatory compliance failure.  How can this be happening to us?  Why do we have data quality issues?  Who is to blame for this?
  3. Bargaining — Okay, we may have just overreacted a little bit.  We’ll purchase a data quality tool, approve a data cleansing project, implement defect prevention, and initiate data governance.  That will fix all of our data quality issues — right?
  4. Depression — Why, oh why, do we keep having data quality issues?  Why does this keep happening to us?  Maybe we should just give up, accept our doomed fate, and not bother doing anything at all about data quality and data governance.
  5. Acceptance — We can’t fight the truth anymore.  We accept that we have to do the hard daily work of continuously improving our data quality and continuously implementing our data governance principles, policies, and procedures.

Quality is the Higgs Field of Data

Recently on Twitter, Daragh O Brien replied to my David Weinberger quote “The atoms of data hook together only because they share metadata,” by asking “So, is Quality Data the Higgs Boson of Information Management?”

I responded that Quality is the Higgs Boson of Data and Information since Quality gives Data and Information their Mass (i.e., their Usefulness).

“Now that is profound,” Daragh replied.

“That’s cute and all,” Brian Panulla interjected, “but you can’t measure Quality.  Mass is objective.  It’s more like Weight — a mass in context.”

I agreed with Brian’s great point since in a previous post I explained the often misunderstood difference between mass, an intrinsic property of matter based on atomic composition, and weight, a gravitational force acting on matter.

Using these concepts metaphorically, mass is an intrinsic property of data, representing objective data quality, whereas weight is a gravitational force acting on data, thereby representing subjective data quality.

But my previous post didn’t explain where matter theoretically gets its mass, and since this scientific mystery was radiating in the cosmic background of my Twitter banter with Daragh and Brian, I decided to use this post to attempt a brief explanation along the way to yet another data quality analogy.

As you have probably heard by now, big scientific news was recently reported about the discovery of the Higgs Boson, which, since the 1960s, the Standard Model of particle physics has theorized to be the fundamental particle associated with a ubiquitous quantum field (referred to as the Higgs Field) that gives all matter its mass by interacting with the particles that make up atoms and weighing them down.  This is foundational to our understanding of the universe because without something to give mass to the basic building blocks of matter, everything would behave the same way as the intrinsically mass-less photons of light behave, floating freely and not combining with other particles.  Therefore, without mass, ordinary matter, as we know it, would not exist.

 

Ping-Pong Balls and Maple Syrup

I like the Higgs Field explanation provided by Brian Cox and Jeff Forshaw.  “Imagine you are blindfolded, holding a ping-pong ball by a thread.  Jerk the string and you will conclude that something with not much mass is on the end of it.  Now suppose that instead of bobbing freely, the ping-pong ball is immersed in thick maple syrup.  This time if you jerk the thread you will encounter more resistance, and you might reasonably presume that the thing on the end of the thread is much heavier than a ping-pong ball.  It is as if the ball is heavier because it gets dragged back by the syrup.”

“Now imagine a sort of cosmic maple syrup that pervades the whole of space.  Every nook and cranny is filled with it, and it is so pervasive that we do not even know it is there.  In a sense, it provides the backdrop to everything that happens.”

Mass is therefore generated as a result of an interaction between the ping-pong balls (i.e., atomic particles) and the maple syrup (i.e, the Higgs Field).  However, although the Higgs Field is pervasive, it is also variable and selective, since some particles are affected by the Higgs Field more than others, and photons pass through it unimpeded, thereby remaining mass-less particles.

 

Quality — Data Gets Higgy with It

Now that I have vastly oversimplified the Higgs Field, let me Get Higgy with It by attempting an analogy for data quality based on the Higgs Field.  As I do, please remember the wise words of Karen Lopez: “All analogies are perfectly imperfect.”

Quality provides the backdrop to everything that happens when we use data.  Data in the wild, independent from use, is as carefree as the mass-less photon whizzing around at the speed of light, like a ping-pong ball bouncing along without a trace of maple syrup on it.  But once we interact with data using our sticky-maple-syrup-covered fingers, data begins to slow down, begins to feel the effects of our use.  We give data mass so that it can become the basic building blocks of what matters to us.

Some data is affected more by our use than others.  The more subjective our use, the more we weigh data down.  The more objective our use, the less we weigh data down.  Sometimes, we drag data down deep into the maple syrup, covering data up with an application layer, or bottling data into silos.  Other times, we keep data in the shallow end of the molasses swimming pool.

Quality is the Higgs Field of Data.  As users of data, we are the Higgs Bosons — we are the fundamental particles associated with a ubiquitous data quality field.  By using data, we give data its quality.  The quality of data can not be separated from its use any more than the particles of the universe can be separated from the Higgs Field.

The closest data equivalent of a photon, a ping-pong ball particle that doesn’t get stuck in the maple syrup of the Higgs Field, is Open Data, which doesn’t get stuck within silos, but is instead data freely shared without the sticky quality residue of our use.

 

Related Posts

Our Increasingly Data-Constructed World

What is Weighing Down your Data?

Data Myopia and Business Relativity

Redefining Data Quality

Are Applications the La Brea Tar Pits for Data?

Swimming in Big Data

Sometimes it’s Okay to be Shallow

Data Quality and Big Data

Data Quality and the Q Test

My Own Private Data

No Datum is an Island of Serendip

Sharing Data

The Family Circus and Data Quality

Family Circus.png

Like many young intellectuals, the only part of the Sunday newspaper I read growing up was the color comics section, and one of my favorite comic strips was The Family Circus created by cartoonist Bil Keane.  One of the recurring themes of the comic strip was a set of invisible gremlins that the children used to shift blame for any misdeeds, including Ida Know, Not Me, and Nobody.

Although I no longer read any section of the newspaper on any day of the week, this Sunday morning I have been contemplating how this same set of invisible gremlins is used by many people throughout most organizations to shift blame for any incidents when poor data quality negatively impacted business activities, especially since, when investigating the root cause, you often find that Ida Know owns the dataNot Me is accountable for data governance, and Nobody takes responsibility for data quality.