Measuring Data Quality for Ongoing Improvement

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Listen to Laura Sebastian-Coleman, author of the book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, and I discuss bringing together a better understanding of what is represented in data, and how it is represented, with the expectations for use in order to improve the overall quality of data.  Our discussion also includes avoiding two common mistakes made when starting a data quality project, and defining five dimensions of data quality.

Laura Sebastian-Coleman has worked on data quality in large health care data warehouses since 2003.  She has implemented data quality metrics and reporting, launched and facilitated a data quality community, contributed to data consumer training programs, and has led efforts to establish data standards and to manage metadata.  In 2009, she led a group of analysts in developing the original Data Quality Assessment Framework (DQAF), which is the basis for her book.

Laura Sebastian-Coleman has delivered papers at MIT’s Information Quality Conferences and at conferences sponsored by the International Association for Information and Data Quality (IAIDQ) and the Data Governance Organization (DGO).  She holds IQCP (Information Quality Certified Professional) designation from IAIDQ, a Certificate in Information Quality from MIT, a B.A. in English and History from Franklin & Marshall College, and a Ph.D. in English Literature from the University of Rochester.

OCDQ Radio Episode 38: Measuring Data Quality for Ongoing Improvement
Jim Harris with Guest Laura Sebastian-Coleman

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Data Profiling Early and Often — Guest James Standen discusses data profiling concepts and practices, and how bad data is often misunderstood and can be coaxed away from the dark side if you know how to approach it.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Too Big to Ignore

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, Phil Simon shares his sage advice for getting started with big data, including the importance of having a data-oriented mindset, that ambitious long-term goals should give way to more reasonable and attainable short-term objectives, and always remembering that big data is just another means toward solving business problems.

Phil Simon is a sought-after speaker and the author of five management books, most recently Too Big to Ignore: The Business Case for Big Data.  A recognized technology expert, he consults companies on how to optimize their use of technology.  His contributions have been featured on NBC, CNBC, ABC News, Inc. magazine, BusinessWeek, Huffington Post, Globe and Mail, Fast Company, Forbes, the New York Times, ReadWriteWeb, and many other sites.

OCDQ Radio Episode 37: Too Big to Ignore
Jim Harris with Guest Phil Simon

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Data Profiling Early and Often — Guest James Standen discusses data profiling concepts and practices, and how bad data is often misunderstood and can be coaxed away from the dark side if you know how to approach it.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Sometimes Worse Data Quality is Better

Continuing a theme from three previous posts, which discussed when it’s okay to call data quality as good as it needs to get, the occasional times when perfect data quality is necessary, and the costs and profits of poor data quality, in this blog post I want to provide three examples of when the world of consumer electronics proved that sometimes worse data quality is better.

 

When the Betamax Bet on Video Busted

While it seems like a long time ago in a galaxy far, far away, during the 1970s and 1980s a videotape format war waged between Betamax and VHS.  Betamax was widely believed to provide superior video data quality.

But a blank Betamax tape allowed users to record up to two hours of high-quality video, whereas a VHS tape allowed users to record up to four hours of slightly lower quality video.  Consumers consistently chose quantity over quality — and especially since lower quality also meant a lower price.  Betamax tapes and machines remained more expensive based on betting that consumers would be willing to pay a premium for higher-quality video.

The VHS victory demonstrated how people often choose quantity over quality, so it doesn’t always pay to have better data quality.

 

When Lossless Lost to Lossy Audio

Much to the dismay of those working in the data quality profession, most people do not care about the quality of their data unless it becomes bad enough for them to pay attention to — and complain about.

An excellent example is bitrate, which refers to the number of bits — or the amount of data — that are processed over a certain amount of time.  In his article Does Bitrate Really Make a Difference In My Music?, Whitson Gordon examined the common debate about lossless versus lossy audio formats.

Using the example of ripping a track from a CD to a hard drive, a lossless format means the track is not compressed to the point where any of its data is lost, retaining, for all intents and purposes, the same audio data quality as the original CD track.

By contrast, a lossy format compresses the track so that it takes up less space by intentionally deleting some of its data, reducing audio data quality.  Audiophiles often claim anything other than vinyl records sound lousy because they are so lossy.

However, like truth, beauty, and art, data quality can be said to be in the eyes — or the ears — of the beholder.  So, if your favorite music sounds fine to you in MP3 file format, then not only do you not need vinyl records, audio tapes, and CDs anymore, but if you consider MP3 files good enough, then you will not pay more attention to (or pay more money for) audio data quality.

 

When Digital Killed the Photograph Star

The Eastman Kodak Company, commonly known as Kodak, which was founded by George Eastman in 1888 and dominated the photograph industry for most of the 20th century, filed for bankruptcy in January 2012.  The primary reason was that Kodak, which had previously pioneered innovations like celluloid film and color photography, failed to embrace the industry’s transition to digital photography, despite the fact that Kodak invented some of the core technology used in current digital cameras.

Why?  Because Kodak believed that the data quality of digital photographs would be generally unacceptable to consumers as a replacement for film photographs.  In much the same way that Betamax assumed consumers wanted higher-quality video, Kodak assumed consumers would always want to use higher-quality photographs to capture their “Kodak moments.”

In fairness to Kodak, mobile devices are causing a massive — and rapid — disruption to many well-established business models, creating a brave new digital world, and obviously not just for photography.  However, when digital killed the photograph star, it proved, once again, that sometimes worse data quality is better.

  

Related Posts

Data Quality and the OK Plateau

When Poor Data Quality Kills

The Costs and Profits of Poor Data Quality

Promoting Poor Data Quality

Data Quality and the Cupertino Effect

The Data Quality Wager

How Data Cleansing Saves Lives

The Dichotomy Paradox, Data Quality and Zero Defects

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

The Seventh Law of Data Quality

A Tale of Two Q’s

Paleolithic Rhythm and Data Quality

Groundhog Data Quality Day

Data Quality and The Middle Way

Stop Poor Data Quality STOP

When Poor Data Quality Calls

Freudian Data Quality

Predictably Poor Data Quality

Satisficing Data Quality

i blog of Data glad and big

I recently blogged about the need to balance the hype of big data with some anti-hype.  My hope was, like a collision of matter and anti-matter, the hype and anti-hype would cancel each other out, transitioning our energy into a more productive discussion about big data.  But, of course, few things in human discourse ever reach such an equilibrium, or can maintain it for very long.

For example, Quentin Hardy recently blogged about six big data myths based on a conference presentation by Kate Crawford, who herself also recently blogged about the hidden biases in big data.  “I call B.S. on all of it,” Derrick Harris blogged in his response to the backlash against big data.  “It might be provocative to call into question one of the hottest tech movements in generations, but it’s not really fair.  That’s because how companies and people benefit from big data, data science or whatever else they choose to call the movement toward a data-centric world is directly related to what they expect going in.  Arguing that big data isn’t all it’s cracked up to be is a strawman, pure and simple — because no one should think it’s magic to begin with.”

In their new book Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schonberger and Kenneth Cukier explained that “like so many new technologies, big data will surely become a victim of Silicon Valley’s notorious hype cycle: after being feted on the cover of magazines and at industry conferences, the trend will be dismissed and many of the data-smitten startups will flounder.  But both the infatuation and the damnation profoundly misunderstand the importance of what is taking place.  Just as the telescope enabled us to comprehend the universe and the microscope allowed us to understand germs, the new techniques for collecting and analyzing huge bodies of data will help us make sense of our world in ways we are just starting to appreciate.  The real revolution is not in the machines that calculate data, but in data itself and how we use it.”

Although there have been numerous critical technology factors making the era of big data possible, such as increases in the amount of computing power, decreases in the cost of data storage, increased network bandwidth, parallel processing frameworks (e.g., Hadoop), scalable and distributed models (e.g., cloud computing), and other techniques (e.g., in-memory computing), Mayer-Schonberger and Cukier argued that “something more important changed too, something subtle.  There was a shift in mindset about how data could be used.  Data was no longer regarded as static and stale, whose usefulness was finished once the purpose for which it was collected was achieved.  Rather, data became a raw material of business, a vital economic input, used to create a new form of economic value.”

“In fact, with the right mindset, data can be cleverly used to become a fountain of innovation and new services.  The data can reveal secrets to those with the humility, the willingness, and the tools to listen.”

Pondering this big data war of words reminded me of the E. E. Cummings poem i sing of Olaf glad and big, which sings of Olaf, a conscientious objector forced into military service, who passively endures brutal torture inflicted upon him by training officers, while calmly responding (pardon the profanity): “I will not kiss your fucking flag” and “there is some shit I will not eat.”

Without question, big data has both positive and negative aspects, but the seeming unwillingness of either side in the big data war of words to “kiss each other’s flag,” so to speak, is not as concerning to me as is the conscientious objection to big data and data science expanding into realms where people and businesses were not used to enduring its influence.  For example, some will feel that data-driven audits of their decision-making is like brutal torture inflicted upon their less-than data-driven intuition.

E.E. Cummings sang the praises of Olaf “because unless statistics lie, he was more brave than me.”  i blog of Data glad and big, but I fear that, regardless of how big it is, “there is some data I will not believe” will be a common refrain by people who will lack the humility and willingness to listen to data, and who will not be brave enough to admit that statistics don’t always lie.

 

Related Posts

The Need for Data Philosophers

On Philosophy, Science, and Data

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

Big Data and the Infinite Inbox

The Laugh-In Effect of Big Data

HoardaBytes and the Big Data Lebowski

Magic Elephants, Data Psychics, and Invisible Gorillas

Will Big Data be Blinded by Data Science?

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

Our Increasingly Data-Constructed World

The Wisdom of Crowds, Friends, and Experts

Data Separates Science from Superstition

Headaches, Data Analysis, and Negativity Bias

Why Data Science Storytelling Needs a Good Editor

Predictive Analytics, the Data Effect, and Jed Clampett

Rage against the Machines Learning

The Flying Monkeys of Big Data

Cargo Cult Data Science

Speed Up Your Data to Slow Down Your Decisions

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

The Need for Data Philosophers

In my post On Philosophy, Science, and Data, I explained that although some argue philosophy only reigns in the absence of data while science reigns in the analysis of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  Therefore, I argued that an endless oscillation persists between science and philosophy, which is why, despite the fact that all we hear about is the need for data scientists, there’s also a need for data philosophers.

Of course, the debate between science and philosophy is a very old one, as is the argument we need both.  In my previous post, I slightly paraphrased Immanuel Kant (“perception without conception is blind and conception without perception is empty”) by saying that science without philosophy is blind and philosophy without science is empty.

In his book Cosmic Apprentice: Dispatches from the Edges of Science, Dorion Sagan explained that science and philosophy hang “in a kind of odd balance, watching each other, holding hands.  Science’s eye for detail, buttressed by philosophy’s broad view, makes for a kind of alembic, an antidote to both.  Although philosophy isn’t fiction, it can be more personal, creative and open, a kind of counterbalance for science even as it argues that science, with its emphasis on a kind of impersonal materialism, provides a crucial reality check for philosophy and a tendency to over-theorize that’s inimical to the scientific spirit.  Ideally, in the search for truth, science and philosophy, the impersonal and autobiographical, can keep each other honest in a kind of open circuit.”

“Science’s spirit is philosophical,” Sagan concluded.  “It is the spirit of questioning, of curiosity, of critical inquiry combined with fact-checking.  It is the spirit of being able to admit you’re wrong, of appealing to data, not authority.”

“Science,” as his father Carl Sagan said, “is a way of thinking much more than it is a body of knowledge.”  By extension, we could say that data science is about a way of thinking much more than it is about big data or about being data-driven.

I have previously blogged that science has always been about bigger questions, not bigger data.  As Claude Lévi-Strauss said, “the scientist is not a person who gives the right answers, but one who asks the right questions.”  As far as data science goes, what are the right questions?  Data scientist Melinda Thielbar proposes three key questions (Actionable? Verifiable? Repeatable?).

Here again we see the interdependence of science and philosophy.  “Philosophy,” Marilyn McCord Adams said, “is thinking really hard about the most important questions and trying to bring analytic clarity both to the questions and the answers.”

“Philosophy is critical thinking,” Don Cupitt said. “Trying to become aware of how one’s own thinking works, of all the things one takes for granted, of the way in which one’s own thinking shapes the things one’s thinking about.”  Yes, even a data scientist’s own thinking could shape the things they are thinking scientifically about.  James Kobielus has blogged about five biases that may crop up in a data scientist’s work (Cognitive, Selection, Sampling, Modeling, Funding).

“Data science has a bright future ahead,” explained Hilary Mason in a recent interview.  “There will only be more data, and more of a need for people who can find meaning and value in that data.  We’re also starting to see a greater need for data engineers, people to build infrastructure around data and algorithms, and data artists, people who can visualize the data.”

I agree with Mason, and I would add that we are also starting to see a greater need for data philosophers, people who can, borrowing the words that Anthony Kenny used to define philosophy, “think as clearly as possible about the most fundamental concepts that reach through all the disciplines.”

Keep Looking Up Insights in Data

In a previous post, I used the history of the Hubble Space Telescope to explain how data cleansing saves lives, based on a true story I read in the book Space Chronicles: Facing the Ultimate Frontier by Neil deGrasse Tyson.  In this post, Hubble and Tyson once again provide the inspiration for an insightful metaphor about data quality.

Hubble is one of dozens of space telescopes of assorted sizes and shapes orbiting the Earth.  “Each one,” Tyson explained, “provides a view of the cosmos that is unobstructed, unblemished, and undiminished by Earth’s turbulent and murky atmosphere.  They are designed to detect bands of light invisible to the human eye, some of which never penetrate Earth’s atmosphere.  Hubble is the first and only space telescope to observe the universe using primarily visible light.  Its stunningly crisp, colorful, and detailed images of the cosmos make Hubble a kind of supreme version of the human eye in space.”

This is how we’d like the quality of data to be when we’re looking for business insights.  High-quality data provides stunningly crisp, colorful, and detailed images of the business cosmos, acting as a kind of supreme version of the human eye in data.

However, despite their less-than-perfect vision, the limitations of Earth-based telescopes still facilitated significant scientific breakthroughs long before Hubble became the first space telescope in 1990.

In 1609, when the Italian physicist and astronomer Galileo Galilei turned a telescope of his own design to the sky, as Tyson explained, he “heralded a new era of technology-aided discovery, whereby the capacities of the human senses could be extended, revealing the natural world in unprecedented, even heretical ways.  The fact that Galileo revealed the Sun to have spots, the planet Jupiter to have satellites [its four moons: Callisto, Ganymede, Europa, Io], and Earth not to be the center of all celestial motion was enough to unsettle centuries of Aristotelian teachings by the Catholic Church and to put Galileo under house arrest.”

And in 1964, another Earth-based telescope, this one operated by the American astronomers Arno Penzias and Robert Wilson at AT&T Bell Labs, was responsible for what is widely considered the most important single discovery in astrophysics, what’s now known as cosmic microwave background radiation, and for which Penzias and Wilson won the 1978 Nobel Prize in Physics.

Recently, I’ve blogged about how there are times when perfect data quality is necessary, when we need the equivalent of a space telescope, and times when okay data quality is good enough, when the equivalent of an Earth-based telescope will do.

What I would like you to take away from this post is that perfect data quality is not a prerequisite for the discovery of new business insights.  Even when data doesn’t provide a perfect view of the business cosmos, even when it’s partially obstructed, blemished, or diminished by the turbulent and murky atmosphere of poor quality, data can still provide business insights.

This doesn’t mean that you should settle for poor data quality, just that you shouldn’t demand perfection before using data.

Tyson ends each episode of his StarTalk Radio program by saying “keep looking up,” so I will end this blog post by saying, even when its quality isn’t perfect, keep looking up insights in data.

 

Related Posts

Data Quality and the OK Plateau

When Poor Data Quality Kills

How Data Cleansing Saves Lives

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

The Seventh Law of Data Quality

A Tale of Two Q’s

Data Quality and The Middle Way

Stop Poor Data Quality STOP

Freudian Data Quality

Predictably Poor Data Quality

This isn’t Jeopardy

Satisficing Data Quality

The Laugh-In Effect of Big Data

Although I am an advocate for data science and big data done right, lately I have been sounding the Anti-Hype Horn with blog posts offering a contrarian’s view of unstructured data, forewarning you about the flying monkeys of big data, cautioning you against performing Cargo Cult Data Science, and inviting you to ponder the perils of the Infinite Inbox.

The hype of big data has resulted in a lot of people and vendors extolling its virtues with stories about how Internet companies, political campaigns, and new technologies have profited, or otherwise benefited, from big data.  These stories are served up as alleged business cases for investing in big data and data science.  Although some of these stories are fluff pieces, many of them accurately, and in some cases comprehensively, describe a real-world application of big data and data science.  However, these messages most often lack a critically important component — applicability to your specific business.  In Made to Stick: Why Some Ideas Survive and Others Die, Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless.  If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”

Rowan & Martin’s Laugh-In was an American sketch comedy television series, which aired from 1968 to 1973.  One of the recurring characters portrayed by Arte Johnson was Wolfgang the German soldier, who would often comment on the previous comedy sketch by saying (in a heavy and long-drawn-out German accent): “Very interesting . . . but stupid!”

From now on whenever someone shares another interesting story masquerading as a solid business case for big data that lacks any applicability beyond the specific scenario in the story, a common phenomenon I call The Laugh-In Effect of Big Data, my unapologetic response will resoundingly be: “Very interesting . . . but stupid!”

 

Related Posts

On Philosophy, Science, and Data

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

Big Data and the Infinite Inbox

HoardaBytes and the Big Data Lebowski

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Big Data el Memorioso

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

Dot Collectors and Dot Connectors

The Wisdom of Crowds, Friends, and Experts

A Contrarian’s View of Unstructured Data

The Flying Monkeys of Big Data

Cargo Cult Data Science

A Statistically Significant Resolution for 2013

Speed Up Your Data to Slow Down Your Decisions

Rage against the Machines Learning

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

The Costs and Profits of Poor Data Quality

Continuing the theme of my two previous posts, which discussed when it’s okay to call data quality as good as it needs to get and when perfect data quality is necessary, in this post I want to briefly discuss the costs — and profits — of poor data quality.

Loraine Lawson interviewed Ted Friedman of Gartner Research about How to Measure the Cost of Data Quality Problems, such as the costs associated with reduced productivity, redundancies, business processes breaking down because of data quality issues, regulatory compliance risks, and lost business opportunities.  David Loshin blogged about the challenge of estimating the cost of poor data quality, noting that many estimates, upon close examination, seem to rely exclusively on anecdotal evidence.

A recent Mental Floss article recounted 10 Very Costly Typos, including the 1962 $80 million dollar missing hyphen in the programming code that led to the destruction of the Mariner 1 spacecraft, the 2007 Roswell, New Mexico car dealership promotion where instead of 1 out of 50,000 scratch lottery tickets revealing a $1,000 cash grand prize, all of the tickets were printed as grand-prize winners, which would have been a $50 million payout, but $250,000 in Walmart gift certificates were given out instead, and, more recently, the March 2013 typographical error in the price of pay-per-ride cards on 160,000 maps and posters that cost New York City’s Transportation Authority approximately $500,000.

Although we often only think about the costs of poor data quality, the article also shared some 2010 research performed by Harvard University claiming that Google profits an estimated $497 million dollars a year from people mistyping the names of popular websites and landing on typosquatter sites, which just happen to be conveniently littered with Google ads.

Poor data quality has also long played an important role in improving Google Search, where misspellings of search terms entered by users (and not just a spellchecker program) is leveraged by the algorithm providing the Did you mean, Including results for, and Search instead for help text displayed at the top of the first page of Google Search results.

What examples (or calculation methods) can you provide about either the costs or profits associated with poor data quality?

 

Related Posts

Promoting Poor Data Quality

Data Quality and the Cupertino Effect

The Data Quality Wager

Data Quality and the OK Plateau

When Poor Data Quality Kills

How Data Cleansing Saves Lives

The Dichotomy Paradox, Data Quality and Zero Defects

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

Data and its Relationships with Quality

The Seventh Law of Data Quality

A Tale of Two Q’s

Paleolithic Rhythm and Data Quality

Groundhog Data Quality Day

Data Quality and The Middle Way

Stop Poor Data Quality STOP

When Poor Data Quality Calls

Freudian Data Quality

Predictably Poor Data Quality

Satisficing Data Quality

Data Quality and the OK Plateau

In his book Moonwalking with Einstein: The Art and Science of Remembering, Joshua Foer explained that “when people first learn to use a keyboard, they improve very quickly from sloppy single-finger pecking to careful two-handed typing, until eventually the fingers move so effortlessly across the keys that the whole process becomes unconscious and the fingers seem to take on a mind of their own.”

“At this point,” Foer continued, “most people’s typing skills stop progressing.  They reach a plateau.  If you think about it, it’s a strange phenomenon.  After all, we’ve always been told that practice makes perfect, and many people sit behind a keyboard for at least several hours a day in essence practicing their typing.  Why don’t they just keep getting better and better?”

Foer then recounted research performed in the 1960s by the psychologists Paul Fitts and Michael Posner, which described the three stages that everyone goes through when acquiring a new skill:

  1. Cognitive — During this stage, you intellectualize the task and discover new strategies to accomplish it more proficiently.
  2. Associative — During this stage, you concentrate less, make fewer major errors, and generally become more efficient.
  3. Autonomous — During this stage, you have gotten as good as you need to get, and are basically running on autopilot.

“During that autonomous stage,” Foer explained, “you lose conscious control over what you are doing.  Most of the time that’s a good thing.  Your mind has one less thing to worry about.  In fact, the autonomous stage seems to be one of those handy features that evolution worked out for our benefit.  The less you have to focus on the repetitive tasks of everyday life, the more you can concentrate on the stuff that really matters, the stuff you haven’t seen before.  And so, once we’re just good enough at typing, we move it to the back of our mind’s filing cabinet and stop paying it any attention.”

“You can see this shift take place in fMRI scans of people learning new skills.  As a task becomes automated, parts of the brain involved in conscious reasoning become less active and other parts of the brain take over.  You could call it the OK plateau, the point at which you decide you’re OK with how good you are at something, turn on autopilot, and stop improving.”

“We all reach OK plateaus in most things we do,” Foer concluded.  “We learn how to drive when we’re in our teens and once we’re good enough to avoid tickets and major accidents, we get only incrementally better.  My father has been playing golf for forty years, and he’s still a duffer.  In four decades his handicap hasn’t fallen even a point.  Why?  He reached an OK plateau.”

I believe that data quality improvement initiatives also eventually reach an OK Plateau, a point just short of data perfection, where the diminishing returns of chasing after zero defects gives way to calling data quality as good as it needs to get.

As long as the autopilot is on, accepting data quality is a journey not a destination, preventing data quality from getting worse, and making sure best practices don’t stop being practiced, then I’m OK with data quality and the OK plateau.  Are you OK?

 

Related Posts

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

The Seventh Law of Data Quality

Data Quality and The Middle Way

Freudian Data Quality

Predictably Poor Data Quality

Satisficing Data Quality

Expectation and Data Quality

One of my favorite recently read books is You Are Not So Smart by David McRaney.  Earlier this week, the book’s chapter about expectation was excerpted as an online article on Why We Can’t Tell Good Wine From Bad, which also provided additional examples about how we can be fooled by altering our expectations.

“In one Dutch study,” McRaney explained, “participants were put in a room with posters proclaiming the awesomeness of high-definition, and were told they would be watching a new high-definition program.  Afterward, the subjects said they found the sharper, more colorful television to be a superior experience to standard programming.”

No surprise there, right?  After all, a high-definition television is expected to produce a high-quality image.

“What they didn’t know,” McRaney continued, “was they were actually watching a standard-definition image.  The expectation of seeing a better quality image led them to believe they had.  Recent research shows about 18 percent of people who own high-definition televisions are still watching standard-definition programming on the set, but think they are getting a better picture.”

I couldn’t help but wonder if establishing an expectation of delivering high-quality data could lead business users to believe that, for example, the data quality of the data warehouse met or exceeded their expectations.  Could business users actually be fooled by altering their expectations about data quality?  Wouldn’t their experience of using the data eventually reveal the truth?

Retailers expertly manipulate us with presentation, price, good marketing, and great service in order to create an expectation of quality in the things we buy.  “The actual experience is less important,” McRaney explained.  “As long as it isn’t total crap, your experience will match up with your expectations.  The build up to an experience can completely change how you interpret the information reaching your brain from your otherwise objective senses.  In psychology, true objectivity is pretty much considered to be impossible.  Memories, emotions, conditioning, and all sorts of other mental flotsam taint every new experience you gain.  In addition to all this, your expectations powerfully influence the final vote in your head over what you believe to be reality.”

“Your expectations are the horse,” McRaney concluded, “and your experience is the cart.”  You might think it should be the other way around, but when your expectations determine your direction, you shouldn’t be surprised by the journey you experience.

If you find it difficult to imagine a positive expectation causing people to overlook poor quality in their experience with data, how about the opposite?  I have seen the first impression of a data warehouse initially affected by poor data quality create a negative expectation causing people to overlook the improved data quality in their subsequent experiences with the data warehouse.  Once people expect to experience poor data quality when using it, people stop trusting, and stop using, the data warehouse.

Data warehousing is only one example of how expectation can affect the data quality experience.  How are your organization’s expectations affecting its experiences with data quality?

On Philosophy, Science, and Data

Ever since Melinda Thielbar helped me demystify data science on OCDQ Radio, I have been pondering my paraphrasing of an old idea: Science without philosophy is blind; Philosophy without science is empty; Data needs both science and philosophy.

“A philosopher’s job is to find out things about the world by thinking rather than observing,” the philosopher Bertrand Russell once said.  One could say a scientist’s job is to find out things about the world by observing and experimenting.  In fact, Russell observed that “the most essential characteristic of scientific technique is that it proceeds from experiment, not from tradition.”

Russell also said that “science is what we know, and philosophy is what we don’t know.”  However, Stuart Firestein, in his book Ignorance: How It Drives Science, explained “there is no surer way to screw up an experiment than to be certain of its outcome.”

Although it seems it would make more sense for science to be driven by what we know, by facts, “working scientists,” according to Firestein, “don’t get bogged down in the factual swamp because they don’t care that much for facts.  It’s not that they discount or ignore them, but rather that they don’t see them as an end in themselves.  They don’t stop at the facts; they begin there, right beyond the facts, where the facts run out.  Facts are selected for the questions they create, for the ignorance they point to.”

In this sense, philosophy and science work together to help us think about and experiment with what we do and don’t know.

Some might argue that while anyone can be a philosopher, being a scientist requires more rigorous training.  A commonly stated requirement in the era of big data is to hire data scientists, but this begs the question: Is data science only for data scientists?

“Clearly what we need,” Firestein explained, “is a crash course in citizen science—a way to humanize science so that it can be both appreciated and judged by an informed citizenry.  Aggregating facts is useless if you don’t have a context to interpret them.”

I would argue that clearly what organizations need is a crash course in data science—a way to humanize data science so that it can be both appreciated and judged by an informed business community.  Big data is useless if you don’t have a business context to interpret it.  Firestein also made great points about science not being exclusionary (i.e., not just for scientists).  Just as you can enjoy watching sports without being a professional athlete and you can appreciate music without being a professional musician, you can—and should—learn the basics of data science (especially statistics) without being a professional data scientist.

In order to truly deliver business value to organizations, data science can not be exclusionary.  This doesn’t mean you shouldn’t hire data scientists.  In many cases, you will need the expertise of professional data scientists.  However, you will not be able to direct them or interpret their findings without understanding the basics, what could be called the philosophy of data science.

Some might argue that philosophy only reigns in the absence of data, while science reigns in the analysis of data.  Although in the era of big data there seems to be fewer areas truly absent of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  So, an endless oscillation persists between science and philosophy, which is why science without philosophy is blind, and philosophy without science is empty.  Data needs both science and philosophy.

Data Governance needs Searchers, not Planners

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance.  A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible.  Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

Demonstrating that some data governance policies reflect existing best practices reduces resistance to change by showing that the search for improvement was not limited to only searching for what is currently going wrong.

This is why data governance needs Searchers, not Planners.  A Planner thinks a framework provides all the answers; A Searcher knows a data governance framework is like a jigsaw puzzle.  A Planner believes outsiders (authorized by executive management) know enough to impose data governance solutions; A Searcher believes only insiders (united by collaboration) have enough knowledge to find the ingredients for data governance solutions, and a true commitment to change always comes from within.

 

Related Posts

The Hawthorne Effect, Helter Skelter, and Data Governance

Cooks, Chefs, and Data Governance

Data Governance Frameworks are like Jigsaw Puzzles

Data Governance and the Buttered Cat Paradox

Data Governance Star Wars: Bureaucracy versus Agility

Beware the Data Governance Ides of March

Aristotle, Data Governance, and Lead Rulers

Data Governance and the Adjacent Possible

The Three Most Important Letters in Data Governance

The Data Governance Oratorio

An Unsettling Truth about Data Governance

The Godfather of Data Governance

Over the Data Governance Rainbow

Getting Your Data Governance Stuff Together

Datenvergnügen

Council Data Governance

A Tale of Two G’s

Declaration of Data Governance

The Role Of Data Quality Monitoring In Data Governance

The Collaborative Culture of Data Governance

Demystifying Data Science

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.

Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific.  Her work there focuses on power system optimization using real-time prediction models.  She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.

Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.

While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

OCDQ Radio Episode 35: Demystifying Data Science
Jim Harris with Guest Melinda Thielbar

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.