OCDQ Blog

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

Our Increasingly Data-Constructed World

July 12, 2012

Quality is the Higgs Field of Data

July 12, 2012/ Jim Harris

Recently on Twitter, Daragh O Brien replied to my David Weinberger quote “The atoms of data hook together only because they share metadata,” by asking “So, is Quality Data the Higgs Boson of Information Management?”

I responded that Quality is the Higgs Boson of Data and Information since Quality gives Data and Information their Mass (i.e., their Usefulness).

“Now that is profound,” Daragh replied.

“That’s cute and all,” Brian Panulla interjected, “but you can’t measure Quality. Mass is objective. It’s more like Weight — a mass in context.”

I agreed with Brian’s great point since in a previous post I explained the often misunderstood difference between mass, an intrinsic property of matter based on atomic composition, and weight, a gravitational force acting on matter.

Using these concepts metaphorically, mass is an intrinsic property of data, representing objective data quality, whereas weight is a gravitational force acting on data, thereby representing subjective data quality.

But my previous post didn’t explain where matter theoretically gets its mass, and since this scientific mystery was radiating in the cosmic background of my Twitter banter with Daragh and Brian, I decided to use this post to attempt a brief explanation along the way to yet another data quality analogy.

As you have probably heard by now, big scientific news was recently reported about the discovery of the Higgs Boson, which, since the 1960s, the Standard Model of particle physics has theorized to be the fundamental particle associated with a ubiquitous quantum field (referred to as the Higgs Field) that gives all matter its mass by interacting with the particles that make up atoms and weighing them down. This is foundational to our understanding of the universe because without something to give mass to the basic building blocks of matter, everything would behave the same way as the intrinsically mass-less photons of light behave, floating freely and not combining with other particles. Therefore, without mass, ordinary matter, as we know it, would not exist.

Ping-Pong Balls and Maple Syrup

I like the Higgs Field explanation provided by Brian Cox and Jeff Forshaw. “Imagine you are blindfolded, holding a ping-pong ball by a thread. Jerk the string and you will conclude that something with not much mass is on the end of it. Now suppose that instead of bobbing freely, the ping-pong ball is immersed in thick maple syrup. This time if you jerk the thread you will encounter more resistance, and you might reasonably presume that the thing on the end of the thread is much heavier than a ping-pong ball. It is as if the ball is heavier because it gets dragged back by the syrup.”

“Now imagine a sort of cosmic maple syrup that pervades the whole of space. Every nook and cranny is filled with it, and it is so pervasive that we do not even know it is there. In a sense, it provides the backdrop to everything that happens.”

Mass is therefore generated as a result of an interaction between the ping-pong balls (i.e., atomic particles) and the maple syrup (i.e, the Higgs Field). However, although the Higgs Field is pervasive, it is also variable and selective, since some particles are affected by the Higgs Field more than others, and photons pass through it unimpeded, thereby remaining mass-less particles.

Quality — Data Gets Higgy with It

Now that I have vastly oversimplified the Higgs Field, let me Get Higgy with It by attempting an analogy for data quality based on the Higgs Field. As I do, please remember the wise words of Karen Lopez: “All analogies are perfectly imperfect.”

Quality provides the backdrop to everything that happens when we use data. Data in the wild, independent from use, is as carefree as the mass-less photon whizzing around at the speed of light, like a ping-pong ball bouncing along without a trace of maple syrup on it. But once we interact with data using our sticky-maple-syrup-covered fingers, data begins to slow down, begins to feel the effects of our use. We give data mass so that it can become the basic building blocks of what matters to us.

Some data is affected more by our use than others. The more subjective our use, the more we weigh data down. The more objective our use, the less we weigh data down. Sometimes, we drag data down deep into the maple syrup, covering data up with an application layer, or bottling data into silos. Other times, we keep data in the shallow end of the molasses swimming pool.

Quality is the Higgs Field of Data. As users of data, we are the Higgs Bosons — we are the fundamental particles associated with a ubiquitous data quality field. By using data, we give data its quality. The quality of data can not be separated from its use any more than the particles of the universe can be separated from the Higgs Field.

The closest data equivalent of a photon, a ping-pong ball particle that doesn’t get stuck in the maple syrup of the Higgs Field, is Open Data, which doesn’t get stuck within silos, but is instead data freely shared without the sticky quality residue of our use.

What is Weighing Down your Data?

Data Myopia and Business Relativity

Redefining Data Quality

Are Applications the La Brea Tar Pits for Data?

Swimming in Big Data

Sometimes it’s Okay to be Shallow

Data Quality and Big Data

Data Quality and the Q Test

My Own Private Data

No Datum is an Island of Serendip

Sharing Data

July 10, 2012

Shining a Social Light on Data Quality

July 10, 2012/ Jim Harris

Last week, when I published my blog post Lightning Strikes the Cloud, I unintentionally demonstrated three important things about data quality.

The first thing I demonstrated was even an obsessive-compulsive data quality geek is capable of data defects, since I initially published the post with the title Lightening Strikes the Cloud, which is an excellent example of the difference between validity and accuracy caused by the Cupertino Effect, since although lightening is valid (i.e., a correctly spelled word), it isn’t contextually accurate.

The second thing I demonstrated was the value of shining a social light on data quality — the value of using collaborative tools like social media to crowd-source data quality improvements. Thankfully, Julian Schwarzenbach quickly noticed my error on Twitter. “Did you mean lightning? The concept of lightening clouds could be worth exploring further,” Julian humorously tweeted. “Might be interesting to consider what happens if the cloud gets so light that it floats away.” To which I replied that if the cloud gets so light that it floats away, it could become Interstellar Computing or, as Julian suggested, the start of the Intergalactic Net, which I suppose is where we will eventually have to store all of that big data we keep hearing so much about these days.

The third thing I demonstrated was the potential dark side of data cleansing, since the only remaining trace of my data defect is a broken URL. This is an example of not providing a well-documented audit trail, which is necessary within an organization to communicate data quality issues and resolutions.

Communication and collaboration are essential to finding our way with data quality. And social media can help us by providing more immediate and expanded access to our collective knowledge, experience, and wisdom, and by shining a social light that illuminates the shadows cast upon data quality issues when a perception filter or bystander effect gets the better of our individual attention or undermines our collective best intentions — which, as I recently demonstrated, occasionally happens to all of us.

Data Quality and the Cupertino Effect

Are you turning Ugly Data into Cute Information?

The Importance of Envelopes

The Algebra of Collaboration

Finding Data Quality

The Wisdom of the Social Media Crowd

Perception Filters and Data Quality

Data Quality and the Bystander Effect

The Family Circus and Data Quality

Data Quality and the Q Test

Metadata, Data Quality, and the Stroop Test

The Three Most Important Letters in Data Governance

July 02, 2012

Saving Private Data

July 02, 2012/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This episode is an edited rebroadcast of a segment from the OCDQ Radio 2011 Year in Review, during which Daragh O Brien and I discuss the data privacy and data protection implications of social media, cloud computing, and big data.

Daragh O Brien is one of Ireland’s leading Information Quality and Governance practitioners. After being born at a young age, Daragh has amassed a wealth of experience in quality information driven business change, from CRM Single View of Customer to Regulatory Compliance, to Governance and the taming of information assets to benefit the bottom line, manage risk, and ensure customer satisfaction. Daragh O Brien is the Managing Director of Castlebridge Associates, one of Ireland’s leading consulting and training companies in the information quality and information governance space.

Daragh O Brien is a founding member and former Director of Publicity for the IAIDQ, which he is still actively involved with. He was a member of the team that helped develop the Information Quality Certified Professional (IQCP) certification and he recently became the first person in Ireland to achieve this prestigious certification.

In 2008, Daragh O Brien was awarded a Fellowship of the Irish Computer Society for his work in developing and promoting standards of professionalism in Information Management and Governance.

Daragh O Brien is a regular conference presenter, trainer, blogger, and author with two industry reports published by Ark Group, the most recent of which is The Data Strategy and Governance Toolkit.

You can also follow Daragh O Brien on Twitter and connect with Daragh O Brien on LinkedIn.

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

So Long 2011, and Thanks for All the . . . — The OCDQ Radio 2011 Year in Review, featuring Jarrett Goldfedder, who discusses Big Data, Nicola Askham, who discusses Data Governance, and Daragh O Brien, who discusses Data Privacy.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Big Data and Big Analytics — Guests Jill Dyché and Dan Soceanu discuss big trends in Business Intelligence, including Cloud, Collaboration, and Big Data, the last of which lead to a discussion about Big Analytics.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

Decision Management Systems — Guest James Taylor discusses concepts from his book: Decision Management Systems: A Practical Guide to Using Business Rules and Predictive Analytics

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

The Data Governance Imperative — Guest Steve Sarsfield discusses his book The Data Governance Imperative, explaining how data governance is about changing the hearts and minds of your company to see the value of data quality.

Social Media Strategy — Guest Crysta Anderson of IBM Initiate explains social media strategy and content marketing, including three recommended practices: (1) Listen intently, (2) Communicate succinctly, and (3) Have fun.

The Fall Back Recap Show — A look back at the Best of OCDQ Radio, including discussions about Data, Information, Business-IT Collaboration, Change Management, Big Analytics, Data Governance, and the Data Revolution.

March 13, 2012

Commendable Comments (Part 12)

March 13, 2012/ Jim Harris

Since I officially launched this blog on March 13, 2009, that makes today the Third Blogiversary of OCDQ Blog!

So, absolutely without question, there is no better way to commemorate this milestone other than to also make this the 12th entry in my ongoing series for expressing my gratitude to my readers for their truly commendable comments on my blog posts.

Commendable Comments

On Big Data el Memorioso, Mark Troester commented:

“I think this helps illustrate that one size does not fit all.

You can’t take a singular approach to how you design for big data. It’s all about identifying relevance and understanding that relevance can change over time.

There are certain situations where it makes sense to leverage all of the data, and now with high performance computing capabilities that include in-memory, in-DB and grid, it's possible to build and deploy rich models using all data in a short amount of time. Not only can you leverage rich models, but you can deploy a large number of models that leverage many variables so that you get optimal results.

On the other hand, there are situations where you need to filter out the extraneous information and the more intelligent you can be about identifying the relevant information the better.

The traditional approach is to grab the data, cleanse it, and land it somewhere before processing or analyzing the data. We suggest that you leverage analytics up front to determine what data is relevant as it streams in, with relevance based on your organizational knowledge or context. That helps you determine what data should be acted upon immediately, where it should be stored, etc.

And, of course, there are considerations about using visual analytic techniques to help you determine relevance and guide your analysis, but that’s an entire subject just on its own!”

On Data Governance Frameworks are like Jigsaw Puzzles, Gabriel Marcan commented:

“I agree (and like) the jigsaw puzzles metaphor. I would like to make an observation though:

Can you really construct Data Governance one piece at a time?

I would argue you need to put together sets of pieces simultaneously, and to ensure early value, you might want to piece together the interesting / easy pieces first.

Hold on, that sounds like the typical jigsaw strategy anyway . . . :-)”

On Data Governance Frameworks are like Jigsaw Puzzles, Doug Newdick commented:

“I think that there are a number of more general lessons here.

In particular, the description of the issues with data governance sounds very like the issues with enterprise architecture. In general, there are very few eureka moments in solving the business and IT issues plaguing enterprises. These solutions are usually 10% inspiration, 90% perspiration in my experience. What looks like genius or a sudden breakthrough is usually the result of a lot of hard work.

I also think that there is a wider Myth of the Framework at play too.

The myth is that if we just select the right framework then everything else will fall into place. In reality, the selection of the framework is just the start of the real work that produces the results. Frameworks don’t solve your problems, people solve your problems by the application of brain-power and sweat.

All frameworks do is take care of some of the heavy-lifting, i.e., the mundane foundational research and thinking activity that is not specific to your situation.

Unfortunately the myth of the framework is why many organizations think that choosing TOGAF will immediately solve their IT issues and are then disappointed when this doesn’t happen, when a more sensible approach might have garnered better long-term success.”

On Data Quality: Quo Vadimus?, Richard Jarvis commented:

“I agree with everything you’ve said, but there’s a much uglier truth about data quality that should also be discussed — the business benefit of NOT having a data quality program.

The unfortunate reality is that in a tight market, the last thing many decision makers want to be made public (internally or externally) is the truth.

In a company with data quality principles ingrained in day-to-day processes, and reporting handled independently, it becomes much harder to hide or reinterpret your falling market share. Without these principles though, you’ll probably be able to pick your version of the truth from a stack of half a dozen, then spend your strategy meeting discussing which one is right instead of what you’re going to do about it.

What we’re talking about here is the difference between a Politician — who will smile at the camera and proudly announce 0.1% growth was a fantastic result given X, Y, and Z factors — and a Statistician who will endeavor to describe reality with minimal personal bias.

And the larger the organization, the more internal politics plays a part. I believe a lot of the reluctance in investing in data quality initiatives could be traced back to this fear of being held truly accountable, regardless of it being in the best interests of the organization. To build a data quality-centric culture, the change must be driven from the CEO down if it’s to succeed.”

On Data Quality: Quo Vadimus?, Peter Perera commented:

“The question: ‘Is Data Quality a Journey or a Destination?’ suggests that it is one or the other.

I agree with another comment that data quality is neither . . . or, I suppose, it could be both (the journey is the destination and the destination is the journey. They are one and the same.)

The quality of data (or anything for that matter) is something we experience.

Quality only radiates when someone is in the act of experiencing the data, and usually only when it is someone that matters. This radiation decays over time, ranging from seconds or less to years or more.

The only problem with viewing data quality as radiation is that radiation can be measured by an instrument, but there is no such instrument to measure data quality.

We tend to confuse data qualities (which can be measured) and data quality (which cannot).

In the words of someone whose name I cannot recall: ‘Quality is not job one. Being totally %@^#&$*% amazing is job one.’ The only thing I disagree with here is that being amazing is characterized as a ‘job.’

Data quality is not something we ‘do’ to data. It’s not a business initiative or project or job. It’s not a discipline. We need to distinguish between the pursuit (journey) of being amazing and actually being amazing (destination — but certainly not a final one). To be amazing requires someone to be amazed. We want data to be continuously amazing . . . to someone that matters, i.e., someone who uses and values the data a whole lot for an end that makes a material difference.

Come to think of it, the only prerequisite for data quality is being alive because that is the only way to experience it. If you come across some data and have an amazed reaction to it and can make a difference using it, you cannot help but experience great data quality. So if you are amazing people all the time with your data, then you are doing your data quality job very well.”

On Data Quality and Miracle Exceptions, Gordon Hamilton commented:

“Nicely delineated argument, Jim. Successfully starting a data quality program seems to be a balance between getting started somewhere and determining where best to start. The data quality problem is like a two-edged sword without a handle that is inflicting the ‘death of a thousand cuts’.

Data quality is indeed difficult to get ‘a handle on’.”

And since they generated so much great banter, please check out all of the commendable comments received by the blog posts There is No Such Thing as a Root Cause and You only get a Return from something you actually Invest in.

Thank You for Three Awesome Years

You are Awesome — which is why receiving your comments has been the most rewarding aspect of my blogging experience over the last three years. Even if you have never posted a comment, you are still awesome — feel free to tell everyone I said so.

This entry in the series highlighted commendable comments on blog posts published between December 2011 and March 2012.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please continue commenting and stay tuned for future entries in the series.

Thank you for reading the Obsessive-Compulsive Data Quality blog for the last three years. Your readership is deeply appreciated.

Commendable Comments (Part 11)

Commendable Comments (Part 10) – The 300th OCDQ Blog Post

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

Big Data and the Infinite Inbox

January 30, 2012

HoardaBytes and the Big Data Lebowski

January 30, 2012/ Jim Harris

The recent #GartnerChat on Big Data was an excellent Twitter discussion about what I often refer to as the Seven Letter Tsunami of the data management industry, which as Gartner Research explains, although the term acknowledges the exponential growth, availability, and use of information in today’s data-rich landscape, big data is about more than just data volume. Data variety (i.e., structured, semi-structured, and unstructured data, as well as other types, such as the sensor data emanating from the Internet of Things), and data velocity (i.e., how fast data is produced and how fast data must be processed to meet demand) are also key characteristics of the big challenges associated with the big buzzword that big data has become over the last year.

Since ours is an industry infatuated with buzzwords, Timo Elliott remarked “new terms arise because of new technology, not new business problems. Big Data came from a need to name Hadoop [and other technologies now being relentlessly marketed as big data solutions], so anybody using big data to refer to business problems is quickly going to tie themselves in definitional knots.”

To which Mark Troester responded, “the hype of Hadoop is driving pressure on people to keep everything — but they ignore the difficulty in managing it.” John Haddad then quipped that “big data is a hoarders dream,” which prompted Andy Bitterer to coin the term HoardaByte for measuring big data, and then asking, “Would the real Big Data Lebowski please stand up?”

HoardaBytes

Although it’s probably no surprise that a blogger with obsessive-compulsive in the title of his blog would like Bitterer’s new term, the fact is that whether you choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how much reality bitterly bites, our organizations have been compulsively hoarding data for a long time.

And with silos replicating data as well as new data, and new types of data, being created and stored on a daily basis, managing all of the data is not only becoming impractical, but because we are too busy with the activity of trying to manage all of it, we are hoarding countless bytes of data without evaluating data usage, gathering data requirements, or planning for data archival.

The Big Data Lebowski

In The Big Lebowski, Jeff Lebowski (“The Dude”) is, in a classic data quality blunder caused by matching on person name only, mistakenly identified as millionaire Jeffrey Lebowski (“The Big Lebowski”) in an eccentric plot expected from a Coen brothers film, which, since its release in the late 1990s, has become a cult classic and inspired a religious following known as Dudeism.

Historically, a big part of the problem in our industry has been the fact that the word “data” is prevalent in the names we have given industry disciplines and enterprise information initiatives. For example, data architecture, data quality, data integration, data migration, data warehousing, master data management, and data governance — to name but a few.

However, all this achieved was to perpetuate the mistaken identification of data management as an esoteric technical activity that played little more than a minor, supporting, and often uncredited, role within the business activities of our organizations.

But since the late 1990s, there has been a shift in the perception of data. The real data deluge has not been the rising volume, variety, and velocity of data, but instead the rising awareness of the big impact that data has on nearly every aspect of our professional and personal lives. In this brave new data world, companies like Google and Facebook have built business empires mostly out of our own personal data, which is why, like it or not, as individuals, we must accept that we are all data geeks now.

All of the hype about Big Data is missing the point. The reality is that Data is Big — meaning that data has now so thoroughly pervaded mainstream culture that data has gone beyond being just a cult classic for the data management profession, and is now inspiring an almost religious following that we could call Dataism.

The Data must Abide

“The Dude abides. I don’t know about you, but I take comfort in that,” remarked The Stranger in The Big Lebowski.

The Data must also abide. And the Data must abide both the Business and the Individual. The Data abides the Business if data proves useful to our business activities. The Data abides the Individual if data protects the privacy of our personal activities.

The Data abides. I don’t know about you, but I would take more comfort in that than in any solutions The Stranger Salesperson wants to sell me that utilize an eccentric sales pitch involving HoardaBytes and the Big Data Lebowski.

The Laugh-In Effect of Big Data

The Need for Data Philosophers

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

A Tale of Two Datas

i blog of Data glad and big

Big Data is Just Another Brick in the Wall

The Graystone Effects of Big Data

Magic Elephants, Data Psychics, and Invisible Gorillas

January 03, 2012

Best OCDQ Blog Posts of 2011

January 03, 2012/ Jim Harris

Welcome to my roundup of the best blog posts published on the Obsessive-Compulsive Data Quality (OCDQ) blog during 2011.

My selections were based on a pseudo-scientific, quasi-statistical combination of page views, comments, and re-tweets (as well as choosing a few of my personal favorites). Instead of ordering the posts chronologically, I decided to organize them by theme.

The Metadata Trilogy

Although it has an incredibly important role to play in data quality and its related disciplines, I don’t write about metadata very often. But the reader feedback that I received lead me to writing three blog posts about metadata in the span of a few weeks:

The Metadata Crisis — There is a running debate within many organizations over the meaning of commonly used terms, which complicates what on the surface seem like straightforward business questions.

The Metadata Continuum — There is a continuum, where at one end we have the uniformity of controlled vocabularies, and at the other end we have the flexibility of chaotic folksonomies. However, both flexibility and uniformity provide value.

You Say Potato and I Say Tater Tot — The demarcations of the borders between metadata, data, and information are important, but sometimes difficult to discern. In this post, I offer an explanation about these demarcations using potatoes.

The Data Governance Star Wars (one less than a) Trilogy

In June, Rob Karel of Forrester Research and I used a Star Wars themed blog mock debate to take on one of data governance’s biggest challenges — how to balance bureaucracy and business agility. Gwen Thomas of the Data Governance Institute joined Rob and I to continue the discussion during a special, extended, and Star Wars themed episode of OCDQ Radio:

Data Governance Star Wars: Balancing Bureaucracy and Agility — In character as OCDQ-Wan, I argue in favor of business agility and explain that Collaboration is the Data Governance Force.

Data Governance Star Wars on OCDQ Radio — In Part 1, Rob Karel and I discuss our blog mock debate, which is followed by a brief Star Wars themed intermission, and then in Part 2, Gwen Thomas joins us to provide her excellent insights.

Although not Star Wars themed, here are some additional Best OCDQ Blog Posts of 2011 on the topic of data governance:

The Three Most Important Letters in Data Governance — There are only three letters of difference between the words cooperative and competitive, which we could say are the three most important letters in data governance.

Data Governance and the Adjacent Possible — It’s important to demonstrate that some data governance policies reflect existing best practices, which helps reduce resistance to change, and therefore I advise: “If it ain’t broke, bricolage it.”

Aristotle, Data Governance, and Lead Rulers — Well-constructed data governance policies are like lead rulers — flexible rules that empower us with an understanding of the principle of the policy, and how to enforce it in a particular context.

The Stakeholder’s Dilemma — There will be times when sacrifices for the long-term greater good will require that stakeholders either contribute more resources during the current phase, or receive fewer benefits from its deliverables.

Beware the Data Governance Ides of March — My dramatized warning about relying too much on the top-down approach to implementing data governance — and especially if your organization has any data stewards named Brutus or Cassius.

Data Governance and the Buttered Cat Paradox — The fearless felines of the buttered-toast-paratrooper brigade ponder how to approach data governance — top-down or bottom-up. See the follow-up post: Zig-Zag-Diagonal Data Governance

OCDQ Radio

In June, I launched OCDQ Radio, which is a vendor-neutral podcast about data quality and the audio complement to this blog, providing me with a platform for recorded discussions with the great folks working in the data management industry. So far, there have been 21 episodes of OCDQ Radio, including 22 guests from 7 countries. Here are a few of the most popular episodes:

So Long 2011, and Thanks for All the . . . — The OCDQ Radio 2011 Year in Review, featuring Jarrett Goldfedder, who discusses Big Data, Nicola Askham, who discusses Data Governance, and Daragh O Brien, who discusses Data Privacy.

The Fall Back Recap Show — A look back at the Best of OCDQ Radio, including discussions about Data, Information, Business-IT Collaboration, Change Management, Big Analytics, Data Governance, and the Data Revolution.

Big Data and Big Analytics — Special Guests Jill Dyché and Dan Soceanu discuss big trends in Business Intelligence, including Cloud, Collaboration, and Big Data, the last of which lead to a discussion about Big Analytics.

Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.

Making EIM Work for Business — Guest John Ladley discusses his book Making EIM Work for Business, exploring what makes information management, not just useful, but valuable to the enterprise.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Master Data Management in Practice — Guests Dalton Cervo and Mark Allen discuss their book MDM in Practice, and how to properly prepare for a new MDM program.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

Social Media Strategy — Guest Crysta Anderson of IBM Initiate explains social media strategy and content marketing, including three recommended practices: (1) Listen intently, (2) Communicate succinctly, and (3) Have fun.

The Best of the Rest

Plato’s Data — Data shapes our perception of the real world, but sometimes we forget that data is only a partial reflection of reality. This theme was also discussed on the OCDQ Radio episode Redefining Data Quality with Peter Perera.

There is No Such Thing as a Root Cause — There are no root causes, only strong correlations. And correlations are strengthened by continuous monitoring. This post received excellent comments, including great banter with Martin Doyle.

You only get a Return from something you actually Invest in — Invest in doing the hard daily work of continuously improving your data quality and putting into practice your data governance principles, policies, and procedures.

The Dichotomy Paradox, Data Quality and Zero Defects — Has your data quality practice become motionless by trying to prove that Zero Defects is more than just theoretically possible?

The Data Quality Wager — Inspired by Gordon Hamilton, my rendering of Pascal’s Wager in a data quality context.

DQ-View: Talking about Data — DQ-View video discussion about how data professionals should talk about data when invited to participate in business discussions within their organizations.

The Speed of Decision — Examines the constraints that time puts on data-driven decision making, pondering whether decision speed is more important than data quality and decision quality.

The Data Cold War — Examines how Google and Facebook have performed the Master Data Management Magic Trick and socialized data (“Information wants to be free!”) in order to capitalize data as a true corporate asset.

A Farscape Analogy for Data Quality — Ponders whether data is not viewed as an asset because data has so thoroughly pervaded the enterprise that data has become invisible to those who are so dependent upon its quality.

No Datum is an Island of Serendip — Our organizations need to create collaborative environments that foster serendipitous connections bringing all of our business units and people together around our shared data assets.

Thank You for Reading OCDQ Blog in 2011

In 2011, the Obsessive-Compulsive Data Quality (OCDQ) blog published 112 posts, which received 130,000 total page views, averaging 350 page views and 150 unique visitors a day.

Thank you for reading OCDQ Blog in 2011. Your readership was deeply appreciated.

So Long 2011, and Thanks for All the . . . – The OCDQ Radio 2011 Year in Review

2011 Quarterly Review of the Data Roundtable (Part 3)

2011 Quarterly Review of the Data Roundtable (Part 2)

2011 Quarterly Review of the Data Roundtable (Part 1)

Commendable Comments (Part 10) – The 300th OCDQ Blog Post

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

The Best Data Quality Blog Posts of 2010

December 29, 2011

So Long 2011, and Thanks for All the . . .

December 29, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Don’t Panic! Welcome to the mostly harmless OCDQ Radio 2011 Year in Review episode. During this approximately 42 minute episode, I recap the data-related highlights of 2011 in a series of sometimes serious, sometimes funny, segments, as well as make wacky and wildly inaccurate data-related predictions about 2012.

Special thanks to my guests Jarrett Goldfedder, who discusses Big Data, Nicola Askham, who discusses Data Governance, and Daragh O Brien, who discusses Data Privacy. Additional thanks to Rich Murnane and Dylan Jones. And Deep Thanks to that frood Douglas Adams, who always knew where his towel was, and who wrote The Hitchhiker’s Guide to the Galaxy.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

December 08, 2011

You only get a Return from something you actually Invest in

December 08, 2011/ Jim Harris

In my previous post, I took a slightly controversial stance on a popular three-word phrase — Root Cause Analysis. In this post, it’s another popular three-word phrase — Return on Investment (most commonly abbreviated as the acronym ROI).

What is the ROI of purchasing a data quality tool or launching a data governance program?

Zero. Zip. Zilch. Intet. Ingenting. Rien. Nada. Nothing. Nichts. Niets. Null. Niente. Bupkis.

There is No Such Thing as the ROI of purchasing a data quality tool or launching a data governance program.

Before you hire “The Butcher” to eliminate me for being The Man Who Knew Too Little about ROI, please allow me to explain.

Returns only come from Investments

Although the reason that you likely purchased a data quality tool is because you have business-critical data quality problems, simply purchasing a tool is not an investment (unless you believe in Magic Beans) since the tool itself is not a solution.

You use tools to build, test, implement, and maintain solutions. For example, I spent several hundred dollars on new power tools last year for a home improvement project. However, I haven’t received any return on my home improvement investment for a simple reason — I still haven’t even taken most of the tools out of their packaging yet. In other words, I barely even started my home improvement project. It is precisely because I haven’t invested any time and effort that I haven’t seen any returns. And it certainly isn’t going to help me (although it would help Home Depot) if I believed buying even more new tools was the answer.

Although the reason that you likely launched a data governance program is because you have complex issues involving the intersection of data, business processes, technology, and people, simply launching a data governance program is not an investment since it does not conjure the three most important letters.

Data is only an Asset if Data is a Currency

In his book UnMarketing, Scott Stratten discusses this within the context of the ROI of social media (a commonly misunderstood aspect of social media strategy), but his insight is just as applicable to any discussion of ROI. “Think of it this way: You wouldn’t open a business bank account and ask to withdraw $5,000 before depositing anything. The banker would think you are a loony.”

Yet, as Stratten explained, people do this all the time in social media by failing to build up what is known as social currency. “You’ve got to invest in something before withdrawing. Investing your social currency means giving your time, your knowledge, and your efforts to that channel before trying to withdraw monetary currency.”

The same logic applies perfectly to data quality and data governance, where we could say it’s the failure to build up what I will call data currency. You’ve got to invest in data before you could ever consider data an asset to your organization. Investing your data currency means giving your time, your knowledge, and your efforts to data quality and data governance before trying to withdraw monetary currency (i.e., before trying to calculate the ROI of a data quality tool or a data governance program).

If you actually want to get a return on your investment, then actually invest in your data. Invest in doing the hard daily work of continuously improving your data quality and putting into practice your data governance principles, policies, and procedures.

Data is only an asset if data is a currency. Invest in your data currency, and you will eventually get a return on your investment.

You only get a return from something you actually invest in.

Can Enterprise-Class Solutions Ever Deliver ROI?

Do you believe in Magic (Quadrants)?

Which came first, the Data Quality Tool or the Business Need?

What Data Quality Technology Wants

A Farscape Analogy for Data Quality

The Data Quality Wager

“Some is not a number and soon is not a time”

The Dumb and Dumber Guide to Data Quality

November 21, 2011

Commendable Comments (Part 11)

November 21, 2011/ Jim Harris

This Thursday is Thanksgiving Day, which in the United States is a holiday with a long, varied, and debated history. However, the most consistent themes remain family and friends gathering together to share a large meal and express their gratitude.

This is the eleventh entry in my ongoing series for expressing my gratitude to my readers for their commendable comments on my blog posts. Receiving comments is the most rewarding aspect of my blogging experience because not only do comments greatly improve the quality of my blog, comments also help me better appreciate the difference between what I know and what I only think I know. Which is why, although I am truly grateful to all of my readers, I am most grateful to my commenting readers.

Commendable Comments

On The Stakeholder’s Dilemma, Gwen Thomas commented:

“Recently got to listen in on a ‘cooperate or not’ discussion. (Not my clients.) What struck me was that the people advocating cooperation were big-picture people (from architecture and process) while those who just wanted what they wanted were more concerned about their own short-term gains than about system health. No surprise, right?

But what was interesting was that they were clearly looking after their own careers, and not their silos’ interests. I think we who help focus and frame the Stakeholder’s Dilemma situations need to be better prepared to address the individual people involved, and not just the organizational roles they represent.”

On Data, Information, and Knowledge Management, Frank Harland commented:

“As always, an intriguing post. Especially where you draw a parallel between Data Governance and Knowledge Management (wisdom management?) We sometimes portray data management (current term) as ‘well managed data administration’ (term from 70s-80s). As for the debate on ‘data’ and ‘information’ I prefer to see everything written, drawn and / or stored on paper or in digital format as data with various levels of informational value, depending on the amount and quality of metadata surrounding the data item and the accessibility, usefulness (quality) of that item.

For example, 12024561414 is a number with low informational value. I could add metadata, for instance: ‘Phone number’, that makes it potentially known as a phone number. Rather than let you find out whose number it is we could add more information value and add more metadata like: ‘White House Switchboard’. Accessibility could be enhanced by improving formatting like: (1) 202-456-1414.

What I am trying to say with this example is that data items should be placed on a rising scale of informational value rather than be put on steps or firm levels of informational value. So the Information Hierarchy provided by Professor Larson does not work very well for me. It could work only if for all data items the exact information value was determined for every probable context. This model is useful for communication purposes.”

On Plato’s Data, Peter Perera commented:

“‘erised stra ehru oyt ube cafru oyt on wohsi.’

To all Harry Potter fans this translates to: ‘I show not your face but your heart’s desire.’

It refers to The Mirror of Erised. It does not reflect reality but what you desire. (Erised is Desired spelled backwards.) Often data will cast a reflection of what people want to see.

‘Dumbledore cautions Harry that the mirror gives neither knowledge nor truth and that men have wasted away before it, entranced by what they see.’ How many systems are really Mirrors of Erised?”

On Plato’s Data, Larisa Bedgood commented:

“Because the prisoners in the cave are chained and unable to turn their heads to see what goes on behind them, they perceive the shadows as reality. They perceive imperfect reflections of truth and reality.

Bringing the allegory to modern times, this serves as a good reminder that companies MUST embrace data quality for an accurate and REAL view of customers, business initiatives, prospects, and so on. Continuing to view half-truths based on possibly faulty data and information means you are just lost in a dark cave!

I also like the comparison to the Mirror of Erised. One of my favorite movies is the Matrix, in which there are also a lot of parallelisms to Plato’s Cave Allegory. As Morpheus says to Neo: ‘That you are a slave, Neo. Like everyone else you were born into bondage. Into a prison that you cannot taste or see or touch. A prison for your mind.’ Once Neo escapes the Matrix, he discovers that his whole life was based on shadows of the truth.

Plato, Harry Potter, and Morpheus — I’d love to hear a discussion between the three of them in a cave!”

On Plato’s Data, John Owens commented:

“It is true that data is only a reflection of reality but that is also true of anything that we perceive with our senses. When the prisoners in the cave turn around, what they perceive with their eyes in the visible spectrum is only a very narrow slice of what is actually there. Even the ‘solid’ objects they see, and can indeed touch, are actually composed of 99% empty space.

The questions that need to be asked and answered about the essence of data quality are far less esoteric than many would have us believe. They can be very simple, without being simplistic. Indeed simplicity can be seen as a cornerstone of true data quality. If you cannot identify the underlying simplicity that lies at the heart of data quality you can never achieve it. Simple questions are the most powerful. Questions like, ‘In our world (i.e., the enterprise in question) what is it that we need to know about (for example) a Sale that will enable us to operate successfully and meet all of our goals and objectives?’ If the enterprise cannot answer such simple questions then it is in trouble. Making the questions more complicated will not take the enterprise any closer to where it needs to be. Rather it will completely obscure the goal.

Data quality is rather like a ‘magic trick’ done by a magician. Until you know how it is done it appears to an unfathomable mystery. Once you find out that is merely an illusion, the reality is absolutely simple and, in fact, rather mundane. But perhaps that is why so many practitioners perpetuate the illusion. It is not for self gain. They just don’t want to tell the world that, when it comes to data quality, there is no Tooth Fairy, no Easter Bunny, or no Santa Claus. It’s sad, but true. Data quality is boringly simple!”

On Plato’s Data, Peter Benson commented:

“Actually I would go substantially further, whereas data was originally no more than a representation of the real world and if validation was required the real world was the ‘authoritative source’ — but that is clearly no longer the case. Data is in fact the new reality!

Data is now used to track everything, if the data is wrong the real world item disappears. It may have really been destroyed or it may be simply lost, but it does not matter, if the data does not provide evidence of its existence then it does not exist. If you doubt this, just think of money, how much you have is not based on any physical object but on data.

By the way the theoretical definition I use for data is as follows:

Datum — a disruption in a continuum.

The practical definition I use for data is as follows:

Data — elements into which information is transformed so that it can be stored or moved.”

On Data Governance and the Adjacent Possible, Paul Erb commented:

“We can see that there’s a trench between those who think adjacent means out of scope and those who think it means opportunity. Great leaders know that good stories make for better governance for an organization that needs to adapt and evolve, but stay true to its mission. Built from, but not about, real facts, good fictions are broadly true without being specifically true, and therefore they carry well to adjacent business processes where their truths can be applied to making improvements.

On the other hand, if it weren’t for nonfiction — accounts of real markets and processes — there would be nothing for the POSSIBLE to be adjacent TO. Managers often have trouble with this because they feel called to manage the facts, and call anything else an airy-fairy waste of time.

So a data governance program needs to assert whether its purpose is to fix the status quo only, or to fix the status quo in order to create agility to move into new areas when needed. Each of these should have its own business case and related budgets and thresholds (tolerances) in the project plan. And it needs to choose its sponsorship and data quality players accordingly.”

On You Say Potato and I Say Tater Tot, John O’Gorman commented:

“I’ve been working on a definitive solution for the data / information / metadata / attributes / properties knot for a while now and I think I have it figured out.

I read your blog entitled The Semantic Future of MDM and we share the same philosophy even while we differ a bit on the details. Here goes. It’s all information. Good, bad, reliable or not, the argument whether data is information or vice versa is not helpful. The reason data seems different than information is because it has too much ambiguity when it is out of context. Data is like a quantum wave: it has many possibilities one of which is ‘collapsed’ into reality when you add context. Metadata is not a type of data, any more than attributes, properties or associations are a type of information. These are simply conventions to indicate the role that information is playing in a given circumstance.

Your Michelle Davis example is a good illustration: Without context, that string could be any number of individuals, so I consider it data. Give it a unique identifier and classify it as a digital representation in the class of Person, however and we have information. If I then have Michelle add attributes to her personal record — like sex, age, etc. — and assuming that these are likewise identified and classed — now Michelle is part of a set, or relation. Note that it is bad practice — and consequently the cause of many information management headaches — to use data instead of information. Ambiguity kills. Now, if I were to use Michelle’s name in a Subject Matter Expert field as proof of the validity of a digital asset; or in the Author field as an attribute, her information does not *become* metadata or an attribute: it is still information. It is merely being used differently.

In other words, in my world while the terms ‘data’ and ‘information’ are classified as concepts, the terms ‘metadata’, ‘attribute’ and ‘property’ are classified as roles to which instances of those concepts (well, one of them anyway) can be put, i.e., they are fit for purpose. This separation of the identity and class of the string from the purpose to which it is being assigned has produced very solid results for me.”

Thanks for giving your comments

Thank you very much for giving your comments and sharing your perspectives with our collablogaunity. This entry in the series highlighted commendable comments on OCDQ Blog posts published between July and November of 2011.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please keep on commenting and stay tuned for future entries in the series.

Thank you for reading the Obsessive-Compulsive Data Quality (OCDQ) blog. Your readership is deeply appreciated.

Commendable Comments (Part 10) – The 300th OCDQ Blog Post

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

Freemium is the future – and the future is now

August 29, 2011

The Data Cold War

August 29, 2011/ Jim Harris

One of the many things I love about Twitter is its ability to spark ideas via real-time conversations. For example, while live-tweeting during last week’s episode of DM Radio, the topic of which was how to get started with data governance, I tweeted about the data silo challenges and corporate cultural obstacles being discussed.

I tweeted that data is an asset only if it is a shared asset, across the silos, across the corporate culture, and that, in order to be successful with data governance, organizations must replace the mantra “my private knowledge is my power” with “our shared knowledge empowers us all.”

“That’s very socialist thinking,” Mark Madsen responded. “Soon we’ll be having arguments about capitalizing over socializing our data.”

To which I responded that the more socialized data is, the more capitalized data can become . . . just ask Google.

“Oh no,” Mark humorously replied, “decades of political rhetoric about socialism to be ruined by a discussion of data!” And I quipped that discussions about data have been accused of worse, and decades of data rhetoric certainly hasn’t proven very helpful in corporate politics.

Later, while ruminating on this light-hearted exchange, I wondered if we actually are in the midst of the Data Cold War.

The Data Cold War

The Cold War, which lasted approximately from 1946 to 1991, was the political, military, and economic competition between the Communist World, primarily the former Soviet Union, and the Western world, primarily the United States. One of the major tenets of the Cold War was the conflicting ideologies of socialism and capitalism.

In enterprise data management, one of the most debated ideologies is whether or not data should be viewed as a corporate asset, especially by the for-profit corporations of capitalism, which is (even before the Cold War began), and will likely forever remain, the world’s dominant economic model.

My earlier remark that data is an asset only if it is a shared asset, across the silos, across the corporate culture, is indicative of the bounded socialist view of enterprise data. In other words, almost no one in the enterprise data management space is suggesting that data should be shared beyond the boundary of the organization. In this sense, advocates, including myself, of data governance are advocating socializing data within the enterprise so that data can be better capitalized as a true corporate asset.

This mindset makes sense because sharing data with the world, especially for free, couldn’t possibly be profitable — or could it?

The Master Data Management Magic Trick

The genius (and some justifiably ponder if it’s evil genius) of companies like Google and Facebook is they realized how to make money in a free world — by which I mean the world of Free: The Future of a Radical Price, the 2009 book by Chris Anderson.

By encouraging their users to freely share their own personal data, Google and Facebook ingeniously answer what David Loshin calls the most dangerous question in data management: What is the definition of customer?

How do Google and Facebook answer the most dangerous question?

A customer is a product.

This is the first step that begins what I call the Master Data Management Magic Trick.

Instead of trying to manage the troublesome master data domain of customer and link it, through sales transaction data, to the master data domain of product (products, by the way, have always been undeniably accepted as a corporate asset even though product data has not been), Google and Facebook simply eliminate the need for customers (and, by extension, eliminate the need for customer service because, since their product is free, it has no customers) by transforming what would otherwise be customers into the very product that they sell — and, in fact, the only “real” product that they have.

And since what their users perceive as their product is virtual (i.e., entirely Internet-based), it’s not really a product, but instead a free service, which can be discontinued at any time. And if it was, who would you complain to? And on what basis?

After all, you never paid for anything.

This is the second step that completes the Master Data Management Magic Trick — a product is a free service.

Therefore, Google and Facebook magically make both their customers and their products (i.e., master data) disappear, while simultaneously making billions of dollars (i.e., transaction data) appear in their corporate bank accounts.

(Yes, the personal data of their users is master data. However, because it is used in an anonymized and aggregated format, it is not, nor does it need to be, managed like the master data we talk about in the enterprise data management industry.)

Google and Facebook have Capitalized Socialism

By “empowering” us with free services, Google and Facebook use the power of our own personal data against us — by selling it.

However, it’s important to note that they indirectly sell our personal data as anonymized and aggregated demographic data.

Although they do not directly sell our individually identifiable information (because, truthfully, it has very limited, and mostly no legal, value, i.e., that would be identity theft), Google and Facebook do occasionally get sued (mostly outside the United States) for violating data privacy and data protection laws.

However, it’s precisely because we freely give our personal data to them, that until, or if, laws are changed to protect us from ourselves, it’s almost impossible to prove they are doing anything illegal (again, their undeniable genius is arguably evil genius).

Google and Facebook are the exact same kind of company — they are both Internet advertising agencies.

They both sell online advertising space to other companies, which are looking to demographically target prospective customers because those companies actually do view people as potential real customers for their own real products.

The irony is that if all of their users stopped using their free service, then not only would our personal data be more private and more secure, but the new revenue streams of Google and Facebook would eventually dry up because, specifically by design, they have neither real customers nor real products. More precisely, their only real customers (other companies) would stop buying advertising from them because no one would ever see and (albeit, even now, only occasionally) click on their ads.

Essentially, companies like Google and Facebook are winning the Data Cold War because they have capitalized socialism.

In other words, the bottom line is Google and Facebook have socialized data in order to capitalize data as a true corporate asset.

The Age of the Platform

Amazon’s Data Management Brain

The Semantic Future of MDM

A Brave New Data World

Big Data and Big Analytics

A Farscape Analogy for Data Quality

Organizing For Data Quality

Sharing Data

Song of My Data

Data in the (Oscar) Wilde

The Most August Imagination

Once Upon a Time in the Data

The Idea of Order in Data

Hell is other people’s data

August 02, 2011

Are you turning Ugly Data into Cute Information?

August 02, 2011/ Jim Harris

Sometimes the ways of the data force are difficult to understand precisely because they are sometimes difficult to see.

Daragh O Brien and I were discussing this recently on Twitter, where tweets about data quality and information quality form the midi-chlorians of the data force. Share disturbances you’ve felt in the data force using the #UglyData and #CuteInfo hashtags.

Presentation Quality

Perhaps one of the most common examples of the difference between data and information is the presentation layer created for business users. In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray defines Presentation Quality as “a measure of how information is presented to, and collected from, those who utilize it. Format and appearance support appropriate use of the information.”

Tom Redman emphasizes the two most important points in the data lifecycle are when data is created and when data is used.

I describe the connection between those two points as the Data-Information Bridge. By passing over this bridge, data becomes the information used to make the business decisions that drive the tactical and strategic initiatives of the organization. Some of the most important activities of enterprise data management actually occur on the Data-Information Bridge, where preventing critical disconnects between data creation and data usage is essential to the success of the organization’s business activities.

Defect prevention and data cleansing are two of the required disciplines of an enterprise-wide data quality program. Defect prevention is focused on the moment of data creation, attempting to enforce better controls to prevent poor data quality at the source. Data cleansing can either be used to compensate for a lack of defect prevention, or it can be included in the processing that prepares data for a specific use (i.e., transforms data into information fit for the purpose of a specific business use.)

The Dark Side of Data Cleansing

In a previous post, I explained that although most organizations acknowledge the importance of data quality, they don’t believe that data quality issues occur very often because the information made available to end users in dashboards and reports often passes through many processes that cleanse or otherwise sanitize the data before it reaches them.

ETL processes that extract source data for a data warehouse load will often perform basic data quality checks. However, a fairly standard practice for “resolving” a data quality issue is to substitute either a missing or default value (e.g., a date stored in a text field in the source, which can not be converted into a valid date value, is loaded with either a NULL value or the processing date).

When postal address validation software generates a valid mailing address, it often does so by removing what it considers to be “extraneous” information from input address fields, which may include valid data accidentally entered in the wrong field, or that was lacking its own input field (e.g., e-mail address in an input address field deleted from the output valid mailing address).

And some reporting processes intentionally filter out “bad records” or eliminate “outlier values.” This happens most frequently when preparing highly summarized reports, especially those intended for executive management.

These are just a few examples of the Dark Side of Data Cleansing, which can turn Ugly Data into Cute Information.

Has your Data Quality turned to the Dark Side?

Like truth, beauty, and singing ability, data quality is in the eyes of the beholder, or since data quality is most commonly defined as fitness for the purpose of use, we could say that data quality is in the eyes of the user. But how do users know if data is truly fit for their purpose, or if they are simply being presented with information that is aesthetically pleasing for their purpose?

Has your data quality turned to the dark side by turning ugly data into cute information?

Data, Information, and Knowledge Management

Beyond a “Single Version of the Truth”

The Data-Information Continuum

The Circle of Quality

Data Quality and the Cupertino Effect

The Idea of Order in Data

Hell is other people’s data

OCDQ Radio - Organizing for Data Quality

The Reptilian Anti-Data Brain

Amazon’s Data Management Brain

Holistic Data Management (Part 3)

Holistic Data Management (Part 2)

Holistic Data Management (Part 1)

OCDQ Radio - Data Governance Star Wars

Data Governance Star Wars: Bureaucracy versus Agility

July 21, 2011

The Age of the Platform

July 21, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Phil Simon is the author of three books: The New Small (Motion, 2010), Why New Systems Fail (Cengage, 2010) and The Next Wave of Technologies (John Wiley & Sons, 2010).

A recognized technology expert, he consults companies on how to optimize their use of technology. His contributions have been featured on The Globe and Mail, the American Express Open Forum, ComputerWorld, ZDNet, abcnews.com, forbes.com, The New York Times, ReadWriteWeb, and many other sites.

When not fiddling with computers, hosting podcasts, putting himself in comics, and writing, Phil enjoys English Bulldogs, tennis, golf, movies that hurt the brain, fantasy football, and progressive rock—which is also the subject of this episode’s book contest (see below).

On this episode of OCDQ Radio, Phil and I discuss his fourth book, The Age of the Platform, which will be published later this year thanks to the help of the generous contributions of people like you who are backing the book’s Kickstarter project.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

July 16, 2011

Commendable Comments (Part 10)

July 16, 2011/ Jim Harris

Welcome to the 300th Obsessive-Compulsive Data Quality (OCDQ) blog post!

You might have been expecting a blog post inspired by the movie 300, but since I already did that with Spartan Data Quality, instead I decided to commemorate this milestone with the 10th entry in my ongoing series for expressing my gratitude to my readers for their truly commendable comments on my blog posts.

Commendable Comments

On DQ-BE: Single Version of the Time, Vish Agashe commented:

“This has been one of my pet peeves for a long time. Shared version of truth or the reference version of truth is so much better, friendly and non-dictative (if such a word exists) than single version of truth.

I truly believe that starting a discussion with Single Version of the Truth with business stakeholders is a nonstarter. There will always be a need for multifaceted view and possibly multiple aspects of the truth.

A very common term/example I have come across is the usage of the term revenue. Unfortunately, there is no single version of revenue across the organizations (and for valid reasons). From Sales Management prospective, they like to look at sales revenue (sales bookings) which is the business on which they are compensated on, financial folks want to look at financial revenue, which is the revenue they capture in the books and marketing possibly wants to look at marketing revenue (sales revenue before the discount) which is the revenue marketing uses to justify their budgets. So if you ever asked questions to a group of people about what revenue of the organization is, you will get three different perspectives. And these three answers will be accurate in the context of three different groups.”

On Data Confabulation in Business Intelligence, Henrik Liliendahl Sørensen commented:

“I think this is going to dominate the data management realm in the coming years. We are not only met with drastically increasing volumes of data, but also increasing velocity and variety of data.

The dilemma is between making good decisions and making fast decisions, whether the decisions based on business intelligence findings should wait for assuring the quality of the data upon which the decisions are made, thus risking the decision being too late. If data quality always could be optimal by being solved at the root we wouldn’t have that dilemma.

The challenge is if we are able to have optimal data all the time when dealing with extreme data, which is data of great variety moving in high velocity and coming in huge volumes.”

On The People Platform, Mark Allen commented:

“I definitely agree and think you are burrowing into the real core of what makes or breaks EDM and MDM type initiatives -- it's the people.

Business models, processes, data, and technology all provide fixed forms of enablement or constraint. And where in the past these dynamics have been very compartmentalized throughout a company's business model and systems architecture, with EDM and MDM involving more integrated functions and shared data, people become more of the x-factor in the equation. This demands the presence of data governance to be the facilitating process that drives the collaborative, cross-functional, and decision making dynamics needed for successful EDM and MDM. Of course, the dilemma is that in a governance model people can still make bad decisions that inhibit people from working effectively.

So in terms of the people platform and data governance, there needs to be the correct focus on what are the right roles and good decisions made that can enable people to interact effectively.”

On Beware the Data Governance Ides of March, Jill Wanless commented:

“Our organization has taken the Hybrid Approach (starting Bottom-Up) and it works well for two reasons: (1) the worker bee rock stars are all aligned and ready to hit the ground running, and (2) the ‘Top’ can sit back and let the ‘aligned’ worker bees get on with it.

Of course, this approach is sometimes (painfully) slow, but with the ground-level rock stars already aligned, there is less resistance implementing the policies, and the Top’s heavy hand is needed much less frequently, but I voted for Hybrid Approach (starting Top-Down) because I have less than stellar patience for the long and scenic route.”

On Data Governance and the Buttered Cat Paradox, Rob Drysdale commented:

“Too many companies get paralyzed thinking about how to do this and implement it. (Along with the overwhelmed feeling that it is too much time/effort/money to fix it.) But I think your poll needs another option to vote on, specifically: ‘Whatever works for the company/culture/organization’ since not all solutions will work for every organization.

In some where it is highly structured, rigid and controlled, there wouldn’t be the freedom at the grass-roots level to start something like this and it might be frowned upon by upper-level management. In other organizations that foster grass-roots things then it could work.

However, no matter which way you can get it started and working, you need to have buy-in and commitment at all levels to keep it going and make it effective.”

On The Data Quality Wager, Gordon Hamilton commented:

“Deming puts a lot of energy into his arguments in 'Out of the Crisis' that the short-term mindset of the executives, and by extension the directors, is a large part of the problem.

Jackanapes, a lovely under-used term, might be a bit strong when the executives are really just doing what they are paid for. In North America we get what the directors measure! In fact, one quandary is that a proactive executive, who invests in data quality is building the long-term value of their company but is also setting it up to be acquired by somebody who recognizes that the 'under the radar' improvements are making the prize valuable.

Deming says on p.100: 'Fear of unfriendly takeover may be the single most important obstacle to constancy of purpose. There is also, besides the unfriendly takeover, the equally devastating leveraged buyout. Either way, the conqueror demands dividends, with vicious consequences on the vanquished.'”

On Got Data Quality?, Graham Rhind commented:

“It always makes me smile when people attempt to put a percentage value on their data quality as though it were something as tangible and measurable as the fat content of your milk.

In order to make such a measurement one would need to know where 100% of the defects lie. If they knew that they would be able to resolve the defects and achieve 100% quality. In reality you cannot and do not know where each defect is and how many there are.

Even though tools such as profilers will tell you, for example, that 95% of your US address records have a valid state added, there is still no way to measure how many of these valid states are applicable to the real world entity on the ground. Mr Smith may be registered in the database to an existing and valid address in the database, but if he moved last week there's a data quality issue that won't be discovered until one attempts to contact him.

The same applies when people say they have removed 95% of duplicates from their data. If they can measure it then they know where the other 5% of duplicates are and they can remove them.

But back to the point: you may not achieve 100% quality. In fact, we know you never will. But aiming for that target means that you're aiming in the right direction. As long as your goal is to get close to perfection and not to achieve it, I don't see the problem.”

On Data Governance Star Wars: Balancing Bureaucracy and Agility, Rob “Darth” Karel commented:

“A curious question to my Rebellious friend OCDQ-Wan, while data governance agility is a wonderful goal, and maybe a great place to start your efforts, is it sustainable?

Your agile Rebellion is like any start-up: decisions must be made quickly, you must do a lot with limited resources, everyone plays multiple roles willingly, and your objective is very targeted and specific. For example, to fire a photon torpedo into a small thermal exhaust port - only 2 meters wide - connected directly to the main reactor of the Death Star. Let's say you 'win' that market objective. What next?

The Rebellion defeats the Galactic Empire, leaving a market leadership vacuum. The Rebellion begins to set up a new form of government to serve all (aka grow existing market and expand into new markets) and must grow larger, with more layers of management, in order to scale. (aka enterprise data governance supporting all LOBs, geographies, and business functions).

At some point this Rebellion becomes a new Bureaucracy - maybe with a different name and legacy, but with similar results. Don't forget, the Galactic Empire started as a mini-rebellion itself spearheaded by the agile Palpatine!”

You Are Awesome

Thank you very much for sharing your perspectives with our collablogaunity. This entry in the series highlighted the commendable comments received on OCDQ Blog posts published between January and June of 2011.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please keep on commenting and stay tuned for future entries in the series.

By the way, even if you have never posted a comment on my blog, you are still awesome — feel free to tell everyone I said so.

Thank you for reading the Obsessive-Compulsive Data Quality (OCDQ) blog. Your readership is deeply appreciated.

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

Commendable Comments (Part 5) – The 100th OCDQ Blog Post