Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments

Entries in Metadata (12)

Thursday
Jan032013

Best OCDQ Blog Posts of 2012

Welcome to my roundup of the best blog posts published on the Obsessive-Compulsive Data Quality (OCDQ) blog during 2012.

My selections were based on a pseudo-scientific, quasi-statistical combination of page views, comments, and re-tweets, as well as choosing a few of my personal favorites, and which I have organized into four sections of ten best posts by topic or type.

 

Ten Best Posts on Big Data

  • Dot Collectors and Dot Connectors — The multifaceted challenges of big data require the dot collectors of data management and the dot connectors of business intelligence to overcome their attention blindness and work together more collaboratively.
  • HoardaBytes and the Big Data Lebowski — Don’t hoard Data, dude.  The Data must abide.  The Data must abide both the Business, by proving useful to our business activities, and the Individual, by protecting the privacy of our personal activities.
  • Our Increasingly Data-Constructed World — What we now call Big Data is in fact a long-running macro trend underlying the many recent trends and innovations making our world, not just more data-driven, but increasingly data-constructed.
  • Will Big Data be Blinded by Data Science? — With apologies to Thomas Dolby, will the business leaders being told to hire data scientists to derive business value from big data analytics be blind to what data science tries to show them?
  • The Graystone Effects of Big Data — Using a metaphor based on the science fiction television show Caprica, I refer to the positive aspects of Big Data as the Zoe Graystone Effect, and the negative aspects of Big Data as the Daniel Graystone Effect.
  • Exercise Better Data Management — Big Data may be followed by MOData (i.e., MOre Data or Morbidly Obese Data), but that doesn’t necessarily mean we require more data management, instead we just need to exercise better data management.
  • A Tale of Two Datas — Inspired by Malcolm Chisholm and Charles Dickens, there are two types of data (i.e., representation and observation, not big and not-so-big) with different data uses that will require different data management approaches.
  • Data Silence — Not only do we need to adopt a mindset that embraces the principles of data science, but we also have to acknowledge that the biases and preconceptions in our minds could silence the signal and amplify the noise in big data.
  • The Wisdom of Crowds, Friends, and Experts — The future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three sources often contributing to data-driven decision making.

 

Ten Best Posts on Data Governance and Data Quality

  • Data Quality: Quo Vadimus? — With lots of help from Henrik Liliendahl Sørensen, Garry Ure, Bryan Larkin, and many others via the comments, I ponder where data quality is going, and whether data quality is a journey or a destination.
  • Data Quality and Miracle Exceptions — Battling the dark forces of poor data quality doesn’t require any superpowers, and data quality doesn’t have any miracle exceptions, so for the love of high-quality data everywhere, stop trying to sell us one.
  • Data Myopia and Business Relativity — Examines the two most prevalent definitions for data quality, real-world alignment and fitness for the purpose of use, otherwise known as the danger of data myopia and the challenge of business relativity.
  • How Data Cleansing Saves Lives — Although proactive defect prevention is far superior to reactive data cleansing, the history of the Hubble Space Telescope proves that data cleansing can be not just a necessary evil, but also a necessary good.
  • Data Quality and the Bystander Effect — The most common reason data quality issues are neither reported nor corrected is the Bystander Effect making people less likely to interpret bad data as a problem or, at the very least, not their responsibility.
  • Data Quality and Chicken Little Syndrome — A chicken-metaphor-based post about the far-too-common and fowl folly of, instead of trying to sell the business benefits of data quality, emphasizing the negative aspects of not investing in data quality.
  • Data and its Relationships with Quality — The metadata linking the data management industry to what it manages suffers from the one-to-many relationships created by never agreeing on how data, information, and quality should be defined.
  • Cooks, Chefs, and Data Governance — Implementing policies requires cooks who are adept at carrying out a recipe, as well as chefs who are trusted to figure out how to best combine policies with the organizational ingredients available to them.
  • Availability Bias and Data Quality Improvement — The availability heuristic explains why a reactive data cleansing project is easily approved, and availability bias explains why initiating a proactive data quality program is usually resisted.

 

Ten Best Podcasts

  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Saving Private Data — Recorded in December 2011, guest Daragh O Brien discusses the data privacy and data protection implications of social media, cloud computing, and big data.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Defining Big Data — This episode of the Open MIKE Podcast, with assistance from Robert Hillard, discusses how big data refers to big complexity, not big volume, even though complex datasets tend to grow rapidly, thus making them voluminous.
  • Getting to Know NoSQL — This episode of the Open MIKE Podcast discusses how NoSQL does not mean AntiSQL (i.e., NoSQL is not a Relational replacement), and that business-driven big data needs will often require “Not Only SQL.”

 

Ten Best of the Rest

  • DQ-View: Data Is as Data Does — In this short video, I explain that data’s value comes from data’s usefulness, exemplifying the potential value of unstructured data based on whether or not you put what you read in data management books to use.
  • DQ-View: The Five Stages of Data Quality — In this short video, using my superb acting skills, I demonstrate how coming to terms with the daunting challenge of data quality is somewhat similar to experiencing the Five Stages of Grief.
  • DQ-View: MetaData makes BettahMusic — In this short video, I demonstrate how better metadata makes data better using the metadata automatically and manually created after importing my CD collection into my iTunes library.
  • Metadata, Data Quality, and the Stroop Test — In this colorful (and perhaps too colorful) post, I use the Stroop Test, where colors do not match their names, to discuss the relationship between metadata and data quality.
  • Quality is the Higgs Field of Data — Using one of the biggest science stories of 2012, the potential discovery of the elusive Higgs Boson (which I also attempt to explain), I attempt an analogy for data quality based on the Higgs Field.
  • The Family Circus and Data Quality — Thanks to The Family Circus comic strip created by cartoonist Bil Keane, I explain how Ida Know owns the data, Not Me is accountable for data governance, and Nobody takes responsibility for data quality.
  • Data Love Song Mashup — Since your data needs love too, on Valentine’s Day I wrote this post providing a mashup of love songs for your data (and Rob DuMoulin added a few more in the comments) — Happy Data Quality to you and your data!
  • The Algebra of Collaboration — The trick of algebra equates collaboration with data quality and data governance success when collaboration is viewed not just as a guiding principle, but also as a call to action in your daily practices.
  • The Return of the Dumb Terminal — With help from author Kevin Kelly and my old green machine, I ponder how the mobile-app-portal-to-the-cloud computing model means mobile devices are bringing about the return of the dumb terminal.
  • An Enterprise Carol — Jacob Marley raises the ghosts of a few ideas to consider about how to keep the Enterprise well in the new year via the Ghosts of Enterprise Past (Legacy Applications), Present (IT Consumerization), and Future (Big Data).

 

Thank You for Reading OCDQ Blog in 2012

In 2012, the Obsessive-Compulsive Data Quality (OCDQ) blog published 92 posts, which received 160,000 total page views, while averaging over 400 page views and 200 unique visitors a day.

Thank you for reading OCDQ Blog in 2012.  Your readership was deeply appreciated.

 

Related Posts

Best OCDQ Blog Posts of 2011

So Long 2011, and Thanks for All the . . . – The OCDQ Radio 2011 Year in Review

2012 Quarterly Review of the Data Roundtable (Part 4)

2012 Quarterly Review of the Data Roundtable (Part 3)

2012 Quarterly Review of the Data Roundtable (Part 2)

2012 Quarterly Review of the Data Roundtable (Part 1)

2011 Quarterly Review of the Data Roundtable (Part 4)

2011 Quarterly Review of the Data Roundtable (Part 3)

2011 Quarterly Review of the Data Roundtable (Part 2)

2011 Quarterly Review of the Data Roundtable (Part 1)

Thursday
Nov152012

Open MIKE Podcast — Episode 07

Method for an Integrated Knowledge Environment (MIKE2.0) is an open source delivery framework for Enterprise Information Management, which provides a comprehensive methodology that can be applied across a number of different projects within the Information Management space.  For more information, click on this link: openmethodology.org/wiki/What_is_MIKE2.0

The Open MIKE Podcast is a video podcast show, hosted by Jim Harris, which discusses aspects of the MIKE2.0 framework, and features content contributed to MIKE 2.0 Wiki Articles, Blog Posts, and Discussion Forums.

 

Episode 07: Guiding Principles for the Open Semantic Enterprise

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: Open MIKE Podcast on Vimeo

 

MIKE2.0 Content Featured in or Related to this Podcast

Semantic Enterprise Guiding Principles: openmethodology.org/wiki/Guiding_Principles_for_the_Open_Semantic_Enterprise *

* Based on Mike Bergman’s article: mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise

Semantic Enterprise Composite Offering: openmethodology.org/wiki/Semantic_Enterprise_Composite_Offering

Semantic Enterprise Wiki Category: openmethodology.org/wiki/Category:Semantic_Enterprise

You can also find the videos and blog post summaries for every episode of the Open MIKE Podcast at: ocdqblog.com/MIKE

 

Related Posts

Open MIKE Podcast — Episode 04: Metadata Management

You Say Potato and I Say Tater Tot

The Metadata Continuum

The Metadata Crisis

DQ-View: MetaData makes BettahMusic

Metadata, Data Quality, and the Stroop Test

Data Quality and the Q Test

Data and its Relationships with Quality

What’s the Meta with your Data?

Let’s Meta a Data

Listen to Peter Benson discuss Metadata, Data, and Information on the Knights of the Data Roundtable

Thursday
Sep272012

Open MIKE Podcast — Episode 04

Method for an Integrated Knowledge Environment (MIKE2.0) is an open source delivery framework for Enterprise Information Management, which provides a comprehensive methodology that can be applied across a number of different projects within the Information Management space.  For more information, click on this link: openmethodology.org/wiki/What_is_MIKE2.0

The Open MIKE Podcast is a video podcast show, hosted by Jim Harris, which discusses aspects of the MIKE2.0 framework, and features content contributed to MIKE 2.0 Wiki Articles, Blog Posts, and Discussion Forums.

 

Episode 04: Metadata Management

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: Open MIKE Podcast on Vimeo

 

MIKE2.0 Content Featured in or Related to this Podcast

Information Asset Management: openmethodology.org/wiki/Information_Asset_Management_Offering_Group

Metadata Management Solution Offering: openmethodology.org/wiki/Metadata_Management_Solution_Offering

You can also find the videos and blog post summaries for every episode of the Open MIKE Podcast at: ocdqblog.com/MIKE

 

Related Posts

You Say Potato and I Say Tater Tot

The Metadata Continuum

The Metadata Crisis

DQ-View: MetaData makes BettahMusic

Metadata, Data Quality, and the Stroop Test

Data Quality and the Q Test

Data and its Relationships with Quality

What’s the Meta with your Data?

Let’s Meta a Data

Listen to Peter Benson discuss Metadata, Data, and Information on the Knights of the Data Roundtable

Tuesday
Aug212012

Data and its Relationships with Quality

The title of this blog post is an allusion to the graphic (shown above) that accompanied the recent Information Management column by Malcolm Chisholm, in which he wrote that data quality is not fitness for use as it is most commonly defined, stating he thinks “a strong case can be made that the definition is indeed inappropriate and should be replaced with a better one.”

“Before we get into the definition of data quality, let us take a brief look at what data is related to,” Chisholm opened, explaining that “data represents something — a thing, event, or concept.”

As I blogged in my post Plato’s Data, whether it’s an abstract description of real-world entities (i.e., “master data”) or an abstract description of real-world interactions (i.e., “transaction data”) among entities, data is an abstract description of reality.  Although data shapes our perception of the real world, sometimes we forget that data is only a partial reflection of reality.

“Data is understood,” Chisholm continued, “by something, for which the best term I can find is the interpretant.”

“The interpretant applies the data to one or more uses, which achieve objectives the interpretant has.  The interpretant is independent of the data.  It understands the data and can put it to use.  But if the interpretant misunderstands the data, or puts it to an inappropriate use, that is hardly the fault of the data, and cannot constitute a data quality problem.”

As I blogged in my post Quality is the Higgs Field of Data, independent from use, data is as carefree as the mass-less photon whizzing around at the speed of light.  But once we interact with it, data begins to feel the effects of our use. We give data mass so that it can become the basic building blocks of what matters to us.  Some data is affected more by our use than others.  The more subjective our use, the more we weigh data down.  The more objective our use, the less we weigh data down.

“A more fundamental problem is that data can have many uses,” Chisholm continued.  “If we think data quality is fitness for use, then data quality must be assessed independently for each use we put it to.”  Instead, Chisholm contends that data quality is “an expression of the relationship between the thing, event, or concept and the data that represents it.  This is a one-to-one relationship, unlike the one-to-many relationship between data and uses.”

Therefore, Chisholm proposes that a better definition of data quality is “the extent to which the data actually represents what it purports to represent.  This definition can be used to think of data quality as a property of the data itself, and then our diagnosis and remediation efforts will focus on the special problems of the relationship between data and what it represents.”

But, of course, although Chisholm doesn’t like it as a definition for data quality, he is not denying that fitness for use describes “a set of valid concepts that deal with types of problems around the use of data.”  Two examples he cites are when the interpretant misunderstands the data, or when the interpretant uses data for a purpose that is incompatible with the data.

In his conclusion, Chisholm states that “the special problems of the relationships between data and what it is used for requires a different set of approaches and should be called something other than data quality.”

And this is exactly why, as I blogged in my post Data Myopia and Business Relativity, many data professionals prefer to define data quality as real-world alignment and information quality as fitness for the purpose of use.  However, I have found that adding the nuance of data versus information only further complicates data quality discussions with business professionals.

Chisholm also suggests that his proposed definition of data quality is not only better, but that “it also alludes to the existence of metadata that links the data to what it is representing.”  The important role that metadata plays in supporting data and its relationships with information and quality is something I blogged about in my post You Say Potato and I Say Tater Tot.

The irony is the metadata that links the data management industry to what it is representing that it manages suffers from the one-to-many relationships we’ve created by seemingly never agreeing on how data, information, and quality should be defined.

 

Related Posts

Plato’s Data

Quality is the Higgs Field of Data

Data Myopia and Business Relativity

Data, Information, and Knowledge Management

You Say Potato and I Say Tater Tot

Metadata, Data Quality, and the Stroop Test

Data Quality and the Q Test

Data Quality and Miracle Exceptions

Data Quality and Chicken Little Syndrome

Data Quality: Quo Vadimus?

DQ-View: The Five Stages of Data Quality

Exercise Better Data Management

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Redefining Data Quality — Guest Peter Perera discusses his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Monday
Jun252012

Metadata, Data Quality, and the Stroop Test

In psychology, the Stroop Effect is a demonstration of the reaction time of a task.  The most commonly used example is what is known as the Stroop Test, which compares the time needed to name colors when they are printed in an ink color that matches their name (e.g., greenyellowredbluebrownpurple) with the time needed to name the same colors when they are printed in an ink color that does not match their name (e.g., bluered, purple, green, brownyellow).  Naming the color of the word takes longer, and is more prone to errors, when the ink color does not match the name of the color.

The Stroop Test, where colors do not match their names, reminds me of the relationship between metadata and data quality if I view the ink color as the metadata and the name of the color as the data, given that understanding data takes longer, and is more prone to errors, when the metadata does not match the data, or when the metadata is ambiguous.

Unlike the Stroop Test, where poor metadata (ink color) obfuscates good data (name of the color), data quality issues can also be caused when good metadata is undermined by poor data (e.g., data entry errors like an email address being entered into a postal address field).  And, of course, even when the entered data matches the metadata (or automatic data-to-metadata matching is enabled by drop-down boxes), more insidious data quality issues can be caused by the complex challenge of data accuracy.

Additionally, the point of view paradox can turn data quality debates about fitness for the purpose of use even more colorful than the Stroop Test, such as when data that one user sees as red and green, another user sees as crimson and chartreuse.

But hopefully we can all agree that good data quality begins with good metadata, because better metadata makes data better.

 

Related Posts

You Say Potato and I Say Tater Tot

The Metadata Continuum

The Metadata Crisis

Let’s Meta a Data

What’s the Meta with your Data?

DQ-View: MetaData makes BettahMusic

Who Framed Data Entry?

Data Quality and the Cupertino Effect

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-BE: Data Quality Airlines

Data Quality and the Q Test

Thursday
Jan262012

The Johari Window of Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

The Johari Window is a term from psychology for a technique used to help people better understand their personality and behavior by combining a self assessment with assessments from their peers.  In relation to data, the Johari Window is a metaphor for helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

During this episode, I discuss the Johari Window of Data Quality with Martin Doyle.  Our discussion, inspired by our blog comment banter on my post There is No Such Thing as a Root Cause, includes root cause analysis, the pursuit of data perfection, metadata, communication, Business-IT collaboration, change management, defect prevention, and continuous improvement.

Martin Doyle is a Data Quality Improvement Evangelist and the CEO of DQ Global, which is a UK-based data quality software and services vendor providing data cleansing, international address and email verification, data deduplication, and data matching solutions for Customer Relationship Management, Single Customer View, and Master Data Management.  DQ Global has worked with over 500 businesses worldwide on a variety of projects, providing their clients with improved data quality, making their data fit for business use, and enabling them to trust their data and make decisions based on a foundation of fact.

 

The Johari Window of Data Quality

Additional listening options:

 

Related Posts

There is No Such Thing as a Root Cause

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

DQ-View: The Cassandra Effect

The Data Quality Wager

DQ-View: Data Is as Data Does

Selling the Business Benefits of Data Quality

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Tuesday
Jan242012

DQ-View: MetaData makes BettahMusic

Tuesday
Jan032012

Best OCDQ Blog Posts of 2011

Welcome to my roundup of the best blog posts published on the Obsessive-Compulsive Data Quality (OCDQ) blog during 2011.

My selections were based on a pseudo-scientific, quasi-statistical combination of page views, comments, and re-tweets (as well as choosing a few of my personal favorites).  Instead of ordering the posts chronologically, I decided to organize them by theme.

 

The Metadata Trilogy

Although it has an incredibly important role to play in data quality and its related disciplines, I don’t write about metadata very often.  But the reader feedback that I received lead me to writing three blog posts about metadata in the span of a few weeks:

  • The Metadata Crisis — There is a running debate within many organizations over the meaning of commonly used terms, which complicates what on the surface seem like straightforward business questions.
  • The Metadata Continuum — There is a continuum, where at one end we have the uniformity of controlled vocabularies, and at the other end we have the flexibility of chaotic folksonomies.  However, both flexibility and uniformity provide value.
  • You Say Potato and I Say Tater Tot — The demarcations of the borders between metadata, data, and information are important, but sometimes difficult to discern.  In this post, I offer an explanation about these demarcations using potatoes.

 

The Data Governance Star Wars (one less than a) Trilogy

In June, Rob Karel of Forrester Research and I used a Star Wars themed blog mock debate to take on one of data governance’s biggest challenges — how to balance bureaucracy and business agility.  Gwen Thomas of the Data Governance Institute joined Rob and I to continue the discussion during a special, extended, and Star Wars themed episode of OCDQ Radio:

  • Data Governance Star Wars on OCDQ Radio — In Part 1, Rob Karel and I discuss our blog mock debate, which is followed by a brief Star Wars themed intermission, and then in Part 2, Gwen Thomas joins us to provide her excellent insights.

 

Although not Star Wars themed, here are some additional Best OCDQ Blog Posts of 2011 on the topic of data governance:

  • Data Governance and the Adjacent Possible — It’s important to demonstrate that some data governance policies reflect existing best practices, which helps reduce resistance to change, and therefore I advise: “If it ain’t broke, bricolage it.”
  • Aristotle, Data Governance, and Lead Rulers — Well-constructed data governance policies are like lead rulers — flexible rules that empower us with an understanding of the principle of the policy, and how to enforce it in a particular context.
  • The Stakeholder’s Dilemma — There will be times when sacrifices for the long-term greater good will require that stakeholders either contribute more resources during the current phase, or receive fewer benefits from its deliverables.
  • Beware the Data Governance Ides of March — My dramatized warning about relying too much on the top-down approach to implementing data governance — and especially if your organization has any data stewards named Brutus or Cassius.

 

OCDQ Radio

In June, I launched OCDQ Radio, which is a vendor-neutral podcast about data quality and the audio complement to this blog, providing me with a platform for recorded discussions with the great folks working in the data management industry.  So far, there have been 21 episodes of OCDQ Radio, including 22 guests from 7 countries.  Here are a few of the most popular episodes:

  • The Fall Back Recap Show — A look back at the Best of OCDQ Radio, including discussions about Data, Information, Business-IT Collaboration, Change Management, Big Analytics, Data Governance, and the Data Revolution.
  • Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.
  • Social Media Strategy — Guest Crysta Anderson of IBM Initiate explains social media strategy and content marketing, including three recommended practices: (1) Listen intently, (2) Communicate succinctly, and (3) Have fun.

 

The Best of the Rest

  • DQ-View: Talking about DataDQ-View video discussion about how data professionals should talk about data when invited to participate in business discussions within their organizations.
  • The Speed of Decision — Examines the constraints that time puts on data-driven decision making, pondering whether decision speed is more important than data quality and decision quality.
  • The Data Cold War — Examines how Google and Facebook have performed the Master Data Management Magic Trick and socialized data (“Information wants to be free!”) in order to capitalize data as a true corporate asset.
  • A Farscape Analogy for Data Quality — Ponders whether data is not viewed as an asset because data has so thoroughly pervaded the enterprise that data has become invisible to those who are so dependent upon its quality.
  • No Datum is an Island of Serendip — Our organizations need to create collaborative environments that foster serendipitous connections bringing all of our business units and people together around our shared data assets.

 

Thank You for Reading OCDQ Blog in 2011

In 2011, the Obsessive-Compulsive Data Quality (OCDQ) blog published 112 posts, which received 130,000 total page views, averaging 350 page views and 150 unique visitors a day.

Thank you for reading OCDQ Blog in 2011.  Your readership was deeply appreciated.

 

Related Posts

So Long 2011, and Thanks for All the . . . – The OCDQ Radio 2011 Year in Review

2011 Quarterly Review of the Data Roundtable (Part 3)

2011 Quarterly Review of the Data Roundtable (Part 2)

2011 Quarterly Review of the Data Roundtable (Part 1)

Commendable Comments (Part 10) – The 300th OCDQ Blog Post

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

OCDQ Blog Bicentennial – The 200th OCDQ Blog Post

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

The Best Data Quality Blog Posts of 2010

Friday
Nov112011

You Say Potato and I Say Tater Tot

One thread of the comment discussion on my blog post The Metadata Continuum raised the excellent point that the demarcation of the border between data and metadata is important, but sometimes difficult to discern.  By extension, we can say the same thing about the demarcation of the border between data and information.

So, in this blog post, I thought I would try to offer an explanation about the importance of these demarcations using potatoes.

 

You Say Potato and I Say Potahto

Let’s Call the Whole Thing Off was a song written by George Gershwin and Ira Gershwin, which became famous for its playful lyrics that poked fun at the differences in the pronunciation of words, such as “you say potato and I say potahto.”

Spelling and pronunciation are included in the dictionary definition of a word, which is a good example of one of the many uses of metadata, namely as a label that provides a definition, description, and context for data.  Essentially, metadata describes data, and since data is attempting to describe a real world object, such as a potato, metadata is a further abstraction from reality.

And as we saw with the example of white horses in my blog post The Metadata Crisis, these abstract definitions can also include additional classifications (e.g., there are over 4,000 different varieties of potato), which also have to be well defined in order to facilitate clear communication and effective discussion.  These levels of abstractions, definitions, and classifications are essential to our attempts to understand, and do business with, the real world.  And this challenge continues even further with information.

 

You Say Potato and I Say Tater Tot

The difference, and relationship, between data and information is a common debate.  Not only do these two terms have varying definitions, but they are often used interchangeably.  Just a few examples include comparing and contrasting data quality with information quality, data management with information management, and data governance with information governance.

Some consider this an esoteric debate between data geeks and information nerds, but what is not debated is the importance of understanding how organizations use data and/or information to support their business activities.

Extending my analogy, data is like a potato and information is like a tater tot.  In other words, information is one of the many possible specific uses for data.  Information is one of the many possible specific things that we can make using data, which is why information quality professionals often speak about the information product.

So it’s important to remember that we can’t have a tater tot (information) without a potato (data), and that we can’t have either a tater tot or a potato without having a working definition (metadata) of what a potato is.

 

Let’s Not Call the Whole Thing Data

David Corrigan recently blogged about the importance of the metadata that tracks the lineage of information presented to an end user, and how the root causes of data quality and data governance issues are impossible to discover without this metadata.

Therefore, the lines of demarcation separating metadata, data, and information are not just an esoteric technical debate.  These demarcations are foundational to the efficiency and effectiveness of business operations.  So, let’s not call the whole thing data.

Let’s acknowledge the separate, but deeply interrelated, continuum formed by the disciplines of metadata, data, and information.

 

Related Posts

The Metadata Continuum

The Metadata Crisis

What’s the Meta with your Data?

Let’s Meta a Data

Listen to Peter Benson discuss Metadata, Data, and Information on the Knights of the Data Roundtable

Listen to Daragh O Brien discuss Data and Information Quality on OCDQ Radio

Listen to Gordon Hamilton discuss the Information Product on OCDQ Radio

Plato’s Data

Data, Information, and Knowledge Management

The Data-Information Continuum

The First Law of Data Quality

OCDQ Radio - The Fall Back Recap Show

Thursday
Nov032011

The Metadata Continuum

Since my previous post about metadata received excellent commentary, I decided to write a follow-up post to address one of the many great points this discussion and its participants raised, namely the role of controlled vocabularies or metadata dictionaries.

According to an insightful comment from John O’Gorman, “the nature of the medium in which we are trying to solve these problems is multi-dimensional.  Any organization can have—and should manage—multiple dialects.”

“By that I mean,” O’Gorman continued, “in the dialect of accounting, customer means some agent who has contributed to increased sales.  In the dialect of marketing, customer can mean anyone with a pulse that will sit and listen to a pitch.  This insistence on a single version of anything, which is embedded in controlled vocabularies, relational tables, object classes, or a folder structure, is the single largest impediment to cleaning up the digital wasteland.”

One example of this digital wasteland metadata challenge, taken from the crowd-sourced wisdom of social media, is a hashtag, which Twitter users include in their tweets in order to tag them for search engines and trending topics websites.

Since it’s also a common strategy for making any type of unstructured data more usable, tagging is a great example of one of the semantic challenges of metadata.  Users freely choosing tags often creates a so-called folksonomy, as opposed to users being forced to only select terms from a controlled vocabulary.  Which is precisely why the metadata resulting from tagging can include homonyms (i.e., the same tags used with different meanings) and synonyms (i.e., multiple tags for the same concept), which may lead to inappropriate data relationships and inefficient searches for data about a particular subject.

 

The Metadata of Babel

Another insightful comment came from Peter Benson, based on his work with the eOTD (ECCMA Open Technical Dictionary).

“Mention the word metadata,” Benson explained, “and you have immediately lost all but the hard core techies and they have neither the authority nor the budget to solve the problem.  If you take a hard look at the financial crisis or cancer research you will indeed find the reason the challenges are so difficult to solve is in large part because of the limitations in our ability to communicate effectively and the lack of transparency that comes from poor data integration.  So, metadata is really important.”

“The Babel approach of a single language to unite them all,” Benson continued, “has a very poor track history and there is good reason for this.  Language is more about power and authority than it is about true communication.  We have tried to come up with a solution that is solely focused on achieving unambiguous communication.  It really does not matter what it is called as long as we agree on what it is.  We do this by using terminology to define concepts and then assigning concept identifiers that are used as metadata.  The separation of the terminology from the concept identifier, or rather linking terminology through a concept identifier, allows everyone to remain comfortably in their own space yet communicate with others.”

 

The Metadata Continuum

So it would appear that we face a daunting challenge, which we could call the Metadata Continuum, where at one end we have the uniformity of controlled vocabularies, and at the other end we have the flexibility of chaotic folksonomies.  The daily business operations of most organizations are governed by a metadata strategy that falls somewhere in between, which begs the question: In which direction should the best practices of metadata management flow—toward flexibility or toward uniformity?

Since in my previous post I used an example of the metadata complexities of everyday language, I thought it might be useful to share two perspectives about linguistic flexibility and uniformity.

In his book Final Jeopardy: Man vs. Machine and the Quest to Know Everything, Stephen Baker explained that “flexibility isn’t a weakness of language, but a strength.  Humans need words to be inexact.  If they were too precise, each person would have a unique vocabulary of several billion words, all of them unintelligible to everyone else.  You might have a unique word for the sip of coffee you just took at 7:59 A.M., which was flavored with the anxiety about the traffic in the Lincoln Tunnel or along Paris’s Boulevard Périphérique.  But that single word would be as useless to you as to everyone else.  A word has to be used at least twice to have any purpose.  Each word is a lingua franca, a fragment of a clumsy common language.”

“Yet paradoxically,” explained Kevin Kelly, in his book What Technology Wants, “diversity can be unleashed by a type of uniformity.  The uniformity of a standard writing system (like an alphabet or script) unleashes the unexpected diversity of literature.  Without uniform rules, every word has to be made up, so communication is localized, inefficient, and thwarted.”

“But with a uniform language,” Kelly continued, “sufficient communication transpires in large circles so that a novel word, phrase, or idea can be appreciated, caught, and disseminated.  The rigidity of an alphabet has done more to enable creativity than any unhinged brain-storming exercise ever invented.  The standard 26 letters in English have produced 16 million different books in English.  Words and language will keep evolving, but their evolution rides on basic fundamentals that are conserved and shared; unvarying (over the short term) letters, spelling, and grammar rules enable creativity in ideas.  In a curious way, the homogenization of shared universals allows the transmission of diversity.”

Perhaps since both flexibility and uniformity have linguistic value, metadata will forever remain a continuum between the two.

Where along the Metadata Continuum is your organization?

 

Related Posts

The Metadata Crisis

What’s the Meta with your Data?

Let’s Meta a Data

The First Law of Data Quality

Data Quality and the Cupertino Effect

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

Plato’s Data

Data, Information, and Knowledge Management

The Data Cold War

The Semantic Future of MDM

OCDQ Radio - A Brave New Data World

Thursday
Oct272011

The Metadata Crisis

I am reading the book The Information: A History, a Theory, a Flood by James Gleick, which recounts a dialogue written by the ancient Chinese philosopher Gongsun Long known as When a White Horse is Not a Horse:

“Horses certainly have color.  Hence, there are white horses.  If it were the case that horses had no color, there would simply be horses, and then how could one select a white horse?  And so it follows that a horse and a white horse are different.  Hence, I say that a white horse is not a horse.

Furthermore, a white horse is a horse and white, but horse is that by means of which one names the shape, and white is that by means of which one names the color.  What names the color is not what names the shape.  Hence, I say that a white horse is not a horse.”

“On its face, this is unfathomable,” explained Gleick, “but it begins to come into focus as a statement about language and logic.  Paradoxes like this formed part of what Chinese historians called the language crisis, a running debate over the nature of language.  Names are not the things they name.”

One of my favorite topics is how data is not the real world it describes.  But perhaps a better data management example of how “names are not the things they name” is metadata, which Julie Hunt blogged about in her post Stumbling Over Metadata, which explored better definitions than the oversimplified “metadata is data about data.”

Metadata can be thought of as a label that provides a definition, description, and context for data.  Common examples include relational table definitions and flat file layouts.  More detailed examples of metadata include conceptual and logical data models.

Therefore, metadata—among its many other uses—often plays an integral role in determining your data usage.  Although it’s often overlooked, there is a strong relationship between metadata and data quality, and by extension, between metadata and data-driven decision making, since a business intelligence report’s metadata often provides the framing effect for its data.

I have often witnessed what could be called the metadata crisis, a running debate within many organizations over the meaning of commonly used terms like revenue, which complicates what on the surface seem like straightforward business questions, such as how much revenue was generated during a particular fiscal reporting period.

A metadata management version of When a White Horse is Not a Horse might be When Recognized Revenue is Not Revenue.

However, the complexities of revenue recognition probably pale in comparison with the metadata crisis that can be caused by what David Loshin calls the most dangerous question in data management: What is the definition of customer?

What examples of the metadata crisis have you encountered in your organization?

 

Related Posts

What’s the Meta with your Data?

Let’s Meta a Data

The First Law of Data Quality

Data Quality and the Cupertino Effect

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

Plato’s Data

Data, Information, and Knowledge Management

The Data Cold War

The Semantic Future of MDM

OCDQ Radio - Master Data Management in Practice

OCDQ Radio - A Brave New Data World

Wednesday
Jun012011

A Brave New Data World

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Welcome to the highly anticipated debut episode of the Obsessive-Compulsive Data Quality (OCDQ) podcast—OCDQ Radio!

In this episode, I discuss how data, data quality, data-driven decision making, and metadata quality no longer reside exclusively within the esoteric realm of data management.  Data has now so thoroughly pervaded mainstream culture that we hardly seem to notice that we are quite literally swimming in data on a daily basis.

The growing challenge is can we extract meaningful insights from these vast and veritable oceans of unrelenting data volumes, and use those insights to make better decisions in near real-time in order to positively impact the various aspects of our lives.

We are now living in a brave new data world where everyone is a data geek—and data quality affects us all.

Or to paraphrase William Shakespeare:

“O wonder!

How many goodly data are there here!  How beauteous data geeks are! 

O brave new world! 

That is so dependent on the quality of the data in it!”

 

A Brave New Data World

Additional listening options:

 

Related Posts

Data, data everywhere, but where is data quality?

Data In, Decision Out

The Data-Decision Symphony

Data Confabulation in Business Intelligence

The Real Data Value is Business Insight

Thaler’s Apples and Data Quality Oranges

Amazon’s Data Management Brain

The Reptilian Anti-Data Brain

Identifying Duplicate Customers

Data Quality and the Cupertino Effect

What's the Meta with your Data?

Let’s Meta a Data