Studying Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode, Gordon Hamilton and I discuss data quality key concepts, including those which we have studied in some of our favorite data quality books, and more important, those which we have implemented in our careers as data quality practitioners.

Gordon Hamilton is a Data Quality and Data Warehouse professional, whose 30 years’ experience in the information business encompasses many industries, including government, legal, healthcare, insurance and financial.  Gordon was most recently engaged in the healthcare industry in British Columbia, Canada, where he continues to advise several health care authorities on data quality and business intelligence platform issues.

Gordon Hamilton’s passion is to bring together:

  • Exposure of business rules through data profiling as recommended by Ralph Kimball.

  • Monitoring business rules in the EQTL (Extract-Quality-Transform-Load) pipeline leading into the data warehouse.

  • Managing the business rule violations through systemic and specific solutions within the statistical process control framework of Shewhart/Deming.

  • Researching how to sustain data quality metrics as the “fit for purpose” definitions change faster than the information product process can easily adapt.

Gordon Hamilton’s moniker of DQStudent on Twitter hints at his plan to dovetail his Lean Six Sigma skills and experience with the data quality foundations to improve the manufacture of the “information product” in today’s organizations.  Gordon is a member of IAIDQ, TDWI, and ASQ, as well as an enthusiastic reader of anything pertaining to data.

Gordon Hamilton recently became an Information Quality Certified Professional (IQCP), via the IAIDQ certification program.

Recommended Data Quality Books

By no means a comprehensive list, and listed in no particular order whatsoever, the following books were either discussed during this OCDQ Radio episode, or are otherwise recommended for anyone looking to study data quality and its related disciplines:

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Data Cold War

One of the many things I love about Twitter is its ability to spark ideas via real-time conversations.  For example, while live-tweeting during last week’s episode of DM Radio, the topic of which was how to get started with data governance, I tweeted about the data silo challenges and corporate cultural obstacles being discussed.

I tweeted that data is an asset only if it is a shared asset, across the silos, across the corporate culture, and that, in order to be successful with data governance, organizations must replace the mantra “my private knowledge is my power” with “our shared knowledge empowers us all.”

“That’s very socialist thinking,” Mark Madsen responded.  “Soon we’ll be having arguments about capitalizing over socializing our data.”

To which I responded that the more socialized data is, the more capitalized data can become . . . just ask Google.

“Oh no,” Mark humorously replied, “decades of political rhetoric about socialism to be ruined by a discussion of data!”  And I quipped that discussions about data have been accused of worse, and decades of data rhetoric certainly hasn’t proven very helpful in corporate politics.

 

Later, while ruminating on this light-hearted exchange, I wondered if we actually are in the midst of the Data Cold War.

 

The Data Cold War

The Cold War, which lasted approximately from 1946 to 1991, was the political, military, and economic competition between the Communist World, primarily the former Soviet Union, and the Western world, primarily the United States.  One of the major tenets of the Cold War was the conflicting ideologies of socialism and capitalism.

In enterprise data management, one of the most debated ideologies is whether or not data should be viewed as a corporate asset, especially by the for-profit corporations of capitalism, which is (even before the Cold War began), and will likely forever remain, the world’s dominant economic model.

My earlier remark that data is an asset only if it is a shared asset, across the silos, across the corporate culture, is indicative of the bounded socialist view of enterprise data.  In other words, almost no one in the enterprise data management space is suggesting that data should be shared beyond the boundary of the organization.  In this sense, advocates, including myself, of data governance are advocating socializing data within the enterprise so that data can be better capitalized as a true corporate asset.

This mindset makes sense because sharing data with the world, especially for free, couldn’t possibly be profitable — or could it?

 

The Master Data Management Magic Trick

The genius (and some justifiably ponder if it’s evil genius) of companies like Google and Facebook is they realized how to make money in a free world — by which I mean the world of Free: The Future of a Radical Price, the 2009 book by Chris Anderson.

By encouraging their users to freely share their own personal data, Google and Facebook ingeniously answer what David Loshin calls the most dangerous question in data management: What is the definition of customer?

How do Google and Facebook answer the most dangerous question?

A customer is a product.

This is the first step that begins what I call the Master Data Management Magic Trick.

Instead of trying to manage the troublesome master data domain of customer and link it, through sales transaction data, to the master data domain of product (products, by the way, have always been undeniably accepted as a corporate asset even though product data has not been), Google and Facebook simply eliminate the need for customers (and, by extension, eliminate the need for customer service because, since their product is free, it has no customers) by transforming what would otherwise be customers into the very product that they sell — and, in fact, the only “real” product that they have.

And since what their users perceive as their product is virtual (i.e., entirely Internet-based), it’s not really a product, but instead a free service, which can be discontinued at any time.  And if it was, who would you complain to?  And on what basis?

After all, you never paid for anything.

This is the second step that completes the Master Data Management Magic Trick — a product is a free service.

Therefore, Google and Facebook magically make both their customers and their products (i.e., master data) disappear, while simultaneously making billions of dollars (i.e., transaction data) appear in their corporate bank accounts.

(Yes, the personal data of their users is master data.  However, because it is used in an anonymized and aggregated format, it is not, nor does it need to be, managed like the master data we talk about in the enterprise data management industry.)

 

Google and Facebook have Capitalized Socialism

By “empowering” us with free services, Google and Facebook use the power of our own personal data against us — by selling it.

However, it’s important to note that they indirectly sell our personal data as anonymized and aggregated demographic data.

Although they do not directly sell our individually identifiable information (because, truthfully, it has very limited, and mostly no legal, value, i.e., that would be identity theft), Google and Facebook do occasionally get sued (mostly outside the United States) for violating data privacy and data protection laws.

However, it’s precisely because we freely give our personal data to them, that until, or if, laws are changed to protect us from ourselves, it’s almost impossible to prove they are doing anything illegal (again, their undeniable genius is arguably evil genius).

Google and Facebook are the exact same kind of company — they are both Internet advertising agencies.

They both sell online advertising space to other companies, which are looking to demographically target prospective customers because those companies actually do view people as potential real customers for their own real products.

The irony is that if all of their users stopped using their free service, then not only would our personal data be more private and more secure, but the new revenue streams of Google and Facebook would eventually dry up because, specifically by design, they have neither real customers nor real products.  More precisely, their only real customers (other companies) would stop buying advertising from them because no one would ever see and (albeit, even now, only occasionally) click on their ads.

Essentially, companies like Google and Facebook are winning the Data Cold War because they have capitalized socialism.

In other words, the bottom line is Google and Facebook have socialized data in order to capitalize data as a true corporate asset.

 

Related Posts

Freemium is the future – and the future is now

The Age of the Platform

Amazon’s Data Management Brain

The Semantic Future of MDM

A Brave New Data World

Big Data and Big Analytics

A Farscape Analogy for Data Quality

Organizing For Data Quality

Sharing Data

Song of My Data

Data in the (Oscar) Wilde

The Most August Imagination

Once Upon a Time in the Data

The Idea of Order in Data

Hell is other people’s data

DAMA International

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

DAMA International is a non-profit, vendor-independent, global association of technical and business professionals dedicated to advancing the concepts and practices of information and data management.

On this episode, special guest Loretta Mahon Smith provides an overview of the Data Management Body of Knowledge (DMBOK) and Certified Data Management Professional (CDMP) certification program.

Loretta Mahon Smith is a visionary and influential data management professional known for her consistent awareness of trends in the forefront of the industry.  Since 1983, she has worked in international financial services, and been actively involved in the maturity and growth of Information Architecture functions, specializing in Data Stewardship and Data Strategy Development.

Loretta Mahon Smith has been a member of DAMA for more than 10 years, with a lifetime membership to the DAMA National Capitol Region Chapter.  As President of the chapter she has the opportunity to help the Washington DC and Baltimore data management communities.  She serves the world community by her involvement on the DAMA International Board as VP of Communications.  She additionally volunteers her time to work on the ICCP Certification Council, most recently working on the development of the Zachman and Data Governance examinations.

In the past, Loretta has facilitated Special Interest Group sessions on Governance and Stewardship and presented Stewardship training at numerous local chapters for DAMA, IIBA, TDWI, and ACM, as well as major conferences including Project World (IIBA), INFO360 (AIIM), EDW (DAMA) and the IQ.  She earned Certified Computing Professional (CCP), Certified Business Intelligence Professional (CBIP), and Certified Data Management Professional (CDMP) designations, achieving mastery level proficiency rating in Data Warehousing, Data Management, and Data Quality.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

The Higher Education of Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode of OCDQ Radio, we leave the corporate world, where data quality and master data management is mostly focused on the challenges of managing data about customers, products, and revenue, and we get schooled in the higher education of data quality.  In other words, we discuss data quality and master data management in higher education, which is mostly focused on the challenges of managing data about students, courses, and tuition.

Our guest lecturer will be Mark Horseman, who has been working at the University of Saskatchewan for over 10 years and has been on the implementation team of many of the University’s enterprise software solutions.  Mark Horseman now works in Information Strategy and Analytics leveraging his knowledge to assist the University in managing its data quality challenges.

Follow Mark Horseman on Twitter and read his Eccentric Data Quality blog to hear more about the challenges faced by Mark on his quest (yes, it’s a quest) to improve Higher-Education Data Quality.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

A Farscape Analogy for Data Quality

Farscape was one of my all-time favorite science fiction television shows.  In the weird way my mind works, the recent blog post (which has received great comments) Four Steps to Fixing Your Bad Data by Tom Redman, triggered a Farscape analogy.

“The notion that data are assets sounds simple and is anything but,” Redman wrote.  “Everyone touches data in one way or another, so the tendrils of a data program will affect everyone — the things they do, the way they think, their relationships with one another, your relationships with customers.”

The key word for me was tendrils — like I said, my mind works in a weird way.

 

Moya and Pilot

On Farscape, the central characters of the show travel through space aboard Moya, a Leviathan, which is a species of living, sentient spaceships.  Pilot is a sentient creature (of a species also known as Pilots) with the vast capacity for multitasking that is necessary for the simultaneous handling of the many systems aboard a Leviathan.  The tendrils of a Pilot’s lower body are biologically bonded with the living systems of a Leviathan, creating a permanent symbiotic connection, meaning that, once bonded, a Pilot and a Leviathan can no longer exist independently for more than an hour or so, or both of them will die.

Leviathans were one of the many laudably original concepts of Farscape.  The role of the spaceship in most science fiction is analogous to the role of a boat.  In other words, traveling through space is most often imagined like traveling on water.  However, seafaring vessels and spaceships are usually seen as a technological object providing transportation and life support, but not actually alive in its own right (despite the fact that both types of ship are usually anthropomorphized, and usually as a female).

Because Moya was alive, when she was damaged, she felt pain and needed time to heal.  And because she was sentient, highly intelligent, and capable of communicating with the crew through Pilot (who was the only one who could understand the complexity of the Leviathan language, which was beyond the capability of a universal translator), Moya was much more than just a means of transportation.  In other words, there truly was a symbiotic relationship between, not only Moya and Pilot, but also between Moya and Pilot, and their crew and passengers.

 

Enterprise and Data

(Sorry, my fellow science fiction geeks, but it’s not that Enterprise and that Data.  Perfectly understandable mistake, though.)

Although technically not alive in the biological sense, in many respects, an organization is like a living, sentient organism, and like space and seafaring ships, often anthropomorphized.  An enterprise is much more than just a large organization providing a means of employment and offering products and/or services (and, in a sense, life support to its employees and customers).

As Redman explains in his book Data Driven: Profiting from Your Most Important Business Asset, data is not just the lifeblood of the Information Age, data is essential to everything the enterprise does, from helping it better understand its customers, to guiding its development of better products and/or services, to setting a strategic direction toward achieving its business goals.

So the symbiotic relationship between Enterprise and Data is analogous to the symbiotic relationship between Moya and Pilot.

Data is the Pilot of the Enterprise Leviathan.  The enterprise can not survive without its data.  A healthy enterprise requires healthy data — data of sufficient quality capable of supporting the operational, tactical, and strategic functions of the enterprise.

Returning to Redman’s words, “Everyone touches data in one way or another, so the tendrils of a data program will affect everyone — the things they do, the way they think, their relationships with one another, your relationships with customers.”

So the relationship between an enterprise and its data, and its people, business processes, and technology, is analogous to the relationship between Moya and Pilot, and their crew and passengers.  It is the enterprise’s people, its crew (i.e., employees), who, empowered by high quality data and enabled by technology, optimize business processes for superior corporate performance, thereby delivering superior products and/or services to the enterprise’s passengers (i.e., customers).

 

So why isn’t data viewed as an asset?

So if this deep symbiosis exists, if these intertwined and symbiotic relationships exist, if the tendrils of data are biologically bonded with the complex enterprise ecosystem — then why isn’t data viewed as an asset?

In Data Driven, Redman references the book The Social Life of Information by John Seely Brown and Paul Duguid, who explained that “a technology is never fully accepted until it becomes invisible to those who use it.”  The term informationalization describes the process of building data and information into a product or service.  “When products and services are fully informationalized,” Redman noted, then data, “blends into the background and people do not even think about it anymore.”

Perhaps that is why data isn’t viewed as an asset.  Perhaps data has so thoroughly pervaded the enterprise that it has become invisible to those who use it.  Perhaps it is not an asset because data is invisible to those who are so dependent upon its quality.

 

Perhaps we only see Moya, but not her Pilot.

 

Related Posts

Organizing For Data Quality

Data, data everywhere, but where is data quality?

Finding Data Quality

The Data Quality Wager

Beyond a “Single Version of the Truth”

Poor Data Quality is a Virus

DQ-Tip: “Don't pass bad data on to the next person...”

Retroactive Data Quality

Hyperactive Data Quality (Second Edition)

A Brave New Data World

International Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode of OCDQ Radio, I discuss the sometimes mysterious world of international name and address data quality, which is why I am pleased to be joined by, not an international man of mystery, but instead, an international man of data quality.

Graham Rhind is an acknowledged expert in the field of data quality.  Graham runs GRC Database Information, a consultancy company based in The Netherlands, where he researches postal code and addressing systems, collates international data, runs a busy postal link website, writes data management software, and maintains an online Data Quality Glossary.

Graham Rhind speaks regularly on the subject and is the author of four books on the topic of international data management, including The Global Source Book for Name and Address Data Management, which has been an invaluable resource for me.

On this episode of OCDQ Radio, Graham Rhind and I discusses the international challenges of postal address and person name data quality, including its implications for web forms and other data entry interfaces.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Big Data and Big Analytics

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Jill Dyché is the Vice President of Thought Leadership and Education at DataFlux.  Jill’s role at DataFlux is a combination of best-practice expert, key client advisor and all-around thought leader.  She is responsible for industry education, key client strategies and market analysis in the areas of data governance, business intelligence, master data management and customer relationship management.  Jill is a regularly featured speaker and the author of several books.

Jill’s latest book, Customer Data Integration: Reaching a Single Version of the Truth (Wiley & Sons, 2006), was co-authored with Evan Levy and shows the business breakthroughs achieved with integrated customer data.

Dan Soceanu is the Director of Product Marketing and Sales Enablement at DataFlux.  Dan manages global field sales enablement and product marketing, including product messaging and marketing analysis.  Prior to joining DataFlux in 2008, Dan has held marketing, partnership and market research positions with Teradata, General Electric and FormScape, as well as data management positions in the Financial Services sector.

Dan received his Bachelor of Science in Business Administration from Kutztown University of Pennsylvania, as well as earning his Master of Business Administration from Bloomsburg University of Pennsylvania.

On this episode of OCDQ Radio, Jill Dyché, Dan Soceanu, and I discuss the recent Pacific Northwest BI Summit, where the three core conference topics were Cloud, Collaboration, and Big Data, the last of which lead to a discussion about Big Analytics.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Are you turning Ugly Data into Cute Information?

Sometimes the ways of the data force are difficult to understand precisely because they are sometimes difficult to see.

Daragh O Brien and I were discussing this recently on Twitter, where tweets about data quality and information quality form the midi-chlorians of the data force.  Share disturbances you’ve felt in the data force using the #UglyData and #CuteInfo hashtags.

 

Presentation Quality

Perhaps one of the most common examples of the difference between data and information is the presentation layer created for business users.  In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray defines Presentation Quality as “a measure of how information is presented to, and collected from, those who utilize it.  Format and appearance support appropriate use of the information.”

Tom Redman emphasizes the two most important points in the data lifecycle are when data is created and when data is used.

I describe the connection between those two points as the Data-Information Bridge.  By passing over this bridge, data becomes the information used to make the business decisions that drive the tactical and strategic initiatives of the organization.  Some of the most important activities of enterprise data management actually occur on the Data-Information Bridge, where preventing critical disconnects between data creation and data usage is essential to the success of the organization’s business activities.

Defect prevention and data cleansing are two of the required disciplines of an enterprise-wide data quality program.  Defect prevention is focused on the moment of data creation, attempting to enforce better controls to prevent poor data quality at the source.  Data cleansing can either be used to compensate for a lack of defect prevention, or it can be included in the processing that prepares data for a specific use (i.e., transforms data into information fit for the purpose of a specific business use.)

 

The Dark Side of Data Cleansing

In a previous post, I explained that although most organizations acknowledge the importance of data quality, they don’t believe that data quality issues occur very often because the information made available to end users in dashboards and reports often passes through many processes that cleanse or otherwise sanitize the data before it reaches them.

ETL processes that extract source data for a data warehouse load will often perform basic data quality checks.  However, a fairly standard practice for “resolving” a data quality issue is to substitute either a missing or default value (e.g., a date stored in a text field in the source, which can not be converted into a valid date value, is loaded with either a NULL value or the processing date).

When postal address validation software generates a valid mailing address, it often does so by removing what it considers to be “extraneous” information from input address fields, which may include valid data accidentally entered in the wrong field, or that was lacking its own input field (e.g., e-mail address in an input address field deleted from the output valid mailing address).

And some reporting processes intentionally filter out “bad records” or eliminate “outlier values.”  This happens most frequently when preparing highly summarized reports, especially those intended for executive management.

These are just a few examples of the Dark Side of Data Cleansing, which can turn Ugly Data into Cute Information.

 

Has your Data Quality turned to the Dark Side?

Like truth, beauty, and singing ability, data quality is in the eyes of the beholder, or since data quality is most commonly defined as fitness for the purpose of use, we could say that data quality is in the eyes of the user.  But how do users know if data is truly fit for their purpose, or if they are simply being presented with information that is aesthetically pleasing for their purpose?

Has your data quality turned to the dark side by turning ugly data into cute information?

 

Related Posts

Data, Information, and Knowledge Management

Beyond a “Single Version of the Truth”

The Data-Information Continuum

The Circle of Quality

Data Quality and the Cupertino Effect

The Idea of Order in Data

Hell is other people’s data

OCDQ Radio - Organizing for Data Quality

The Reptilian Anti-Data Brain

Amazon’s Data Management Brain

Holistic Data Management (Part 3)

Holistic Data Management (Part 2)

Holistic Data Management (Part 1)

OCDQ Radio - Data Governance Star Wars

Data Governance Star Wars: Bureaucracy versus Agility

Organizing for Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Dr. Thomas C. Redman (the “Data Doc”) is an innovator, advisor and teacher.  He was first to extend quality principles to data and information in the late 80s.  Since then he has crystallized a body of tools, techniques, roadmaps and organizational insights that help organizations make order-of-magnitude improvements.

More recently Tom has developed keen insights into the nature of data and formulated the first comprehensive approach to “putting data to work.”  Taken together, these enable organizations to treat data as assets of virtually unlimited potential.

Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books.  The most recent, Data Driven: Profiting from Your Most Important Business Asset (Harvard Business Press, 2008) was a Library Journal best buy of 2008.

Prior to forming Navesink Consulting Group in 1996, Tom conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  Tom holds a Ph.D. in statistics from Florida State University.  He holds two patents.

On this episode of OCDQ Radio, Tom Redman and I discuss concepts from his Data Governance and Information Quality 2011 post-conference tutorial about organizing for data quality, which includes his call to action for your role in the data revolution.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Data, Information, and Knowledge Management

The difference, and relationship, between data and information is a common debate.  Not only do these two terms have varying definitions, but they are often used interchangeably.  Just a few examples include comparing and contrasting data quality with information quality, data management with information management, and data governance with information governance.

In a previous blog post, I referenced the Information Hierarchy provided by Professor Ray R. Larson of the School of Information at the University of California, Berkeley:

  • Data – The raw material of information
  • Information – Data organized and presented by someone
  • Knowledge – Information read, heard, or seen, and understood
  • Wisdom – Distilled and integrated knowledge and understanding

Some consider this an esoteric debate between data geeks and information nerds, but what is not debated is the importance of understanding how organizations use data and/or information to support their business activities.  Of particular interest is the organization’s journey from data to decision, the latter of which is usually considered the primary focus of business intelligence.

In his recent blog post, Scott Andrews explained what he called The Information Continuum:

  • Data – A Fact or a piece of information, or a series thereof
  • Information – Knowledge discerned from data
  • Business Intelligence – Information Management pertaining to an organization’s policy or decision-making, particularly when tied to strategic or operational objectives

 

Knowledge Management

Data Cake
Image by EpicGraphic

This recent graphic does a great job of visualizing the difference between data and information, as well as the importance of how information is presented.  Although the depiction of knowledge as consumed information is oversimplified, I am not sure how this particular visual metaphor could properly represent knowledge as actually understanding the consumed information.

It’s been awhile since the term knowledge management was in vogue within the data management industry. When I began my career, in the early 1990s, I remember hearing about knowledge management as often as we hear about data governance today, which, as you know, is quite often.  The reason I have resurrected the term in this blog post is because I can’t help but wonder if the debate about data and information obfuscates the fact that the organization’s appetite, its business hunger, is for knowledge.

 

Three Questions for You

  1. Does your organization make a practical distinction between data and information?
  2. If so, how does this distinction affect your quality, management, and governance initiatives?
  3. What is the relationship between those initiatives and your business intelligence efforts?

 

Please share your thoughts and experiences by posting a comment below.

 

Related Posts

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data In, Decision Out

The Data-Decision Symphony

Data Confabulation in Business Intelligence

Thaler’s Apples and Data Quality Oranges

DQ-View: Baseball and Data Quality

Beyond a “Single Version of the Truth”

The Business versus IT—Tear down this wall!

Finding Data Quality

Fantasy League Data Quality

The Circle of Quality

Data Quality Mischief Managed

Even if you are not a fan of Harry Potter (i.e., you’re a Muggle who hasn’t either read the books or at least seen the movies), you’re probably aware the film franchise concludes this summer.

As I have discussed in my blog post Data Quality Magic, data quality tools are not magic in and of themselves, but like the wands in the wizarding world of Harry Potter, they channel the personal magic force of the wizards or witches who wield them.  In other words, the magic in the wizarding world of data quality comes from the people working on data quality initiatives.

Extending the analogy, data quality methodology is like the books of spells and potions in Harry Potter, which are also not magic in and of themselves, but again require people through which to channel their magical potential.  And the importance of having people who are united by trust, cooperation, and collaboration is the data quality version of the Order of the Phoenix, with the Data Geeks battling against the Data Eaters (i.e., the dark wizards, witches, spells, and potions that are perpetuating the plague of poor data quality throughout the organization).

And although data quality doesn’t have a Marauder’s Map (nor does it usually require you to recite the oath: “I solemnly swear that I am up to no good”), sometimes the journey toward getting your organization’s data quality mischief managed feels like you’re on a magical quest.

 

Related Posts

Data Quality Magic

Data Quality is not a Magic Trick

Do you believe in Magic (Quadrants)?

There are no Magic Beans for Data Quality

The Tooth Fairy of Data Quality

Video: Oh, the Data You’ll Show!

Data Quality and #FollowFriday the 13th

Dilbert, Data Quality, Rabbits, and #FollowFriday

Spartan Data Quality

Pirates of the Computer: The Curse of the Poor Data Quality

The Tell-Tale Data

Data Quality is People!

Commendable Comments (Part 10)

Welcome to the 300th Obsessive-Compulsive Data Quality (OCDQ) blog post!

You might have been expecting a blog post inspired by the movie 300, but since I already did that with Spartan Data Quality, instead I decided to commemorate this milestone with the 10th entry in my ongoing series for expressing my gratitude to my readers for their truly commendable comments on my blog posts.

 

Commendable Comments

On DQ-BE: Single Version of the Time, Vish Agashe commented:

“This has been one of my pet peeves for a long time. Shared version of truth or the reference version of truth is so much better, friendly and non-dictative (if such a word exists) than single version of truth.

I truly believe that starting a discussion with Single Version of the Truth with business stakeholders is a nonstarter. There will always be a need for multifaceted view and possibly multiple aspects of the truth.

A very common term/example I have come across is the usage of the term revenue. Unfortunately, there is no single version of revenue across the organizations (and for valid reasons). From Sales Management prospective, they like to look at sales revenue (sales bookings) which is the business on which they are compensated on, financial folks want to look at financial revenue, which is the revenue they capture in the books and marketing possibly wants to look at marketing revenue (sales revenue before the discount) which is the revenue marketing uses to justify their budgets. So if you ever asked questions to a group of people about what revenue of the organization is, you will get three different perspectives. And these three answers will be accurate in the context of three different groups.”

On Data Confabulation in Business Intelligence, Henrik Liliendahl Sørensen commented:

“I think this is going to dominate the data management realm in the coming years. We are not only met with drastically increasing volumes of data, but also increasing velocity and variety of data.

The dilemma is between making good decisions and making fast decisions, whether the decisions based on business intelligence findings should wait for assuring the quality of the data upon which the decisions are made, thus risking the decision being too late. If data quality always could be optimal by being solved at the root we wouldn’t have that dilemma.

The challenge is if we are able to have optimal data all the time when dealing with extreme data, which is data of great variety moving in high velocity and coming in huge volumes.”

On The People Platform, Mark Allen commented:

“I definitely agree and think you are burrowing into the real core of what makes or breaks EDM and MDM type initiatives -- it's the people.

Business models, processes, data, and technology all provide fixed forms of enablement or constraint. And where in the past these dynamics have been very compartmentalized throughout a company's business model and systems architecture, with EDM and MDM involving more integrated functions and shared data, people become more of the x-factor in the equation. This demands the presence of data governance to be the facilitating process that drives the collaborative, cross-functional, and decision making dynamics needed for successful EDM and MDM. Of course, the dilemma is that in a governance model people can still make bad decisions that inhibit people from working effectively.

So in terms of the people platform and data governance, there needs to be the correct focus on what are the right roles and good decisions made that can enable people to interact effectively.”

On Beware the Data Governance Ides of March, Jill Wanless commented:

“Our organization has taken the Hybrid Approach (starting Bottom-Up) and it works well for two reasons: (1) the worker bee rock stars are all aligned and ready to hit the ground running, and (2) the ‘Top’ can sit back and let the ‘aligned’ worker bees get on with it.

Of course, this approach is sometimes (painfully) slow, but with the ground-level rock stars already aligned, there is less resistance implementing the policies, and the Top’s heavy hand is needed much less frequently, but I voted for Hybrid Approach (starting Top-Down) because I have less than stellar patience for the long and scenic route.”

On Data Governance and the Buttered Cat Paradox, Rob Drysdale commented:

“Too many companies get paralyzed thinking about how to do this and implement it. (Along with the overwhelmed feeling that it is too much time/effort/money to fix it.) But I think your poll needs another option to vote on, specifically: ‘Whatever works for the company/culture/organization’ since not all solutions will work for every organization.

In some where it is highly structured, rigid and controlled, there wouldn’t be the freedom at the grass-roots level to start something like this and it might be frowned upon by upper-level management. In other organizations that foster grass-roots things then it could work.

However, no matter which way you can get it started and working, you need to have buy-in and commitment at all levels to keep it going and make it effective.”

On The Data Quality Wager, Gordon Hamilton commented:

“Deming puts a lot of energy into his arguments in 'Out of the Crisis' that the short-term mindset of the executives, and by extension the directors, is a large part of the problem.

Jackanapes, a lovely under-used term, might be a bit strong when the executives are really just doing what they are paid for. In North America we get what the directors measure! In fact, one quandary is that a proactive executive, who invests in data quality is building the long-term value of their company but is also setting it up to be acquired by somebody who recognizes that the 'under the radar' improvements are making the prize valuable.

Deming says on p.100: 'Fear of unfriendly takeover may be the single most important obstacle to constancy of purpose. There is also, besides the unfriendly takeover, the equally devastating leveraged buyout. Either way, the conqueror demands dividends, with vicious consequences on the vanquished.'”

On Got Data Quality?, Graham Rhind commented:

“It always makes me smile when people attempt to put a percentage value on their data quality as though it were something as tangible and measurable as the fat content of your milk.

In order to make such a measurement one would need to know where 100% of the defects lie. If they knew that they would be able to resolve the defects and achieve 100% quality. In reality you cannot and do not know where each defect is and how many there are.

Even though tools such as profilers will tell you, for example, that 95% of your US address records have a valid state added, there is still no way to measure how many of these valid states are applicable to the real world entity on the ground. Mr Smith may be registered in the database to an existing and valid address in the database, but if he moved last week there's a data quality issue that won't be discovered until one attempts to contact him.

The same applies when people say they have removed 95% of duplicates from their data. If they can measure it then they know where the other 5% of duplicates are and they can remove them.

But back to the point: you may not achieve 100% quality. In fact, we know you never will. But aiming for that target means that you're aiming in the right direction. As long as your goal is to get close to perfection and not to achieve it, I don't see the problem.”

On Data Governance Star Wars: Balancing Bureaucracy and Agility, Rob “Darth” Karel commented:

“A curious question to my Rebellious friend OCDQ-Wan, while data governance agility is a wonderful goal, and maybe a great place to start your efforts, is it sustainable?

Your agile Rebellion is like any start-up: decisions must be made quickly, you must do a lot with limited resources, everyone plays multiple roles willingly, and your objective is very targeted and specific. For example, to fire a photon torpedo into a small thermal exhaust port - only 2 meters wide - connected directly to the main reactor of the Death Star. Let's say you 'win' that market objective. What next?

The Rebellion defeats the Galactic Empire, leaving a market leadership vacuum. The Rebellion begins to set up a new form of government to serve all (aka grow existing market and expand into new markets) and must grow larger, with more layers of management, in order to scale. (aka enterprise data governance supporting all LOBs, geographies, and business functions).

At some point this Rebellion becomes a new Bureaucracy - maybe with a different name and legacy, but with similar results. Don't forget, the Galactic Empire started as a mini-rebellion itself spearheaded by the agile Palpatine!” 

You Are Awesome

Thank you very much for sharing your perspectives with our collablogaunity.  This entry in the series highlighted the commendable comments received on OCDQ Blog posts published between January and June of 2011.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please keep on commenting and stay tuned for future entries in the series.

By the way, even if you have never posted a comment on my blog, you are still awesome — feel free to tell everyone I said so.

Thank you for reading the Obsessive-Compulsive Data Quality (OCDQ) blog.  Your readership is deeply appreciated.

 

Related Posts

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

OCDQ Blog Bicentennial – The 200th OCDQ Blog Post

Commendable Comments (Part 9)

Commendable Comments (Part 8)

Commendable Comments (Part 7)

Commendable Comments (Part 6)

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

Commendable Comments (Part 4)

Commendable Comments (Part 3)

Commendable Comments (Part 2)

Commendable Comments (Part 1)

Social Media Strategy

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Effectively using social media within a business context is more art than science, which is why properly planning and executing a social media strategy is essential for organizations as well as individual professionals.

On this episode, I discuss social media strategy and content marketing with Crysta Anderson, a Social Media Strategist for IBM, who manages IBM InfoSphere’s social media presence, including the Mastering Data Management blog, the @IBMInitiate and @IBM_InfoSphere Twitter accounts, LinkedIn and other platforms.

Crysta Anderson also serves as a social media subject matter expert for IBM’s Information Management division.

Under Crysta’s execution, IBM Initiate has received numerous social media awards, including “Best Corporate Blog” from the Chicago Business Marketing Association, Marketing Sherpa’s 2010 Viral and Social Marketing Hall of Fame, and BtoB Magazine’s list of “Most Successful Online Social Networking Initiatives.”

Crysta graduated from the University of Chicago with a BA in Political Science and is currently pursuing a Master’s in Integrated Marketing Communications at Northwestern University’s Medill School.  Learn more about Crysta Anderson on LinkedIn.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

The Stakeholder’s Dilemma

Game theory models a strategic situation as a game in which an individual player’s success depends on the choices made by the other players involved in the game.  One excellent example is the game known as The Prisoner’s Dilemma, which is deliberately designed to demonstrate why two people might not cooperate—even if it is in both of their best interests to do so.

Here is the classic scenario.  Two criminal suspects are arrested, but the police have insufficient evidence for a conviction.  So they separate the prisoners and offer each the same deal.  If one testifies for the prosecution against the other (i.e., defects) and the other remains silent (i.e., cooperates), the defector goes free and the silent accomplice receives the full one-year sentence.  If both remain silent, both prisoners are sentenced to only one month in jail for a minor charge.  If each betrays the other, each receives a three-month sentence.  Each prisoner must choose to betray the other or to remain silent.

If you have ever regularly watched a police procedural television series, such as Law & Order, then you have seen many dramatizations of the prisoner’s dilemma, including several sample outcomes of when the prisoners make different choices.

The Iterated Prisoner’s Dilemma

In iterated versions of the prisoner’s dilemma, players remember the previous actions of their opponent and change their strategy accordingly.  In many fields of study, these variations are considered fundamental to understanding cooperation and trust.

Here is an economics scenario with two players and a banker.  Each player holds a set of two cards, one printed with the word Cooperate (as in, with each other), the other printed with the word Defect.  Each player puts one card face-down in front of the banker.  By laying them face down, the possibility of a player knowing the other player’s selection in advance is eliminated.  At the end of each turn, the banker turns over both cards and gives out the payments, which can vary, but one example is as follows.

If both players cooperate, they are each awarded $5.  If both players defect, they are each penalized $1.  But if one player defects while the other player cooperates, the defector is awarded $10, while the cooperator neither wins nor loses any money.

Therefore, the safest play is to always cooperate, since you would never lose any money—and if your opponent always cooperates, then you can both win on every turn.  However, although defecting creates the possibility of losing a small amount of money, it also creates the possibility of winning twice as much money.

It is the iterated nature of this version of the prisoner’s dilemma that makes it so interesting for those studying human behavior.

For example, if you were playing against me, and I defected on the first two turns while you cooperated, I would have won $20 while you would have won nothing.  So what would you do on the third turn?  Let’s say that you choose to defect.

But if I defected yet again, although we would both lose $1, overall I would still be +$19 while you would be -$1.  And what if I continued defecting?  This would actually be an understandable strategy for me—if I was only playing for money, since you would have to defect 19 more times in a row before I broke even, but by which time you would have also lost $20.  And if instead, you start cooperating again in order to stop your losses, I could win a lot of money—at the expense of losing your trust.

Although the iterated prisoner’s dilemma is designed so that, over the long-term, cooperating players generally do better than non-cooperating players, in the short-term, the best result for an individual player is to defect while their opponent cooperates.

The Stakeholder’s Dilemma

Organizations embarking on an enterprise-wide initiative, such as data quality, master data management, and data governance, play a version of the iterated prisoner’s dilemma, which I refer to as The Stakeholder’s Dilemma.

These initiatives often bring together key stakeholders from all around the organization, representing each business unit or business function, and perhaps stakeholders representing data and technology as well.  These stakeholders usually form a committee or council, which is responsible for certain top-down aspects of the initiative, such as funding and strategic planning.

Of course, it is unrealistic to expect every stakeholder to cooperate equally at all times.  The realities of the fiscal calendar effect, conflicting interests, and changing business priorities, will mean that during any particular turn in the game (i.e., the current phase of the initiative), the amount of resources (money, time, people) allocated to the effort by a particular stakeholder will vary.

There will be times when sacrifices for the long-term greater good of the initiative will require that cooperating stakeholders either contribute more resources during the current phase, or receive fewer benefits from its deliverables, than defecting stakeholders.

As with the iterated prisoner’s dilemma, the challenge is what happens during the next turn (i.e., the next phase of the initiative).

If the same stakeholders repeatedly defect, then will the other stakeholders continue to cooperate?  Or will the spirit of trust, cooperation, and collaboration necessary for the continuing success of the ongoing initiative be irreparably damaged?

There are many, and often complex, reasons for why enterprise-wide initiatives fail, but failing to play the stakeholder’s dilemma well is one very common reason—and it is also a reason why many future enterprise-wide initiatives will fail to garner support.

How well does your organization play The Stakeholder’s Dilemma?

Related Posts