Finding Data Quality


Have you ever experienced that sinking feeling, where you sense if you don’t find data quality, then data quality will find you?

In the spring of 2003, Pixar Animation Studios produced one of my all-time favorite Walt Disney Pictures—Finding Nemo

This blog post is an hommage to not only the film, but also to the critically important role into which data quality is cast within all of your enterprise information initiatives, including business intelligence, master data management, and data governance. 

I hope that you enjoy reading this blog post, but most important, I hope you always remember: “Data are friends, not food.”

Data Silos


“Mine!  Mine!  Mine!  Mine!  Mine!”

That’s the Data Silo Mantra—and it is also the bane of successful enterprise information management.  Many organizations persist on their reliance on vertical data silos, where each and every business unit acts as the custodian of their own private data—thereby maintaining their own version of the truth.

Impressive business growth can cause an organization to become a victim of its own success.  Significant collateral damage can be caused by this success, and most notably to the organization’s burgeoning information architecture.

Earlier in an organization’s history, it usually has fewer systems and easily manageable volumes of data, thereby making managing data quality and effectively delivering the critical information required to make informed business decisions everyday, a relatively easy task where technology can serve business needs well—especially when the business and its needs are small.

However, as the organization grows, it trades effectiveness for efficiency, prioritizing short-term tactics over long-term strategy, and by seeing power in the hoarding of data, not in the sharing of information, the organization chooses business unit autonomy over enterprise-wide collaboration—and without this collaboration, successful enterprise information management is impossible.

A data silo often merely represents a microcosm of an enterprise-wide problem—and this truth is neither convenient nor kind.

Data Profiling


“I see a light—I’m feeling good about my data . . . Good feeling’s gone—AHH!”

Although it’s not exactly a riddle wrapped in a mystery inside an enigma,  understanding your data is essential to using it effectively and improving its quality—to achieve these goals, there is simply no substitute for data analysis.

Data profiling can provide a reality check for the perceptions and assumptions you may have about the quality of your data.  A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.

However, it is important to remember that the analysis itself can not be automated—you need to translate your analysis into the meaningful reports and questions that will facilitate more effective communication and help establish tangible business context.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions. 

Discovering the right questions requires talking with data’s best friends—its stewards, analysts, and subject matter experts.  These discussions are a critical prerequisite for determining data usage, standards, and the business relevant metrics for measuring and improving data quality.  Always remember that well performed data profiling is highly interactive and a very iterative process.

Defect Prevention


“You, Data-Dude, takin’ on the defects.

You’ve got serious data quality issues, dude.


Even though it is impossible to truly prevent every problem before it happens, proactive defect prevention is a highly recommended data quality best practice because the more control enforced where data originates, the better the overall quality will be for enterprise information.

Although defect prevention is most commonly associated with business and technical process improvements, after identifying the burning root cause of your data defects, you may predictably need to apply some of the principles of behavioral data quality.

In other words, understanding the complex human dynamics often underlying data defects is necessary for developing far more effective tactics and strategies for implementing successful and sustainable data quality improvements.

Data Cleansing


“Just keep cleansing.  Just keep cleansing.

Just keep cleansing, cleansing, cleansing.

What do we do?  We cleanse, cleanse.”

That’s not the Data Cleansing Theme Song—but it can sometimes feel like it.  Especially whenever poor data quality negatively impacts decision-critical information, the organization may legitimately prioritize a reactive short-term response, where the only remediation will be fixing the immediate problems.

Balancing the demands of this data triage mentality with the best practice of implementing defect prevention wherever possible, will often create a very challenging situation for you to contend with on an almost daily basis.

Therefore, although comprehensive data remediation will require combining reactive and proactive approaches to data quality, you need to be willing and able to put data cleansing tools to good use whenever necessary.



“It’s like he’s trying to speak to me, I know it.

Look, you’re really cute, but I can’t understand what you’re saying.

Say that data quality thing again.”

I hear this kind of thing all the time (well, not the “you’re really cute” part).

Effective communication improves everyone’s understanding of data quality, establishes a tangible business context, and helps prioritize critical data issues. 

Keep in mind that communication is mostly about listening.  Also, be prepared to face “data denial” when data quality problems are discussed.  Most often, this is a natural self-defense mechanism for the people responsible for business processes, technology, and data—and because of the simple fact that nobody likes to feel blamed for causing or failing to fix the data quality problems.

The key to effective communication is clarity.  You should always make sure that all data quality concepts are clearly defined and in a language that everyone can understand.  I am not just talking about translating the techno-mumbojumbo, because even business-speak can sound more like business-babbling—and not just to the technical folks.

Additionally, don’t be afraid to ask questions or admit when you don’t know the answers.  Many costly mistakes can be made when people assume that others know (or pretend to know themselves) what key concepts and other terminology actually mean.

Never underestimate the potential negative impacts that the point of view paradox can have on communication.  For example, the perspectives of the business and technical stakeholders can often appear to be diametrically opposed.

Practicing effective communication requires shutting our mouth, opening our ears, and empathically listening to each other, instead of continuing to practice ineffective communication, where we merely take turns throwing word-darts at each other.



“Oh and one more thing:

When facing the daunting challenge of collaboration,

Work through it together, don't avoid it.

Come on, trust each other on this one.

Yes—trust—it’s what successful teams do.”

Most organizations suffer from a lack of collaboration, and as noted earlier, without true enterprise-wide collaboration, true success is impossible.

Beyond the data silo problem, the most common challenge for collaboration is the divide perceived to exist between the Business and IT, where the Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise, and IT usually owns the hardware and software infrastructure of the enterprise’s technical architecture.

However, neither the Business nor IT alone has all of the necessary knowledge and resources required to truly be successful.  Data quality requires that the Business and IT forge an ongoing and iterative collaboration.

You must rally the team that will work together to improve the quality of your data.  A cross-disciplinary team will truly be necessary because data quality is neither a business issue nor a technical issue—it is both, truly making it an enterprise issue.

Executive sponsors, business and technical stakeholders, business analysts, data stewards, technology experts, and yes, even consultants and contractors—only when all of you are truly working together as a collaborative team, can the enterprise truly achieve great things, both tactically and strategically.

Successful enterprise information management is spelled E—A—C.

Of course, that stands for Enterprises—Always—Collaborate.  The EAC can be one seriously challenging place, dude.

You don’t know if you know what they know, or if they know what you know, but when you know, then they know, you know?

It’s like first you are all like “Whoa!” and they are all like “Whoaaa!” then you are like “Sweet!” and then they are like “Totally!”

This critical need for collaboration might seem rather obvious.  However, as all of the great philosophers have taught us, sometimes the hardest thing to learn is the least complicated.

Okay.  Squirt will now give you a rundown of the proper collaboration technique:

“Good afternoon. We’re gonna have a great collaboration today.

Okay, first crank a hard cutback as you hit the wall.

There’s a screaming bottom curve, so watch out.

Remember: rip it, roll it, and punch it.”

Finding Data Quality


As more and more organizations realize the critical importance of viewing data as a strategic corporate asset, data quality is becoming an increasingly prevalent topic of discussion.

However, and somewhat understandably, data quality is sometimes viewed as a small fish—albeit with a “lucky fin”—in a much larger pond.

In other words, data quality is often discussed only in its relation to enterprise information initiatives such as data integration, master data management, data warehousing, business intelligence, and data governance.

There is nothing wrong with this perspective, and as a data quality expert, I admit to my general tendency to see data quality in everything.  However, regardless of the perspective from which you begin your journey, I believe that eventually you will be Finding Data Quality wherever you look as well.


Data is a Game Changer

Data is a Game Changer.png

Nowadays we hear a lot of chatter, rather reminiscent of the boisterous bluster of sports talk radio debates, about the potential of big data and its related technologies to enable predictive and real-time analytics and, by leveraging an infrastructure provided by the symbiotic relationship of cloud and mobile, serve up better business performance and an enhanced customer experience.

Sports have always provided great fodder for the data-obsessed with its treasure troves of statistical data dissecting yesterday’s games down to the most minute detail, which is called upon by experts and amateurs alike to try to predict tomorrow’s games as well as analyze in real-time the play-by-play of today’s games.  Arguably, it was the bestselling book Moneyball by Michael Lewis, which was also adapted into a popular movie starring Brad Pitt, that brought data obsession to the masses, further fueling the hype and overuse of sports metaphors such as how data can be a game changer for businesses in any industry and of any size.

The Future is Now Playing on Center Court

Which is why it is so refreshing to see a tangible real-world case study for big data analytics being delivered with the force of an Andy Murray two-handed backhand as over the next two weeks the United States Tennis Association (USTA) welcomes hundreds of thousands of spectators to New York City’s Flushing Meadows for the 2013 U.S. Open tennis tournament.  Both the fans in the stands and the millions more around the world will visit, via the web or mobile apps, in order to follow the action, watch live-streamed tennis matches, and get scores, stats, and the latest highlights and news thanks to IBM technologies.

Before, during, and after each match, predictive and real-time analytics drive IBM’s SlamTracker tool.  Before matches, IBM analyzes 41 million data points collected from eight years of Grand Slam play, including head-to-head matches, similar player types, and playing surfaces.  SlamTracker uses this data to create engaging and compelling tools for digital audiences, which identify key actions players must take to enhance their chances of winning, and give fans player information, match statistics, social sentiment, and more.

The infrastructure that supports the U.S. Open’s digital presence is hosted on an IBM SmartCloud.  This flexible, scalable environment, managed by IBM Analytics, lets the USTA ensure continuous availability of their digital platforms throughout the tournament and year-round.  The USTA and IBM give fans the ability to experience the matches from anywhere, with any device via a mobile-friendly site and engaging apps for multiple mobile platforms.  Together these innovations make the U.S. Open experience immediate and intimate for fans sitting in the stands or on another continent.

Better Service, More Winners, and Fewer Unforced Errors

In tennis, a service (also known as a serve) is a shot to start a point.  In business, a service is a shot to start a point of positive customer interaction, whether that’s a point of sale or an opportunity to serve a customer’s need (e.g., resolving a complaint).

In tennis, a winner is a shot not reached by your opponent, which wins you a point.  In business, a winner is a differentiator not reached by your competitor, which wins your business a sale when it makes a customer choose your product or service.

In tennis, an unforced error is a failure to complete a service or return a shot, which cannot be attributed to any factor other than poor judgement or execution by the player.  In business, an unforced error is a failure to service a customer or get a return on an investment, which cannot be attributed to any factor other than poor decision making or execution by the organization.

Properly supported by enabling technologies, businesses of all sizes, and across all industries, can capture and analyze data to uncover hidden patterns and trends that can help them achieve better service, more winners, and fewer unforced errors.

How can Data change Your Game?

Whether it’s on the court, in the stands, on the customer-facing front lines, in the dashboards used by executive management, or behind the scenes of a growing midsize business, data is a game changer.  How can data change your game?

IBM Logo.jpg

Business Analytics for Midsize Businesses

As this growing list of definitions for big data attests, big data evangelist and IBM thought leader James Kobielus rightfully warns that big data is in danger of definitional overkill.  But most midsize business owners are less concerned about defining big data as they are about, as Laurie McCabe recently blogged, determining whether big data is relevant for their business.

“The fact of the matter is, big is a relative term,” McCabe explained, “relative to the amount of information that your organization needs to sift through to find the insights you need to operate the business more proactively and profitably.”

McCabe also noted that this is not just a problem for big businesses, since getting better insights from the data you already have is a challenge for businesses of all sizes.  Midsize businesses “may not be dealing with terabytes of data,” McCabe explained, “but many are finding that tools that used to suffice—such as Excel spreadsheets—fall short even when it comes to analyzing internal transactional databases.”  McCabe also provided recommendations for how midsize businesses can put big data to work.

The recent IBM study The Case for Business Analytics in Midsize Firms lists big data as one of the trends making a compelling case for the growing importance of business analytics for midsize businesses.  The study also noted that important functional data continues to live in departmental spreadsheets, and state-of-the-art business analytics solutions are needed to make it easy to pull all that data, along with data from other sources, together in a meaningful way.  Despite the common misconception that such solutions are too expensive for midsize businesses, solutions are now available that can deliver analytics capabilities to help overcome big data challenges without requiring a big upfront investment in hardware or software.

Phil Simon, author of Too Big to Ignore: The Business Case for Big Data, recently blogged about reporting versus analytics, explaining the essence of analytics is it goes beyond the what and where provided by reporting, and tries to explain the why.

Big data isn’t the only reason why analytics is becoming more of a necessity.  But with the barriers to what it costs and where it can be deployed becoming easier to overcome, business analytics is becoming more commonplace in midsize businesses.

IBM Logo.jpg

Business Intelligence for Midsize Businesses

Business intelligence is one of those phrases that everyone agrees is something all organizations, regardless of their size, should be doing.  After all, no organization would admit to doing business stupidity.  Nor, I presume, would any vendor admit to selling it.

But not everyone seems to agree on what the phrase means.  Personally, I have always defined business intelligence as the data analytics performed in support of making informed business decisions (i.e., for me, business intelligence = decision support).

Oftentimes, this analytics is performed on data integrated, cleansed, and consolidated into a repository (e.g., a data warehouse).  Other times, it’s performed on a single data set (e.g., a customer information file).  Either way, business decision makers interact with the analytical results via static reports, data visualizations, dynamic dashboards, and ad hoc querying and reporting tools.

But robust business intelligence and analytics solutions used to be perceived as something only implemented by big businesses, as evinced in the big price tags usually associated with them.  However, free and open source software, cloud computingmobile, social, and a variety of as-a-service technologies drove the consumerization of IT, driving down the costs of solutions, enabling small and midsize businesses to afford them.  Additionally, the open data movement lead to a wealth of free public data sets that can be incorporated into business intelligence and analytics solutions (examples can be found at

Lyndsay Wise, author of the insightful book Using Open Source Platforms for Business Intelligence (to listen to a podcast about the book, click here: OSBI on OCDQ Radio), recently blogged about business intelligence for small and midsize businesses.

Wise advised that “recent market changes have shifted the market in favor of small and midsize businesses.  Before this, most were limited by requirements for large infrastructures, high-cost licensing, and limited solution availability.  With this newly added flexibility and access to lower price points, business intelligence and analytics solutions are no longer out of reach.”


This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies, or opinions.


Related Posts

The Big Datastillery

Smart Big Data Adoption for Midsize Businesses

Big Data is not just for Big Businesses

Big Data Lessons from Orbitz

The Graystone Effects of Big Data

Will Big Data be Blinded by Data Science?

Social Business is more than Social Marketing

Social Media Marketing: From Monologues to Dialogues

Social Media for Midsize Businesses

Barriers to Cloud Adoption

Leveraging the Cloud for Application Development

Cloud Computing for Midsize Businesses

Cloud Computing is the New Nimbyism

Devising a Mobile Device Strategy

The Age of the Mobile Device

Word of Mouth has become Word of Data

Information Asymmetry versus Empowered Customers

Talking Business about the Weather

Open Source Business Intelligence

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I discuss open source business intelligence (OSBI) with Lyndsay Wise, author of the insightful new book Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize ROI.

Lyndsay Wise is the President and Founder of WiseAnalytics, an independent analyst firm and consultancy specializing in business intelligence for small and mid-sized organizations.  For more than ten years, Lyndsay Wise has assisted clients in business systems analysis, software selection, and implementation of enterprise applications.

Lyndsay Wise conducts regular research studies, consults, writes articles, and speaks about how to implement a successful business intelligence approach and improving the value of business intelligence within organizations.


Open Source Business Intelligence

Additional listening options:


Win a copy of the Book

Lyndsay Wise wants to give one OCDQ Radio listener a free copy of Using Open Source Platforms for Business Intelligence.


Here is how the book contest will work:

(1) Book Contest Question — Name one of the considerations for evaluating whether OSBI is the right choice for your organization that Lyndsay Wise discussed during this OCDQ Radio episode.


(2) Book Contest Deadline — By or before January 31, 2013, Email Jim Harris with your answer to the book contest question.


(3) Book Contest Winner — In February 2013, one winner will be randomly selected from the emails containing the answer to the contest question, and Lyndsay Wise will email the winner requesting a postal address for sending a free copy of the book.


Related Lyndsay Wise Articles

What You Need to Know about Open Source BI

Open Source BI Considerations and Implications

Do Self-Service and Open Source Co-Exist?

Everything Executives Need to Know about Open Source BI

The Importance of Data Management for Business People


Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

The Wisdom of Crowds, Friends, and Experts

I recently finished reading the TED Book by Jim Hornthal, A Haystack Full of Needles, which included an overview of the different predictive approaches taken by one of the most common forms of data-driven decision making in the era of big data, namely, the recommendation engines increasingly provided by websites, social networks, and mobile apps.

These recommendation engines primarily employ one of three techniques, choosing to base their data-driven recommendations on the “wisdom” provided by either crowds, friends, or experts.


The Wisdom of Crowds

In his book The Wisdom of Crowds, James Surowiecki explained that the four conditions characterizing wise crowds are diversity of opinion, independent thinking, decentralization, and aggregation.  Amazon is a great example of a recommendation engine using this approach by assuming that a sufficiently large population of buyers is a good proxy for your purchasing decisions.

For example, Amazon tells you that people who bought James Surowiecki’s bestselling book also bought Thinking, Fast and Slow by Daniel Kahneman, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business by Jeff Howe, and Wikinomics: How Mass Collaboration Changes Everything by Don Tapscott.  However, Amazon neither provides nor possesses knowledge of why people bought all four of these books or qualification of the subject matter expertise of these readers.

However, these concerns, which we could think of as potential data quality issues, and which would be exacerbated within a small amount of transaction data where the eclectic tastes and idiosyncrasies of individual readers would not help us decide what books to buy, within a large amount of transaction data, we achieve the Wisdom of Crowds effect when, taken in aggregate, we receive a general sense of what books we might like to read based on what a diverse group of readers collectively makes popular.

As I blogged about in my post Sometimes it’s Okay to be Shallow, sometimes the aggregated, general sentiment of a large group of unknown, unqualified strangers will be sufficient to effectively make certain decisions.


The Wisdom of Friends

Although the influence of our friends and family is the oldest form of data-driven decision making, historically this influence was delivered by word of mouth, which required you to either be there to hear those influential words when they were spoken, or have a large enough network of people you knew that would be able to eventually pass along those words to you.

But the rise of social networking services, such as Twitter and Facebook, has transformed word of mouth into word of data by transcribing our words into short bursts of social data, such as status updates, online reviews, and blog posts.

Facebook “Likes” are a great example of a recommendation engine that uses the Wisdom of Friends, where our decision to buy a book, see a movie, or listen to a song might be based on whether or not our friends like it.  Of course, “friends” is used in a very loose sense in a social network, and not just on Facebook, since it combines strong connections such as actual friends and family, with weak connections such as acquaintances, friends of friends, and total strangers from the periphery of our social network.

Social influence has never ended with the people we know well, as Nicholas Christakis and James Fowler explained in their book Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives.  But the hyper-connected world enabled by the Internet, and further facilitated by mobile devices, has strengthened the social influence of weak connections, and these friends form a smaller crowd whose wisdom is involved in more of our decisions than we may even be aware of.


The Wisdom of Experts

Since it’s more common to associate wisdom with expertise, Pandora is a great example of a recommendation engine that uses the Wisdom of Experts.  Pandora used a team of musicologists (professional musicians and scholars with advanced degrees in music theory) to deconstruct more than 800,000 songs into 450 musical elements that make up each performance, including qualities of melody, harmony, rhythm, form, composition, and lyrics, as part of what Pandora calls the Music Genome Project.

As Pandora explains, their methodology uses precisely defined terminology, a consistent frame of reference, redundant analysis, and ongoing quality control to ensure that data integrity remains reliably high, believing that delivering a great radio experience to each and every listener requires an incredibly broad and deep understanding of music.

Essentially, experts form the smallest crowd of wisdom.  Of course, experts are not always right.  At the very least, experts are not right about every one of their predictions.  Nor do experts always agree with other, which is why I imagine that one of the most challenging aspects of the Music Genome Project is getting music experts to consistently apply precisely the same methodology.

Pandora also acknowledges that each individual has a unique relationship with music (i.e., no one else has tastes exactly like yours), and allows you to “Thumbs Up” or “Thumbs Down” songs without affecting other users, producing more personalized results than either the popularity predicted by the Wisdom of Crowds or the similarity predicted by the Wisdom of Friends.


The Future of Wisdom

It’s interesting to note that the Wisdom of Experts is the only one of these approaches that relies on what data management and business intelligence professionals would consider a rigorous approach to data quality and decision quality best practices.  But this is also why the Wisdom of Experts is the most time-consuming and expensive approach to data-driven decision making.

In the past, the Wisdom of Crowds and Friends was ignored in data-driven decision making for the simple reason that this potential wisdom wasn’t digitized.  But now, in the era of big data, not only are crowds and friends digitized, but technological advancements combined with cost-effective options via open source (data and software) and cloud computing make these approaches quicker and cheaper than the Wisdom of Experts.  And despite the potential data quality and decision quality issues, the Wisdom of Crowds and/or Friends is proving itself a viable option for more categories of data-driven decision making.

I predict that the future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three potential sources of wisdom often acknowledged as contributors to data-driven decision making.


Related Posts

Sometimes it’s Okay to be Shallow

Word of Mouth has become Word of Data

The Wisdom of the Social Media Crowd

Data Management: The Next Generation

Exercise Better Data Management

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

The Big Data Theory

Finding a Needle in a Needle Stack

Big Data, Predictive Analytics, and the Ideal Chronicler

The Limitations of Historical Analysis

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

A Tale of Two Datas

A Tale of Two Datas

Is big data more than just lots and lots of data?  Is big data unstructured and not-so-big data structured?  Malcolm Chisholm explored these questions in his recent Information Management column, where he posited that there are, in fact, two datas.

“One type of data,” Chisholm explained,  “represents non-material entities in vast computerized ecosystems that humans create and manage.  The other data consists of observations of events, which may concern material or non-material entities.”

Providing an example of the first type, Chisholm explained, “my bank account is not a physical thing at all; it is essentially an agreed upon idea between myself, the bank, the legal system, and the regulatory authorities.  It only exists insofar as it is represented, and it is represented in data.  The balance in my bank account is not some estimate with a positive and negative tolerance; it is exact.  The non-material entities of the financial sector are orderly human constructs.  Because they are orderly, we can more easily manage them in computerized environments.”

The orderly human constructs that are represented in data, in the stories told by data (including the stories data tell about us and the stories we tell data) is one of my favorite topics.  In our increasingly data-constructed world, it’s important to occasionally remind ourselves that data and the real world are not the same thing, especially when data represents non-material entities since, with the possible exception of Makers using 3-D printers, data-represented entities do not re-materialize into the real world.

Describing the second type, Chisholm explained, “a measurement is usually a comparison of a characteristic using some criteria, a count of certain instances, or the comparison of two characteristics.  A measurement can generally be quantified, although sometimes it’s expressed in a qualitative manner.  I think that big data goes beyond mere measurement, to observations.”

Chisholm called the first type the Data of Representation, and the second type the Data of Observation.

The data of representation tends to be structured, in the relational sense, but doesn’t need to be (e.g., graph databases) and the data of observation tends to be unstructured, but it can also be structured (e.g., the structured observations generated by either a data profiling tool analyzing structured relational tables or flat files, or a word-counting algorithm analyzing unstructured text).

Structured and unstructured,” Chisholm concluded, “describe form, not essence, and I suggest that representation and observation describe the essences of the two datas.  I would also submit that both datas need different data management approaches.  We have a good idea what these are for the data of representation, but much less so for the data of observation.”

I agree that there are two types of data (i.e., representation and observation, not big and not-so-big) and that different data uses will require different data management approaches.  Although data modeling is still important and data quality still matters, how much data modeling and data quality is needed before data can be effectively used for specific business purposes will vary.

In order to move our discussions forward regarding “big data” and its data management and business intelligence challenges, we have to stop fiercely defending our traditional perspectives about structure and quality in order to effectively manage both the form and essence of the two datas.  We also have to stop fiercely defending our traditional perspectives about data analytics, since there will be some data use cases where depth and detailed analysis may not be necessary to provide business insight.


A Tale of Two Datas

In conclusion, and with apologies to Charles Dickens and his A Tale of Two Cities, I offer the following A Tale of Two Datas:

It was the best of times, it was the worst of times.
It was the age of Structured Data, it was the age of Unstructured Data.
It was the epoch of SQL, it was the epoch of NoSQL.
It was the season of Representation, it was the season of Observation.
It was the spring of Big Data Myth, it was the winter of Big Data Reality.
We had everything before us, we had nothing before us,
We were all going direct to hoarding data, we were all going direct the other way.
In short, the period was so far like the present period, that some of its noisiest authorities insisted on its being signaled, for Big Data or for not-so-big data, in the superlative degree of comparison only.

Related Posts

HoardaBytes and the Big Data Lebowski

The Idea of Order in Data

The Most August Imagination

Song of My Data

The Lies We Tell Data

Our Increasingly Data-Constructed World

Plato’s Data

OCDQ Radio - Demystifying Master Data Management

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

Swimming in Big Data

Sometimes it’s Okay to be Shallow

Darth Vader, Big Data, and Predictive Analytics

The Big Data Theory

Finding a Needle in a Needle Stack

Exercise Better Data Management

Magic Elephants, Data Psychics, and Invisible Gorillas

Why Can’t We Predict the Weather?

Data and its Relationships with Quality

A Tale of Two Q’s

A Tale of Two G’s

Exercise Better Data Management

Recently on Twitter, Daragh O Brien and I discussed his proposed concept.  “After Big Data,” Daragh tweeted, “we will inevitably begin to see the rise of MOData as organizations seek to grab larger chunks of data and digest it.  What is MOData?  It’s MO’Data, as in MOre Data. Or Morbidly Obese Data.  Only good data quality and data governance will determine which.”

Daragh asked if MO’Data will be the Big Data Killer.  I said only if MO’Data doesn’t include MO’BusinessInsight, MO’DataQuality, and MO’DataPrivacy (i.e., more business insight, more data quality, and more data privacy).

“But MO’Data is about more than just More Data,” Daragh replied.  “It’s about avoiding Morbidly Obese Data that clogs data insight and data quality, etc.”

I responded that More Data becomes Morbidly Obese Data only if we don’t exercise better data management practices.

Agreeing with that point, Daragh replied, “Bring on MOData and the Pilates of Data Quality and Data Governance.”

To slightly paraphrase lines from one of my favorite movies — Airplane! — the Cloud is getting thicker and the Data is getting laaaaarrrrrger.  Surely I know well that growing data volumes is a serious issue — but don’t call me Shirley.

Whether you choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how much reality bites, the truth is we were consuming way more than our recommended daily allowance of data long before the data management industry took a tip from McDonald’s and put the word “big” in front of its signature sandwich.  (Oh great . . . now I’m actually hungry for a Big Mac.)

But nowadays with silos replicating data, as well as new data, and new types of data, being created and stored on a daily basis, our data is resembling the size of Bob Parr in retirement, making it seem like not even Mr. Incredible in his prime possessed the super strength needed to manage all of our data.  Those were references to the movie The Incredibles, where Mr. Incredible was a superhero who, after retiring into civilian life under the alias of Bob Parr, elicits the observation from this superhero costume tailor: “My God, you’ve gotten fat.”  Yes, I admit not even Helen Parr (aka Elastigirl) could stretch that far for a big data joke.

A Healthier Approach to Big Data

Although Daragh’s concerns about morbidly obese data are valid, no superpowers (or other miracle exceptions) are needed to manage all of our data.  In fact, it’s precisely when we are so busy trying to manage all of our data that we hoard countless bytes of data without evaluating data usage, gathering data requirements, or planning for data archival.  It’s like we are trying to lose weight by eating more and exercising less, i.e., consuming more data and exercising less data quality and data governance.  As Daragh said, only good data quality and data governance will determine whether we get more data or morbidly obese data.

Losing weight requires a healthy approach to both diet and exercise.  A healthy approach to diet includes carefully choosing the food you consume and carefully controlling your portion size.  A healthy approach to exercise includes a commitment to exercise on a regular basis at a sufficient intensity level without going overboard by spending several hours a day, every day, at the gym.

Swimming is a great form of exercise, but swimming in big data without having a clear business objective before you jump into the pool is like telling your boss that you didn’t get any work done because you decided to spend all day working out at the gym.

Carefully choosing the data you consume and carefully controlling your data portion size is becoming increasingly important since big data is forcing us to revisit information overload.  However, the main reason that traditional data management practices often become overwhelmed by big data is because traditional data management practices are not always the right approach.

We need to acknowledge that some big data use cases differ considerably from traditional ones.  Data modeling is still important and data quality still matters, but how much data modeling and data quality is needed before big data can be effectively used for business purposes will vary.  In order to move the big data discussion forward, we have to stop fiercely defending our traditional perspectives about structure and quality.  We also have to stop fiercely defending our traditional perspectives about analytics, since there will be some big data use cases where depth and detailed analysis may not be necessary to provide business insight.

Better than Big or More

Jim Ericson explained that your data is big enough.  Rich Murnane explained that bigger isn’t better, better is better.  Although big data may indeed be followed by more data that doesn’t necessarily mean we require more data management in order to prevent more data from becoming morbidly obese data.  I think that we just need to exercise better data management.


Related Posts

Data Quality and Big Data

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This is Part 2 of 2 from my recent discussion with Tom Redman.  In this episode, Tom and I discuss data quality and big data, including if data quality matters less in larger data sets, if statistical outliers represent business insights or data quality issues, statistical sampling errors versus measurement calibration errors, mistaking signal for noise (i.e., good data for bad data), and whether or not the principles and practices of true “data scientists” will truly be embraced by an organization’s business leaders.

Dr. Thomas C. Redman (the “Data Doc”) is an innovator, advisor, and teacher.  He was first to extend quality principles to data and information in the late 80s.  Since then he has crystallized a body of tools, techniques, roadmaps and organizational insights that help organizations make order-of-magnitude improvements.

More recently Tom has developed keen insights into the nature of data and formulated the first comprehensive approach to “putting data to work.”  Taken together, these enable organizations to treat data as assets of virtually unlimited potential.

Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books.  The most recent, Data Driven: Profiting from Your Most Important Business Asset (Harvard Business Press, 2008) was a Library Journal best buy of 2008.

Prior to forming Navesink Consulting Group in 1996, Tom conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  Tom holds a Ph.D. in statistics from Florida State University. He holds two patents.


Data Quality and Big Data

Additional listening options:


Related Posts

Bayesian Data-Driven Decision Making

Magic Elephants, Data Psychics, and Invisible Gorillas

Big Data el Memorioso

Neither the I Nor the T is Magic

Information Overload Revisited

HoardaBytes and the Big Data Lebowski


The Speed of Decision

The Data-Decision Symphony

A Decision Needle in a Data Haystack

The Big Data Collider

Dot Collectors and Dot Connectors


Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Organizing for Data Quality — Guest Tom Redman (aka the “Data Doc”) discusses how your organization should approach data quality, including his call to action for your role in the data revolution.