December 26, 2011

Neither the I Nor the T is Magic

December 26, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

It’s that time when we reflect on the past year and try to predict the future, such as Paul Muller, Joel Rothman, and Pearl Zhu did with their recent blog posts. Although I have previously written about why most predictions don’t come true, in this post, I throw my fortune-telling hat into the 2012 prediction ring.

The information technology (IT) trends of 2011 included consumerization and decentralization, application modernization and information optimization, cloud computing and cloud security (and, by extension, enterprise security). However, perhaps the biggest IT trend of the year was that 2011 is going out with a Big Bang about Big Data in 2012 and beyond.

Since its inception, the IT industry has both benefited from and battled against the principle known as Clarke’s Third Law:

“Any sufficiently advanced technology is indistinguishable from magic.”

This principle often fuels the Diderot Effect of New Technology, enchanting our organizations with the mad desire to stock up on new technologically magic things. As such, many are predicting 2012 will be the Year of the Magic Elephant named Hadoop because, as Gartner Research predicts about big data, “the size, complexity of formats, and speed of delivery exceeds the capabilities of traditional data management technologies; it requires the use of new or exotic technologies simply to manage the volume alone. Many new technologies are emerging, with the potential to be disruptive. Analytics has become a major driving application.” As a corollary, the potential business value of integrating big data into business analytics seems to be conjuring up an alternative version of Clarke’s Third Law:

“Any sufficiently advanced information is indistinguishable from magic.”

In other words, many big data proponents (especially IT vendors selling Hadoop-based solutions) extol its virtues as if its information is capable of providing clairvoyant business insight, as if big data was the Data Psychic of the Information Age.

Although both sufficiently advanced information and technology will have important business-enabling IT roles to play in 2012, never forget that neither the I nor the T is magic — no matter what the Data Psychics and Magic Elephants may say.

This blog post is sponsored by the Enterprise CIO Forum and HP.

Information Overload Revisited

The Data Encryption Keeper

The Cloud Security Paradox

The Good, the Bad, and the Secure

Securing your Digital Fortress

Shadow IT and the New Prometheus

Are Cloud Providers the Bounty Hunters of IT?

The Diderot Effect of New Technology

The IT Consumerization Conundrum

The IT Prime Directive of Business First Contact

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

Why does the sun never set on legacy applications?

The Partly Cloudy CIO

The IT Pendulum and the Federated Future of IT

Suburban Flight, Technology Sprawl, and Garage IT

December 19, 2011

Redefining Data Quality

December 19, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I have an occasionally spirited discussion about data quality with Peter Perera, partially precipitated by his provocative post from this past summer, The End of Data Quality...as we know it, which included his proposed redefinition of data quality, as well as his perspective on the relationship of data quality to master data management and data governance.

Peter Perera is a recognized consultant and thought leader with significant experience in Master Data Management, Customer Relationship Management, Data Quality, and Customer Data Integration. For over 20 years, he has been advising and working with Global 5000 organizations and mid-size enterprises to increase the usability and value of their customer information.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

December 15, 2011

Information Overload Revisited

December 15, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

Information Overload is a term invoked regularly during discussions about the data deluge of the Information Age, which has created a 24 hours a day, 7 days a week, 365 days a year, world-wide whirlwind of constant information flow, where the very air we breath is literally teeming with digital data streams — continually inundating us with new, and new types of, information.

Information overload generally refers to how too much information can overwhelm our ability to understand an issue, and can even disable our decision making in regards to that issue (this latter aspect is generally referred to as Analysis Paralysis).

But we often forget that the term is over 40 years old. It was popularized by Alvin Toffler in his bestselling book Future Shock, which was published in 1970, back when the Internet was still in its infancy, and long before the Internet’s progeny would give birth to the clouds contributing to the present, potentially perpetual, forecast for data precipitation.

A related term that has become big in the data management industry is Big Data, which, as Gartner Research explains, although the term acknowledges the exponential growth, availability, and use of information in today’s data-rich landscape, big data is about more than just data volume. Data variety (i.e., structured, semi-structured, and unstructured data, as well as other types, such as the sensor data emanating from the Internet of Things) and data velocity (i.e., how fast data is being produced and how fast the data must be processed to meet demand) are also key characteristics of the big challenges of big data.

John Dodge and Bob Gourley recently discussed big data on Enterprise CIO Forum Radio, where Gourley explained that big data is essentially “the data that your enterprise is not currently able to do analysis over.” This point resonates with a similar one made by Bill Laberis, who recently discussed new global research where half of the companies polled responded that they cannot effectively deal with analyzing the rising tide of data available to them.

Most of the big angst about big data comes from this fear that organizations are not tapping the potential business value of all that data not currently being included in their analytics and decision making. This reminds me of psychologist Herbert Simon, who won the 1978 Nobel Prize in Economics for his pioneering research on decision making, which included comparing and contrasting the decision-making strategies of maximizing and satisficing (a term that combines satisfying with sufficing).

Simon explained that a maximizer is like a perfectionist who considers all the data they can find because they need to be assured that their decision was the best that could be made. This creates a psychologically daunting task, especially as the amount of available data constantly increases (again, note that this observation was made over 40 years ago). The alternative is to be a satisficer, someone who attempts to meet criteria for adequacy rather than identify an optimal solution. And especially when time is a critical factor, such as it is with the real-time decision making demanded by a constantly changing business world.

Big data strategies will also have to compare and contrast maximizing and satisficing. Maximizers, if driven by their angst about all that data they are not analyzing, might succumb to information overload. Satisficers, if driven by information optimization, might sufficiently integrate just enough of big data into their business analytics in a way that satisfies specific business needs.

As big data forces us to revisit information overload, it may be useful for us to remember that originally the primary concern was not about the increasing amount of information, but instead the increasing access to information. As Clay Shirky succinctly stated, “It’s not information overload, it’s filter failure.” So, to harness the business value of big data, we will need better filters, which may ultimately make for the entire distinction between information overload and information optimization.

This blog post is sponsored by the Enterprise CIO Forum and HP.

The Data Encryption Keeper

The Cloud Security Paradox

The Good, the Bad, and the Secure

Securing your Digital Fortress

Shadow IT and the New Prometheus

Are Cloud Providers the Bounty Hunters of IT?

The Diderot Effect of New Technology

The IT Consumerization Conundrum

The IT Prime Directive of Business First Contact

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

Why does the sun never set on legacy applications?

The Partly Cloudy CIO

The IT Pendulum and the Federated Future of IT

Suburban Flight, Technology Sprawl, and Garage IT

December 13, 2011

Making EIM Work for Business

December 13, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I discuss Enterprise Information Management (EIM) with John Ladley, the author of the excellent book Making EIM Work for Business, exploring what makes information management, not just useful, but valuable to the enterprise.

John Ladley is a business technology thought leader with 30 years of experience in improving organizations through the successful implementation of information systems. He is a recognized authority in the use and implementation of business intelligence and enterprise information management. John Ladley frequently writes and speaks on a variety of technology and enterprise information management topics. His information management experience is balanced between strategic technology planning, project management, and, most important, the practical application of technology to business problems.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

December 11, 2011

Two Weeks Before Christmas

December 11, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Season’s Greetings fellow data management enthusiasts and welcome to a special holiday-themed episode of OCDQ Radio.

With the Christmas, Hanukkah, Kwanzaa, and Festivus seasons now upon us, I revisited my ‘Twas Two Weeks Before Christmas blog post from 2009, which is based on the poem A Visit from St. Nicholas. During this brief podcast, I perform a recital.

The entire OCDQ Blog family wishes you and yours all the best during this holiday season and the coming new year.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

December 08, 2011

You only get a Return from something you actually Invest in

December 08, 2011/ Jim Harris

In my previous post, I took a slightly controversial stance on a popular three-word phrase — Root Cause Analysis. In this post, it’s another popular three-word phrase — Return on Investment (most commonly abbreviated as the acronym ROI).

What is the ROI of purchasing a data quality tool or launching a data governance program?

Zero. Zip. Zilch. Intet. Ingenting. Rien. Nada. Nothing. Nichts. Niets. Null. Niente. Bupkis.

There is No Such Thing as the ROI of purchasing a data quality tool or launching a data governance program.

Before you hire “The Butcher” to eliminate me for being The Man Who Knew Too Little about ROI, please allow me to explain.

Returns only come from Investments

Although the reason that you likely purchased a data quality tool is because you have business-critical data quality problems, simply purchasing a tool is not an investment (unless you believe in Magic Beans) since the tool itself is not a solution.

You use tools to build, test, implement, and maintain solutions. For example, I spent several hundred dollars on new power tools last year for a home improvement project. However, I haven’t received any return on my home improvement investment for a simple reason — I still haven’t even taken most of the tools out of their packaging yet. In other words, I barely even started my home improvement project. It is precisely because I haven’t invested any time and effort that I haven’t seen any returns. And it certainly isn’t going to help me (although it would help Home Depot) if I believed buying even more new tools was the answer.

Although the reason that you likely launched a data governance program is because you have complex issues involving the intersection of data, business processes, technology, and people, simply launching a data governance program is not an investment since it does not conjure the three most important letters.

Data is only an Asset if Data is a Currency

In his book UnMarketing, Scott Stratten discusses this within the context of the ROI of social media (a commonly misunderstood aspect of social media strategy), but his insight is just as applicable to any discussion of ROI. “Think of it this way: You wouldn’t open a business bank account and ask to withdraw $5,000 before depositing anything. The banker would think you are a loony.”

Yet, as Stratten explained, people do this all the time in social media by failing to build up what is known as social currency. “You’ve got to invest in something before withdrawing. Investing your social currency means giving your time, your knowledge, and your efforts to that channel before trying to withdraw monetary currency.”

The same logic applies perfectly to data quality and data governance, where we could say it’s the failure to build up what I will call data currency. You’ve got to invest in data before you could ever consider data an asset to your organization. Investing your data currency means giving your time, your knowledge, and your efforts to data quality and data governance before trying to withdraw monetary currency (i.e., before trying to calculate the ROI of a data quality tool or a data governance program).

If you actually want to get a return on your investment, then actually invest in your data. Invest in doing the hard daily work of continuously improving your data quality and putting into practice your data governance principles, policies, and procedures.

Data is only an asset if data is a currency. Invest in your data currency, and you will eventually get a return on your investment.

You only get a return from something you actually invest in.

Can Enterprise-Class Solutions Ever Deliver ROI?

Do you believe in Magic (Quadrants)?

Which came first, the Data Quality Tool or the Business Need?

What Data Quality Technology Wants

A Farscape Analogy for Data Quality

The Data Quality Wager

“Some is not a number and soon is not a time”

The Dumb and Dumber Guide to Data Quality

December 05, 2011

There is No Such Thing as a Root Cause

December 05, 2011/ Jim Harris

Root cause analysis. Most people within the industry, myself included, often discuss the importance of determining the root cause of data governance and data quality issues. However, the complex cause and effect relationships underlying an issue means that when an issue is encountered, often you are only seeing one of the numerous effects of its root cause (or causes).

In my post The Root! The Root! The Root Cause is on Fire!, I poked fun at those resistant to root cause analysis with the lyrics:

The Root! The Root! The Root Cause is on Fire!
We don’t want to determine why, just let the Root Cause burn.
Burn, Root Cause, Burn!

However, I think that the time is long overdue for even me to admit the truth — There is No Such Thing as a Root Cause.

Before you charge at me with torches and pitchforks for having an Abby Normal brain, please allow me to explain.

Defect Prevention, Mouse Traps, and Spam Filters

Some advocates of defect prevention claim that zero defects is not only a useful motivation, but also an attainable goal. In my post The Asymptote of Data Quality, I quoted Daniel Pink’s book Drive: The Surprising Truth About What Motivates Us:

“Mastery is an asymptote. You can approach it. You can home in on it. You can get really, really, really close to it. But you can never touch it. Mastery is impossible to realize fully.

The mastery asymptote is a source of frustration. Why reach for something you can never fully attain?

But it’s also a source of allure. Why not reach for it? The joy is in the pursuit more than the realization.

In the end, mastery attracts precisely because mastery eludes.”

The mastery of defect prevention is sometimes distorted into a belief in data perfection, into a belief that we can not just build a better mousetrap, but we can build a mousetrap that could catch all the mice, or that by placing a mousetrap in our garage, which prevents mice from entering via the garage, we somehow also prevent mice from finding another way into our house.

Obviously, we can’t catch all the mice. However, that doesn’t mean we should let the mice be like Pinky and the Brain:

Pinky: “Gee, Brain, what do you want to do tonight?”

The Brain: “The same thing we do every night, Pinky — Try to take over the world!”

My point is that defect prevention is not the same thing as defect elimination. Defects evolve. An excellent example of this is spam. Even conservative estimates indicate almost 80% of all e-mail sent world-wide is spam. A similar percentage of blog comments are spam, and spam generating bots are quite prevalent on Twitter and other micro-blogging and social networking services. The inconvenient truth is that as we build better and better spam filters, spammers create better and better spam.

Just as mousetraps don’t eliminate mice and spam filters don’t eliminate spam, defect prevention doesn’t eliminate defects.

However, mousetraps, spam filters, and defect prevention are essential proactive best practices.

There are No Lines of Causation — Only Loops of Correlation

There are no root causes, only strong correlations. And correlations are strengthened by continuous monitoring. Believing there are root causes means believing continuous monitoring, and by extension, continuous improvement, has an end point. I call this the defect elimination fallacy, which I parodied in song in my post Imagining the Future of Data Quality.

Knowing there are only strong correlations means knowing continuous improvement is an infinite feedback loop. A practical example of this reality comes from data-driven decision making, where:

Better Business Performance is often correlated with
Better Decisions, which, in turn, are often correlated with
Better Data, which is precisely why Better Decisions with Better Data is foundational to Business Success — however . . .

This does not mean that we can draw straight lines of causation between (3) and (1), (3) and (2), or (2) and (1).

Despite our preference for simplicity over complexity, if bad data was the root cause of bad decisions and/or bad business performance, every organization would never be profitable, and if good data was the root cause of good decisions and/or good business performance, every organization could always be profitable. Even if good data was a root cause, not just a correlation, and even when data perfection is temporarily achieved, the effects would still be ephemeral because not only do defects evolve, but so does the business world. This evolution requires an endless revolution of continuous monitoring and improvement.

Many organizations implement data quality thresholds to close the feedback loop evaluating the effectiveness of their data management and data governance, but few implement decision quality thresholds to close the feedback loop evaluating the effectiveness of their data-driven decision making.

The quality of a decision is determined by the business results it produces, not the person who made the decision, the quality of the data used to support the decision, or even the decision-making technique. Of course, the reality is that business results are often not immediate and may sometimes be contingent upon the complex interplay of multiple decisions.

Even though evaluating decision quality only establishes a correlation, and not a causation, between the decision execution and its business results, it is still essential to continuously monitor data-driven decision making.

Although the business world will never be totally predictable, we can not turn a blind eye to the need for data-driven decision making best practices, or the reality that no best practice can eliminate the potential for poor data quality and decision quality, nor the potential for poor business results even despite better data quality and decision quality. Central to continuous improvement is the importance of closing the feedback loops that make data-driven decisions more transparent through better monitoring, allowing the organization to learn from its decision-making mistakes, and make adjustments when necessary.

We need to connect the dots of better business performance, better decisions, and better data by drawing loops of correlation.

Decision-Data Feedback Loop

Continuous improvement enables better decisions with better data, which drives better business performance — as long as you never stop looping the Decision-Data Feedback Loop, and start accepting that there is no such thing as a root cause.

I discuss this, and other aspects of data-driven decision making, in my DataFlux white paper, which is available for download (registration required) using the following link: Decision-Driven Data Management

The Root! The Root! The Root Cause is on Fire!

Bayesian Data-Driven Decision Making

The Role of Data Quality Monitoring in Data Governance

The Circle of Quality

Oughtn’t you audit?

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

Imagining the Future of Data Quality

What going to the Dentist taught me about Data Quality

DQ-Tip: “There is No Such Thing as Data Accuracy...”

The HedgeFoxian Hypothesis

December 01, 2011

Bayesian Data-Driven Decision Making

December 01, 2011/ Jim Harris

In his book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman recounts the story of economist John Maynard Keynes, who, when asked what he does when new data is presented that does not support his earlier decision, responded: “I change my opinion. What do you do?”

“This is the way good decision makers behave,” Redman explained. “They know that a newly made decision is but the first step in its execution. They regularly and systematically evaluate how well a decision is proving itself in practice by acquiring new data. They are not afraid to modify their decisions, even admitting they are wrong and reversing course if the facts demand it.”

Since he has a PhD in statistics, it’s not surprising that Redman explained effective data-driven decision making using Bayesian statistics, which is “an important branch of statistics that differs from classic statistics in the way it makes inferences based on data. One of its advantages is that it provides an explicit means to quantify uncertainty, both a priori, that is, in advance of the data, and a posteriori, in light of the data.”

Good decision makers, Redman explained, follow at least three Bayesian principles:

They bring as much of their prior experience as possible to bear in formulating their initial decision spaces and determining the sorts of data they will consider in making the decision.
For big, important decisions, they adopt decision criteria that minimize the maximum risk.
They constantly evaluate new data to determine how well a decision is working out, and they do not hesitate to modify the decision as needed.

A key concept of statistical process control and continuous improvement is the importance of closing the feedback loop that allows a process to monitor itself, learn from its mistakes, and adjust when necessary.

The importance of building feedback loops into data-driven decision making is too often ignored.

Decision-Driven Data Management

The Speed of Decision

The Big Data Collider

A Decision Needle in a Data Haystack

The Data-Decision Symphony

Thaler’s Apples and Data Quality Oranges

Satisficing Data Quality

Data Confabulation in Business Intelligence

The Data that Supported the Decision

Data Psychedelicatessen

OCDQ Radio - Big Data and Big Analytics

OCDQ Radio - Good-Enough Data for Fast-Enough Decisions

The Circle of Quality

A Farscape Analogy for Data Quality

OCDQ Radio - Organizing for Data Quality

November 29, 2011

No Datum is an Island of Serendip

November 29, 2011/ Jim Harris

Continuing a series of blog posts inspired by the highly recommended book Where Good Ideas Come From by Steven Johnson, in this blog post I want to discuss the important role that serendipity plays in data — and, by extension, business success.

Let’s start with a brief etymology lesson. The origin of the word serendipity, which is commonly defined as a “happy accident” or “pleasant surprise” can be traced to the Persian fairy tale The Three Princes of Serendip, whose heroes were always making discoveries of things they were not in quest of either by accident or by sagacity (i.e., the ability to link together apparently innocuous facts to come to a valuable conclusion). Serendip was an old name for the island nation now known as Sri Lanka.

“Serendipity,” Johnson explained, “is not just about embracing random encounters for the sheer exhilaration of it. Serendipity is built out of happy accidents, to be sure, but what makes them happy is the fact that the discovery you’ve made is meaningful to you. It completes a hunch, or opens up a door in the adjacent possible that you had overlooked. Serendipitous discoveries often involve exchanges across traditional disciplines. Serendipity needs unlikely collisions and discoveries, but it also needs something to anchor those discoveries. The challenge, of course, is how to create environments that foster these serendipitous connections.”

No Datum is an Island of Serendip

“No man is an island, entire of itself; every man is a piece of the continent, a part of the main.”

These famous words were written by the poet John Donne, the meaning of which is generally regarded to be that human beings do not thrive when isolated from others. Likewise, data does not thrive in isolation. However, many organizations persist on data isolation, on data silos created when separate business units see power in the hoarding of data, not in the sharing of data.

But no business unit is an island, entire of itself; every business unit is a piece of the organization, a part of the enterprise.

Likewise, no datum is an Island of Serendip. Data thrives through the connections, collisions, and combinations that collectively unleash serendipity. When data is exchanged across organizational boundaries, and shared with the entire enterprise, it enables the interdisciplinary discoveries required for making business success more than just a happy accident or pleasant surprise.

Our organizations need to create collaborative environments that foster serendipitous connections bringing all of our business units and people together around our shared data assets. We need to transcend our organizational boundaries, reduce our data silos, and gather our enterprise’s heroes together on the Data Island of Serendip — our United Nation of Business Success.

Data Governance and the Adjacent Possible

The Three Most Important Letters in Data Governance

The Stakeholder’s Dilemma

The Data Cold War

Turning Data Silos into Glass Houses

The Good Data

DQ-BE: Single Version of the Time

My Own Private Data

Sharing Data

Are you Building Bridges or Digging Moats?

The Collaborative Culture of Data Governance

The Interconnected User Interface

November 21, 2011

Commendable Comments (Part 11)

November 21, 2011/ Jim Harris

This Thursday is Thanksgiving Day, which in the United States is a holiday with a long, varied, and debated history. However, the most consistent themes remain family and friends gathering together to share a large meal and express their gratitude.

This is the eleventh entry in my ongoing series for expressing my gratitude to my readers for their commendable comments on my blog posts. Receiving comments is the most rewarding aspect of my blogging experience because not only do comments greatly improve the quality of my blog, comments also help me better appreciate the difference between what I know and what I only think I know. Which is why, although I am truly grateful to all of my readers, I am most grateful to my commenting readers.

Commendable Comments

On The Stakeholder’s Dilemma, Gwen Thomas commented:

“Recently got to listen in on a ‘cooperate or not’ discussion. (Not my clients.) What struck me was that the people advocating cooperation were big-picture people (from architecture and process) while those who just wanted what they wanted were more concerned about their own short-term gains than about system health. No surprise, right?

But what was interesting was that they were clearly looking after their own careers, and not their silos’ interests. I think we who help focus and frame the Stakeholder’s Dilemma situations need to be better prepared to address the individual people involved, and not just the organizational roles they represent.”

On Data, Information, and Knowledge Management, Frank Harland commented:

“As always, an intriguing post. Especially where you draw a parallel between Data Governance and Knowledge Management (wisdom management?) We sometimes portray data management (current term) as ‘well managed data administration’ (term from 70s-80s). As for the debate on ‘data’ and ‘information’ I prefer to see everything written, drawn and / or stored on paper or in digital format as data with various levels of informational value, depending on the amount and quality of metadata surrounding the data item and the accessibility, usefulness (quality) of that item.

For example, 12024561414 is a number with low informational value. I could add metadata, for instance: ‘Phone number’, that makes it potentially known as a phone number. Rather than let you find out whose number it is we could add more information value and add more metadata like: ‘White House Switchboard’. Accessibility could be enhanced by improving formatting like: (1) 202-456-1414.

What I am trying to say with this example is that data items should be placed on a rising scale of informational value rather than be put on steps or firm levels of informational value. So the Information Hierarchy provided by Professor Larson does not work very well for me. It could work only if for all data items the exact information value was determined for every probable context. This model is useful for communication purposes.”

On Plato’s Data, Peter Perera commented:

“‘erised stra ehru oyt ube cafru oyt on wohsi.’

To all Harry Potter fans this translates to: ‘I show not your face but your heart’s desire.’

It refers to The Mirror of Erised. It does not reflect reality but what you desire. (Erised is Desired spelled backwards.) Often data will cast a reflection of what people want to see.

‘Dumbledore cautions Harry that the mirror gives neither knowledge nor truth and that men have wasted away before it, entranced by what they see.’ How many systems are really Mirrors of Erised?”

On Plato’s Data, Larisa Bedgood commented:

“Because the prisoners in the cave are chained and unable to turn their heads to see what goes on behind them, they perceive the shadows as reality. They perceive imperfect reflections of truth and reality.

Bringing the allegory to modern times, this serves as a good reminder that companies MUST embrace data quality for an accurate and REAL view of customers, business initiatives, prospects, and so on. Continuing to view half-truths based on possibly faulty data and information means you are just lost in a dark cave!

I also like the comparison to the Mirror of Erised. One of my favorite movies is the Matrix, in which there are also a lot of parallelisms to Plato’s Cave Allegory. As Morpheus says to Neo: ‘That you are a slave, Neo. Like everyone else you were born into bondage. Into a prison that you cannot taste or see or touch. A prison for your mind.’ Once Neo escapes the Matrix, he discovers that his whole life was based on shadows of the truth.

Plato, Harry Potter, and Morpheus — I’d love to hear a discussion between the three of them in a cave!”

On Plato’s Data, John Owens commented:

“It is true that data is only a reflection of reality but that is also true of anything that we perceive with our senses. When the prisoners in the cave turn around, what they perceive with their eyes in the visible spectrum is only a very narrow slice of what is actually there. Even the ‘solid’ objects they see, and can indeed touch, are actually composed of 99% empty space.

The questions that need to be asked and answered about the essence of data quality are far less esoteric than many would have us believe. They can be very simple, without being simplistic. Indeed simplicity can be seen as a cornerstone of true data quality. If you cannot identify the underlying simplicity that lies at the heart of data quality you can never achieve it. Simple questions are the most powerful. Questions like, ‘In our world (i.e., the enterprise in question) what is it that we need to know about (for example) a Sale that will enable us to operate successfully and meet all of our goals and objectives?’ If the enterprise cannot answer such simple questions then it is in trouble. Making the questions more complicated will not take the enterprise any closer to where it needs to be. Rather it will completely obscure the goal.

Data quality is rather like a ‘magic trick’ done by a magician. Until you know how it is done it appears to an unfathomable mystery. Once you find out that is merely an illusion, the reality is absolutely simple and, in fact, rather mundane. But perhaps that is why so many practitioners perpetuate the illusion. It is not for self gain. They just don’t want to tell the world that, when it comes to data quality, there is no Tooth Fairy, no Easter Bunny, or no Santa Claus. It’s sad, but true. Data quality is boringly simple!”

On Plato’s Data, Peter Benson commented:

“Actually I would go substantially further, whereas data was originally no more than a representation of the real world and if validation was required the real world was the ‘authoritative source’ — but that is clearly no longer the case. Data is in fact the new reality!

Data is now used to track everything, if the data is wrong the real world item disappears. It may have really been destroyed or it may be simply lost, but it does not matter, if the data does not provide evidence of its existence then it does not exist. If you doubt this, just think of money, how much you have is not based on any physical object but on data.

By the way the theoretical definition I use for data is as follows:

Datum — a disruption in a continuum.

The practical definition I use for data is as follows:

Data — elements into which information is transformed so that it can be stored or moved.”

On Data Governance and the Adjacent Possible, Paul Erb commented:

“We can see that there’s a trench between those who think adjacent means out of scope and those who think it means opportunity. Great leaders know that good stories make for better governance for an organization that needs to adapt and evolve, but stay true to its mission. Built from, but not about, real facts, good fictions are broadly true without being specifically true, and therefore they carry well to adjacent business processes where their truths can be applied to making improvements.

On the other hand, if it weren’t for nonfiction — accounts of real markets and processes — there would be nothing for the POSSIBLE to be adjacent TO. Managers often have trouble with this because they feel called to manage the facts, and call anything else an airy-fairy waste of time.

So a data governance program needs to assert whether its purpose is to fix the status quo only, or to fix the status quo in order to create agility to move into new areas when needed. Each of these should have its own business case and related budgets and thresholds (tolerances) in the project plan. And it needs to choose its sponsorship and data quality players accordingly.”

On You Say Potato and I Say Tater Tot, John O’Gorman commented:

“I’ve been working on a definitive solution for the data / information / metadata / attributes / properties knot for a while now and I think I have it figured out.

I read your blog entitled The Semantic Future of MDM and we share the same philosophy even while we differ a bit on the details. Here goes. It’s all information. Good, bad, reliable or not, the argument whether data is information or vice versa is not helpful. The reason data seems different than information is because it has too much ambiguity when it is out of context. Data is like a quantum wave: it has many possibilities one of which is ‘collapsed’ into reality when you add context. Metadata is not a type of data, any more than attributes, properties or associations are a type of information. These are simply conventions to indicate the role that information is playing in a given circumstance.

Your Michelle Davis example is a good illustration: Without context, that string could be any number of individuals, so I consider it data. Give it a unique identifier and classify it as a digital representation in the class of Person, however and we have information. If I then have Michelle add attributes to her personal record — like sex, age, etc. — and assuming that these are likewise identified and classed — now Michelle is part of a set, or relation. Note that it is bad practice — and consequently the cause of many information management headaches — to use data instead of information. Ambiguity kills. Now, if I were to use Michelle’s name in a Subject Matter Expert field as proof of the validity of a digital asset; or in the Author field as an attribute, her information does not *become* metadata or an attribute: it is still information. It is merely being used differently.

In other words, in my world while the terms ‘data’ and ‘information’ are classified as concepts, the terms ‘metadata’, ‘attribute’ and ‘property’ are classified as roles to which instances of those concepts (well, one of them anyway) can be put, i.e., they are fit for purpose. This separation of the identity and class of the string from the purpose to which it is being assigned has produced very solid results for me.”

Thanks for giving your comments

Thank you very much for giving your comments and sharing your perspectives with our collablogaunity. This entry in the series highlighted commendable comments on OCDQ Blog posts published between July and November of 2011.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please keep on commenting and stay tuned for future entries in the series.

Thank you for reading the Obsessive-Compulsive Data Quality (OCDQ) blog. Your readership is deeply appreciated.

Commendable Comments (Part 10) – The 300th OCDQ Blog Post

730 Days and 264 Blog Posts Later – The Second Blogiversary of OCDQ Blog

OCDQ Blog Bicentennial – The 200th OCDQ Blog Post

Commendable Comments (Part 9)

Commendable Comments (Part 8)

Commendable Comments (Part 7)

Commendable Comments (Part 6)

Commendable Comments (Part 5) – The 100th OCDQ Blog Post

Commendable Comments (Part 4)

Commendable Comments (Part 3)

Commendable Comments (Part 2)

Commendable Comments (Part 1)

November 17, 2011

The Speed of Decision

November 17, 2011/ Jim Harris

In a previous post, I used the Large Hadron Collider as a metaphor for big data and big analytics where the creative destruction caused by high-velocity collisions of large volumes of varying data attempt to reveal elementary particles of business intelligence.

Since recent scientific experiments have sparked discussion about the possibility of exceeding the speed of light, in this blog post I examine whether it’s possible to exceed the speed of decision (i.e., the constraints that time puts on data-driven decision making).

Is Decision Speed more important than Data Quality?

In my blog post Thaler’s Apples and Data Quality Oranges, I explained how time-inconsistent data quality preferences within business intelligence reflect the reality that with the speed at which things change these days, more near-real-time operational business decisions are required, which sometimes makes decision speed more important than data quality.

Even though advancements in computational power, network bandwidth, parallel processing frameworks (e.g., MapReduce), scalable and distributed models (e.g., cloud computing), and other techniques (e.g., in-memory computing) are making real-time data-driven decisions more technologically possible than ever before, as I explained in my blog post Satisficing Data Quality, data-driven decision making often has to contend with the practical trade-offs between correct answers and timely answers.

Although we can’t afford to completely sacrifice data quality for faster business decisions, and obviously high quality data is preferable to poor quality data, less than perfect data quality can not be used as an excuse to delay making a critical decision.

Is Decision Speed more important than Decision Quality?

The increasing demand for real-time data-driven decisions is not only requiring us to re-evaluate our data quality thresholds. In my blog post The Circle of Quality, I explained the connection between data quality and decision quality, and how result quality trumps them both because an organization’s success is measured by the quality of the business results it produces.

Again, with the speed at which the business world now changes, the reality is that the fear of making a mistake can not be used as an excuse to delay making a critical decision, which sometimes makes decision speed more important than decision quality.

“Fail faster” has long been hailed as the mantra of business innovation. It’s not because failure is a laudable business goal, but instead because the faster you can identify your mistakes, the faster you can correct your mistakes. Of course this requires that you are actually willing to admit you made a mistake.

(As an aside, I often wonder what’s more difficult for an organization to admit: poor data quality or poor decision quality?)

Although good decisions are obviously preferable to bad decisions, we have to acknowledge the fragility of our knowledge and accept that mistake-driven learning is an essential element of efficient and effective data-driven decision making.

Although the speed of decision is not the same type of constant as the speed of light, in our constantly changing business world, the speed of decision represents the constant demand for good-enough data for fast-enough decisions.

The Big Data Collider

A Decision Needle in a Data Haystack

The Data-Decision Symphony

Thaler’s Apples and Data Quality Oranges

Satisficing Data Quality

Data Confabulation in Business Intelligence

The Data that Supported the Decision

Data Psychedelicatessen

OCDQ Radio - Big Data and Big Analytics

OCDQ Radio - Good-Enough Data for Fast-Enough Decisions

Data, Information, and Knowledge Management

Data In, Decision Out

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

The Circle of Quality

November 11, 2011

You Say Potato and I Say Tater Tot

November 11, 2011/ Jim Harris

One thread of the comment discussion on my blog post The Metadata Continuum raised the excellent point that the demarcation of the border between data and metadata is important, but sometimes difficult to discern. By extension, we can say the same thing about the demarcation of the border between data and information.

So, in this blog post, I thought I would try to offer an explanation about the importance of these demarcations using potatoes.

You Say Potato and I Say Potahto

Let’s Call the Whole Thing Off was a song written by George Gershwin and Ira Gershwin, which became famous for its playful lyrics that poked fun at the differences in the pronunciation of words, such as “you say potato and I say potahto.”

Spelling and pronunciation are included in the dictionary definition of a word, which is a good example of one of the many uses of metadata, namely as a label that provides a definition, description, and context for data. Essentially, metadata describes data, and since data is attempting to describe a real world object, such as a potato, metadata is a further abstraction from reality.

And as we saw with the example of white horses in my blog post The Metadata Crisis, these abstract definitions can also include additional classifications (e.g., there are over 4,000 different varieties of potato), which also have to be well defined in order to facilitate clear communication and effective discussion. These levels of abstractions, definitions, and classifications are essential to our attempts to understand, and do business with, the real world. And this challenge continues even further with information.

You Say Potato and I Say Tater Tot

The difference, and relationship, between data and information is a common debate. Not only do these two terms have varying definitions, but they are often used interchangeably. Just a few examples include comparing and contrasting data quality with information quality, data management with information management, and data governance with information governance.

Some consider this an esoteric debate between data geeks and information nerds, but what is not debated is the importance of understanding how organizations use data and/or information to support their business activities.

Extending my analogy, data is like a potato and information is like a tater tot. In other words, information is one of the many possible specific uses for data. Information is one of the many possible specific things that we can make using data, which is why information quality professionals often speak about the information product.

So it’s important to remember that we can’t have a tater tot (information) without a potato (data), and that we can’t have either a tater tot or a potato without having a working definition (metadata) of what a potato is.

Let’s Not Call the Whole Thing Data

David Corrigan recently blogged about the importance of the metadata that tracks the lineage of information presented to an end user, and how the root causes of data quality and data governance issues are impossible to discover without this metadata.

Therefore, the lines of demarcation separating metadata, data, and information are not just an esoteric technical debate. These demarcations are foundational to the efficiency and effectiveness of business operations. So, let’s not call the whole thing data.

Let’s acknowledge the separate, but deeply interrelated, continuum formed by the disciplines of metadata, data, and information.

November 08, 2011

The Three Most Important Letters in Data Governance

November 08, 2011/ Jim Harris

In his book I Is an Other: The Secret Life of Metaphor and How It Shapes the Way We See the World, James Geary included several examples of the psychological concept of priming. “Our metaphors prime how we think and act. This kind of associative priming goes on all the time. In one study, researchers showed participants pictures of objects characteristic of a business setting: briefcases, boardroom tables, a fountain pen, men’s and women’s suits. Another group saw pictures of objects—a kite, sheet music, a toothbrush, a telephone—not characteristic of any particular setting.”

“Both groups then had to interpret an ambiguous social situation, which could be described in several different ways. Those primed by pictures of business-related objects consistently interpreted the situation as more competitive than those who looked at pictures of kites and toothbrushes.”

“This group’s competitive frame of mind asserted itself in a word completion task as well. Asked to complete fragments such as wa_, _ight, and co_p__tive, the business primes produced words like war, fight, and competitive more often than the control group, eschewing equally plausible alternatives like was, light, and cooperative.”

Communication, collaboration, and change management are arguably the three most critical aspects for implementing a new data governance program successfully. Since all three aspects are people-centric, we should pay careful attention to how we are priming people to think and act within the context of data governance principles, policies, and procedures. We could simplify this down to whether we are fostering an environment that primes people for cooperation—or primes people for competition.

Since there are only three letters of difference between the words cooperative and competitive, we could say that these are the three most important letters in data governance.

November 03, 2011

The Metadata Continuum

November 03, 2011/ Jim Harris

Since my previous post about metadata received excellent commentary, I decided to write a follow-up post to address one of the many great points this discussion and its participants raised, namely the role of controlled vocabularies or metadata dictionaries.

According to an insightful comment from John O’Gorman, “the nature of the medium in which we are trying to solve these problems is multi-dimensional. Any organization can have—and should manage—multiple dialects.”

“By that I mean,” O’Gorman continued, “in the dialect of accounting, customer means some agent who has contributed to increased sales. In the dialect of marketing, customer can mean anyone with a pulse that will sit and listen to a pitch. This insistence on a single version of anything, which is embedded in controlled vocabularies, relational tables, object classes, or a folder structure, is the single largest impediment to cleaning up the digital wasteland.”

One example of this digital wasteland metadata challenge, taken from the crowd-sourced wisdom of social media, is a hashtag, which Twitter users include in their tweets in order to tag them for search engines and trending topics websites.

Since it’s also a common strategy for making any type of unstructured data more usable, tagging is a great example of one of the semantic challenges of metadata. Users freely choosing tags often creates a so-called folksonomy, as opposed to users being forced to only select terms from a controlled vocabulary. Which is precisely why the metadata resulting from tagging can include homonyms (i.e., the same tags used with different meanings) and synonyms (i.e., multiple tags for the same concept), which may lead to inappropriate data relationships and inefficient searches for data about a particular subject.

The Metadata of Babel

Another insightful comment came from Peter Benson, based on his work with the eOTD (ECCMA Open Technical Dictionary).

“Mention the word metadata,” Benson explained, “and you have immediately lost all but the hard core techies and they have neither the authority nor the budget to solve the problem. If you take a hard look at the financial crisis or cancer research you will indeed find the reason the challenges are so difficult to solve is in large part because of the limitations in our ability to communicate effectively and the lack of transparency that comes from poor data integration. So, metadata is really important.”

“The Babel approach of a single language to unite them all,” Benson continued, “has a very poor track history and there is good reason for this. Language is more about power and authority than it is about true communication. We have tried to come up with a solution that is solely focused on achieving unambiguous communication. It really does not matter what it is called as long as we agree on what it is. We do this by using terminology to define concepts and then assigning concept identifiers that are used as metadata. The separation of the terminology from the concept identifier, or rather linking terminology through a concept identifier, allows everyone to remain comfortably in their own space yet communicate with others.”

The Metadata Continuum

So it would appear that we face a daunting challenge, which we could call the Metadata Continuum, where at one end we have the uniformity of controlled vocabularies, and at the other end we have the flexibility of chaotic folksonomies. The daily business operations of most organizations are governed by a metadata strategy that falls somewhere in between, which begs the question: In which direction should the best practices of metadata management flow—toward flexibility or toward uniformity?

Since in my previous post I used an example of the metadata complexities of everyday language, I thought it might be useful to share two perspectives about linguistic flexibility and uniformity.

In his book Final Jeopardy: Man vs. Machine and the Quest to Know Everything, Stephen Baker explained that “flexibility isn’t a weakness of language, but a strength. Humans need words to be inexact. If they were too precise, each person would have a unique vocabulary of several billion words, all of them unintelligible to everyone else. You might have a unique word for the sip of coffee you just took at 7:59 A.M., which was flavored with the anxiety about the traffic in the Lincoln Tunnel or along Paris’s Boulevard Périphérique. But that single word would be as useless to you as to everyone else. A word has to be used at least twice to have any purpose. Each word is a lingua franca, a fragment of a clumsy common language.”

“Yet paradoxically,” explained Kevin Kelly, in his book What Technology Wants, “diversity can be unleashed by a type of uniformity. The uniformity of a standard writing system (like an alphabet or script) unleashes the unexpected diversity of literature. Without uniform rules, every word has to be made up, so communication is localized, inefficient, and thwarted.”

“But with a uniform language,” Kelly continued, “sufficient communication transpires in large circles so that a novel word, phrase, or idea can be appreciated, caught, and disseminated. The rigidity of an alphabet has done more to enable creativity than any unhinged brain-storming exercise ever invented. The standard 26 letters in English have produced 16 million different books in English. Words and language will keep evolving, but their evolution rides on basic fundamentals that are conserved and shared; unvarying (over the short term) letters, spelling, and grammar rules enable creativity in ideas. In a curious way, the homogenization of shared universals allows the transmission of diversity.”

Perhaps since both flexibility and uniformity have linguistic value, metadata will forever remain a continuum between the two.

Where along the Metadata Continuum is your organization?

November 01, 2011

The Big Data Collider

November 01, 2011/ Jim Harris

As I mentioned in a previous post, I am reading the book Where Good Ideas Come From by Steven Johnson, which examines recurring patterns in the history of innovation. The current chapter that I am reading is dispelling the traditional notion of the eureka effect by explaining that the evolution of ideas, like all evolution, stumbles its way toward the next good idea, which inevitably, and not immediately, leads to a significant breakthrough.

One example is how the encyclopedic book Enquire Within Upon Everything, the first edition of which was published in 1856, influenced a young British scientist, who in his childhood in the 1960s was drawn to the “suggestion of magic in the book’s title, and who spent hours exploring this portal to the world of information, along with the wondrous feeling of exploring an immense trove of data.” His childhood fascination with data and information influenced a personal project that he started in 1980, which ten years later became a professional project while he has working in the Swiss particle physics lab CERN.

The scientist was Tim Berners-Lee and his now famous project created the World Wide Web.

“Journalists always ask me,” Berners-Lee explained, “what the crucial idea was, or what the singular event was, that allowed the Web to exist one day when it hadn’t the day before. They are frustrated when I tell them there was no eureka moment.”

“Inventing the World Wide Web involved my growing realization that there was a power in arranging ideas in an unconstrained, web-like way. And that awareness came to me through precisely that kind of process.”

CERN is famous for its Large Hadron Collider that uses high-velocity particle collisions to explore some of the open questions in physics concerning the basic laws governing the interactions and forces among elementary particles in an attempt to understand the deep structure of space and time, and, in particular, the intersection of quantum mechanics and general relativity.

The Big Data Collider

While reading this chapter, I stumbled toward an idea about Big Data, which as Gartner Research explains, although the term acknowledges the exponential growth, availability, and use of information in today’s data-rich landscape, it’s about more than just data volume. Data variety (i.e., structured, semi-structured, and unstructured data, as well as other types of data such as sensor data) and data velocity (i.e., how fast data is being produced and how fast the data must be processed to meet demand) are also key characteristics of Big Data.

David Loshin’s recent blog post about Hadoop and Big Data provides a straightforward explanation and simple example of using MapReduce for not only processing fast-moving large volumes of various data, but also deriving meaningful insights from it.

My idea was how Big Analytics uses the Big Data Collider to allow large volumes of various data particles to bounce off each other in high-velocity collisions. Although a common criticism of Big Data is that it contains more noise than signal, smashing data particles together in the Big Data Collider may destroy most of the noise in the collision, allowing the signals that survive that creative destruction to potentially reduce into an elementary particle of business intelligence.

Admittedly not the greatest metaphor, but as we enquire within data about everything in the Information Age, I thought that it might be useful to share my idea so that it might stumble its way toward the next good idea by colliding with an idea of your own.

OCDQ Radio - Big Data and Big Analytics

OCDQ Radio - Good-Enough Data for Fast-Enough Decisions

OCDQ Radio - A Brave New Data World

Data, Information, and Knowledge Management

Thaler’s Apples and Data Quality Oranges

Data Confabulation in Business Intelligence

Data In, Decision Out

The Data-Decision Symphony

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Beyond a “Single Version of the Truth”

The General Theory of Data Quality