April 05, 2011

DQ-Tip: “Undisputable fact about the value and use of data…”

April 05, 2011/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Undisputable fact about the value and use of data—any business process that is based on the assumption of having access to trustworthy, accurate, and timely data will produce invalid, unexpected, and meaningless results if this assumption is false.”

This DQ-Tip is from the excellent book Master Data Management and Data Governance by Alex Berson and Larry Dubov.

As data quality professionals, our strategy for quantifying and qualifying the business value of data is an essential tenet of how we make the pitch to get executive management to invest in enterprise data quality improvement initiatives.

However, all too often, the problem when we talk about data with executive management is exactly that—we talk about data.

Let’s instead follow the sage advice of Berson and Dubov. Before discussing data quality, let’s research the data quality assumptions underlying core business processes. This due diligence will allow us to frame data quality discussions within a business context by focusing on how the organization is using its data to support its business processes, which will allow us to qualify and quantify the business value of having high quality data as a strategic corporate asset.

DQ-Tip: “Data quality tools do not solve data quality problems...”

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “There is no point in monitoring data quality...”

DQ-Tip: “Don't pass bad data on to the next person...”

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is about more than just improving your data...”

DQ-Tip: “Start where you are...”

March 31, 2011

The IT Pendulum and the Federated Future of IT

March 31, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

In a previous post, I asked whether the consumerization of IT, which is decentralizing IT and is a net positive for better servicing the diverse technology needs of large organizations, will help or hinder enterprise-wide communication and collaboration.

Stephen Putman commented that a centralized IT department will just change focus, but not disappear altogether:

“I look at the centralized/decentralized IT argument in the same way as the in-house/outsourced development argument—as a pendulum, over time. Right now, the pendulum is swinging toward the decentralized end, but when people realize the need for collaboration and enterprise-wide communication (dare I say, ‘federalization’), the need for a centralized organization will be better realized. I think that smart centralized IT departments will realize this, and shift focus to facilitating collaboration.”

I agree with Putman that the IT Pendulum is currently swinging toward the decentralized end, but once large organizations realize the increased communication and collaboration challenges it will bring, then the IT Pendulum will start swinging back a little more toward the centralized end (and hopefully before people within the organization start taking swings at each other).

A federated approach combining centralized IT control over core areas (including, but not necessarily limited to, facilitating communication and collaboration) with business unit and individual user autonomy over areas where centralization would disrupt efficiency and effectiveness, may ultimately be what many large organizations will have to adopt for long-term success.

The Federated Future of IT, which would allow the IT Pendulum to swing in harmony, balancing centralized control with decentralized autonomy, may not be as far off as some might imagine. After all, a centralized IT department is one of the few organizational functions that regularly interacts with the entire enterprise, and is therefore already strategically positioned to be able to best support the evolving technology needs of the organization.

However, the required paradigm shift for IT is shifting its focus away from controlling how the organization is using technology, and toward advising the organization on how to better leverage technology—including centralized and decentralized options.

Joel Dobbs has advised that it’s “absolutely critical” for IT to embrace consumerization, and John Dodge has recently blogged about “striking that hybrid balance where IT is delivering agility.” Historically, IT Delivery has been focused on control, but the future of IT is to boldly go beyond just a centralized department, and to establish a united federation of information technology that truly enables the enterprise-wide communication and collaboration needed for 21st century corporate survival and success.

This blog post is sponsored by the Enterprise CIO Forum and HP.

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

Why does the sun never set on legacy applications?

The Partly Cloudy CIO

Suburban Flight, Technology Sprawl, and Garage IT

March 29, 2011

Follow the Data

March 29, 2011/ Jim Harris

In his recent blog post Multiple Data Touch Points, David Loshin wrote about how common it is for many organizations to not document how processes acquire, read, or modify data. As a result, when an error occurs, it manifests itself in a downstream application, and it takes a long time to figure out where the error occurred and how it was related to the negative impacts.

Data is often seen as just a by-product of business and technical processes, but a common root cause of poor data quality is this lack of awareness of the end-to-end process of how the organization is using its data to support its business activities.

For example, imagine we have discovered an error in a report. Do we know the business and technical processes the data passed through before appearing in the report? Do we know the chain of custody for the report data? In other words, do we know the business analyst who prepared it, the data steward who verified its data quality, the technical architect who designed its database, and the data entry clerk who created the data? And if we can’t answers these questions, do we even know where to start looking?

When an organization doesn’t understand its multiple data touch points, it’s blindsided by events caused by the negative business impacts of poor data quality, e.g., a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal.

“Follow the money” is an expression often used during the investigation of criminal activities or political corruption. I remember the phrase from the 1976 Academy Award winning movie All the President’s Men, which was based on the non-fiction book of the same name written by Carl Bernstein and Bob Woodward, two of the journalists who investigated the Watergate scandal.

“Follow the data” is an expression sometimes used during the investigation of incidents of poor data quality. However, it’s often limited to reactive data cleansing projects where the only remediation will be finding and fixing the critical data problems, but without taking corrective action to resolve the root cause—and in some cases, without even identifying the root cause.

A more proactive approach is establishing a formal process to follow the data from its inception and document every step of its journey throughout the organization, including the processes and people that the data encountered. This makes it much easier to retrace data’s steps, recover more quickly when something goes awry, and prevent similar problems from recurring in the future.

Deep Throat told Woodward and Bernstein to: “Follow the money.”

Deep Thought told me 42 times to tell you to: “Follow the data.”

March 22, 2011

Retroactive Data Quality

March 22, 2011/ Jim Harris

As I, and many others, have blogged about many times before, the proactive approach to data quality, i.e., defect prevention, is highly recommended over the reactive approach to data quality, i.e., data cleansing.

However, reactive data quality still remains the most common approach because “let’s wait and see if something bad happens” is typically much easier to sell strategically than “let’s try to predict the future by preventing something bad before it happens.”

Of course, when something bad does happen (and it always does), it is often too late to do anything about it. So imagine if we could somehow travel back in time and prevent specific business-impacting occurrences of poor data quality from happening.

This would appear to be the best of both worlds since we could reactively wait and see if something bad happens, and if (when) it does, then we could travel back in time and proactively prevent just that particular bad thing from happening to our data quality.

This approach is known as Retroactive Data Quality—and it has been (somewhat successfully) implemented at least three times.

Flux Capacitated Data Quality

In 1985, Dr. Emmett “Doc” Brown turned a modified DeLorean DMC-12 into a retroactive data quality machine that when accelerated to 88 miles per hour, created a time displacement window using its flux capacitor (according to Doc it’s what makes time travel possible) powered by 1.21 gigawatts of electricity, which could be provided by either a nuclear reaction or a lightning strike.

On October 25, 1985, Doc sent data quality expert Marty McFly back in time to November 5, 1955 to prevent a few data defects in the original design of the flux capacitor, which inadvertently triggers some severe data defects in 2015, requiring Doc and Marty to travel back to 1955, then 1885, before traveling Back to the Future of a defect-free 1985—when the flux capacitor is destroyed.

Quantum Data Quality

In 1989, theorizing a data steward could time travel within his own database, Dr. Sam Beckett launched a retroactive data quality project called Quantum Data Quality, stepped into its Quantum Leap data accelerator—and vanished.

He awoke to find himself trapped in the past, stewarding data that was not his own, and driven by an unknown force to change data quality for the better. His only guide on this journey was Al, a subject matter expert from his own time, who appeared in the form of a hologram only Sam could see and hear. And so, Dr. Beckett found himself leaping from database to database, putting data right that once went wrong, and hoping each time that his next leap would be the leap home to his own database—but Sam never returned home.

Data Quality Slingshot Effect

The slingshot effect is caused by traveling in a starship at an extremely high warp factor toward a sun. After allowing the gravitational pull to accelerate it to even faster speeds, the starship will then break away from the sun, which creates the so-called slingshot effect that transports the starship through time.

In 2267, Captain Gene Roddenberry will begin a Star Trek, commanding a starship using the slingshot effect to travel back in time to September 8, 1966 to launch a retroactive data quality initiative that has the following charter:

“Data: the final frontier. These are the voyages of the starship Quality. Its continuing mission: To explore strange, new databases; To seek out new data and new corporations; To boldly go where no data quality has gone before.”

Retroactive Data Quality Log, Supplemental

It is understandable if many of you doubt the viability of time travel as an approach to improving your data quality. After all, whenever Doc and Marty, or Sam and Al, or Captain Roddenberry and the crew of the starship Quality, travel back in time and prevent specific business-impacting occurrences of poor data quality from happening, how do we prove they were successful? Within the resulting altered timeline, there would be no traces of the data quality issues after they were retroactively resolved.

“Great Scott!” It will always be more difficult to sell the business benefits of defect prevention, than the relative ease of selling data cleansing after a CxO responds “Oh, boy!” after the next time poor data quality negatively impacts business performance.

Nonetheless, you must continue your mission to engage your organization in a proactive approach to data quality. “Make It So!”

Groundhog Data Quality Day

What Data Quality Technology Wants

To Our Data Perfectionists

Finding Data Quality

MacGyver: Data Governance and Duct Tape

What going to the dentist taught me about data quality

Microwavable Data Quality

A Tale of Two Q’s

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

March 17, 2011

Suburban Flight, Technology Sprawl, and Garage IT

March 17, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

Suburban flight is a term describing the migration of people away from an urban center into its surrounding, less-populated, residential communities, aka suburbs. The urban center is a large city or metropolitan area providing a source of employment and other professional opportunities, whereas suburbs provide a sense of community and other personal opportunities. Despite their strong economic ties to the urban center, most suburbs have political autonomy and remain focused on internal matters.

Historically, the IT department has been a technological urban center providing and managing the technology used by all of the people within a large organization. However, in his blog post Has your IT department died yet?, John Dodge pondered whether this notion of “the IT department as a single and centralized organization is on the way out at many enterprises.”

David Heinemeier Hansson raised similar points in his recent blog post The end of the IT department, explaining “the problem with IT departments seems to be that they’re set up as a forced internal vendor. But change is coming. Dealing with technology has gone from something only for the techy geeks to something more mainstream.”

Nicholas Carr, author of the infamous 2004 book Does IT Matter?, expanded on his perspective in his 2009 book The Big Switch, which uses the history of electric grid power utilities as a backdrop and analogy for Internet-based utility (i.e., cloud) computing:

“In the long run, the IT department is unlikely to survive, at least not in its familiar form. IT will have little left to do once the bulk of business computing shifts out of private data centers and into the cloud. Business units and even individual employees will be able to control the processing of information directly, without the need for legions of technical people.”

Cloud computing, as well as software-as-a-service (SaaS), open source software, and the rise of mobile computing have all been contributing factors to the technology sprawl that has begun within many large organizations, which, similar to suburban flight, is causing a migration of people and business units away from an IT-centric approach to providing for their technology needs.

We are all familiar with the stories of how some of the world’s largest technology companies were started in a garage, including Google, Apple, and Hewlett-Packard (HP), which William Hewlett and David Packard started in a garage in Palo Alto, California.

However, in this new era of the consumerization of IT, new information technology projects may start—and stay—in the garage, where in the organizational suburbs, most business units, and some individual users, will run their own Garage IT department.

Although decentralizing IT is a net positive for better servicing the technology needs of the organization, will Garage IT stop large organizations from carpooling together toward the business-driven success of the corporate urban center? In other words, will the technological autonomy of the consumerization of IT help or hinder enterprise-wide communication and collaboration?

This blog post is sponsored by the Enterprise CIO Forum and HP.

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

Why does the sun never set on legacy applications?

The Partly Cloudy CIO

The IT Pendulum and the Federated Future of IT

March 11, 2011

Twitter, Data Governance, and a #ButteredCat #FollowFriday

March 11, 2011/ Jim Harris

I have previously blogged in defense of Twitter, the pithy platform for social networking that I use perhaps a bit too frequently, and about which many people argue is incompatible with meaningful communication (Twitter that is, not me—hopefully).

Whether it is a regularly scheduled meeting of the minds, like the Data Knights Tweet Jam, or simply a spontaneous supply of trenchant thoughts, Twitter quite often facilitates discussions that deliver practical knowledge or thought-provoking theories.

However, occasionally the discussions center around more curious concepts, such as a paradox involving a buttered cat, which thankfully Steve Sarsfield, Mark Horseman, and Daragh O Brien can help me attempt to explain (remember I said attempt):

So, basically . . . successful data governance is all about Buttered Cats, Breaded CxOs, and Beer-Battered Data Quality Managers working together to deliver Bettered Data to the organization . . . yeah, that all sounded perfectly understandable to me.

But just in case you don’t have your secret decoder ring, let’s decipher the message (remember: “Be sure to drink your Ovaltine”):

Buttered Cats – metaphor for combining the top-down and bottom-up approaches to data governance
Breaded CxOs – metaphor for executive sponsors, especially ones providing bread (i.e., funding, not lunch—maybe both)
Beer-Battered Data Quality Managers – metaphor (and possibly also a recipe) for data stewardship
Bettered Data – metaphor for the corporate asset thingy that data governance helps you manage

(For more slightly less cryptic information, check out my previous post/poll: Data Governance and the Buttered Cat Paradox)

#FollowFriday Recommendations

Today is #FollowFriday, the day when Twitter users recommend other users you should follow, so here are some great tweeps for mostly non-buttered-cat tweets about Data Quality, Data Governance, Master Data Management, and Business Intelligence:

(Please Note: This is by no means a comprehensive list, is listed in no particular order whatsoever, and no offense is intended to any of my tweeps not listed below. I hope that everyone has a great #FollowFriday and an even greater weekend.)

Henrik Liliendahl Sørensen – @hlsdk
Dylan Jones – @DataQualityPro
Phil Simon – @PhilSimon
Julian Schwarzenbach – @jschwa1
Rich Murnane – @murnane
Rob Paller – @RobPaller
Stephen Putman – @SJPutman
Augusto Albeghi – @Stray__Cat
Garnie Bolling – @GarnieBolling
Mark Horseman – @MarkHorseman
William Sharp – @dqchronicle
Ken O’Connor – @KenOConnorData
Graham Rhind – @GrahamRhind
Jill Wanless – @sheezaredhead
Jacqueline Roberts – @JackieMRoberts
Terri Rylander – @BIMarcom
Loraine Lawson – @LoraineLawson
Sarah Burnett – @SarahBurnett
Corinna Martinez – @Futureratti
Beth Breidenbach – @bbreidenbach
Julie Hunt – @juliebhunt
Nicole Carriere – @carrni
Teresa Cottam – @Teresacottam
Loretta Mahon Smith – @silverdata
April Reeve – @Datagrrl
Karen Lopez – @datachick
Rob Drysdale – @projmgr
Ted Louie – @TedLouie
Nick Giuliano – @Nick_Giuliano
John Owens – @JohnIMM
Dalton Cervo – @dcervo
Phil Wright – @faropress
Jaime Fitzgerald – @jaimefitzgerald
Simon Daniels – @mktginsightguy
Marcus Borba – @marcusborba
Alexej Freund – @alexej_freund
Vish Agashe – @VishAgashe
Chris Sorensen – @wjdataguy
Mark Lorion – @mark_lorion
Brett Stupakevich – @Brett2point0
Steve Dine – @steve_dine
David Pratt – @DataMgmtWonk

IAIDQ – @IAIDQ
Daragh O Brien – @daraghobrien
Jill Dyché – @JillDyche
David Loshin – @DavidLoshin
Neil Raden – @NeilRaden
James Taylor – @jamet123
Shawn Rogers – @shawnrog
Wayne Eckerson – @weckerson
Peter Thomas – @PeterJThomas
Timo Elliott – @TimoElliott
Dan Power – @dan_power
William McKnight – @williammcknight
Gartner Research – @Gartner_inc
Ted Friedman – @ted_friedman
Andy Bitterer – @bitterer
ANALYSTerical – @ANALYSTerical
Merv Adrian – @merv
Forrester Research – @forrester
Robert Karel – @rbkarel
Initiate, an IBM Company – @IBMInitiate
Jarrett Goldfedder – @JGoldfed
DataFlux, a SAS Company – @DataFlux
Data Roundtable – @DataRoundtable
Kalido – @Kalido
Winston Chen – @WinstonChen
John Evans – @jmevans00
Lorita Ba Vannah – @lorita
Informatica – @InformaticaCorp
Clarke Patterson – @ClarkePatterson
Talend – @Talend
Steve Sarsfield – @SteveSarsfield
Trillium Software – @TrilliumSW
Pervasive Software – @DataIntegrate
Paige Roberts – @RobertsPaige
Experian QAS – @Experian_QAS
Datamartist – @Datamartist
Utopia, Inc. – @UtopiaInc
Melissa Data – @MelissaData
Ira Warren Whiteside – @irawhiteside
DataMentors – @DataMentors
Information Management – @InfoMgmt
Enterprise CIO Forum – @ECIOForum

#FollowFriday Spotlight: @PhilSimon

#FollowFriday Spotlight: @hlsdk

#FollowFriday Spotlight: @DataQualityPro

#FollowFriday and The Three Tweets

Dilbert, Data Quality, Rabbits, and #FollowFriday

Twitter, Meaningful Conversations, and #FollowFriday

The Fellowship of #FollowFriday

The Wisdom of the Social Media Crowd

Social Karma (Part 7) – Twitter

March 08, 2011

Data Governance and the Buttered Cat Paradox

March 08, 2011/ Jim Harris

One of the most common questions about data governance is:

What is the best way to approach it—top-down or bottom-up?

The top-down approach is where executive sponsorship and the role of the data governance board is emphasized.

The bottom-up approach is where data stewardship and the role of peer-level data governance change agents is emphasized.

This debate reminds me of the buttered cat paradox (shown to the left as illustrated by Greg Williams), which is a thought experiment combining the two common adages: “cats always land on their feet” and “buttered toast always lands buttered side down.”

In other words, if you strapped buttered toast (butter side up) on the back of a cat and then dropped it from a high height (Please Note: this is only a thought experiment, so no cats or toast are harmed), presumably the very laws of physics would be suspended, leaving our fearless feline of the buttered-toast-paratrooper brigade hovering forever in midair, spinning in perpetual motion, as both the buttered side of the toast and the cat’s feet attempt to land on the ground.

It appears that the question of either a top-down or a bottom-up approach with data governance poses a similar paradox.

Data governance will require executive sponsorship and a data governance board for the top-down-driven activities of funding, policy making and enforcement, decision rights, and arbitration of conflicting business priorities as well as organizational politics.

However, data governance will also require data stewards and other grass roots advocates for the bottom-up-driven activities of policy implementation, data remediation, and process optimization, all led by the example of peer-level change agents adopting the organization’s new best practices for data quality management, business process management, and technology management.

Therefore, recognizing the eventual need for aspects of both a top-down and a bottom-up approach with data governance can leave an organization at a loss to understand where to begin, hovering forever in mid-decision, spinning in perpetual thought, unable to land a first footfall on their data governance journey—and afraid of falling flat on the buttered side of their toast.

Although data governance is not a thought experiment, planning and designing your data governance program does require thought, and perhaps some experimentation, in order to discover what will work best for your organization’s corporate culture.

What do you think is the best way to approach data governance? Please feel free to post a comment below and explain your vote or simply share your opinions and experiences.

March 03, 2011

Thaler’s Apples and Data Quality Oranges

March 03, 2011/ Jim Harris

In the opening chapter of his book Carrots and Sticks, Ian Ayres recounts the story of Thaler’s Apples:

“The behavioral revolution in economics began in 1981 when Richard Thaler published a seven-page letter in a somewhat obscure economics journal, which posed a pretty simple choice about apples.

Which would you prefer:

(A) One apple in one year, or

(B) Two apples in one year plus one day?

This is a strange hypothetical—why would you have to wait a year to receive an apple? But choosing is not very difficult; most people would choose to wait an extra day to double the size of their gift.

Thaler went on, however, to pose a second apple choice.

Which would you prefer:

(C) One apple today, or

(D) Two apples tomorrow?

What’s interesting is that many people give a different, seemingly inconsistent answer to this second question. Many of the same people who are patient when asked to consider this choice a year in advance turn around and become impatient when the choice has immediate consequences—they prefer C over D.

What was revolutionary about his apple example is that it illustrated the plausibility of what behavioral economists call ‘time-inconsistent’ preferences. Richard was centrally interested in the people who chose both B and C. These people, who preferred two apples in the future but one apple today, flipped their preferences as the delivery date got closer.”

What does this have to do with data quality? Give me a moment to finish eating my second apple, and then I will explain . . .

Data Quality Oranges

Let’s imagine that an orange represents a unit of measurement for data quality, somewhat analogous to data accuracy, such that the more data quality oranges you have, the better the quality of data is for your needs—let’s say for making a business decision.

Which would you prefer:

(A) One data quality orange in one month, or

(B) Two data quality oranges in one month plus one day?

(Please Note: Due to the strange uncertainties of fruit-based mathematics, two data quality oranges do not necessarily equate to a doubling of data accuracy, but two data quality oranges are certainly an improvement over one data quality orange).

Now, of course, on those rare occasions when you can afford to wait a month or so before making a critical business decision, most people would choose to wait an extra day in order to improve their data quality before making their data-driven decision.

However, let’s imagine you are feeling squeezed by a more pressing business decision—now which would you prefer:

(D) Two data quality oranges tomorrow?

In my experience with data quality and business intelligence, most people prefer B over A—and C over D.

This “time-inconsistent” data quality preference within business intelligence reflects the reality that with the speed at which things change these days, more real-time business decisions are required—perhaps making speed more important than quality.

In a recent Data Knights Tweet Jam, Mark Lorion pondered speed versus quality within business intelligence, asking: “Is it better to be perfect in 30 days or 70% today? Good enough may often be good enough.”

To which Henrik Liliendahl Sørensen responded with the perfectly pithy wisdom: “Good, Fast, Decision—Pick any two.”

However, Steve Dine cautioned that speed versus quality is decision dependent: “70% is good when deciding how many pencils to order, but maybe not for a one billion dollar acquisition.”

Mark’s follow-up captured the speed versus quality tradeoff succinctly with “Good Now versus Great Later.” And Henrik added the excellent cautionary note: “Good decision now, great decision too late—especially if data quality is not a mature discipline.”

What Say You?

How many data quality oranges do you think it takes? Or for those who prefer a less fruitful phrasing, where do you stand on the speed versus quality debate? How good does data quality have to be in order to make a good data-driven business decision?

To Our Data Perfectionists

DQ-Tip: “There is no such thing as data accuracy...”

DQ-Tip: “Data quality is primarily about context not accuracy...”

Data Quality and the Cupertino Effect

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Data In, Decision Out

The Data-Decision Symphony

Data!

You Can’t Always Get the Data You Want

March 01, 2011

Data Qualia

March 01, 2011/ Jim Harris

In philosophy (according to Wikipedia), the term qualia is used to describe the subjective quality of conscious experience.

Examples of qualia are the pain of a headache, the taste of wine, or the redness of an evening sky. As Daniel Dennett explains:

“Qualia is an unfamiliar term for something that could not be more familiar to each of us:

The ways things seem to us.”

Like truth, beauty, and singing ability, data quality is in the eyes of the beholder, or since data quality is most commonly defined as fitness for the purpose of use, we could say that data quality is in the eyes of the user.

However, most data has both multiple uses and multiple users. Data of sufficient quality for one use or one user may not be of sufficient quality for other uses and other users. Quite often these diverse data needs and divergent data quality perspectives make it a daunting challenge to provide meaningful data quality metrics to the organization.

Recently on the Data Roundtable, Dylan Jones of Data Quality Pro discussed the need to create data quality reports that matter, explaining that if you’re relying on canned data profiling reports (i.e., column statistics and data quality metrics at an attribute, table, and system level), then you are measuring data quality in isolation of how the business is performing.

Instead, data quality metrics must measure data qualia—the subjective quality of the user’s business experience with data:

“Data Qualia is an unfamiliar term for something that must become more familiar to the organization:

The ways data quality impact business performance.”

The Point of View Paradox

DQ-BE: Single Version of the Time

Single Version of the Truth

Beyond a “Single Version of the Truth”

The Idea of Order in Data

Hell is other people’s data

DQ-BE: Data Quality Airlines

DQ-Tip: “There is no such thing as data accuracy...”

Data Quality and the Cupertino Effect

DQ-Tip: “Data quality is primarily about context not accuracy...”

February 24, 2011

Data Confabulation in Business Intelligence

February 24, 2011/ Jim Harris

Jarrett Goldfedder recently asked the excellent question: When does Data become Too Much Information (TMI)?

We now live in a 24 hours a day, 7 days a week, 365 days a year world-wide whirlwind of constant information flow, where the very air we breath is literally teeming with digital data streams—continually inundating us with new information.

The challenge is our time is a zero-sum game, meaning for every new information source we choose, others are excluded.

There’s no way to acquire all available information. And even if we somehow could, due to the limitations of human memory, we often don’t remember much of the new information we do acquire. In my blog post Mind the Gap, I wrote about the need to coordinate our acquisition of new information with its timely and practical application.

So I definitely agree with Jarrett that the need to find the right amount of information appropriate for the moment is the needed (and far from easy) solution. Since this is indeed the age of the data deluge and TMI, I fear that data-driven decision making may simply become intuition-driven decisions validated after the fact by selectively choosing the data that supports the decision already made. The human mind is already exceptionally good at doing this—the term for it in psychology is confabulation.

Although, according to Wikipedia, the term can be used to describe neurological or psychological dysfunction, Jonathan Haidt explained in his book The Happiness Hypothesis, confabulation is frequently used by “normal” people as well. For example, after buying my new smart phone, I chose to read only the positive online reviews about it, trying to make myself feel more confident I had made the right decision—and more capable of justifying my decision beyond saying I bought the phone that looked “cool.”

Data Confabulation in Business Intelligence

Data confabulation in business intelligence occurs when intuition-driven business decisions are claimed to be data-driven and justified after the fact using the results of selective post-decision data analysis. This is even worse than when confirmation bias causes intuition-driven business decisions, which are justified using the results of selective pre-decision data analysis that only confirms preconceptions or favored hypotheses, resulting in potentially bad—albeit data-driven—business decisions.

My fear is that the data deluge will actually increase the use of both of these business decision-making “techniques” because they are much easier than, as Jarrett recommended, trying to make sense of the business world by gathering and sorting through as much data as possible, deriving patterns from the chaos and developing clear-cut, data-driven, data-justifiable business decisions.

But the data deluge generally broadcasts more noise than signal, and sometimes trying to get better data to make better decisions simply means getting more data, which often only delays or confuses the decision-making process, or causes analysis paralysis.

Can we somehow listen for decision-making insights among the cacophony of chaotic and constantly increasing data volumes?

I fear that the information overload of the data deluge is going to trigger an intuition override of data-driven decision making.

The Reptilian Anti-Data Brain

Data In, Decision Out

The Data-Decision Symphony

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

DQ-View: From Data to Decision

TDWI World Conference Orlando 2010

Hell is other people’s data

Mind the Gap

The Fragility of Knowledge

February 21, 2011

Alternatives to Enterprise Data Quality Tools

February 21, 2011/ Jim Harris

The recent analysis by Andy Bitterer of Gartner Research (and ANALYSTerical) about the acquisition of open source data quality tool DataCleaner by the enterprise data quality vendor Human Inference, prompted the following Twitter conversation:

Since enterprise data quality tools can be cost-prohibitive, more prospective customers are exploring free and/or open source alternatives, such as the Talend Open Profiler, licensed under the open source General Public License, or non-open source, but entirely free alternatives, such as the Ataccama DQ Analyzer. And, as Andy noted in his analysis, both of these tools offer an easy transition to the vendors’ full-fledged commercial data quality tools, offering more than just data profiling functionality.

As Henrik Liliendahl Sørensen explained, in his blog post Data Quality Tools Revealed, data profiling is the technically easiest part of data quality, which explains the tool diversity, and early adoption of free and/or open source alternatives.

And there are also other non-open source alternatives that are more affordable than enterprise data quality tools, such as Datamartist, which combines data profiling and data migration capabilities into an easy-to-use desktop application.

My point is neither to discourage the purchase of enterprise data quality tools, nor promote their alternatives—and this blog post is certainly not an endorsement—paid or otherwise—of the alternative data quality tools I have mentioned simply as examples.

My point is that many new technology innovations originate from small entrepreneurial ventures, which tend to be specialists with a narrow focus that can provide a great source of rapid innovation. This is in contrast to the data management industry trend of innovation via acquisition and consolidation, embedding data quality technology within data management platforms, which also provide data integration and master data management (MDM) functionality as well, allowing the mega-vendors to offer end-to-end solutions and the convenience of one-vendor information technology shopping.

However, most software licenses for these enterprise data management platforms start in the six figures. On top of the licensing, you have to add the annual maintenance fees, which are usually in the five figures. Add to the total cost of the solution, the professional services that are needed for training and consulting for installation, configuration, application development, testing, and production implementation—and you have another six figure annual investment.

Debates about free and/or open source software usually focus on the robustness of functionality and the intellectual property of source code. However, from my perspective, I think that the real reason more prospective customers are exploring these alternatives to enterprise data quality tools is because of the free aspect—but not because of the open source aspect.

In other words—and once again I am only using it as an example—I might download Talend Open Profiler because I wanted data profiling functionality at an affordable price—but not because I wanted the opportunity to customize its source code.

I believe the “try it before you buy it” aspect of free and/or open source software is what’s important to prospective customers.

Therefore, enterprise data quality vendors, instead of acquiring an open source tool as Human Inference did with DataCleaner, how about offering a free (with limited functionality) or trial version of your enterprise data quality tool as an alternative option?

Do you believe in Magic (Quadrants)?

Can Enterprise-Class Solutions Ever Deliver ROI?

Which came first, the Data Quality Tool or the Business Need?

Selling the Business Benefits of Data Quality

What Data Quality Technology Wants

February 17, 2011

Has Data Become a Four-Letter Word?

February 17, 2011/ Jim Harris

In her excellent blog post 'The Bad Data Ate My Homework' and Other IT Scapegoating, Loraine Lawson explained how “there are a lot of problems that can be blamed on bad data. I suspect it would be fair to say that there’s a good percentage of problems we don’t even know about that can be blamed on bad data and a lack of data integration, quality and governance.”

Lawson examined whether bad data could have been the cause of the bank foreclosure fiasco, as opposed to, as she concludes, the more realistic causes being bad business and negligence, which, if not addressed, could lead to another global financial crisis.

“Bad data,” Lawson explained, “might be the most ubiquitous excuse since ‘the dog ate my homework.’ But while most of us would laugh at the idea of blaming the dog for missing homework, when someone blames the data, we all nod our heads in sympathy, because we all know how troublesome computers are. And then the buck gets (unfairly) passed to IT.”

Unfairly blaming IT, or technology in general, when poor data quality negatively impacts business performance is ignoring the organization’s collective ownership of its problems, and its shared responsibility for the solutions to those problems, and causes, as Lawson explained in Data’s Conundrum: Everybody Wants Control, Nobody Wants Responsibility, an “unresolved conflict on both the business and the IT side over data ownership and its related issues, from stewardship to governance.”

In organizations suffering from this unresolved conflict between IT and the Business—a dysfunctional divide also known as the IT-Business Chasm—bad data becomes the default scapegoat used by both sides.

Perhaps, in a strange way, placing the blame on bad data is progress when compared with the historical notions of data denial, when an organization’s default was to claim that it had no data quality issues whatsoever.

However, admitting bad data not only exists, but that bad data is also having a tangible negative impact on business performance doesn’t seem to have motivated organizations to take action. Instead, many appear to prefer practicing bad data blamestorming, where the Business blames bad data on IT and its technology, and IT blames bad data on the Business and its business processes.

Or perhaps, by default, everyone just claims that “the bad data ate my homework.”

Are your efforts to convince executive management that data needs to treated like a five-letter word (“asset”) being undermined by the fact that data has become a four-letter word in your organization?

The Business versus IT—Tear down this wall!

Quality and Governance are Beyond the Data

Data In, Decision Out

The Data-Decision Symphony

The Reptilian Anti-Data Brain

Hell is other people’s data

Promoting Poor Data Quality

Who Framed Data Entry?

Data, data everywhere, but where is data quality?

The Circle of Quality

February 14, 2011

Commendable Comments (Part 9)

February 14, 2011/ Jim Harris

Today is February 14 — Valentine’s Day — the annual celebration of enduring romance, where true love is publicly judged according to your willingness to purchase chocolate, roses, and extremely expensive jewelry, and privately judged in ways that nobody (and please, trust me when I say nobody) wants to see you post on Twitter, Facebook, Flickr, YouTube, or your blog.

This is the ninth entry in my ongoing series for expressing my true love to my readers for their truly commendable comments on my blog posts. Receiving comments is the most rewarding aspect of my blogging experience. Although I love all of my readers, I love my commenting readers most of all.

Commendable Comments

On Data Quality Industry: Problem Solvers or Enablers?, Henrik Liliendahl Sørensen commented:

“I sometimes compare our profession with that of dentists. Dentists are also believed to advocate for good habits around your teeth, but are making money when these good habits aren’t followed.

So when 4 out 5 dentists recommend a certain toothpaste, it is probably no good :-)

Seriously though, I take the amount of money spent on data quality tools as a sign that organizations believe there are issues best solved with technology. Of course these tools aren’t magic.

Data quality tools only solve a certain part of your data and information related challenges. On the other hand, the few problems they do solve may be solved very well and cannot be solved by any other line of products or in any practical way by humans in any quantity or quality.”

On Data Quality Industry: Problem Solvers or Enablers?, Jarrett Goldfedder commented:

“I think that the expectations of clients from their data quality vendors have grown tremendously over the past few years. This is, of course, in line with most everything in the Web 2.0 cloud world that has become point-and-click, on-demand response.

In the olden days of 2002, I remember clients asking for vendors to adjust data only to the point where dashboard statistics could be presented on a clean Java user interface. I have noticed that some clients today want the software to not just run customizable reports, but to extract any form of data from any type of database, to perform advanced ETL and calculations with minimal user effort, and to be easy to use. It’s almost like telling your dentist to fix your crooked teeth with no anesthesia, no braces, no pain, during a single office visit.

Of course, the reality today does not match the expectation, but data quality vendors and architects may need to step up their game to remain cutting edge.”

On Data Quality is not an Act, it is a Habit, Rob Paller commented:

“This immediately reminded me of the practice of Kaizen in the manufacturing industry. The idea being that continued small improvements yield large improvements in productivity when compounded.

For years now, many of the thought leaders have preached that projects from business intelligence to data quality to MDM to data governance, and so on, start small and that by starting small and focused, they will yield larger benefits when all of the small projects are compounded.

But the one thing that I have not seen it tied back to is the successes that were found in the leaders of the various industries that have adopted the Kaizen philosophy.

Data quality practitioners need to recognize that their success lies in the fundamentals of Kaizen: quality, effort, participation, willingness to change, and communication. The fundamentals put people and process before technology. In other words, technology may help eliminate the problem, but it is the people and process that allow that elimination to occur.”

On Data Quality is not an Act, it is a Habit, Dylan Jones commented:

“Subtle but immensely important because implementing a coordinated series of small, easily trained habits can add up to a comprehensive data quality program.

In my first data quality role we identified about ten core habits that everyone on the team should adopt and the results were astounding. No need for big programs, expensive technology, change management and endless communication, just simple, achievable habits that importantly were focused on the workers.

To make habits work they need the WIIFM (What’s In It For Me) factor.”

On Darth Data, Rob Drysdale commented:

“Interesting concept about using data for the wrong purpose. I think that data, if it is the ‘true’ data can be used for any business decision as long as it is interpreted the right way.

One problem is that data may have a margin of error associated with it and this must be understood in order to properly use it to make decisions. Another issue is that the underlying definitions may be different.

For example, an organization may use the term ‘customer’ when it means different things. The marketing department may have a list of ‘customers’ that includes leads and prospects, but the operational department may only call them ‘customers’ when they are generating revenue.

Each department’s data and interpretation of it is correct for their own purpose, but you cannot mix the data or use it in the ‘other’ department to make decisions.

If all the data is correct, the definitions and the rules around capturing it are fully understood, then you should be able to use it to make any business decision.

But when it gets misinterpreted and twisted to suit some business decision that it may not be suited for, then you are crossing over to the Dark Side.”

On Data Governance and the Social Enterprise, Jacqueline Roberts commented:

“My continuous struggle is the chaos of data electronically submitted by many, many sources, different levels of quality and many different formats while maintaining the history of classification, correction, language translation, where used, and a multitude of other ‘data transactions’ to translate this data into usable information for multi-business use and reporting. This is my definition of Master Data Management.

I chuckled at the description of the ‘rigid business processes’ and I added ‘software products’ to the concept, since the software industry must understand the fluidity of the change of data to address the challenges of Master Data Management, Data Governance, and Data Cleansing.”

On Data Governance and the Social Enterprise, Frank Harland commented:

“I read: ‘Collaboration is the key to business success. This essential collaboration has to be based on people, and not on rigid business processes . . .’

And I think: Collaboration is the key to any success. This must have been true since the time man hunted the Mammoth. When collaborating, it went a lot better to catch the bugger.

And I agree that the collaboration has to be based on people, and not on rigid business processes. That is as opposed to based on rigid people, and not on flexible business processes. All the truths are in the adjectives.

I don’t mean to bash, Jim, I think there is a lot of truth here and you point to the exact relationship between collaboration as a requirement and Data Governance as a prerequisite. It’s just me getting a little tired of Gartner saying things of the sort that ‘in order to achieve success, people should work together. . .’

I have a word in mind that starts with ‘du’ and ends with ‘h’ :-)”

On Quality and Governance are Beyond the Data, Milan Kučera commented:

“Quality is a result of people’s work, their responsibility, improvement initiatives, etc. I think it is more about the company culture and its possible regulation by government. It is the most complicated to set-up a ‘new’ (information quality) culture, because of its influence on every single employee. It is about well balanced information value chain and quality processes at every ‘gemba’ where information is created.

Confidence in the information is necessary because we make many decisions based on it. Sometimes we do better or worse then before. We should store/use as much accurate information as possible.

All stewardship or governance frameworks should help companies with the change of its culture, define quality measures (the most important is accuracy), cost of poor quality system (allowing them to monitor impacts of poor quality information), and other necessary things. Only at this moment would we be able to trust corporate information and make decisions.

A small remark on technology only. Data quality technology is a good tool for helping you to analyze ‘technical’ quality of data – patterns, business rules, frequencies, NULL or Not NULL values, etc. Many technology companies narrow information quality into an area of massive cleansing (scrap/rework) activities. They can correct some errors but everything in general leads to a higher validity, but not information accuracy. If cleansing is implemented as a regular part of the ETL processes then the company institutionalizes massive correction, which is only a cost adding activity and I am sure it is not the right place to change data contents – we increase data inconsistency within information systems.

Every quality management system (for example TQM, TIQM, Six Sigma, Kaizen) focuses on improvement at the place where errors occur – gemba. All those systems require: leaders, measures, trained people, and simply – adequate culture.

Technology can be a good assistant (helper), but a bad master.”

On Can Data Quality avoid the Dustbin of History?, Vish Agashe commented:

“In a sense, I would say that the current definitions and approaches of/towards data quality might very well not be able to avoid the Dustbin of History.

In the world of phones and PDAs, quality of information about environments, current fashions/trends, locations and current moods of the customer might be more important than a single view of customer or de-duped customers. The pace at which consumer’s habits are changing, it might be the quality of information about the environment in which the transaction is likely to happen that will be more important than the quality of the post transaction data itself . . . Just a thought.”

On Does your organization have a Calumet Culture?, Garnie Bolling commented:

“So true, so true, so true.

I see this a lot. Great projects or initiatives start off, collaboration is expected across organizations, and there is initial interest, big meetings / events to jump start the Calumet. Now what, when the events no longer happen, funding to fly everyone to the same city to bond, share, explore together dries up.

Here is what we have seen work. After the initial kick off, have small events, focus groups, and let the Calumet grow organically. Sometimes after a big powwow, folks assume others are taking care of the communication / collaboration, but with a small venue, it slowly grows.

Success breeds success and folks want to be part of that, so when the focus group achieves, the growth happens. This cycle is then repeated, hopefully.

While it is important for folks to come together at the kick off to see the big picture, it is the small rolling waves of success that will pick up momentum, and people will want to join the effort to collaborate versus waiting for others to pick up the ball and run.

Thanks for posting, good topic. Now where is my small focus group? :-)”

You Are Awesome

Thank you very much for sharing your perspectives with our collablogaunity. This entry in the series highlighted the commendable comments received on OCDQ Blog posts published in October, November, and December of 2010.

Since there have been so many commendable comments, please don’t be offended if one of your comments wasn’t featured.

Please keep on commenting and stay tuned for future entries in the series.

By the way, even if you have never posted a comment on my blog, you are still awesome — feel free to tell everyone I said so.

Commendable Comments (Part 8)

Commendable Comments (Part 7)

Commendable Comments (Part 6)

Commendable Comments (Part 5)

Commendable Comments (Part 4)

Commendable Comments (Part 3)

Commendable Comments (Part 2)

Commendable Comments (Part 1)

February 12, 2011

Spartan Data Quality

February 12, 2011/ Jim Harris

My recent Twitter conservation with Dylan Jones, Henrik Liliendahl Sørensen, and Daragh O Brien was sparked by the blog post Case study with Data blogs, from 300 to 1000, which included a list of the top 500 data blogs ranked by influence.

Data Quality Pro was ranked #57, Liliendahl on Data Quality was ranked #87, The DOBlog was a glaring omission, and I was proud OCDQ Blog was ranked #33 – at least until, being the data quality geeks we are, we noticed that it was also ranked #165.

In other words, there was an ironic data quality issue—a data quality blog was listed twice (i.e., a duplicate record in the list)!

Hilarity ensued, including some epic photo shopping by Daragh, leading, quite inevitably, to the writing of this Data Quality Tale, which is obviously loosely based on the epic movie 300—and perhaps also the epically terrible comedy Meet the Spartans. Enjoy!

Spartan Data Quality

In 1989, an alliance of Data Geeks, lead by the Spartans, an unrivaled group of data quality warriors, battled against an invading data deluge in the mountain data center of Thermopylae, caused by the complexities of the Greco-Persian Corporate Merger.

Although they were vastly outnumbered, the Data Geeks overcame epic data quality challenges in one of the most famous enterprise data management initiatives in history—The Data Integration of Thermopylae.

This is their story.

Leonidas, leader of the Spartans, espoused an enterprise data management approach known as Spartan Data Quality, defined by its ethos of collaboration amongst business, data, and technology experts, collectively and affectionately known as Data Geeks.

Therefore, Leonidas was chosen as the Thermopylae Project Lead. However, Xerxes, the new Greco-Persian CIO, believed that the data integration project was pointless, Spartan Data Quality was a fool’s errand, and the technology-only Persian approach, known as Magic Beans, should be implemented instead. Xerxes saw the Thermopylae project as an unnecessary sacrifice.

“There will be no glory in your sacrifice,” explained Xerxes. “I will erase even the memory of Sparta from the database log files! Every bit and byte of Data Geek tablespace shall be purged. Every data quality historian and every data blogger shall have their Ethernet cables pulled out, and their network connections cut from the Greco-Persian mainframe. Why, uttering the very name of Sparta, or Leonidas, will be punishable by employee termination! The corporate world will never know you existed at all!”

“The corporate world will know,” replied Leonidas, “that Data Geeks stood against a data deluge, that few stood against many, and before this battle was over, a CIO blinded by technology saw what it truly takes to manage data as a corporate asset.”

Addressing his small army of 300 Data Geeks, Leonidas declared: “Gather round! No retreat, no surrender. That is Spartan law. And by Spartan law we will stand and fight. And together, united by our collaboration, our communication, our transparency, and our trust in each other, we shall overcome this challenge.”

“A new Information Age has begun. An age of data-driven business decisions, an age of data-empowered consumers, an age of a world connected by a web of linked data. And all will know, that 300 Data Geeks gave their last breath to defend it!”

“But there will be so many data defects, they will blot out the sun!” exclaimed Xerxes.

“Then we will fight poor data quality in the shade,” Leonidas replied, with a sly smile.

“This is madness!” Xerxes nervously responded as the new servers came on-line in the data center of Thermopylae.

“Madness? No,” Leonidas calmly said as the first wave of the data deluge descended upon them. “THIS . . . IS . . . DATA !!!”