May 24, 2011

Got Data Quality?

May 24, 2011/ Jim Harris

I have written many blog posts about how it’s neither a realistic nor a required data management goal to achieve data perfection, i.e., 100% data quality or zero defects.

Of course, this admonition logically invites the questions:

If achieving 100% data quality isn’t the goal, then what is?

99%?

98%?

As I was pondering these questions while grocery shopping, I walked down the dairy aisle casually perusing the wide variety of milk options, when the thought occurred to me that data quality issues have a lot in common with the fat content of milk.

The classification of the percentage of fat (more specifically butterfat) in milk varies slightly by country. In the United States, whole milk is approximately 3.25% fat, whereas reduced fat milk is 2% fat, low fat milk is 1% fat, and skim milk is 0.5% fat.

Reducing the total amount of fat (especially saturated and trans fat) is a common recommendation for a healthy diet. Likewise, reducing the total amount of defects (i.e., data quality issues) is a common recommendation for a healthy data management strategy. However, just like it would be unhealthy to remove all of the fat from your diet (because some fatty acids are essential nutrients that can’t be derived from other sources), it would be unhealthy to attempt to remove all of the defects from your data.

So maybe your organization is currently drinking whole data (i.e., 3.25% defects or 96.75% data quality) and needs to consider switching to reduced defect data (i.e., 2% defects or 98% data quality), low defect data (i.e., 1% defects or 99% data quality), or possibly even skim data (i.e., 0.5% defects or 99.5% data quality).

No matter what your perspective is regarding the appropriate data quality goal for your organization, at the very least, I think that we can all agree that all of our enterprise data management initiatives have to ask the question: “Got Quality?”

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

Thaler’s Apples and Data Quality Oranges

Data Quality and The Middle Way

Missed It By That Much

The Data Quality Goldilocks Zone

You Can’t Always Get the Data You Want

May 19, 2011

Data Quality Practices—Activate!

May 19, 2011/ Jim Harris

This is a screen capture of the results of last month’s unscientific poll about proactive data quality versus reactive data quality alongside one of my favorite (this is the third post I’ve used it in) graphics of the Wonder Twins (Zan and Jayna) with Gleek.

Although reactive (15 combined votes) easily defeated proactive (6 combined votes) in the poll, proactive versus reactive is one debate that will likely never end. However, the debate makes it seem as if we are forced to choose one approach over the other.

Generally speaking, most recommended data quality practices advocate implementing proactive defect prevention and avoiding reactive data cleansing. But as Graham Rhind commented, data quality is neither exclusively proactive nor exclusively reactive.

“And if you need proof, start looking at the data,” Graham explained. “For example, gender. To produce quality data, a gender must be collected and assigned proactively, i.e., at the data collection stage. Gender coding reactively on the basis of, for example, name, only works correctly and with certainty in a certain percentage of cases (that percentage always being less than 100). Reactive data quality in that case can never be the best practice because it can never produce the best data quality, and, depending on what you do with your data, can be very damaging.”

“On the other hand,” Graham continued, “the real world to which the data is referring changes. People move, change names, grow old, die. Postal code systems and telephone number systems change. Place names change, countries come and go. In all of those cases, a reactive process is the one that will improve data quality.”

“Data quality is a continuous process,” Graham concluded. From his perspective, a realistic data quality practice advocates being “proactive as much as possible, and reactive to keep up with a dynamic world. Works for me, and has done well for decades.”

I agree with Graham because, just like any complex problem, data quality has no fast and easy solution. In my experience, a hybrid discipline is always required, combining proactive and reactive approaches into one continuous data quality practice.

Or as Zan (representing Proactive) and Jayna (representing Reactive) would say: “Data Quality Practices—Activate!”

And as Gleek would remind us: “The best data quality practices remain continuously active.”

How active is your data quality practice?

The Data Quality Wager

The Dichotomy Paradox, Data Quality and Zero Defects

Retroactive Data Quality

A Tale of Two Q’s

What going to the dentist taught me about data quality

Groundhog Data Quality Day

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

What Data Quality Technology Wants

MacGyver: Data Governance and Duct Tape

To Our Data Perfectionists

Finding Data Quality

May 17, 2011

DQ-View: Occam’s Razor Burn

May 17, 2011/ Jim Harris

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: DQ-View on Vimeo

DQ-View: Roman Ruts on the Road to Data Governance

DQ-View: Talking about Data

DQ-View: The Poor Data Quality Blizzard

DQ-View: New Data Resolutions

DQ-View: From Data to Decision

DQ View: Achieving Data Quality Happiness

Data Quality is not a Magic Trick

DQ-View: The Cassandra Effect

DQ-View: Is Data Quality the Sun?

DQ-View: Designated Asker of Stupid Questions

Video: Oh, the Data You’ll Show!

May 13, 2011

Data Quality and #FollowFriday the 13th

May 13, 2011/ Jim Harris

As Alice Hardy arrived at her desk at Crystal Lake Insurance, it seemed like a normal Friday morning. Her thoughts about her weekend camping trip were interrupted by an eerie sound emanating from one of the adjacent cubicles:

Da da da, ta ta ta. Da da da, ta ta ta.

“What’s that sound?” Alice wondered out loud.

“Sorry, am I typing too loud again?” responded Tommy Jarvis from another adjacent cubicle. “Can you come take a look at something for me?”

“Sure, I’ll be right over,” Alice replied as she quickly circumnavigated their cluster of cubicles, puzzled and unsettled to find the other desks unoccupied with their computers turned off, wondering, to herself this time, where did that eerie sound come from? Where are the other data counselors today?

“What’s up?” she casually asked upon entering Tommy’s cubicle, trying, as always, to conceal her discomfort about being alone in the office with the one colleague that always gave her the creeps. Visiting his cubicle required a constant vigilance in order to avoid making prolonged eye contact, not only with Tommy Jarvis, but also with the horrifying hockey mask hanging above his computer screen like some possessed demon spawn from a horror movie.

“I’m analyzing the Date of Death in the life insurance database,” Tommy explained. “And I’m receiving really strange results. First of all, there are no NULLs, which indicates all of our policyholders are dead, right? And if that wasn’t weird enough, there are only 12 unique values: January 13, 1978, February 13, 1981, March 13, 1987, April 13, 1990, May 13, 2011, June 13, 1997, July 13, 2001, August 13, 1971, September 13, 2002, October 13, 2006, November 13, 2009, and December 13, 1985.”

“That is strange,” said Alice. “All of our policyholders can’t be dead. And why is Date of Death always the 13th of the month?”

“It’s not just always the 13th of the month,” Tommy responded, almost cheerily. “It’s always a Friday the 13th.”

“Well,” Alice slowly, and nervously, replied. “I have a life insurance policy with Crystal Lake Insurance. Pull up my policy.”

After a few, quick, loud pounding keystrokes, Tommy ominously read aloud the results now displaying on his computer screen, just below the hockey mask that Alice could swear was staring at her. “Date of Death: May 13, 2011 . . . Wait, isn’t that today?”

Da da da, ta ta ta. Da da da, ta ta ta.

“Did you hear that?” asked Alice. “Hear what?” responded Tommy with a devilish grin.

“Never mind,” replied Alice quickly while trying to focus her attention on only the computer screen. “Are you sure you pulled up the right policy? I don’t recognize the name of the Primary Beneficiary . . . Who the hell is Jason Voorhees?”

“How the hell could you not know who Jason Voorhees is?” asked Tommy, with anger sharply crackling throughout his words. “Jason Voorhees is now rightfully the sole beneficiary of every life insurance policy ever issued by Crystal Lake Insurance.”

Da da da, ta ta ta. Da da da, ta ta ta.

“What? That’s impossible!” Alice screamed. “This has to be some kind of sick data quality joke.”

“It’s a data quality masterpiece!” Tommy retorted with rage. “I just finished implementing my data machete, er I mean, my data matching solution. From now on, Crystal Lake Insurance will never experience another data quality issue.”

“There’s just one last thing that I need to take care of.”

Da da da, ta ta ta. Da da da, ta ta ta.

“And what’s that?” Alice asked, smiling nervously while quickly backing away into the hallway—and preparing to run for her life.

Da da da, ta ta ta. Da da da, ta ta ta.

“Real-world alignment,” replied Tommy. Rising to his feet, he put on the hockey mask, and pulled an actual machete out of the bottom drawer of his desk. “Your Date of Death is entered as May 13, 2011. Therefore, I must ensure real-world alignment.”

Da da da, ta ta ta. Da da da, ta ta ta. Da da da, ta ta ta. Da da da, ta ta ta. Data Quality.

The End.

(Note — You can also listen to the OCDQ Radio Theater production of this DQ-Tale in the Scary Calendar Effects episode.)

#FollowFriday Recommendations

#FollowFriday is when Twitter users recommend other users you should follow, so here are some great tweeps who provide tweets mostly about Data Quality, Data Governance, Master Data Management, Business Intelligence, and Big Data Analytics:

(Please Note: This is by no means a comprehensive list, is listed in no particular order whatsoever, and no offense is intended to any of my tweeps not listed below. I hope that everyone has a great #FollowFriday and an even greater weekend.)

Henrik Liliendahl Sørensen – @hlsdk
Dylan Jones – @DataQualityPro
Prashanta Chandramohan – @MDMGeek
John Owens – @JohnIMM
Phil Simon – @PhilSimon
Rich Murnane – @murnane
Alan David Duncan – @Alan_D_Duncan
Axel Troike – @AxelTroike
Chris Hale – @HaleChris
Nicola Askham – @Nicola_Askham
Karen Way – @KarenAWay1
Melinda Thielbar – @mthielbar
Carla Gentry – @data_nerd
Jill Wanless – @sheezaredhead
Julie Hunt – @juliebhunt
Jacqueline Roberts – @JackieMRoberts
Lyndsay Wise – @wiseanalytics
Loraine Lawson – @LoraineLawson
Corinna Martinez – @Futureratti
Jen Stirrup – @JenStirrup
Loretta Mahon Smith – @silverdata
April Reeve – @Datagrrl
Tamara Dull – @tamaradull
Tracy Austin – @tracykowal
Ina Felsheim – @InaSAP
Marinka Voorhout – @marinkavoorhout
Meta Brown – @metabrown312
Kelle O’Neal – @1stSanFrancisco
Robin Stehlik – @RobinPStehlik
Natalie Eschen – @NatalieEschen
Cindy Balon Harder – @CindyBHarder
Kristin McMahon – @Kristin_McMahon
Carrie Byrum – @CByrum
Marie Haggberg – @HaggbergConsult
Terri Rylander – @BIMarcom
Nicole Carriere – @carrni
Beth Breidenbach – @bbreidenbach
Giedre Aleknonyte – @googlea
Sarah Burnett – @SarahBurnett
Teresa Cottam – @Teresacottam
Kelly Lautt – @KellyLautt
Margaret van Engers – @EngersWijs
Karen Lopez – @datachick
Rob Drysdale – @projmgr
William Sharp – @dqchronicle
Augusto Albeghi – @Stray__Cat
Ted Louie – @TedLouie
Gordon Hamilton – @DQStudent
Juan Carlos Ore – @juancore
Richard Eudes, PhD – @RichardEudes
Julian Schwarzenbach – @jschwa1
Ken O’Connor – @KenOConnorData
Mark Horseman – @MarkHorseman
Graham Rhind – @GrahamRhind
Rich Northwood – @TheDataGeek
Jarrett Goldfedder – @JGoldfed
Dalton Cervo – @dcervo
Phil Wright – @faropress
Garry Ure – @dq_midnightblue
Jaime Fitzgerald – @jaimefitzgerald
Simon Daniels – @mktginsightguy
Ajay Ohri – @decisionstats
Marcus Borba – @marcusborba
Alexej Freund – @alexej_freund
Vish Agashe – @VishAgashe
Jean-Michel Franco – @jmichel_franco
Jorge García – @jgptec
Rohin Bhargava – @RohinBhargava
Krish Krishnan – @datagenius
Sunil Soares – @SunilSoares1
Mady Korada – @MostlyKnown
Craig Mullins – @craigmullins
Chris Sorensen – @wjdataguy
Ted Cuzzillo – @datadoodle
Mark Lorion – @mark_lorion
Steve Dine – @steve_dine
Doug Newdick – @dougnewdick
Richard Jarvis – @ManBagBlog
David Pratt – @DataMgmtWonk
Gary Allemann – @Gary_Allemann
David Ho – @David_Ho_DQ
Jay Zaidi – @JayZaidi
Ralph Winters – @RDub2
Rob Paller – @RobPaller
Stephen Putman – @SJPutman
JJ Burnam – @AmbivalentGeek
Garnie Bolling – @GarnieBolling
Benjamin Wolfe – @BenjaminWolfe
Nick Giuliano – @Nick_Giuliano
Scott Delaney – @scottyd99999
FX Nicolas – @FxNicolas
Kimmo Kontra – @kimmokontra
Leif Tietje – @Leif74
Jeff Cutler – @JeffCutler
Amanda Woolf – @AmandWoolf
Ron Miller – @ron_miller

IAIDQ – @IAIDQ
Daragh O Brien – @daraghobrien
Thomas Redman – @thedatadoc1
David Loshin – @DavidLoshin
Neil Raden – @NeilRaden
James Taylor – @jamet123
John Ladley – @jladley
Art Petty – @ArtPetty
Paul Gillin – @pgillin
Shawn Rogers – @shawnrog
Wayne Eckerson – @weckerson
Peter Thomas – @PeterJThomas
Timo Elliott – @TimoElliott
Dan Power – @dan_power
William McKnight – @williammcknight
Kirk Borne – @KirkDBorne
Michael Cavaretta – @mjcavaretta
Peter Aiken – @paiken
Gwen Thomas – @gwenthomasdgi
Claudia Imhoff – @Claudia_Imhoff
Tom Davenport – @tdav
Gartner Research – @Gartner_inc
Ted Friedman – @ted_friedman
Andy Bitterer – @bitterer
Merv Adrian – @merv
Doug Laney – @Doug_Laney
Forrester Research – @forrester
Michele Goetz – @FORR_Mgoetz
IBM Midsize Business – @MidmarketIBM
IBM Big Data – @IBMbigdata
Crysta Anderson – @CrystaAnderson
David Pittman – @TheSocialPitt
James Kobielus – @jameskobielus
SAS DataFlux – @SASDataFlux
Jill Dyché – @JillDyche
Data Roundtable – @DataRoundtable
SAS Software – @SASsoftware
SAS Analytics – @SASanalytics
Mark Troester – @mtroester
All Analytics – @AllAnalytics
Service Objects – @ServiceObjects
Wendy Breakstone – @socialwen
Actian Corporation – @ActianCorp
Actian DataCloud – @DataIntegrate
Paige Roberts – @RobertsPaige
Bill Hewitt – @BillHewittCEO
Kalido – @Kalido
Mike Wheeler – @MWheeler_DDG
Informatica – @InformaticaCorp
Robert Karel – @rbkarel
Clarke Patterson – @ClarkePatterson
Jakki Geiger – @JakkiGeiger
Ravi Shankar – @Ravi_Shankar_
Talend – @Talend
Steve Sarsfield – @SteveSarsfield
Trillium Software – @TrilliumSW
RedPoint Global – @RedPointGlobal
George Corugedo – @GeorgeCorugedo
Carol Wolicki – @cwolicki
TIBCO Spotfire – @TibcoSpotfire
Brett Stupakevich – @Brett2point0
Melissa Data – @MelissaData
Ira Warren Whiteside – @irawhiteside
Dun & Bradstreet – @DnBUS
Shelly Lucas – @pisarose
helpIT Systems – @helpIT
Joshua Buckler – @bucklerjosh
Datamartist – @Datamartist
Experian QAS – @Experian_QAS
Ataccama – @Ataccama
DataMentors – @DataMentors
DQ Global – @DQ_Global
Pitney Bowes Software – @PitneyBowes
Utopia, Inc. – @UtopiaInc
Enterworks – @enterworks
Fergus Cloughley – @Master_OBASHI
Match2Lists – @Match2Lists
DMG Federal – @dmgfederal
Data Science at UC Berkeley – @BerkeleyData
MIKE2.0 – @OpenMethodology
The Data Manager’s Public Library – @TheDMPL
Dataversity – @Dataversity
ECCMA – @ECCMA
SmartData Collective – @SmartDataCo
Information Management – @InfoMgmt
Tony Carrini – @TonyCarrini
Julie Langenkamp – @JulieLangenkamp
Jim Ericson – @JimEricson
Eric Kavanagh – @Eric_Kavanagh
Justin Kern – @IMJustinKern
Whitney Eden – @WhitneyAEden
Enterprise CIO Forum – @ECIOForum
John Dodge – @Thedodgeretort
Paul Calento – @pcalento
Pearl Zhu – @pearl_zhu
Eric D. Brown – @ericdbrown

May 12, 2011

Are Applications the La Brea Tar Pits for Data?

May 12, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

In a previous post, I explained application modernization must become the information technology (IT) prime directive in order for IT departments to satisfy the speed and agility business requirements of their organizations. An excellent point raised in the comments of that post was that continued access to legacy data is often a business driver for not sunsetting legacy applications.

“I find many legacy applications are kept alive in read-only mode, i.e., purely for occasional query/reporting purposes,” explained Beth Breidenbach. “Stated differently, the end users often just want to be able to look at the legacy data from time to time.”

Gordon Hamilton commented that data is often stuck in the “La Brea Tar Pits of legacy” applications. Even when the data is migrated during the implementation of a new application (its new tar pit, so to speak), the legacy data, as Breidenbach said, is often still accessed via the legacy application, which could be dangerous, as Hamilton noted, because the legacy data is diverging from the version migrated to the new application (i.e., after migration, the legacy data could be updated, or possibly deleted).

The actual La Brea Tar Pits were often covered with water, causing animals that came to drink to fall in and get stuck in the tar, thus preserving their fossils for centuries—much to the delight of future paleontologists and natural history museum enthusiasts.

Although they are often cited as the bane of data management, most data silos are actually application silos because historically data and applications have been so tightly coupled. Data is often covered with an application layer, causing users that enter, access, and use the data to get stuck with the functionality provided by its application, thus preserving their use of the application even after it has become outdated (i.e., legacy)—much to the dismay of IT departments and emerging technology enthusiasts.

When so tightly coupled with data, applications—not just legacy applications—truly can be the La Brea Tar Pits for data, since once data needed to support business activities gets stuck in an application, that application will stick around for a very long time.

If applications and data were not so tightly coupled, we could both modernize our applications and optimize our data usage in order to better satisfy the speed and agility business requirements of our organizations. Therefore, not only should we sunset our legacy applications, we should also approach data management with the mindset of decoupling our data from its applications.

This blog post is sponsored by the Enterprise CIO Forum and HP.

A Sadie Hawkins Dance of Business Transformation

Why does the sun never set on legacy applications?

The Partly Cloudy CIO

The IT Pendulum and the Federated Future of IT

Suburban Flight, Technology Sprawl, and Garage IT

May 09, 2011

DQ-BE: Invitation to Duplication

May 09, 2011/ Jim Harris

Data Quality By Example (DQ-BE) is an OCDQ regular segment that provides examples of data quality key concepts.

I recently received my invitation to the Data Governance and Information Quality Conference, which will be held June 27-30 in San Diego, California at the Catamaran Resort Hotel and Spa. Well, as shown above, I actually received both of my invitations.

Although my postal address is complete, accurate, and exactly the same on both of the invitations, my name is slightly different (“James” vs. “Jim”), and my title (“Data Quality Journalist” vs. “Blogger-in-Chief”) and company (“IAIDQ” vs. “OCDQ Blog”) are both completely different. I wonder how many of the data quality software vendors sponsoring this conference would consider my invitations to be duplicates. (Maybe I’ll use the invitations to perform a vendor evaluation on the exhibit floor.)

So it would seem that even “The Premier Event in Data Governance and Data Quality” can experience data quality problems.

No worries, I doubt the invitation system will be one of the “Practical Approaches and Success Stories” presented—unless it’s used as a practical approach to a success story about demonstrating how embarrassing it might be to send duplicate invitations to a data quality journalist and blogger-in-chief. (I wonder if this blog post will affect the approval of my Press Pass for the event.)

Okay, on a far more serious note, you should really consider attending this event. As the conference agenda shows, there will be great keynote presentations, case studies, tutorials, and other sessions conducted by experts in data governance and data quality, including (among many others) Larry English, Danette McGilvray, Mike Ferguson, David Loshin, and Thomas Redman.

DQ-BE: Dear Valued Customer

Customer Incognita

Identifying Duplicate Customers

Adventures in Data Profiling (Part 7) – Customer Name

The Quest for the Golden Copy (Part 3) – Defining “Customer”

‘Tis the Season for Data Quality

The Seven Year Glitch

DQ-IRL (Data Quality in Real Life)

Data Quality, 50023

Once Upon a Time in the Data

The Semantic Future of MDM

May 05, 2011

The Dichotomy Paradox, Data Quality and Zero Defects

May 05, 2011/ Jim Harris

As Joseph Mazur explains in Zeno’s Paradox, the ancient Greek philosopher Zeno constructed a series of logical paradoxes to prove that motion is impossible, which today remain on the cutting edge of our investigations into the fabric of space and time.

One of the paradoxes is known as the Dichotomy:

“A moving object will never reach any given point, because however near it may be, it must always first accomplish a halfway stage, and then the halfway stage of what is left and so on, and this series has no end. Therefore, the object can never reach the end of any given distance.”

Of course, this paradox sounds silly. After all, reaching a given point like the finish line in a race is reachable in real life since people win races all the time. However, in theory, the mathematics is maddeningly sound, since it creates an infinite series of steps between the starting point and the finish line—and an infinite number of steps creates a journey that can never end.

Furthermore, this theoretical race cannot even begin, since in order to reach the first step, the recursive nature of this paradox proves that we would never reach the point of completing the first step. Hence, the paradoxical conclusion is any travel over any finite distance can neither be completed nor begun, and so all motion must be an illusion. Some of the greatest minds in history (from Galileo to Einstein to Stephen Hawking) have tackled the Dichotomy Paradox—but without being able to disprove it.

Data Quality and Zero Defects

The given point that many enterprise initiatives attempt to reach with data quality is 100% with a metric such as data accuracy. Leaving aside (in this post) the fact that any data quality metric without a tangible business context provides no business value, 100% data quality (aka Zero Defects) is an unreachable destination—no matter how close you get or how long you try to reach it.

Zero Defects is a laudable goal—but its theory and practice comes from manufacturing quality. However, I have always been of the opinion, unpopular among some of my peers, that manufacturing quality and data quality are very different disciplines, and although there is much to be learned from studying the theories of manufacturing quality, I believe that brute forcing those theories onto data quality is impractical and fundamentally flawed (and I’ve even said so in verse: To Our Data Perfectionists).

The given point that enterprise initiatives should actually be attempting to reach is data-driven solutions for business problems.

Advocates of Zero Defects argue that, in theory, defect-free data should be fit to serve as the basis for every possible business use, enabling a data-driven solution for any business problem. However, in practice, business uses for data, as well as business itself, is always evolving. Therefore, business problems are dynamic problems that do not have—nor do they require—perfect solutions.

Although the Dichotomy Paradox proves motion is theoretically impossible, our physical motion practically proves otherwise. Has your data quality practice become motionless by trying to prove that Zero Defects is more than just theoretically possible?

May 03, 2011

DQ-View: Roman Ruts on the Road to Data Governance

May 03, 2011/ Jim Harris

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: DQ-View on Vimeo

Delivering Data Happiness

Don’t Do Less Bad; Do Better Good

Beware the Data Governance Ides of March

The Road of Collaboration

The Data Governance Oratorio

Video: Declaration of Data Governance

DQ-View: Talking about Data

DQ-View: New Data Resolutions

DQ View: Achieving Data Quality Happiness

Data Quality is not a Magic Trick

DQ-View: The Cassandra Effect

DQ-View: Is Data Quality the Sun?

April 28, 2011

Why does the sun never set on legacy applications?

April 28, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

Most information technology (IT) departments split their time between implementing new and maintaining existing technology. On the software side, most of the focus is on applications supporting business processes. A new application is usually intended to replace an outdated (i.e., legacy) application. However, both will typically run in production until the new application is proven, after which the legacy application is obsolesced (i.e., no longer used) and sunset (i.e., no longer maintained, usually removed).

At least, in theory, that is how it is all supposed to work. However, in practice, legacy applications are rarely sunset. Why?

The simple reason most legacy applications do not go gentle into that good night, but instead rage against the dying of their light, is that some users continue using some of the functionality provided by the legacy application to support daily business activities.

Two of the biggest contributors to this IT conundrum are speed and agility—the most common business drivers for implementing new technology. Historically, IT has spent a significant part of their implementation time setting up application environments (i.e., development, test, and parallel production). For traditional enterprise solutions, this meant on-site hardware configuration, followed by software acquisition, installation, configuration, and customization. Therefore, a significant amount of time, effort, and money had to be expended before the development of the new application could even begin.

Legacy applications are typically the antithesis of agility, but they are typically good at doing what they have always done. Although this doesn’t satisfy all of the constantly evolving business needs of the organization, it is often easier to use the legacy application to support the less dynamic business needs, and try to focus new application development on new requirements.

Additionally, the consumerization of IT and the technology trifecta of Cloud, SaaS, and Mobility has both helped and hindered the legacy logjam, since a common characteristic of this new breed of off-premise applications is quickly providing only the features that users currently need, which is often in stark contrast to new on-premise applications that although feature-rich, often remain user-poor because of the slower time to implement—again failing the business requirements of speed and agility.

Therefore, although legacy applications are used less and less, since they continue to support some business needs, they are never sunset. This feature fracture (i.e., technology supporting business needs being splintered across new and legacy applications) often leaves IT departments overburdened with maintaining a lot of technology that is not being used all that much.

The white paper The Mandate to Modernize Aging Applications is a good resource for information about this complex challenge and explains why application modernization must become the enterprise’s information technology prime directive.

IT can not enable the enterprise’s future if they are stuck still supporting its past. If the sun never sets on legacy applications, then a foreboding darkness may fall and the number of successful new days dawning for the organization may quickly dwindle.

This blog post is sponsored by the Enterprise CIO Forum and HP.

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

The Partly Cloudy CIO

The IT Pendulum and the Federated Future of IT

Suburban Flight, Technology Sprawl, and Garage IT

April 26, 2011

DQ-View: Talking about Data

April 26, 2011/ Jim Harris

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: DQ-View on Vimeo

DQ-View: The Poor Data Quality Blizzard

DQ-View: New Data Resolutions

DQ-View: From Data to Decision

DQ View: Achieving Data Quality Happiness

Data Quality is not a Magic Trick

DQ-View: The Cassandra Effect

DQ-View: Is Data Quality the Sun?

DQ-View: Designated Asker of Stupid Questions

Video: Oh, the Data You’ll Show!

April 21, 2011

The Data Governance Oratorio

April 21, 2011/ Jim Harris

Boston Symphony Orchestra

An oratorio is a large musical composition collectively performed by an orchestra of musicians and choir of singers, all of whom accept a shared responsibility for the quality of their performance, but also requires individual performers accept accountability for playing their own musical instrument or singing their own lines, which includes an occasional instrumental or lyrical solo.

During a well-executed oratorio, individual mastery combines with group collaboration, creating a true symphony, a sounding together, which produces a more powerful performance than even the most consummate solo artist could deliver on their own.

The Data Governance Oratorio

Ownership, Responsibility, and Accountability comprise the core movements of the Data Governance ORA-torio.

Data is a corporate asset collectively owned by the entire enterprise. Data governance is a cross-functional, enterprise-wide initiative requiring that everyone, regardless of their primary role or job function, accept a shared responsibility for preventing data quality issues, and for responding appropriately to mitigate the associated business risks when issues do occur. However, individuals must still be held accountable for the specific data, business process, and technology aspects of data governance.

Data governance provides the framework for the communication and collaboration of business, data, and technical stakeholders, and establishes an enterprise-wide understanding of the roles and responsibilities involved, and the accountability required to support the organization’s business activities, and materialize the value of the enterprise’s data as positive business impacts.

Collective ownership, shared responsibility, and individual accountability combine to create a true enterprise-wide symphony, a sounding together by the organization’s people, who, when empowered by high quality data and enabled by technology, can optimize business processes for superior corporate performance.

Is your organization collectively performing the Data Governance Oratorio?

Data Governance and the Buttered Cat Paradox

Beware the Data Governance Ides of March

Zig-Zag-Diagonal Data Governance

A Tale of Two G’s

The People Platform

The Collaborative Culture of Data Governance

Connect Four and Data Governance

The Business versus IT—Tear down this wall!

The Road of Collaboration

Collaboration isn’t Brain Surgery

Shared Responsibility

The Role Of Data Quality Monitoring In Data Governance

Quality and Governance are Beyond the Data

Data Transcendentalism

Podcast: Data Governance is Mission Possible

Video: Declaration of Data Governance

Don’t Do Less Bad; Do Better Good

Jack Bauer and Enforcing Data Governance Policies

The Prince of Data Governance

MacGyver: Data Governance and Duct Tape

The Diffusion of Data Governance

April 18, 2011

How active is your data quality practice?

April 18, 2011/ Jim Harris

My recent blog post The Data Quality Wager received a provocative comment from Richard Ordowich that sparked another round of discussion and debate about proactive data quality versus reactive data quality in the LinkedIn Group for the IAIDQ.

“Data quality is a reactive practice,” explained Ordowich. “Perhaps that is not what is professed in the musings of others or the desired outcome, but it is nevertheless the current state of the best practices. Data profiling and data cleansing are after the fact data quality practices. The data is already defective. Proactive defect prevention requires a greater discipline and changes to organizational behavior that is not part of the current best practices. This I suggest is wishful thinking at this point in time.”

“How can data quality practices,” C. Lwanga Yonke responded, “that do not include proactive defect prevention (with the required discipline and changes to organizational behavior) be considered best practices? Seems to me a data quality program must include these proactive activities to be considered a best practice. And from what I see, there are many such programs out there. True, they are not the majority—but they do exist.”

After Ordowich requested real examples of proactive data quality practices, Jayson Alayay commented “I have implemented data quality using statistical process control techniques where expected volumes and ratios are predicted using forecasting models that self-adjust using historical trends. We receive an alert when significant deviations from forecast are detected. One of our overarching data quality goals is to detect a significant data issue as soon as it becomes detectable in the system.”

“It is possible,” replied Ordowich, “to estimate the probability of data errors in data sets based on the currency (freshness) and usage of the data. The problem is this process does not identify the specific instances of errors just the probability that an error may exist in the data set. These techniques only identify trends not specific instances of errors. These techniques do not predict the probability of a single instance data error that can wreak havoc. For example, the ratings of mortgages was a systemic problem, which data quality did not address. Yet the consequences were far and wide. Also these techniques do not predict systemic quality problems related to business policies and processes. As a result, their direct impact on the business is limited.”

“For as long as human hands key in data,” responded Alayay, “a data quality implementation to a great extent will be reactive. Improving data quality not only pertains to detection of defects, but also enhancement of content, e.g., address standardization, geocoding, application of rules and assumptions to replace missing values, etc. With so many factors in play, a real life example of a proactive data quality implementation that suits what you’re asking for may be hard to pinpoint. My opinion is that the implementation of ‘comprehensive’ data quality programs can have big rewards and big risks. One big risk is that it can slow time-to-market and kill innovation because otherwise talented people would be spending a significant amount of their time complying with rules and standards in the name of improving data quality.”

“When an organization embarks on a new project,” replied Ordowich, “at what point in the conversation is data quality discussed? How many marketing plans, new product development plans, or even software development plans have you seen include data quality? Data quality is not even an afterthought in most organizations, it is ignored. Data quality is not even in the vocabulary until a problem occurs. Data quality is not part of the culture or behaviors within most organizations.”

Please feel free to post a comment below and explain your vote or simply share your opinions and experiences.

A Tale of Two Q’s

What going to the dentist taught me about data quality

Groundhog Data Quality Day

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

What Data Quality Technology Wants

MacGyver: Data Governance and Duct Tape

To Our Data Perfectionists

Finding Data Quality

Retroactive Data Quality

April 14, 2011

The Partly Cloudy CIO

April 14, 2011/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

The increasing frequency with which the word cloud is mentioned during information technology (IT) discussions has some people believing that CIO now officially stands for Cloud Information Officer. At the very least, CIOs are being frequently asked about their organization’s cloud strategy. However, as John Dodge has blogged, when it comes to the cloud, many organizations and industries still appear to be somewhere between FUD (fear, uncertainty, and doubt) and HEF (hype, enthusiasm, and fright).

Information Technology’s Hierarchy of Needs

Abraham Maslow’s Hierarchy of Needs

Information Technology’s Hierarchy of Needs

Joel Dobbs describes IT’s hierarchy of needs (conceptually similar to Abraham Maslow’s hierarchy of needs) as a pyramid with basic operational needs at the bottom, short-term tactical needs in the middle, and long-term strategic needs at the top.

Dobbs explains that cloud computing (and outsourcing) is an option that organizations should consider for some of their basic operational and short-term tactical needs in order to free up internal IT resources to support the strategic goals of the enterprise.

Since cost, simplicity, speed, and agility are common business drivers for cloud-based solutions, this approach alleviates some of the bottlenecks caused by rigid, centralized IT departments, allowing them to cede control over less business-critical applications, and focuses them on servicing the unique needs of the organization that require their internal expertise and in-house oversight.

The Partly Cloudy CIO

The white paper Get Your Head in the Cloud is a good resource for information about the various cloud options facing CIOs and distinguishes among the choices based on specific business requirements. The emphasis should always be on technology-aware, business-driven IT solutions, which means selecting the technology option that best satisfies a particular business need — and data security is one obvious example of the technology awareness needed during the evaluation of cloud-based solutions.

Although no one is advising an all-or-nothing cloud strategy, cloud computing is becoming a critical component of IT Delivery. With hybrid solutions becoming more common, the forecast for the foreseeable future is calling for the Partly Cloudy CIO.

This blog post is sponsored by the Enterprise CIO Forum and HP.

A Sadie Hawkins Dance of Business Transformation

Are Applications the La Brea Tar Pits for Data?

Why does the sun never set on legacy applications?

The IT Pendulum and the Federated Future of IT

Suburban Flight, Technology Sprawl, and Garage IT

April 12, 2011

The Data Quality Wager

April 12, 2011/ Jim Harris

Gordon Hamilton emailed me with an excellent recommended topic for a data quality blog post:

“It always seems crazy to me that few executives base their ‘corporate wagers’ on the statistical research touted by data quality authors such as Tom Redman, Jack Olson and Larry English that shows that 15-45% of the operating expense of virtually all organizations is WASTED due to data quality issues.

So, if every organization is leaving 15-45% on the table each year, why don’t they do something about it? Philip Crosby says that quality is free, so why do the executives allow the waste to go on and on and on? It seems that if the shareholders actually think about the Data Quality Wager they might wonder why their executives are wasting their shares’ value. A large portion of that 15-45% could all go to the bottom line without a capital investment.

I’m maybe sounding a little vitriolic because I’ve been re-reading Deming’s Out of the Crisis and he has a low regard for North American industry because they won’t move beyond their short-term goals to build a quality organization, let alone implement Deming’s 14 principles or Larry English’s paraphrasing of them in a data quality context.”

The Data Quality Wager

Gordon Hamilton explained in his email that his reference to the Data Quality Wager was an allusion to Pascal’s Wager, but what follows is my rendering of it in a data quality context (i.e., if you don’t like what follows, please yell at me, not Gordon).

Although I agree with Gordon, I also acknowledge that convincing your organization to invest in data quality initiatives can be a hard sell. A common mistake is not framing the investment in data quality initiatives using business language such as mitigated risks, reduced costs, or increased revenue. I also acknowledge the reality of the fiscal calendar effect and how most initiatives increase short-term costs based on the long-term potential of eventually mitigating risks, reducing costs, or increasing revenue.

Short-term increased costs of a data quality initiative can include the purchase of data quality software and its maintenance fees, as well as the professional services needed for training and consulting for installation, configuration, application development, testing, and production implementation. And there are often additional short-term increased costs, both external and internal.

Please note that I am talking about the costs of proactively investing in a data quality initiative before any data quality issues have manifested that would prompt reactively investing in a data cleansing project. Although, either way, the short-term increased costs are the same, I am simply acknowledging the reality that it is always easier for a reactive project to get funding than it is for a proactive program to get funding—and this is obviously not only true for data quality initiatives.

Therefore, the organization has to evaluate the possible outcomes of proactively investing in data quality initiatives while also considering the possible existence of data quality issues (i.e., the existence of tangible business-impacting data quality issues):

WindowsLiveWriter-TheDataQualityWager_BA5E-

Invest in data quality initiatives + Data quality issues exist = Decreased risks and (eventually) decreased costs
Invest in data quality initiatives + Data quality issues do not exist = Only increased costs — No ROI
Do not invest in data quality initiatives + Data quality issues exist = Increased risks and (eventually) increased costs
Do not invest in data quality initiatives + Data quality issues do not exist = No increased costs and no increased risks

Data quality professionals, vendors, and industry analysts all strongly advocate #1 — and all strongly criticize #3. (Additionally, since we believe data quality issues exist, most “orthodox” data quality folks generally refuse to even acknowledge #2 and #4.)

Unfortunately, when advocating #1, we often don’t effectively sell the business benefits of data quality, and when criticizing #3, we often focus too much on the negative aspects of not investing in data quality.

Only #4 “guarantees” neither increased costs nor increased risks by gambling on not investing in data quality initiatives based on the belief that data quality issues do not exist—and, by default, this is how many organizations make the Data Quality Wager.

How is your organization making the Data Quality Wager?

April 07, 2011

Zig-Zag-Diagonal Data Governance

April 07, 2011/ Jim Harris

This is a screen capture of the results of last month’s unscientific poll about the best way to approach data governance, which requires executive sponsorship and a data governance board for the top-down-driven activities of funding, policy making and enforcement, decision rights, and arbitration of conflicting business priorities as well as organizational politics—but also requires data stewards and other grass roots advocates for the bottom-up-driven activities of policy implementation, data remediation, and process optimization, all led by the example of peer-level change agents adopting the organization’s new best practices for data quality management, business process management, and technology management.

Hybrid Approach (starting Top-Down) won by a slim margin, but overall the need for a hybrid approach to data governance was the prevailing consensus opinion, with the only real debate being whether to start data governance top-down or bottom-up.

Commendable Comments

Rob Drysdale commented: “Too many companies get paralyzed thinking about how to do this and implement it. (Along with the overwhelmed feeling that it is too much time/effort/money to fix it.) But I think your poll needs another option to vote on, specifically: ‘Whatever works for the company/culture/organization’ since not all solutions will work for every organization. In some where it is highly structured, rigid and controlled, there wouldn’t be the freedom at the grass-roots level to start something like this and it might be frowned upon by upper-level management. In other organizations that foster grass-roots things then it could work. However, no matter which way you can get it started and working, you need to have buy-in and commitment at all levels to keep it going and make it effective.”

Paul Fulton commented: “I definitely agree that it needs to be a combination of both. Data Governance at a senior level making key decisions to provide air cover and Data Management at the grass-roots level actually making things happen.”

Jill Wanless commented: “Our organization has taken the Hybrid Approach (starting Bottom-Up) and it works well for two reasons: (1) the worker bee rock stars are all aligned and ready to hit the ground running, and (2) the ‘Top’ can sit back and let the ‘aligned’ worker bees get on with it. Of course, this approach is sometimes (painfully) slow, but with the ground-level rock stars already aligned, there is less resistance implementing the policies, and the Top’s heavy hand is needed much less frequently, but I voted for Hybrid Approach (starting Top-Down) because I have less than stellar patience for the long and scenic route.”

Zig-Zag-Diagonal Data Governance

I definitely agree with Rob’s well-articulated points that corporate culture is the most significant variable with data governance since it determines whether starting top-down or bottom-up is the best approach for a particular organization—and no matter which way you get started, you eventually need buy-in and commitment at all levels to keep it going and make it effective.

I voted for Hybrid Approach (starting Bottom-Up) since I have seen more data governance programs get successfully started because of the key factor of grass-roots alignment minimizing resistance to policy implementation, as Jill’s comment described.

And, of course, I agree with Paul’s remark that eventually data governance will require a combination of both top-down and bottom-up aspects. At certain times during the evolution of a data governance program, top-down aspects will be emphasized, and at other times, bottom-up aspects will be emphasized. However, it is unlikely that any long-term success can be sustained by relying exclusively on either a top-down-only or a bottom-up-only approach to data governance.

Let’s stop debating top-down versus bottom-up data governance—and start embracing Zig-Zag-Diagonal Data Governance.

Data Governance “Next Practices”

Phil Simon and I co-host and co-produce the wildly popular podcast Knights of the Data Roundtable, a bi-weekly data management podcast sponsored by the good folks at DataFlux, a SAS Company.

On Episode 5, our special guest, best-practice expert, and all-around industry thought leader Jill Dyché discussed her excellent framework for data governance “next practices” called The 5 + 2 Model.

Beware the Data Governance Ides of March

Data Governance and the Buttered Cat Paradox

Twitter, Data Governance, and a #ButteredCat #FollowFriday

A Tale of Two G’s

The Collaborative Culture of Data Governance

Connect Four and Data Governance

Quality and Governance are Beyond the Data

Podcast: Data Governance is Mission Possible

Video: Declaration of Data Governance

Don’t Do Less Bad; Do Better Good

Jack Bauer and Enforcing Data Governance Policies

The Prince of Data Governance