July 01, 2016

Data Governance Frameworks are like Jigsaw Puzzles

July 01, 2016/ Jim Harris

In a recent interview, Jill Dyché explained a common misconception, namely that a data governance framework is not a strategy. “Unlike other strategic initiatives that involve IT,” Jill explained, “data governance needs to be designed. The cultural factors, the workflow factors, the organizational structure, the ownership, the political factors, all need to be accounted for when you are designing a data governance roadmap.”

“People need a mental model, that is why everybody loves frameworks,” Jill continued. “But they are not enough and I think the mistake that people make is that once they see a framework, rather than understanding its relevance to their organization, they will just adapt it and plaster it up on the whiteboard and show executives without any kind of context. So they are already defeating the purpose of data governance, which is to make it work within the context of your business problems, not just have some kind of mental model that everybody can agree on, but is not really the basis for execution.”

“So it’s a really, really dangerous trend,” Jill cautioned, “that we see where people equate strategy with framework because strategy is really a series of collected actions that result in some execution — and that is exactly what data governance is.”

And in her excellent article Data Governance Next Practices: The 5 + 2 Model, Jill explained that data governance requires a deliberate design so that the entire organization can buy into a realistic execution plan, not just a sound bite. As usual, I agree with Jill, since, in my experience, many people expect a data governance framework to provide eureka-like moments of insight.

In The Myths of Innovation, Scott Berkun debunked the myth of the eureka moment using the metaphor of a jigsaw puzzle.

“When you put the last piece into place, is there anything special about that last piece or what you were wearing when you put it in?” Berkun asked. “The only reason that last piece is significant is because of the other pieces you’d already put into place. If you jumbled up the pieces a second time, any one of them could turn out to be the last, magical piece.”

“The magic feeling at the moment of insight, when the last piece falls into place,” Berkun explained, “is the reward for many hours (or years) of investment coming together. In comparison to the simple action of fitting the puzzle piece into place, we feel the larger collective payoff of hundreds of pieces’ worth of work.”

Perhaps the myth of the data governance framework could also be debunked using the metaphor of a jigsaw puzzle.

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, change management — and many other puzzle pieces.

How could a data governance framework possibly predict how you will assemble the puzzle pieces? Or how the puzzle pieces will fit together within your unique corporate culture? Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization? And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles.

March 01, 2012

Data Quality and Miracle Exceptions

March 01, 2012/ Jim Harris

“Reading superhero comic books with the benefit of a Ph.D. in physics,” James Kakalios explained in The Physics of Superheroes, “I have found many examples of the correct description and application of physics concepts. Of course, the use of superpowers themselves involves direct violations of the known laws of physics, requiring a deliberate and willful suspension of disbelief.”

“However, many comics need only a single miracle exception — one extraordinary thing you have to buy into — and the rest that follows as the hero and the villain square off would be consistent with the principles of science.”

“Data Quality is all about . . .”

It is essential to foster a marketplace of ideas about data quality in which a diversity of viewpoints is freely shared without bias, where everyone is invited to get involved in discussions and debates and have an opportunity to hear what others have to offer.

However, one of my biggest pet peeves about the data quality industry is when I listen to analysts, vendors, consultants, and other practitioners discuss data quality challenges, I am often required to make a miracle exception for data quality. In other words, I am given one extraordinary thing I have to buy into in order to be willing to buy their solution to all of my data quality problems.

These superhero comic book style stories usually open with a miracle exception telling me that “data quality is all about . . .”

Sometimes, the miracle exception is purchasing technology from the right magic quadrant. Other times, the miracle exception is either following a comprehensive framework, or following the right methodology from the right expert within the right discipline (e.g., data modeling, business process management, information quality management, agile development, data governance, etc.).

But I am especially irritated by individuals who bash vendors for selling allegedly only reactive data cleansing tools, while selling their allegedly only proactive defect prevention methodology, as if we could avoid cleaning up the existing data quality issues, or we could shut down and restart our organizations, so that before another single datum is created or business activity is executed, everyone could learn how to “do things the right way” so that “the data will always be entered right, the first time, every time.”

Although these and other miracle exceptions do correctly describe the application of data quality concepts in isolation, by doing so, they also oversimplify the multifaceted complexity of data quality, requiring a deliberate and willful suspension of disbelief.

Miracle exceptions certainly make for more entertaining stories and more effective sales pitches, but oversimplifying complexity for the purposes of explaining your approach, or, even worse and sadly more common, preaching at people that your approach definitively solves their data quality problems, is nothing less than applying the principle of deus ex machina to data quality.

Data Quality and deus ex machina

Deus ex machina is a plot device whereby a seemingly unsolvable problem is suddenly and abruptly solved with the contrived and unexpected intervention of some new event, character, ability, or object.

This technique is often used in the marketing of data quality software and services, where the problem of poor data quality can seemingly be solved by a new event (e.g., creating a data governance council), a new character (e.g., hiring an expert consultant), a new ability (e.g., aligning data quality metrics with business insight), or a new object (e.g., purchasing a new data quality tool).

Now, don’t get me wrong. I do believe various technologies and methodologies from numerous disciplines, as well as several core principles (e.g., communication, collaboration, and change management) are all important variables in the data quality equation, but I don’t believe that any particular variable can be taken in isolation and deified as the God Particle of data quality physics.

Data Quality is Not about One Extraordinary Thing

Data quality isn’t all about technology, nor is it all about methodology. And data quality isn’t all about data cleansing, nor is it all about defect prevention. Data quality is not about only one thing — no matter how extraordinary any one of its things may seem.

Battling the dark forces of poor data quality doesn’t require any superpowers, but it does require doing the hard daily work of continuously improving your data quality. Data quality does not have a miracle exception, so please stop believing in one.

And for the love of high-quality data everywhere, please stop trying to sell us one.

February 03, 2011

The People Platform

February 03, 2011/ Jim Harris

Platforms are popular in enterprise data management. Most of the time, the term is used to describe a technology platform, an integrated suite of tools that enables the organization to manage its data in support of its business processes.

Other times the term is used to describe a methodology platform, an integrated set of best practices that enables the organization to manage its data as a corporate asset in order to achieve superior business performance.

Data governance is an example of a methodology platform, where one of its central concepts is the definition, implementation, and enforcement of policies, which govern the interactions between business processes, data, technology, and people.

But many rightfully lament the misleading term “data governance” because it appears to put the emphasis on data, arguing that since business needs come first in every organization, data governance should be formalized as a business process, and therefore mature organizations should view data governance as business process management.

However, successful enterprise data management is about much more than data, business processes, or enabling technology.

Business process management, data quality management, and technology management are all people-driven activities because people empowered by high quality data, enabled by technology, optimize business processes for superior business performance.

Data governance policies illustrate the intersection of business, data, and technical knowledge, which is spread throughout the enterprise, transcending any artificial boundaries imposed by an organizational chart, where different departments or different business functions appear as if they were independent of the rest of the organization.

Data governance policies reveal how truly interconnected and interdependent the organization is, and how everything that happens within the organization happens as a result of the interactions occurring among its people.

Michael Fauscette defines people-centricity as “our current social and business progression past the industrial society’s focus on business, technology, and process. Not that business or technology or process go away, but instead they become supporting structures that facilitate new ways of collaborating and interacting with customers, suppliers, partners, and employees.”

In short, Fauscette believes people are becoming the new enterprise platform—and not just for data management.

I agree, but I would argue that people have always been—and always will be—the only successful enterprise platform.

The Collaborative Culture of Data Governance

Data Governance and the Social Enterprise

Connect Four and Data Governance

What Data Quality Technology Wants

Data and Process Transparency

The Business versus IT—Tear down this wall!

Collaboration isn’t Brain Surgery

Trust is not a checklist

Quality and Governance are Beyond the Data

Data Transcendentalism

Podcast: Data Governance is Mission Possible

Video: Declaration of Data Governance

August 14, 2010

Scrum Screwed Up

August 14, 2010/ Jim Harris

This was the inaugural cartoon on Implementing Scrum by Michael Vizdos and Tony Clark, which does a great job of illustrating the fable of The Chicken and the Pig used to describe the two types of roles involved in Scrum, which, quite rare for our industry, is not an acronym, but one common approach among many iterative, incremental frameworks for agile software development.

Scrum is also sometimes used as a generic synonym for any agile framework. Although I’m not an expert, I’ve worked on more than a few agile programs. And since I am fond of metaphors, I will use the Chicken and the Pig to describe two common ways that scrums of all kinds can easily get screwed up:

All Chicken and No Pig
All Pig and No Chicken

However, let’s first establish a more specific context for agile development using one provided by a recent blog post on the topic.

A Contrarian’s View of Agile BI

In her excellent blog post A Contrarian’s View of Agile BI, Jill Dyché took a somewhat unpopular view of a popular view, which is something that Jill excels at—not simply for the sake of doing it—because she’s always been well-known for telling it like it is.

In preparation for the upcoming TDWI World Conference in San Diego, Jill was pondering the utilization of agile methodologies in business intelligence (aka BI—ah, there’s one of those oh so common industry acronyms straight out of The Acronymicon).

The provocative TDWI conference theme is: “Creating an Agile BI Environment—Delivering Data at the Speed of Thought.”

Now, please don’t misunderstand. Jill is an advocate for doing agile BI the right way. And it’s certainly understandable why so many organizations love the idea of agile BI. Especially when you consider the slower time to value of most other approaches when compared with, following Jill’s rule of thumb, how agile BI would have “either new BI functionality or new data deployed (at least) every 60-90 days. This approach establishes BI as a program, greater than the sum of its parts.”

“But in my experience,” Jill explained, “if the organization embracing agile BI never had established BI development processes in the first place, agile BI can be a road to nowhere. In fact, the dirty little secret of agile BI is this: It’s companies that don’t have the discipline to enforce BI development rigor in the first place that hurl themselves toward agile BI.”

“Peek under the covers of an agile BI shop,” Jill continued, “and you’ll often find dozens or even hundreds of repeatable canned BI reports, but nary an advanced analytics capability. You’ll probably discover an IT organization that failed to cultivate solid relationships with business users and is now hiding behind an agile vocabulary to justify its own organizational ADD. It’s lack of accountability, failure to manage a deliberate pipeline, and shifting work priorities packaged up as so much scrum.”

I really love the term Organizational Attention Deficit Disorder, and in spite of myself, I can’t help but render it acronymically as OADD—which should be pronounced as “odd” because the “a” is silent, as in: “Our organization is really quite OADD, isn’t it?”

Scrum Screwed Up: All Chicken and No Pig

Returning to the metaphor of the Scrum roles, the pigs are the people with their bacon in the game performing the actual work, and the chickens are the people to whom the results are being delivered. Most commonly, the pigs are IT or the technical team, and the chickens are the users or the business team. But these scrum lines are drawn in the sand, and therefore easily crossed.

Many organizations love the idea of agile BI because they are thinking like chickens and not like pigs. And the agile life is always easier for the chicken because they are only involved, whereas the pig is committed.

OADD organizations often “hurl themselves toward agile BI” because they’re enamored with the theory, but unrealistic about what the practice truly requires. They’re all-in when it comes to the planning, but bacon-less when it comes to the execution.

This is one common way that OADD organizations can get Scrum Screwed Up—they are All Chicken and No Pig.

Scrum Screwed Up: All Pig and No Chicken

Closer to the point being made in Jill’s blog post, IT can pretend to be pigs making seemingly impressive progress, but although they’re bringing home the bacon, it lacks any real sizzle because it’s not delivering any real advanced analytics to business users.

Although they appear to be scrumming, IT is really just screwing around with technology, albeit in an agile manner. However, what good is “delivering data at the speed of thought” when that data is neither what the business is thinking, nor truly needs?

This is another common way that OADD organizations can get Scrum Screwed Up—they are All Pig and No Chicken.

Scrum is NOT a Silver Bullet

Scrum—and any other agile framework—is not a silver bullet. However, agile methodologies can work—and not just for BI.

But whether you want to call it Chicken-Pig Collaboration, or Business-IT Collaboration, or Shiny Happy People Holding Hands, a true enterprise-wide collaboration facilitated by a cross-disciplinary team is necessary for any success—agile or otherwise.

Agile frameworks, when implemented properly, help organizations realistically embrace complexity and avoid oversimplification, by leveraging recurring iterations of relatively short duration that always deliver data-driven solutions to business problems.

Agile frameworks are successful when people take on the challenge united by collaboration, guided by effective methodology, and supported by enabling technology. Agile frameworks allow the enterprise to follow what works, for as long as it works, and without being afraid to adjust as necessary when circumstances inevitably change.

For more information about Agile BI, follow Jill Dyché and TDWI World Conference in San Diego, August 15-20 via Twitter.

July 20, 2010

Common Change

July 20, 2010/ Jim Harris

I recently finished reading the great book Switch: How to Change Things When Change Is Hard by Chip Heath and Dan Heath, which examines why it can be so difficult for us to make lasting changes—both professional changes and personal changes.

“For anything to change,” the Heaths explain, “someone has to start acting differently. Ultimately, all change efforts boil down to the same mission: Can you get people to start behaving in a new way?”

Their metaphor for change of all kinds is making a Switch, which they explain requires the following three things:

Directing the Rider, which is a metaphor for the rational aspect of our decisions and behavior.
Motivating the Elephant, which is a metaphor for the emotional aspect of our decisions and behavior.
Shaping the Path, which is a metaphor for the situational aspect of our decisions and behavior.

Despite being the most common phenomenon in the universe, change is almost universally resisted, making most of us act as if change is anything but common. Therefore, in this blog post, I will discuss the Heaths three key concepts using some common terminology: Common Sense, Common Feeling, and Common Place—which, when working together, lead to Common Change.

Common Sense

“What looks like resistance is often a lack of clarity,” the Heaths explain. “Ambiguity is the enemy. Change begins at the level of individual decisions and behaviors. To spark movement in a new direction, you need to provide crystal-clear guidance.”

Unfortunately, changes are usually communicated in ways that cause confusion instead of provide clarity. Many change efforts fail at the outset because of either ambiguous goals or a lack of specific instructions explaining exactly how to get started.

One personal change example would be: Eat Healthier.

Although the goal makes sense, what exactly should I do? Should I eat smaller amounts of the same food, or eat different food? Should I start eating two large meals a day while eliminating snacks, or start eating several smaller meals throughout the day?

One professional example would be: Streamline Inefficient Processes.

This goal is even more ambiguous. Does it mean all of the existing processes are inefficient? What does streamline really mean? What exactly should I do? Should I be spending less time on certain tasks, or eliminating some tasks from my daily schedule?

Ambiguity is the enemy. For any chance of success to be possible, both the change itself and the plan for making it happen must sound like Common Sense.

More specifically, the following two things must be clearly defined and effectively communicated:

Long-term Goal – What exactly is the change that we are going to make—what is our destination?
Short-term Critical Moves – What are the first few things we need to do—how do we begin our journey?

“What is essential,” as the Heaths explain, “is to marry your long-term goal with short-term critical moves.”

“What you don’t need to do is anticipate every turn in the road between today and the destination. It’s not that plotting the whole journey is undesirable; it’s that it’s impossible. When you’re at the beginning, don’t obsess about the middle, because the middle is going to look different once you get there. Just look for a strong beginning and a strong ending and get moving.”

Common Feeling

I just emphasized the critical importance of envisioning both the beginning and the end of our journey toward change.

However, what happens in the middle is the change. So, if common sense can help us understand where we are going and how to get started, what can help keep us going during the really challenging aspects of the middle?

There’s really only one thing that can carry us through the middle—we need to get hooked on a Common Feeling.

Some people—and especially within a professional setting—will balk at discussing the role that feeling (i.e., emotion) plays in our decision making and behavior because it is commonly believed that rational analysis must protect us from irrational emotions.

However, relatively recent advancements in the fields of psychology and neuroscience have proven that good decision making requires the flexibility to know when to rely on rational analysis and when to rely on emotions—and to always consider not only how we’re thinking, but also how we’re feeling.

In their book The Heart of Change: Real-Life Stories of How People Change Their Organizations, John Kotter and Dan Cohen explained that “the core of the matter is always about changing the behavior of people, and behavior change happens mostly by speaking to people’s feelings. In highly successful change efforts, people find ways to help others see the problems or solutions in ways that influence emotions, not just thought.”

Kotter and Cohen wrote that most people think change happens in this order: ANALYZE—THINK—CHANGE.

However, from interviewing over 400 people across more than 130 large organizations in the United States, Europe, Australia, and South Africa, they observed that in almost all successful change efforts, the sequence of change is: SEE—FEEL—CHANGE.

“We know there’s a difference between knowing how to act and being motivated to act,” the Heaths explain. “But when it comes time to change the behavior of other people, our first instinct is to teach them something.”

Making only a rational argument for change without an emotional appeal results in understanding without motivation, and making only an emotional appeal for change without a rational plan results in passion without direction.

Therefore, making the case for lasting change requires that you effectively combine common sense with common feeling.

Common Place

“That is NOT how we do things around here” is the most common objection to change. This is the Oath of Change Resistance, which maintains the status quo—the current situation that is so commonplace that it seems like “these people will never change.”

But as the Heaths explain, “what looks like a people problem is often a situation problem.”

Stanford psychologist Lee Ross coined the term fundamental attribution error to describe our tendency to ignore the situational forces that shape other people’s behavior. The error lies in our inclination to attribute people’s behavior to the way they are rather than to the situation they are in.

When we lament that “these people will never change” we have convinced ourselves that change-resistant behavior equates to a change-resistant personal character and discount the possibility that it simply could be a reflection of the current situation.

The great analogy used by the Heaths is water. When boiling in a pot on the stove, it’s a scalding-hot liquid, but when cooling in a tray in the freezer, it’s an icy-cold solid. However, declaring either scalding-hot or icy-cold as a fundamental attribute of water and not a situational attribute of water would obviously be absurd—but we do this with people and their behavior all the time.

This doesn’t mean that people’s behavior is always a result of their situation—nor does it excuse inappropriate behavior.

The fundamental point is that the situation that people are currently in (i.e., their environment) can always be changed, and most important, it can be tweaked in ways that influence their behavior and encourage them to change for the better.

“Tweaking the environment,” the Heaths explain, “is about making the right behaviors a little bit easier and the wrong behaviors a little bit harder. It’s that simple.” The status quo is sometimes described as the path of least resistance. So consider how you could tweak the environment in order to transform the path of least resistance into the path of change.

Therefore, in order to facilitate lasting change, you must create a new Common Place where the change becomes accepted as: “That IS how we do things around here—from now on.” This is the Oath of Change, which redefines the status quo.

Common Change

“When change happens,” the Heaths explain, “it tends to follow a pattern.” Although it is far easier to recognize than to embrace, in order for any of the changes we need to make to be successful, “we’ve got to stop ignoring that pattern and start embracing it.”

Change begins when our behavior changes. In order for this to happen, we have to think that the change makes common sense, we have to feel that the change evokes a common feeling, and we have to accept that the change creates a new common place.

When all three of these rational, emotional, and situational forces are in complete alignment, then instead of resisting change, we will experience it as Common Change.

The Winning Curve

The Balancing Act of Awareness

The Importance of Envelopes

The Point of View Paradox

Persistence

July 11, 2010

The Winning Curve

July 11, 2010/ Jim Harris

Illustrated above is what I am calling The Winning Curve and it combines ideas from three books I have recently read:

Switch: How to Change Things When Change Is Hard by Chip Heath and Dan Heath
The Dip: A Little Book That Teaches You When to Quit (and When to Stick) by Seth Godin
Linchpin: Are You Indispensable? by Seth Godin

The Winning Curve is applicable to any type of project or the current iteration of an ongoing program—professional or personal.

Insight

The Winning Curve starts with the Design Phase, the characteristics of which are inspired by Tim Brown (quoted in Switch.) Brown explains how every design phase goes through “foggy periods.” He uses a U-shaped curve called a “project mood chart” that predicts how people will feel at different stages of the design phase.

The design phase starts with a peak of positive emotion, labeled “Hope,” and ends with a second peak of positive emotion, labeled “Confidence.” In between these two great heights is a deep valley of negative emotion, labeled “Insight.”

The design phase, according to Brown, is “rarely a graceful leap from height to height,” and as Harvard Business School professor Rosabeth Moss Kanter explains, “everything can look like a failure in the middle.”

Therefore, the design phase is really exciting—at the beginning.

After the reality of all the research, as well as the necessary communication and collaboration with others has a chance to set in, then the hope you started out with quickly dissipates, and insight is the last thing you would expect to find “down in the valley.”

During this stage, “it’s easy to get depressed, because insight doesn’t always strike immediately,” explains Chip and Dan Heath. “But if the team persists through this valley of angst and doubt, it eventually emerges with a growing sense of momentum.”

“The Dip”

After The Winning Curve has finally reached the exhilarating summit of Confidence Mountain (i.e., your design is completed), you are then faced with yet another descent, since now the Development Phase is ready to begin.

Separating the start of the development phase from the delivery date is another daunting valley, otherwise known as “The Dip.”

The development phase can be downright brutal. It is where the grand conceptual theory of your design’s insight meets the grunt work practice required by your development’s far from conceptual daily realities.

Everything sounds easier on paper (or on a computer screen). Although completing the design phase was definitely a challenge, completing the development phase is almost always more challenging.

However, as Seth Godin explains, “The Dip is where success happens. Successful people don’t just ride out The Dip. They don’t just buckle down and survive it. No, they lean into The Dip.”

“All our successes are the same. All our failures, too,” explains Godin in the closing remarks of The Dip. “We succeed when we do something remarkable. We fail when we give up too soon.”

“Real Artists Ship”

When Steve Jobs said “real artists ship,” he was calling the bluff of a recalcitrant engineer who couldn’t let go of some programming code. In Linchpin, Seth Godin quotes poet Bruce Ario to explain that “creativity is an instinct to produce.”

Toward the end of the development phase, the Delivery Date forebodingly looms. The delivery date is when your definition of success will be judged by others, which is why some people prefer the term Judgment Day since it seems far more appropriate.

“The only purpose of starting,” writes Godin, “is to finish, and while the projects we do are never really finished, they must ship.”

Godin explains that the primary challenge to shipping (i.e., completing development by or before your delivery date) is thrashing.

“Thrashing is the apparently productive brainstorming and tweaking we do for a project as it develops. Thrashing is essential. The question is: when to thrash? Professional creators thrash early. The closer the project gets to completion, the fewer people see it and the fewer changes are permitted.”

Thrashing is mostly about the pursuit of perfection.

We believe that if what we deliver isn’t perfect, then our efforts will be judged a failure. Of course, we know that perfection is impossible. However, our fear of failure is often based on our false belief that perfection was the actual expectation of others.

Therefore, our fear of failure offers this simple and comforting advice: if you don’t deliver, then you can’t fail.

However, real artists realize that success or failure—or even worse, mediocrity—could be the judgment that they receive after they have delivered. Success rocks and failure sucks—but only if you don’t learn from it. That’s why real artists always ship.

The Winning Curve

I named it “The Winning Curve” both because its shape resembles a “W” and it sounds better than calling it “The Failing Curve.”

However, the key point is that failure often (if not always) precedes success, and in both our professional and personal lives, most (if not all) of us are pursuing one or more kinds of success—and in these pursuits, we generally view failure as the enemy.

Failure is not the enemy. In fact, the most successful people realize failure is their greatest ally.

As Thomas Edison famously said, “I didn’t find a way to make a light bulb, I found a thousand ways how not to make one.”

“Even in failure, there is success,” explains Chip and Dan Heath. Whenever you fail, it’s extremely rare that everything you did was a failure. Your approach almost always creates a few small sparks in your quest to find a way to make your own light bulb.

“These flashes of success—these bright spots—can illuminate the road map for action,” according to the Heaths, who also explain that “we will struggle, we will fail, we will be knocked down—but throughout, we’ll get better, and we’ll succeed in the end.”

The Winning Curve can’t guarantee success—only learning. Unfortunately, the name “The Learning Curve” was already taken.

Persistence

Thinking along the edges of the box

The HedgeFoxian Hypothesis

The Once and Future Data Quality Expert

Mistake Driven Learning

The Fragility of Knowledge

The Wisdom of Failure

A Portrait of the Data Quality Expert as a Young Idiot

June 10, 2010

Jack Bauer and Enforcing Data Governance Policies

June 10, 2010/ Jim Harris

Jack Bauer

In my recent blog post Red Flag or Red Herring?, I explained that the primary focus of data governance is the strategic alignment of people throughout the organization through the definition, and enforcement, of policies in relation to data access, data sharing, data quality, and effective data usage, all for the purposes of supporting critical business decisions and enabling optimal business performance.

Simply establishing these internal data governance policies is often no easy task to accomplish.

However, without enforcement, data governance policies are powerless to affect the real changes necessary.

(Pictured: Jack Bauer enforcing a data governance policy.)

Jack Bauer and Data Governance

Jill Wanless commented that “sometimes organizations have the best of intentions. They establish strategic alignment and governing policies (no small feat!) only to fail at the enforcement and compliance. I believe some of this behavior is due to the fact that they may not know how to enforce effectively, without risking the very alignment they have established. I would really like to see a follow up post on what effective enforcement looks like.”

As I began drafting this requested blog post, the first image that came to my mind for what effective enforcement looks like was Jack Bauer, the protagonist of the popular (but somewhat controversial) television series 24.

Well-known for his willingness to do whatever it takes, you can almost imagine Jack explaining to executive management:

“The difference between success and failure for your data governance program is the ability to enforce your policies. But the business processes, technology, data, and people that I deal with, don’t care about your policies. Every day I will regret looking into the eyes of men and women, knowing that at any moment, their jobs—or even their lives—may be deemed expendable, in order to protect the greater corporate good.

I will regret every decision and mistake I have to make, which results in the loss of an innocent employee. But you know what I will regret the most? I will regret that data governance even needs people like me.”

Although definitely dramatic and somewhat cathartic, I don’t think it would be the right message for this blog post. Sorry, Jack.

Enforcing Data Governance Policies

So if hiring Jack Bauer isn’t the answer, what is? I recommend the following five steps for enforcing data governance policies, which I have summarized into the following simple list and explain in slightly more detail in the corresponding sections below:

Documentation – Use straightforward, natural language to document your policies in a way everyone can understand.
Communication – Effective communication requires that you encourage open discussion and debate of all viewpoints.
Metrics – Truly meaningful metrics can be effectively measured, and represent the business impact of data governance.
Remediation – Correcting any combination of business process, technology, data, and people—and sometimes, all four.
Refinement – You must dynamically evolve and adapt your data governance policies—as well as their associated metrics.

Documentation

The first step in enforcing data governance policies is effectively documenting the defined policies. As stated above, the definition process itself can be quite laborious. However, before you can expect anyone to comply with the new policies, you first have to make sure that they can understand exactly what they mean.

This requires documenting your polices using a straightforward and natural language. I am not just talking about avoiding the use of techno-mumbo-jumbo. Even business-speak can sound more like business-babbling—and not just to the technical folks. Perhaps most important, avoid using acronyms and other lexicons of terminology—unless you can unambiguously define them.

For additional information on aspects related to documentation, please refer to these blog posts:

Communication

The second step is the effective communication of the defined and documented data governance policies. Consider using a wiki in order to facilitate easy distribution, promote open discussion, and encourage feedback—as well as track all changes.

I always emphasize the importance of communication since it’s a crucial component of the collaboration that data governance truly requires in order to be successful.

Your data governance policies reflect a shared business understanding. The enforcement of these policies has as much to do with enterprise-wide collaboration as it does with supporting critical business decisions and enabling optimal business performance.

Never underestimate the potential negative impacts that the point of view paradox can have on communication. For example, the perspectives of the business and technical stakeholders can often appear to be diametrically opposed.

At the other end of the communication spectrum, you must also watch out for what Jill Dyché calls the tyranny of consensus, where the path of least resistance is taken, and justifiable objections either remain silent or are silenced by management.

The tyranny of consensus is indeed the antithesis of the wisdom of crowds. As James Surowiecki explains in his excellent book, the best collective decisions are the product of disagreement and contest, not consensus or compromise.

Data Governance lives on the two-way Street named Communication (which, of course, intersects with Collaboration Road).

For additional information on aspects related to communication, please refer to these blog posts:

Metrics

The third step in enforcing data governance policies is the creation of metrics with tangible business relevance. These metrics must be capable of being effectively measured, and must also meaningfully represent the business impact of data governance.

The common challenge is that the easiest ones to create and monitor are low-level technical metrics, such as those provided by data profiling. However, elevating these technical metrics to a level representing business relevance can often, and far too easily, merely establish their correlation with business performance. Of course, correlation does not imply causation.

This doesn’t mean that creating metrics to track compliance with your data governance policies is impossible, it simply means you must be as careful with the definition of the metrics as you were with the definition of the policies themselves.

In his blog post Metrics, The Trap We All Fall Into, Thomas Murphy of Gartner discussed a few aspects of this challenge.

Truly meaningful metrics always align your data governance policies with your business performance. Lacking this alignment, you could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.

For additional information on aspects related to metrics, please refer to these blog posts:

Remediation

Effective metrics will let you know when something has gone wrong. Francis Bacon taught us that “knowledge is power.” However, Jackson Beck also taught us that “knowing is half the battle.” Therefore, the fourth step in enforcing data governance policies is taking the necessary corrective actions when non-compliance and other problems inevitably arise.

Remediation can involve any combination of business processes, technology, data, and people—and sometimes, all four.

The most common is data remediation, which includes both reactive and proactive approaches to data quality.

Proactive defect prevention is the superior approach. Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.

However, and most often driven by a business triage for critical data problems, reactive data cleansing will be necessary.

After the root causes of the data remediation are identified—and they should always be identified—then additional remediation may involve a combination of business processes, technology, or people—and sometimes, all three.

Effective metrics also help identify business-driven priorities that determine the necessary corrective actions to be implemented.

For additional information on aspects related to remediation, please refer to these blog posts:

Refinement

The fifth and final step is the ongoing refinement of your data governance policies, which, as explained above, you are enforcing for the purposes of supporting critical business decisions and enabling optimal business performance.

As such, your data governance policies—as well as their associated metrics—can never remain static, but instead, they must dynamically evolve and adapt, all in order to protect and serve the enterprise’s continuing mission to survive and thrive in today’s highly competitive and rapidly changing marketplace.

For additional information on aspects related to refinement, please refer to these blog posts:

Conclusion

Obviously, the high-level framework I described for enforcing your data governance policies has omitted some important details, such as when you should create your data governance board, and what the responsibilities of the data stewardship function are, as well as how data governance relates to specific enterprise information initiatives, such as master data management (MDM).

However, if you are looking to follow a step-by-step, paint-by-numbers, only color inside the lines, guaranteed fool-proof plan, then you are going to fail before you even begin—because there are simply NO universal frameworks for data governance.

This is only the beginning of a more detailed discussion, the specifics of which will vary based on your particular circumstances, especially the unique corporate culture of your organization.

Most important, you must be brutally honest about where your organization currently is in terms of data governance maturity, as this, more than anything else, dictates what your realistic capabilities are during every phase of a data governance program.

Please share your thoughts about enforcing data governance policies, as well as your overall perspectives on data governance.

Follow OCDQ

If you enjoyed this blog post, then please subscribe to OCDQ via my RSS feed, my E-mail updates, or Google Reader.

You can also follow OCDQ on Twitter, fan the Facebook page for OCDQ, and connect with me on LinkedIn.

March 04, 2010

Adventures in Data Profiling

March 04, 2010/ Jim Harris

Data profiling is a critical step in a variety of information management projects, including data quality initiatives, MDM implementations, data migration and consolidation, building a data warehouse, and many others.

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.

Webinar

In this vendor-neutral eLearningCurve webinar, I discuss the common functionality provided by data profiling tools, which can help automate some of the work needed to begin your preliminary data analysis.

You can download (no registration required) the webinar (.wmv file) using this link: Adventures in Data Profiling Webinar

Presentation

You can download the presentation (no registration required) used in the webinar as an Adobe Acrobat Document (.pdf file) using this link: Adventures in Data Profiling Presentation

Complete Blog Series

You can read (no registration required) the complete OCDQ blog series Adventures in Data Profiling by following these links:

Adventures in Data Profiling (Part 1) – Series Introduction
Adventures in Data Profiling (Part 2) – Customer ID and Gender Code
Adventures in Data Profiling (Part 3) – Birth Date, Telephone Number and E-mail Address
Adventures in Data Profiling (Part 4) – City Name, State Abbreviation, Zip Code and Country Code
Adventures in Data Profiling (Part 5) – Postal Address Line 1 and Postal Address Line 2
Adventures in Data Profiling (Part 6) – Account Number and Tax ID
Adventures in Data Profiling (Part 7) – Customer Name 1 and Customer Name 2
Adventures in Data Profiling (Part 8) – Series Conclusion

December 01, 2009

Adventures in Data Profiling (Part 8)

December 01, 2009/ Jim Harris

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis. This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

Provide a reality check for the perceptions and assumptions you may have about the quality of your data
Verify your data matches the metadata that describes it
Identify different representations for the absence of data (i.e., NULL and other missing values)
Identify potential default values
Identify potential invalid values
Check data formats for inconsistencies
Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics. Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered. Both the data profiling tool and data used throughout the series were fictional. The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

Counts and Percentages
Values and Formats
Drill-down Analysis

Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces)
Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique. In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records. In Part 6, Account Number and Tax ID were used as examples. Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record). The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata). Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.

Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record).

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field. Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest. Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage. As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series. However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).

Conclusion

Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative. I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions.

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality. All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis. However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important – translate your analysis into meaningful reports and questions to share with the rest of your team.

Always remember that well performed data profiling is both a highly interactive and a very iterative process.

Thank You

I want to thank you for providing your feedback throughout this series.

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments.

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.

The Complete Series

Adventures in Data Profiling (Part 1) – Series Introduction
Adventures in Data Profiling (Part 2) – Customer ID and Gender Code
Adventures in Data Profiling (Part 3) – Birth Date, Telephone Number and E-mail Address
Adventures in Data Profiling (Part 4) – City Name, State Abbreviation, Zip Code and Country Code
Adventures in Data Profiling (Part 5) – Postal Address Line 1 and Postal Address Line 2
Adventures in Data Profiling (Part 6) – Account Number and Tax ID
Adventures in Data Profiling (Part 7) – Customer Name 1 and Customer Name 2

November 03, 2009

Customer Incognita

November 03, 2009/ Jim Harris

Many enterprise information initiatives are launched in order to unravel that riddle, wrapped in a mystery, inside an enigma, that great unknown, also known as...Customer.

Centuries ago, cartographers used the Latin phrase terra incognita (meaning “unknown land”) to mark regions on a map not yet fully explored. In this century, companies simply can not afford to use the phrase customer incognita to indicate what information about their existing (and prospective) customers they don't currently have or don't properly understand.

What is a Customer?

First things first, what exactly is a customer? Those happy people who give you money? Those angry people who yell at you on the phone or say really mean things about your company on Twitter and Facebook? Why do they have to be so mean?

Mean people suck. However, companies who don't understand their customers also suck. And surely you don't want to be one of those companies, do you? I didn't think so.

Getting back to the question, here are some insights from the Data Quality Pro discussion forum topic What is a customer?:

Someone who purchases products or services from you. The word “someone” is key because it’s not the role of a “customer” that forms the real problem, but the precision of the term “someone” that causes challenges when we try to link other and more specific roles to that “someone.” These other roles could be contract partner, payer, receiver, user, owner, etc.
Customer is a role assigned to a legal entity in a complete and precise picture of the real world. The role is established when the first purchase is accepted from this real-world entity. Of course, the main challenge is whether or not the company can establish and maintain a complete and precise picture of the real world.

These working definitions were provided by fellow blogger and data quality expert Henrik Liliendahl Sørensen, who recently posted 360° Business Partner View, which further examines the many different ways a real-world entity can be represented, including when, instead of a customer, the real-world entity represents a citizen, patient, member, etc.

A critical first step for your company is to develop your definition of a customer. Don't underestimate either the importance or the difficulty of this process. And don't assume it is simply a matter of semantics.

Some of my consulting clients have indignantly told me: “We don't need to define it, everyone in our company knows exactly what a customer is.” I usually respond: “I have no doubt that everyone in your company uses the word customer, however I will work for free if everyone defines the word customer in exactly the same way.” So far, I haven't had to work for free.

How Many Customers Do You Have?

You have done the due diligence and developed your definition of a customer. Excellent! Nice work. Your next challenge is determining how many customers you have. Hopefully, you are not going to try using any of these techniques:

SELECT COUNT(*) AS "We have this many customers" FROM Customers
SELECT COUNT(DISTINCT Name) AS "No wait, we really have this many customers" FROM Customers
Middle-Square or Blum Blum Shub methods (i.e. random number generation)
Magic 8-Ball says: “Ask again later”

One of the most common and challenging data quality problems is the identification of duplicate records, especially redundant representations of the same customer information within and across systems throughout the enterprise. The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.

Earlier this year on Data Quality Pro, I published a five part series of articles on identifying duplicate customers, which focused on the methodology for defining your business rules and illustrated some of the common data matching challenges.

Topics covered in the series:

Why a symbiosis of technology and methodology is necessary when approaching this challenge
How performing a preliminary analysis on a representative sample of real data prepares effective examples for discussion
Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
How both false negatives and false positives illustrate the highly subjective nature of this problem
How to document your business rules for identifying duplicate customers
How to set realistic expectations about application development
How to foster a collaboration of the business and technical teams throughout the entire project
How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

To download the associated presentation (no registration required), please follow this link: OCDQ Downloads

Conclusion

“Knowing the characteristics of your customers,” stated Jill Dyché and Evan Levy in the opening chapter of their excellent book, Customer Data Integration: Reaching a Single Version of the Truth, “who they are, where they are, how they interact with your company, and how to support them, can shape every aspect of your company's strategy and operations. In the information age, there are fewer excuses for ignorance.”

For companies of every size and within every industry, customer incognita is a crippling condition that must be replaced with customer cognizance in order for the company to continue to remain competitive in a rapidly changing marketplace.

Do you know your customers? If not, then they likely aren't your customers anymore.

October 19, 2009

Adventures in Data Profiling (Part 7)

October 19, 2009/ Jim Harris

In Part 6 of this series: You completed your initial analysis of the Account Number and Tax ID fields.

Previously during your adventures in data profiling, you have looked at customer name within the context of other fields. In Part 2, you looked at the associated customer names during drill-down analysis on the Gender Code field while attempting to verify abbreviations as well as assess NULL and numeric values. In Part 6, you investigated customer names during drill-down analysis for the Account Number and Tax ID fields while assessing the possibility of duplicate records.

In Part 7 of this award-eligible series, you will complete your initial analysis of this data source with direct investigation of the Customer Name 1 and Customer Name 2 fields.

Previously, the data profiling tool provided you with the following statistical summaries for customer names:

As we discussed when we looked at the E-mail Address field (in Part 3) and the Postal Address Line fields (in Part 5), most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Customer Name 1 and Customer Name 2 are additional examples of the necessity of this analysis technique. Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

Customer Name 1

The data profiling tool has provided you the following drill-down “screen” for Customer Name 1:

Please Note: The differentiation between given and family names has been based on our fictional data profiling tool using probability-driven non-contextual classification of the individual field values.

For example, Harris, Edward, and James are three of the most common names in the English language, and although they can also be family names, they are more frequently given names. Therefore, “Harris Edward James” is assigned “Given-Name Given-Name Given-Name” for a field format. For this particular example, how do we determine the family name?

The top twenty most frequently occurring field formats for Customer Name 1 collectively account for over 80% of the records with an actual value in this field for this data source. All of these field formats appear to be common potentially valid structures. Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Customer Name 1 field?

Customer Name 2

The data profiling tool has provided you the following drill-down “screen” for Customer Name 2:

The top ten most frequently occurring field formats for Customer Name 2 collectively account for over 50% of the records with an actual value in this sparsely populated field for this data source. Some of these field formats show common potentially valid structures. Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Customer Name 2 field?

The Challenges of Person Names

Not that business names don't have their own challenges, but person names present special challenges. Many data quality initiatives include the business requirement to parse, identify, verify, and format a “valid” person name. However, unlike postal addresses where country-specific postal databases exist to support validation, no such “standards” exist for person names.

In his excellent book Viral Data in SOA: An Enterprise Pandemic, Neal A. Fishman explains that “a person's name is a concept that is both ubiquitous and subject to regional variations. For example, the cultural aspects of an individual's name can vary. In lieu of last name, some cultures specify a clan name. Others specify a paternal name followed by a maternal name, or a maternal name followed by a paternal name; other cultures use a tribal name, and so on. Variances can be numerous.”

“In addition,” continues Fishman, “a name can be used in multiple contexts, which might affect what parts should or could be communicated. An organization reporting an employee's tax contributions might report the name by using the family name and just the first letter (or initial) of the first name (in that sequence). The same organization mailing a solicitation might choose to use just a title and a family name.”

However, it is not a simple task to identify what part of a person's name is the family name or the first given name (as some of the above data profiling sample field values illustrate). Again, regional, cultural, and linguistic variations can greatly complicate what at first may appear to be a straightforward business request (e.g. formatting a person name for a mailing label).

As Fishman cautions, “many regions have cultural name profiles bearing distinguishing features for words, sequences, word frequencies, abbreviations, titles, prefixes, suffixes, spelling variants, gender associations, and indications of life events.”

If you know of any useful resources for dealing with the challenges of person names, then please share them by posting a comment below. Additionally, please share your thoughts and experiences regarding the challenges (as well as useful resources) associated with business names.

What other analysis do you think should be performed for customer names?

In Part 8 of this series: We will conclude the adventures in data profiling with a summary of the lessons learned.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Getting Your Data Freq On

September 21, 2009

Adventures in Data Profiling (Part 6)

September 21, 2009/ Jim Harris

In Part 5 of this series: You completed your initial analysis of the fields relating to postal address with the investigation of Postal Address Line 1 and Postal Address Line 2.

You saw additional examples of why free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

You learned this analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

You also saw examples of how the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.

In Part 6, you will continue your adventures in data profiling by analyzing the Account Number and Tax ID fields.

Account Number

The field summary for Account Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

In Part 2, we learned that Customer ID is likely an integer surrogate key and the primary key for this data source because it is both 100% complete and 100% unique. Account Number is 100% complete and almost 100% unique. Perhaps it was intended to be the natural key for this data source?

Let's assume that drill-downs revealed the single profiled field data type was VARCHAR and the single profiled field format was aa-nnnnnnnnn (i.e. 2 characters, followed by a hyphen, followed by a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that not only is Account Number always populated, it is also consistently formatted.

The profiled minimum/maximum field values appear somewhat suspicious, possibly indicating the presence of invalid values?

We can use drill-downs on the field summary “screen” to get more details about Account Number provided by the data profiling tool.

The cardinality of Account Number is very high, as is its Distinctness (i.e. the same field value rarely occurs on more than one record). Therefore, when we limit the review to only the top ten most frequently occurring values, it is not surprising to see low counts.

Since we do not yet have a business understanding of the data, we are not sure if it is valid for multiple records to have the same Account Number.

Additional analysis can be performed by extracting the alpha prefix and reviewing its top ten most frequently occurring values. One aspect of this analysis is that it can be used to assess the possibility that Account Number is an “intelligent key.” Perhaps the alpha prefix is a source system code?

Tax ID

The field summary for Tax ID includes input metadata along with the summary and additional statistics provided by the data profiling tool.

Let's assume that drill-downs revealed the single profiled field data type was INTEGER and the single profiled field format was nnnnnnnnn (i.e. a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that Tax ID is also consistently formatted. However, the profiled minimum/maximum field values appear to indicate the presence of invalid values.

In Part 4, we learned that most of the records appear to have either an United States (US) or Canada (CA) postal address. For US records, the Tax ID field could represent the social security number (SSN), federal employer identification number (FEIN), or tax identification number (TIN). For CA records, this field could represent the social insurance number (SIN). All of these identifiers are used for tax reporting purposes and have a 9 digit number format (when no presentation formatting is used).

We can use drill-downs on the field summary “screen” to get more details about Tax ID provided by the data profiling tool.

The Distinctness of Tax ID is slightly lower than Account Number and therefore the same field value does occasionally occur on more than one record.

Since the cardinality of Tax ID is very high, we will limit the review to only the top ten most frequently occurring values. This analysis reveals the presence of more (most likely) invalid values.

Potential Duplicate Records

In Part 1, we asked if the data profiling statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records. In other words, since some distinct actual values for these fields occur on more than one record, does this imply more than just a possible data relationship, but a possible data redundancy? Obviously, we would need to interact with the business team in order to better understand the data and their business rules for identifying duplicate records.

However, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

What other analysis do you think should be performed for these fields?

In Part 7 of this series: We will continue the adventures in data profiling by completing our initial analysis with the investigation of the Customer Name 1 and Customer Name 2 fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

September 03, 2009

To Parse or Not To Parse

September 03, 2009/ Jim Harris

“To Parse, or Not To Parse,—that is the question:
Whether 'tis nobler in the data to suffer
The slings and arrows of free-form fields,
Or to take arms against a sea of information,
And by parsing, understand them?”

Little known fact: before William Shakespeare made it big as a playwright, he was a successful data quality consultant.

Alas, poor data quality! The Bard of Avon knew it quite well. And he was neither a fan of free verse nor free-form fields.

Free-Form Fields

A free-form field contains multiple (usually interrelated) sub-fields. Perhaps the most common examples of free-form fields are customer name and postal address.

A Customer Name field with the value “Christopher Marlowe” is comprised of the following sub-fields and values:

Given Name = “Christopher”
Family Name = “Marlowe”

A Postal Address field with the value “1587 Tambur Lane” is comprised of the following sub-fields and values:

House Number = “1587”
Street Name = “Tambur”
Street Type = “Lane”

Obviously, both of these examples are simplistic. Customer name and postal address are comprised of additional sub-fields, not all of which will be present on every record or represented consistently within and across data sources.

Returning to the bard's question, a few of the data quality reasons to consider parsing free-form fields include:

Data Profiling
Data Standardization
Data Matching

Much Ado About Analysis

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. In Adventures in Data Profiling (Part 5), a data profiling tool was used to analyze the field Postal Address Line 1:

The Taming of the Variations

Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value, and a general lack of data quality standards. Additional variations are introduced by multiple data sources, each with its own unique data characteristics and quality challenges.

Data standardization parses free-form fields to break them down into their smaller individual sub-fields to gain improved visibility of the available input data. Data standardization is the taming of the variations that creates a consistent representation, applies standard values where appropriate, and when possible, populates missing values.

The following example shows parsed and standardized postal addresses:

In your data quality implementations, do you use this functionality for processing purposes only? If you retain the standardized results, do you store the parsed and standardized sub-fields or just the standardized free-form value?

Shall I compare thee to other records?

Data matching often uses data standardization to prepare its input. This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.

Imagine matching the following product description records with and without the parsed and standardized sub-fields:

Doth the bard protest too much?

Please share your thoughts and experiences regarding free-form fields.

August 28, 2009

Adventures in Data Profiling (Part 5)

August 28, 2009/ Jim Harris

In Part 4 of this series: You went totally postal...shifting your focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

You learned when a field is both 100% complete and has an extremely low cardinality, its most frequently occurring value could be its default value, how forcing international addresses to be entered into country-specific data structures can cause data quality problems, and with the expert assistance of Graham Rhind, we all learned more about international postal code formats.

In Part 5, you will continue your adventures in data profiling by completing your initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

As we discussed in Part 3 when we looked at the E-mail Address field, most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Postal Address Line 1 and Postal Address Line 2 are additional examples of the necessity of this analysis technique. Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record). Some variations in postal addresses can be the results of data entry errors, the use of local conventions, or ignoring (or lacking) postal standards.

Additionally, postal address lines can sometimes contain overflow from other fields (e.g. Customer Name) or they can be used as a dumping ground for values without their own fields (e.g. Twitter username), values unable to conform to the limitations of their intended fields (e.g. countries with something analogous to a US state or CA province but incompatible with a two character field length), or comments (e.g. LDIY, which as Steve Sarsfield discovered, warns us about the Large Dog In Yard).

Postal Address Line 1

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 1:

The top twenty most frequently occurring field formats for Postal Address Line 1 collectively account for over 80% of the records with an actual value in this field for this data source. All of these field formats appear to be common potentially valid structures. Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 1 field?

Postal Address Line 2

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 2:

The top ten most frequently occurring field formats for Postal Address Line 2 collectively account for half of the records with an actual value in this sparsely populated field for this data source. Some of these field formats show several common potentially valid structures. Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 2 field?

Postal Address Validation

Many data quality initiatives include the implementation of postal address validation software. This provides the capability to parse, identify, verify, and format a valid postal address by leveraging country-specific postal databases.

Some examples of postal validation functionality include correcting misspelled street and city names, populating missing postal codes, and applying (within context) standard abbreviations for sub-fields such as directionals (e.g. N for North and E for East), street types (e.g. ST for Street and AVE for Avenue), and box types (e.g. BP for Boite Postale and CP for Case Postale). These standards not only vary by country, but can also vary within a country when there are multiple official languages.

The presence of non-postal data can sometimes cause either validation failures (i.e. an inability to validate some records, not a process execution failure) or simply deletion of the unexpected values. Therefore, some implementations will use a pre-process to extract the non-postal data prior to validation.

Most validation software will append one or more status fields indicating what happened to the records during processing. It is a recommended best practice to perform post-validation analysis by not only looking at these status fields, but also comparing the record content before and after validation, in order to determine what modifications and enhancements have been performed.

What other analysis do you think should be performed for postal address?

In Part 6 of this series: We will continue the adventures by analyzing the Account Number and Tax ID fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 18, 2009

Adventures in Data Profiling (Part 4)

August 18, 2009/ Jim Harris

In Part 3 of this series: The adventures continued with a detailed analysis of the fields Birth Date, Telephone Number and E-mail Address. This provided you with an opportunity to become familiar with analysis techniques that use a combination of field values and field formats.

You also saw examples of how valid values in a valid format can have an invalid context, how valid field formats can conceal invalid field values, and how free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

In Part 4, you will continue your adventures in data profiling by going postal...postal address that is, by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Country Code

In Part 1, we wondered if 5 distinct Country Code field values indicated international postal addresses. This drill-down “screen” provided by the data profiling tool shows the frequency distribution. First of all, the field name might have lead us to assume we would only see ISO 3166 standard country codes.

However, two of the field values are a country name and not a country code. This is another example of how verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

Secondly, the field values would appear to indicate that most of the postal addresses are from the United States. However, if you recall from Part 3, we discovered some potential clues during our analysis of Telephone Number, which included two formats that appear invalid based on North American standards, and E-mail Address, which included country code Top Level Domain (TLD) values for Canada and the United Kingdom.

Additionally, whenever a field is both 100% complete and has an extremely low cardinality, it could be an indication that the most frequently occurring value is the field's default value.

Therefore, is it possible that US is simply the default value for Country Code for this data source?

Zip Code

From the Part 1 comments, it was noted that Zip Code as a field name is unique to the postal code system used in the United States (US). This drill-down “screen” provided by the data profiling tool shows the field has only a total of ten field formats.

The only valid field formats for ZIP (which, by the way, is an acronym for Zone Improvement Plan) are 5 digits and 9 digits when the 4 digit ZIP+4 add-on code is also present, which according to the US postal standards should be separated from the 5 digit ZIP Code using a hyphen.

The actual field formats in the Zip Code field of this data source reveal another example of how we should not make assumptions about our data based on the metadata that describes it. Although the three most frequently occurring field formats appear to be representative of potentially valid US postal codes, the alphanumeric postal code field formats are our first indication that it is, perhaps sadly, not all about US (pun intended, my fellow Americans).

The two most frequently occurring alphanumeric field formats appear to be representative of potentially valid Canadian postal codes. An interesting thing to note is that their combined frequency distribution is double the count of the number of records having CA as a Country Code field value. Therefore, if these field formats are representative of a valid Canadian postal code, then some Canadian records have a contextually invalid field value in Country Code.

The other alphanumeric field formats appear to be representative of potentially valid postal codes for the United Kingdom (UK). To the uninitiated, the postal codes of Canada (CA) and the UK appear very similar. Both postal code formats contain two parts, which according to their postal standards should be separated by a single character space.

In CA postal codes, the first part is called the Forward Sortation Area (FSA) and the second part is called the Local Delivery Unit (LDU). In UK postal codes, the first part is called the outward code and the second part is called the inward code.

One easy way to spot the difference is that a UK inward code always has the format of a digit followed by two letters (i.e. “naa” in the field formats generated by my fictional data profiling tool), whereas a CA LDU always has the format of a digit followed by a letter followed by another digit (i.e. “nan”).

However, we should never rule out the possibility of transposed values making a CA postal code look like a UK postal code, or vice versa. Also, never forget the common data quality challenge of valid field formats concealing invalid field values.

Returning to the most frequently occurring field format of 5 digits, can we assume all valid field values would represent US postal addresses? Of course not. One significant reason is that a 5 digit postal code is one of the most common formats in the world.

Just some of the other countries also using a 5 digit postal code include: Algeria, Cuba, Egypt, Finland, France, Germany, Indonesia, Israel, Italy, Kuwait, Mexico, Spain, and Turkey.

What about the less frequently occurring field formats of 4 digits and 6 digits? It is certainly possible that these field formats could indicate erroneous attempts at entering a valid US postal code. However, it could also indicate the presence of additional non-US postal addresses.

Just some of the countries using a 4 digit postal code include: Australia, Austria, Belgium, Denmark, El Salvador, Georgia (no, the US state did not once again secede, there is also a country called Georgia and its not even in the Americas), Hungary, Luxembourg, Norway, and Venezuela. Just some of the countries using a 6 digit postal code include: Belarus, China, India, Kazakhstan (yes, Borat fans, Kazakhstan is a real country), Russia, and Singapore.

Additionally, why do almost 28% of the records in this data source not have a field value for Zip Code?

One of the possibilities is that we could have postal addresses from countries that do not have a postal code system. Just a few examples would be: Aruba, Bahamas (sorry fellow fans of the Beach Boys, but both Jamaica and Bermuda have a postal code system, and therefore I could not take you down to Kokomo), Fiji (home of my favorite bottled water), and Ireland (home of my ancestors and inventors of my second favorite coffee).

State Abbreviation

From the Part 1 comments, it was noted that the cardinality of State Abbreviation appeared suspect because, if we assume that its content matches its metadata, then we would expect only 51 distinct values (i.e. actual US state abbreviations without counting US territories) and not the 72 distinct values discovered by the data profiling tool.

Let's assume that drill-downs have revealed the single profiled field data type was CHAR, and the profiled minimum/maximum field lengths were both 2. Therefore, State Abbreviation, when populated, always contains a two character field value.

This drill-down “screen” first displays the top ten most frequently occurring values in the State Abbreviation field, which are all valid US state abbreviations. The frequency distributions are also within general expectations since eight of the largest US states by population are represented.

However, our previous analysis of Country Code and Zip Code has already made us aware that international postal addresses exist in this data source. Therefore, this drill-down “screen” also displays the top ten most frequently occurring non-US values based on the data profiling tool comparing all 72 distinct values against a list of valid US state and territory abbreviations.

Most of the field values discovered by this analysis appear to be valid CA province codes (including PQ being used as a common alternative for QC – the province of Quebec or Québec si vous préférez). These frequency distributions are also within general expectations since six of the largest CA provinces by population are represented. Their combined frequency distribution is also fairly close to the combined frequency distribution of potentially valid Canadian postal codes found in the Zip Code field.

However, we still have three additional values (ZZ, SA, HD) which require more analysis. Additionally, almost 22% of the records in this data source do not have a field value for State Abbreviation, which could be attributable to the fact that even when the postal standards for other countries include something analogous to a US state or CA province, it might not be compatible with a two character field length.

City Name

Let's assume that we have performed some preliminary analysis on the statistical summaries and frequency distributions provided by the data profiling tool for the City Name field using the techniques illustrated throughout this series so far.

Let's also assume analyzing the City Name field in isolation didn't reveal anything suspicious. The field is consistently populated and its frequently occurring values appeared to meet general expectations. Therefore, let's assume we have performed additional drill-down analysis using the data profiling tool and have selected the following records of interest:

Based on reviewing these records, what conclusions, assumptions, and questions do you have about the City Name field?

What other questions can you think of for these fields? What other analysis do you think should be performed for these fields?

In Part 5 of this series: We will continue the adventures in data profiling by completing our initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

International Man of Postal Address Standards

Since I am a geographically-challenged American, the first (and often the only necessary) option I choose for assistance with international postal address standards is Graham Rhind.

His excellent book The Global Source-Book for Address Data Management is an invaluable resource and recognized standard reference that contains over 1,000 pages of data pertaining to over 240 countries and territories.

OCDQ Blog

“Data Quality is all about . . .”

Data Quality and deus ex machina

Data Quality is Not about One Extraordinary Thing

Related Posts

A Contrarian’s View of Agile BI

Scrum Screwed Up: All Chicken and No Pig

Scrum Screwed Up: All Pig and No Chicken

Scrum is NOT a Silver Bullet

Common Sense

Common Feeling

Common Place

Common Change

Related Posts

Insight

“The Dip”

“Real Artists Ship”

The Winning Curve

Related Posts

Jack Bauer and Data Governance

Enforcing Data Governance Policies

Documentation

Communication

Metrics

Remediation

Refinement

Conclusion

Follow OCDQ

Webinar

Presentation

Complete Blog Series

Adventures in Data Profiling

Counts and Percentages

Values and Formats

Drill-down Analysis

Conclusion

Thank You

The Complete Series

What is a Customer?

How Many Customers Do You Have?

Conclusion

Customer Name 1

Customer Name 2

The Challenges of Person Names

Related Posts

Account Number

Tax ID

Potential Duplicate Records

Related Posts

Free-Form Fields

Much Ado About Analysis

The Taming of the Variations

Shall I compare thee to other records?

Doth the bard protest too much?

Postal Address Line 1

Postal Address Line 2

Postal Address Validation

Related Posts

Country Code

Zip Code

State Abbreviation

City Name

Related Posts

International Man of Postal Address Standards

OCDQ Blog