September 27, 2009

Tweet 2001: A Social Media Odyssey

September 27, 2009/ Jim Harris

“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”

As I get closer and closer to my 2001^sttweet on Twitter, I wanted to pause for some quiet reflection on my personal odyssey in social media – but then I decided to blog about it instead.

The Dawn of OCDQ

Except for LinkedIn, my epic drama of social media adventure and exploration started with my OCDQ blog.

In my Data Quality Pro article Blogging about Data Quality, I explained why I started this blog and discussed some of my thoughts on blogging. Most importantly, I explained that I am neither a blogging expert nor a social media expert.

But now that I have been blogging and using social media for over six months, I feel more comfortable sharing my thoughts and personal experiences with social media without worrying about sounding like too much of an idiot (no promises, of course).

My social media odyssey began in 2007 when I created my account on LinkedIn, which I admit, I initially viewed as just an online resume. I put little effort into my profile, only made a few connections, and only joined a few groups.

Last year (motivated by the economic recession), I started using LinkedIn more extensively. I updated my profile with a complete job history, asked my colleagues for recommendations, expanded my network with more connections, and joined more groups. I also used LinkedIn applications (e.g. Reading List by Amazon and Blog Link) to further enhance my profile.

My favorite feature is the LinkedIn Groups, which not only provide an excellent opportunity to connect with other users, but also provide Discussions, News (including support for RSS feeds), and Job Postings.

By no means a comprehensive list, here are some LinkedIn Groups that you may be interested in:

For more information about LinkedIn features and benefits, check out the following posts on the LinkedIn Blog:

Twitter

Shortly after launching my blog in March 2009, I created my Twitter account to help promote my blog content. In blogging, content is king, but marketing is queen. LinkedIn (via group news feeds) is my leading source of blog visitors from social media, but Twitter isn't far behind.

However, as Michele Goetz of Brain Vibe explained in her blog post Is Twitter an Effective Direct Marketing Tool?, Twitter has a click-through rate equivalent to direct mail. Citing research from Pear Analytics, a “useful” tweet was found to have a shelf life of about one hour with about a 1% click-through rate on links.

In his blog post Is Twitter Killing Blogging?, Ajay Ohri of Decision Stats examined whether Twitter was a complement or a substitute for blogging. I created a Data Quality on Twitter page on my blog in order to illustrate what I have found to be the complementary nature of tweeting and blogging.

My ten blog posts receiving the most tweets (tracked using the Retweet Button from TweetMeme):

The Nine Circles of Data Quality Hell – 13 Tweets
Adventures in Data Profiling (Part 1) – 13 Tweets
Fantasy League Data Quality – 12 Tweets
Not So Strange Case of Dr. Technology and Mr. Business – 12 Tweets
The Fragility of Knowledge – 11 Tweets
The General Theory of Data Quality – 9 Tweets
The Very True Fear of False Positives – 8 Tweets
Data Governance and Data Quality – 8 Tweets
Adventures in Data Profiling (Part 3) – 8 Tweets
Data Quality: The Reality Show? – 7 Tweets

Most of my social networking is done using Twitter (with LinkedIn being a close second). I have also found Twitter to be great for doing research, which I complement with RSS subscriptions to blogs.

To search Twitter for data quality content:

If you are new to Twitter, then I would recommend reading the following blog posts:

Facebook

I also created my Facebook account shortly after launching my blog. Although I almost exclusively use social media for professional purposes, I do use Facebook as a way to stay connected with family and friends.

I created a page for my blog to separate my professional and personal aspects of Facebook without the need to manage multiple accounts. Additionally, this allows you to become a “fan” of my blog without requiring you to also become my “friend.”

A quick note on Facebook games, polls, and trivia: I do not play them. With my obsessive-compulsive personality, I have to ignore them. Therefore, please don't be offended if for example, I have ignored your invitation to play Mafia Wars.

By no means a comprehensive list, here are some Facebook Pages or Groups that you may be interested in:

Additional Social Media Websites

Although LinkedIn, Twitter, and Facebook are my primary social media websites, I also have accounts on three of the most popular social bookmarking websites: Digg, StumbleUpon, and Delicious.

Social bookmarking can be a great promotional tool that can help blog content go viral. However, niche content is almost impossible to get to go viral. Data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll.

This doesn't mean that data quality isn't an important subject – it just means that you will not see a blog post about data quality hitting the front pages of mainstream social bookmarking websites anytime soon. Dylan Jones of Data Quality Pro created DQVote, which is a social bookmarking website dedicated to sharing data quality community content.

I also have an account on FriendFeed, which is an aggregator that can consolidate content from other social media websites, blogs or anything providing a RSS feed. My blog posts and my updates from other social media websites (except for Facebook) are automatically aggregated. On Facebook, my personal page displays my FriendFeed content.

Social Media Tools and Services

Social media tools and services that I personally use (listed in no particular order):

Flock – The Social Web Browser Powered by Mozilla
TweetDeck – Connecting you with your contacts across Twitter, Facebook, MySpace and more
Digsby – Digsby = Instant Messaging (IM) + E-mail + Social Networks
Ping.fm – Update all of your social networks at once
HootSuite – The professional Twitter client
Twitterfeed – Feed your blog to Twitter
Google FeedBurner – Provide an e-mail subscription to your blog
TweetMeme – Add a Retweet Button to your blog
Squarespace Blog Platform – The secret behind exceptional websites

Social Media Strategy

As Darren Rowse of ProBlogger explained in his blog post How I use Social Media in My Blogging, Chris Brogan developed a social media strategy using the metaphor of a Home Base with Outposts.

“A home base,” explains Rowse, “is a place online that you own.” For example, your home base could be your blog or your company's website. “Outposts,” continues Rowse, “are places that you have an online presence out in other parts of the web that you might not own.” For example, your outposts could be your LinkedIn, Twitter, and Facebook accounts.

According to Rowse, your Outposts will make your Home Base stronger by providing:

“Relationships, ideas, traffic, resources, partnerships, community and much more.”

Social Karma

An effective social media strategy is essential for both companies and individual professionals. Using social media can help promote you, your expertise, your company and your products and services.

However, too many companies and individuals have a selfish social media strategy.

You should not use social media exclusively for self-promotion. You should view social media as Social Karma.

If you can focus on helping others when you use social media, then you will get much more back than just a blog reader, a LinkedIn connection, a Twitter follower, a Facebook friend, or even a potential customer.

Yes, I use social media to promote myself and my blog content. However, more than anything else, I use social media to listen, to learn, and to help others when I can.

Please Share Your Social Media Odyssey

As always, I am interested in hearing from you. What have been your personal experiences with social media?

September 23, 2009

DQ-Tip: “Data quality is primarily about context not accuracy...”

September 23, 2009/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Data quality is primarily about context not accuracy.

Accuracy is part of the equation, but only a very small portion.”

This DQ-Tip is from Rick Sherman's recent blog post summarizing the TDWI Boston Chapter Meeting at MIT.

I define data using the Dragnet definition – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). A common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements. Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Alternatively, information can be defined as data in context.

Quality, as Sherman explains, “is in the eyes of the beholder, i.e. the business context.”

DQ-Tip: “Don't pass bad data on to the next person...”

The General Theory of Data Quality

The Data-Information Continuum

September 21, 2009

Adventures in Data Profiling (Part 6)

September 21, 2009/ Jim Harris

In Part 5 of this series: You completed your initial analysis of the fields relating to postal address with the investigation of Postal Address Line 1 and Postal Address Line 2.

You saw additional examples of why free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

You learned this analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

You also saw examples of how the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.

In Part 6, you will continue your adventures in data profiling by analyzing the Account Number and Tax ID fields.

Account Number

The field summary for Account Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

In Part 2, we learned that Customer ID is likely an integer surrogate key and the primary key for this data source because it is both 100% complete and 100% unique. Account Number is 100% complete and almost 100% unique. Perhaps it was intended to be the natural key for this data source?

Let's assume that drill-downs revealed the single profiled field data type was VARCHAR and the single profiled field format was aa-nnnnnnnnn (i.e. 2 characters, followed by a hyphen, followed by a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that not only is Account Number always populated, it is also consistently formatted.

The profiled minimum/maximum field values appear somewhat suspicious, possibly indicating the presence of invalid values?

We can use drill-downs on the field summary “screen” to get more details about Account Number provided by the data profiling tool.

The cardinality of Account Number is very high, as is its Distinctness (i.e. the same field value rarely occurs on more than one record). Therefore, when we limit the review to only the top ten most frequently occurring values, it is not surprising to see low counts.

Since we do not yet have a business understanding of the data, we are not sure if it is valid for multiple records to have the same Account Number.

Additional analysis can be performed by extracting the alpha prefix and reviewing its top ten most frequently occurring values. One aspect of this analysis is that it can be used to assess the possibility that Account Number is an “intelligent key.” Perhaps the alpha prefix is a source system code?

Tax ID

The field summary for Tax ID includes input metadata along with the summary and additional statistics provided by the data profiling tool.

Let's assume that drill-downs revealed the single profiled field data type was INTEGER and the single profiled field format was nnnnnnnnn (i.e. a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that Tax ID is also consistently formatted. However, the profiled minimum/maximum field values appear to indicate the presence of invalid values.

In Part 4, we learned that most of the records appear to have either an United States (US) or Canada (CA) postal address. For US records, the Tax ID field could represent the social security number (SSN), federal employer identification number (FEIN), or tax identification number (TIN). For CA records, this field could represent the social insurance number (SIN). All of these identifiers are used for tax reporting purposes and have a 9 digit number format (when no presentation formatting is used).

We can use drill-downs on the field summary “screen” to get more details about Tax ID provided by the data profiling tool.

The Distinctness of Tax ID is slightly lower than Account Number and therefore the same field value does occasionally occur on more than one record.

Since the cardinality of Tax ID is very high, we will limit the review to only the top ten most frequently occurring values. This analysis reveals the presence of more (most likely) invalid values.

Potential Duplicate Records

In Part 1, we asked if the data profiling statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records. In other words, since some distinct actual values for these fields occur on more than one record, does this imply more than just a possible data relationship, but a possible data redundancy? Obviously, we would need to interact with the business team in order to better understand the data and their business rules for identifying duplicate records.

However, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

What other analysis do you think should be performed for these fields?

In Part 7 of this series: We will continue the adventures in data profiling by completing our initial analysis with the investigation of the Customer Name 1 and Customer Name 2 fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

September 19, 2009

Commendable Comments (Part 2)

September 19, 2009/ Jim Harris

In a recent guest post on ProBlogger, Josh Hanagarne “quoted” Jane Austen:

“It is a truth universally acknowledged, that a blogger in possession of a good domain must be in want of some worthwhile comments.”

“The most rewarding thing has been that comments,” explained Hanagarne, “led to me meeting some great people I possibly never would have known otherwise.” I wholeheartedly echo that sentiment.

This is the second entry in my ongoing series celebrating my heroes – my readers.

Commendable Comments

Proving that comments are the best part of blogging, on The Data-Information Continuum, Diane Neville commented:

“This article is intriguing. I would add more still.

A most significant quote: 'Data could be considered a constant while Information is a variable that redefines data for each specific use.'

This tells us that Information draws from a snapshot of a Data store. I would state further that the very Information [specification] is – in itself – a snapshot.

The earlier quote continues: 'Data is not truly a constant since it is constantly changing.'

Similarly, it is a business reality that 'Information is not truly a constant since it is constantly changing.'

The article points out that 'The Data-Information Continuum' implies a many-to-many relationship between the two. This is a sensible CONCEPTUAL model.

Enterprise Architecture is concerned as well with its responsibility for application quality in service to each Business Unit/Initiative.

For example, in the interest of quality design in Application Architecture, an additional LOGICAL model must be maintained between a then-current Information requirement and the particular Data (snapshots) from which it draws. [Snapshot: generally understood as captured and frozen – and uneditable – at a particular point in time.] Simply put, Information Snapshots have a PARENT RELATIONSHIP to the Data Snapshots from which they draw.

Analyzing this further, refer to this further piece of quoted wisdom (from section 'Subjective Information Quality'): '...business units and initiatives must begin defining their Information...by using...Data...as a foundation...necessary for the day-to-day operation of each business unit and initiative.'

From logically-related snapshots of Information to the Data from which it draws, we can see from this quote that yet another PARENT/CHILD relationship exists...that from Business Unit/Initiative Snapshots to the Information Snapshots that implement whatever goals are the order of the day. But days change.

If it is true that 'Data is not truly a constant since it is constantly changing,' and if we can agree that Information is not truly a constant either, then we can agree to take a rational and profitable leap to the truth that neither is a Business Unit/Initiative...since these undergo change as well, though they represent more slowly-changing dimensions.

Enterprises have an increasing responsibility for regulatory/compliance/archival systems that will qualitatively reproduce the ENTIRE snapshot of a particular operational transaction at any given point in time.

Thus, the Enterprise Architecture function has before it a daunting task: to devise a holistic process that can SEAMLESSLY model the correct relationship of snapshots between Data (grandchild), Information (parent) and Business Unit/Initiative (grandparent).

There need be no conversion programs or redundant, throw-away data structures contrived to bridge the present gap. The ability to capture the activities resulting from the undeniable point-in-time hierarchy among these entities is where tremendous opportunities lie.”

On Missed It By That Much, Vish Agashe commented:

“My favorite quote is 'Instead of focusing on the exceptions – focus on the improvements.'

I think that it is really important to define incremental goals for data quality projects and track the progress through percentage improvement over a period of time.

I think it is also important to manage the expectations that the goal is not necessarily to reach 100% (which will be extremely difficult if not impossible) clean data but the goal is to make progress to a point where the purpose for cleaning the data can be achieved in much better way than had the original data been used.

For example, if marketing wanted to use the contact data to create a campaign for those contacts which have a certain ERP system installed on-site. But if the ERP information on the contact database is not clean (it is free text, in some cases it is absent etc...) then any campaign run on this data will reach only X% contacts at best (assuming only X% of contacts have ERP which is clean)...if the data quality project is undertaken to clean this data, one needs to look at progress in terms of % improvement. How many contacts now have their ERP field cleaned and legible compared to when we started etc...and a reasonable goal needs to be set based on how much marketing and IT is willing to invest in these issues (which in turn could be based on ROI of the campaign based on increased outreach).”

Proving that my readers are way smarter than I am, on The General Theory of Data Quality, John O'Gorman commented:

“My theory of the data, information, knowledge continuum is more closely related to the element, compound, protein, structure arc.

In my world, there is no such thing as 'bad' data, just as there is no 'bad' elements. Data is either useful or not: the larger the audience that agrees that a string is representative of something they can use, the more that string will be of value to me.

By dint of its existence in the world of human communication and in keeping with my theory, I can assign every piece of data to one of a fixed number of classes, each with characteristics of their own, just like elements in the periodic table. And, just like the periodic table, those characteristics do not change. The same 109 usable elements in the periodic table are found and are consistent throughout the universe, and our ability to understand that universe is based on that stability.

Information is simply data in a given context, like a molecule of carbon in flour. The carbon retains all of its characteristics but the combination with other elements allows it to partake in a whole class of organic behavior. This is similar to the word 'practical' occurring in a sentence: Jim is a practical person or the letter 'p' in the last two words.

Where the analogue bends a bit is a cause of a lot of information management pain, but can be rectified with a slight change in perspective. Computers (and almost all indexes) have a hard time with homographs: strings that are identical but that mean different things. By creating fixed and persistent categories of data, my model suffers no such pain.

Take the word 'flies' in the following: 'Time flies like an arrow.' and 'Fruit flies like a pear.' The data 'flies' can be permanently assigned to two different places, and their use determines which instance is relevant in the context of the sentence. One instance is a verb, the other a plural noun.

Knowledge, in my opinion, is the ability to recognize, predict and synthesize patterns of information for past, present and future use, and more importantly to effectively communicate those patterns in one or more contexts to one or more audiences.

On one level, the model for information management that I use makes no apparent distinction between the data: we all use nouns, adjectives, verbs and sometimes scalar objects to communicate. We may compress those into extremely compact concepts but they can all be unraveled to get at elemental components. At another level every distinction is made to insure precision.

The difference between information and knowledge is experiential and since experience is an accumulative construct, knowledge can be layered to appeal to common knowledge, special knowledge and unique knowledge.

Common being the most easily taught and widely applied; Special being related to one or more disciplines and/or special functions; and, Unique to individuals who have their own elevated understanding of the world and so have a need for compact and purpose-built semantic structures.

Going back to the analogue, knowledge is equivalent to the creation by certain proteins of cartilage, the use to which that cartilage is put throughout a body, and the specific shape of the cartilage that forms my nose as unique from the one on my wife's face.

To me, the most important part of the model is at the element level. If I can convince a group of people to use a fixed set of elemental categories and to reference those categories when they create information, it's amazing how much tension disappears in the design, creation and deployment of knowledge.”

Tá mé buíoch díot

Daragh O Brien recently taught me the Irish Gaelic phrase Tá mé buíoch díot, which translates as I am grateful to you.

I am very grateful to all of my readers. Since there have been so many commendable comments, please don't be offended if your commendable comment hasn't been featured yet. Please keep on commenting and stay tuned for future entries in the series.

Commendable Comments (Part 1)

Commendable Comments (Part 3)

September 16, 2009

DQ-Tip: “Don't pass bad data on to the next person...”

September 16, 2009/ Jim Harris

Data Quality (DQ) Tips is a new regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Don't pass bad data on to the next person. And don't accept bad data from the previous person.”

This DQ-Tip is from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset.

In the book, Redman explains that this advice is a rewording of his favorite data quality policy of all time.

Assuming that it is someone else's responsibility is a fundamental root case for enterprise data quality problems. One of the primary goals of a data quality initiative must be to define the roles and responsibilities for data ownership and data quality.

In sports, it is common for inspirational phrases to be posted above every locker room exit door. Players acknowledge and internalize the inspirational phrase by reaching up and touching it as they head out onto the playing field.

Perhaps you should post this DQ-Tip above every break room exit door throughout your organization?

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

Data Governance and Data Quality

Additional Resources

Who is responsible for data quality?

DQ Problems? Start a Data Quality Recognition Program!

Starting Your Own Personal Data Quality Crusade

September 15, 2009

The Fragility of Knowledge

September 15, 2009/ Jim Harris

In his excellent book The Black Swan: The Impact of the Highly Improbable, Nassim Nicholas Taleb explains:

“What you don’t know is far more relevant than what you do know.”

Our tendency is to believe the opposite. After we have accumulated the information required to be considered knowledgeable in our field, we believe that what we have learned and experienced (i.e. what we know) is far more relevant than what we don’t know. We are all proud of our experience, which we believe is the path that separates knowledge from wisdom.

“We tend to treat our knowledge as personal property to be protected and defended,” explains Taleb. “It is an ornament that allows us to rise in the pecking order. We take what we know a little too seriously.”

However, our complacency is all too often upset by the unexpected. Some new evidence is discovered that disproves our working theory of how things work. Or something that we have repeatedly verified in the laboratory of our extensive experience, suddenly doesn’t produce the usual results.

Taleb cautions that this “illustrates a severe limitation to our learning from experience and the fragility of our knowledge.”

I have personally encountered this many times throughout my career in data quality. At first, it seemed like a cruel joke or some bizarre hazing ritual. Every time I thought that I had figured it all out, that I had learned all the rules, something I didn’t expect would come along and smack me upside the head.

“We do not spontaneously learn,” explains Taleb, “that we don’t learn that we don’t learn. The problem lies in the structure of our minds: we don’t learn rules, just facts, and only facts.”

Facts are important. Facts are useful. However, sometimes our facts are really only theories. Mistaking a theory for a fact can be very dangerous. What you don’t know can hurt you.

However, as Taleb explains, “what you know cannot really hurt you.” Therefore, we tend to only “look at what confirms our knowledge, not our ignorance.” This is unfortunate, because “there are so many things we can do if we focus on antiknowledge, or what we do not know.”

This is why, as a data quality consultant, when I begin an engagement with a new client, I usually open with the statement (completely without sarcasm):

“Tell me something that I don’t know.”

Hailing Frequencies Open

September 13, 2009

Commendable Comments (Part 1)

September 13, 2009/ Jim Harris

Six month ago today, I launched this blog by asking: Do you have obsessive-compulsive data quality (OCDQ)?

As of September 10, here are the monthly traffic statistics provided by my blog platform:

It Takes a Village (Idiot)

In my recent Data Quality Pro article Blogging about Data Quality, I explained why I started this blog. Blogging provides me a way to demonstrate my expertise. It is one thing for me to describe myself as an expert and another to back up that claim by allowing you to read my thoughts and decide for yourself.

In general, I have always enjoyed sharing my experiences and insights. A great aspect to doing this via a blog (as opposed to only via whitepapers and presentations) is the dialogue and discussion provided via comments from my readers.

This two-way conversation not only greatly improves the quality of the blog content, but much more importantly, it helps me better appreciate the difference between what I know and what I only think I know.

Even an expert's opinions are biased by the practical limits of their personal experience. Having spent most of my career working with what is now mostly IBM technology, I sometimes have to pause and consider if some of that yummy Big Blue Kool-Aid is still swirling around in my head (since I “think with my gut,” I have to “drink with my head”).

Don't get me wrong – “You're my boy, Blue!” – but there are many other vendors and all of them also offer viable solutions driven by impressive technologies and proven methodologies.

Data quality isn't exactly the most exciting subject for a blog. Data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll.

This doesn't mean that data quality isn't an important subject – it just means that you will not see a blog post about data quality hitting the front page of Digg anytime soon.

All blogging is more art than science. My personal blogging style can perhaps best be described as mullet blogging – not “business in the front, party in the back” but “take your subject seriously, but still have a sense of humor about it.”

My blog uses a lot of metaphors and analogies (and sometimes just plain silliness) to try to make an important (but dull) subject more interesting. Sometimes it works and sometimes it sucks. However, I have never been afraid to look like an idiot. After all, idiots are important members of society – they make everyone else look smart by comparison.

Therefore, I view my blog as a Data Quality Village. And as the Blogger-in-Chief, I am the Village Idiot.

The Rich Stuff of Comments

Earlier this year in an excellent IT Business Edge article by Ann All, David Churbuck of Lenovo explained:

“You can host focus groups at great expense, you can run online surveys, you can do a lot of polling, but you won’t get the kind of rich stuff (you will get from blog comments).”

How very true. But before we get to the rich stuff of our village, let's first take a look at a few more numbers:

Not counting this one, I have published 44 posts on this blog
Those blog posts have collectively received a total of 185 comments
Only 5 blog posts received no comments
30 comments were actually me responding to my readers
45 comments were from LinkedIn groups (23), SmartData Collective re-posts (17), or Twitter re-tweets (5)

The ten blog posts receiving the most comments:

The Two Headed Monster of Data Matching – 11 Comments
Adventures in Data Profiling (Part 4) – 9 Comments
Adventures in Data Profiling (Part 2) – 9 Comments
You're So Vain, You Probably Think Data Quality Is About You – 8 Comments
There are no Magic Beans for Data Quality – 8 Comments
The General Theory of Data Quality – 8 Comments
Adventures in Data Profiling (Part 1) – 8 Comments
To Parse or Not To Parse – 7 Comments
The Wisdom of Failure – 7 Comments
The Nine Circles of Data Quality Hell – 7 Comments

Commendable Comments

This post will be the first in an ongoing series celebrating my heroes – my readers.

As Darren Rowse and Chris Garrett explained in their highly recommended ProBlogger book: “even the most popular blogs tend to attract only about a 1 percent commenting rate.”

Therefore, I am completely in awe of my blog's current 88 percent commenting rate. Sure, I get my fair share of the simple and straightforward comments like “Great post!” or “You're an idiot!” – but I decided to start this series because I am consistently amazed by the truly commendable comments that I regularly receive.

On The Data Quality Goldilocks Zone, Daragh O Brien commented:

“To take (or stretch) your analogy a little further, it is also important to remember that quality is ultimately defined by the consumers of the information. For example, if you were working on a customer data set (or 'porridge' in Goldilocks terms) you might get it to a point where Marketing thinks it is 'just right' but your Compliance and Risk management people might think it is too hot and your Field Sales people might think it is too cold. Declaring 'Mission Accomplished' when you have addressed the needs of just one stakeholder in the information can often be premature.

Also, one of the key learnings that we've captured in the IAIDQ over the past 5 years from meeting with practitioners and hosting our webinars is that, just like any Change Management effort, information quality change requires you to break the challenge into smaller deliverables so that you get regular delivery of 'just right' porridge to the various stakeholders rather than boiling the whole thing up together and leaving everyone with a bad taste in their mouths. It also means you can more quickly see when you've reached the Goldilocks zone.”

On Data Quality Whitepapers are Worthless, Henrik Liliendahl Sørensen commented:

“Bashing in blogging must be carefully balanced.

As we all tend to find many things from gurus to tools in our own country, I have also found one of my favourite sayings from Søren Kirkegaard:

If One Is Truly to Succeed in Leading a Person to a Specific Place, One Must First and Foremost Take Care to Find Him Where He is and Begin There.

This is the secret in the entire art of helping.

Anyone who cannot do this is himself under a delusion if he thinks he is able to help someone else. In order truly to help someone else, I must understand more than he–but certainly first and foremost understand what he understands.

If I do not do that, then my greater understanding does not help him at all. If I nevertheless want to assert my greater understanding, then it is because I am vain or proud, then basically instead of benefiting him I really want to be admired by him.

But all true helping begins with a humbling.

The helper must first humble himself under the person he wants to help and thereby understand that to help is not to dominate but to serve, that to help is not to be the most dominating but the most patient, that to help is a willingness for the time being to put up with being in the wrong and not understanding what the other understands.”

On All I Really Need To Know About Data Quality I Learned In Kindergarten, Daniel Gent commented:

“In kindergarten we played 'Simon Says...'

I compare it as a way of following the requirements or business rules.

Simon says raise your hands.

Simon says touch your nose.

Touch your feet.

With that final statement you learned very quickly in kindergarten that you can be out of the game if you are not paying attention to what is being said.

Just like in data quality, to have good accurate data and to keep the business functioning properly you need to pay attention to what is being said, what the business rules are.

So when Simon says touch your nose, don't be touching your toes, and you'll stay in the game.”

Since there have been so many commendable comments, I could only list a few of them in the series debut. Therefore, please don't be offended if your commendable comment didn't get featured in this post. Please keep on commenting and stay tuned for future entries in the series.

Because of You

As Brian Clark of Copyblogger explains, The Two Most Important Words in Blogging are “You” and “Because.”

I wholeheartedly agree, but prefer to paraphrase it as: Blogging is “because of you.”

Not you meaning me, the blogger – you meaning you, the reader.

Thank You.

Commendable Comments (Part 2)

Commendable Comments (Part 3)

September 08, 2009

Fantasy League Data Quality

September 08, 2009/ Jim Harris

For over 25 years, I have been playing fantasy league baseball and football. For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team. Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?

Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

Master Data – data describing the real-world players on the roster of each fantasy team.
Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number. Team attributes include name, owner, home city, and the name and seating capacity of their stadium.
Definitions – Player and Team have both Professional and Fantasy definitions. Professional teams and players are real-world objects managed independent of fantasy sports. Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League). Therefore, Fantasy Team and Fantasy Player are the true master data objects. The distinction between professional and fantasy teams is simpler than between professional and fantasy players. Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).
Roles – In baseball, the player roles are Batter, Pitcher, and Fielder. In football, the player roles are Offense, Defense and Special Teams. In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).
Connections – Fantasy Players are connected to Fantasy Teams via a roster. On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball). These connections change throughout the season. Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).
Taxonomies – Positions played are defined individually and organized into taxonomies. In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield. Second base and short stop are also infield positions, and more specifically middle infield. And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).

Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse. I downloaded master, reference, and transaction data from my fantasy league's website. I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team. All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables. For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions.

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics. These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball.

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics. This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues. Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams. Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only. Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables. In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century).

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century. However, there are still statistics with multiple meanings. For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas. Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.

Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team. Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website. This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.” However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites.

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online. Of course, this wasn't always the case.

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers. We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs.

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves. I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.

Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.
Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.
Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.
Presentation Quality – A measure of how information is presented to and collected from those who utilize it. Format and appearance support appropriate use of the information.
Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.

Conclusion

I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you. It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby. They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

September 03, 2009

To Parse or Not To Parse

September 03, 2009/ Jim Harris

“To Parse, or Not To Parse,—that is the question:
Whether 'tis nobler in the data to suffer
The slings and arrows of free-form fields,
Or to take arms against a sea of information,
And by parsing, understand them?”

Little known fact: before William Shakespeare made it big as a playwright, he was a successful data quality consultant.

Alas, poor data quality! The Bard of Avon knew it quite well. And he was neither a fan of free verse nor free-form fields.

Free-Form Fields

A free-form field contains multiple (usually interrelated) sub-fields. Perhaps the most common examples of free-form fields are customer name and postal address.

A Customer Name field with the value “Christopher Marlowe” is comprised of the following sub-fields and values:

Given Name = “Christopher”
Family Name = “Marlowe”

A Postal Address field with the value “1587 Tambur Lane” is comprised of the following sub-fields and values:

House Number = “1587”
Street Name = “Tambur”
Street Type = “Lane”

Obviously, both of these examples are simplistic. Customer name and postal address are comprised of additional sub-fields, not all of which will be present on every record or represented consistently within and across data sources.

Returning to the bard's question, a few of the data quality reasons to consider parsing free-form fields include:

Data Profiling
Data Standardization
Data Matching

Much Ado About Analysis

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. In Adventures in Data Profiling (Part 5), a data profiling tool was used to analyze the field Postal Address Line 1:

The Taming of the Variations

Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value, and a general lack of data quality standards. Additional variations are introduced by multiple data sources, each with its own unique data characteristics and quality challenges.

Data standardization parses free-form fields to break them down into their smaller individual sub-fields to gain improved visibility of the available input data. Data standardization is the taming of the variations that creates a consistent representation, applies standard values where appropriate, and when possible, populates missing values.

The following example shows parsed and standardized postal addresses:

In your data quality implementations, do you use this functionality for processing purposes only? If you retain the standardized results, do you store the parsed and standardized sub-fields or just the standardized free-form value?

Shall I compare thee to other records?

Data matching often uses data standardization to prepare its input. This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.

Imagine matching the following product description records with and without the parsed and standardized sub-fields:

Doth the bard protest too much?

Please share your thoughts and experiences regarding free-form fields.

August 28, 2009

Adventures in Data Profiling (Part 5)

August 28, 2009/ Jim Harris

In Part 4 of this series: You went totally postal...shifting your focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

You learned when a field is both 100% complete and has an extremely low cardinality, its most frequently occurring value could be its default value, how forcing international addresses to be entered into country-specific data structures can cause data quality problems, and with the expert assistance of Graham Rhind, we all learned more about international postal code formats.

In Part 5, you will continue your adventures in data profiling by completing your initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

As we discussed in Part 3 when we looked at the E-mail Address field, most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Postal Address Line 1 and Postal Address Line 2 are additional examples of the necessity of this analysis technique. Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record). Some variations in postal addresses can be the results of data entry errors, the use of local conventions, or ignoring (or lacking) postal standards.

Additionally, postal address lines can sometimes contain overflow from other fields (e.g. Customer Name) or they can be used as a dumping ground for values without their own fields (e.g. Twitter username), values unable to conform to the limitations of their intended fields (e.g. countries with something analogous to a US state or CA province but incompatible with a two character field length), or comments (e.g. LDIY, which as Steve Sarsfield discovered, warns us about the Large Dog In Yard).

Postal Address Line 1

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 1:

The top twenty most frequently occurring field formats for Postal Address Line 1 collectively account for over 80% of the records with an actual value in this field for this data source. All of these field formats appear to be common potentially valid structures. Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 1 field?

Postal Address Line 2

The data profiling tool has provided you the following drill-down “screen” for Postal Address Line 2:

The top ten most frequently occurring field formats for Postal Address Line 2 collectively account for half of the records with an actual value in this sparsely populated field for this data source. Some of these field formats show several common potentially valid structures. Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Postal Address Line 2 field?

Postal Address Validation

Many data quality initiatives include the implementation of postal address validation software. This provides the capability to parse, identify, verify, and format a valid postal address by leveraging country-specific postal databases.

Some examples of postal validation functionality include correcting misspelled street and city names, populating missing postal codes, and applying (within context) standard abbreviations for sub-fields such as directionals (e.g. N for North and E for East), street types (e.g. ST for Street and AVE for Avenue), and box types (e.g. BP for Boite Postale and CP for Case Postale). These standards not only vary by country, but can also vary within a country when there are multiple official languages.

The presence of non-postal data can sometimes cause either validation failures (i.e. an inability to validate some records, not a process execution failure) or simply deletion of the unexpected values. Therefore, some implementations will use a pre-process to extract the non-postal data prior to validation.

Most validation software will append one or more status fields indicating what happened to the records during processing. It is a recommended best practice to perform post-validation analysis by not only looking at these status fields, but also comparing the record content before and after validation, in order to determine what modifications and enhancements have been performed.

What other analysis do you think should be performed for postal address?

In Part 6 of this series: We will continue the adventures by analyzing the Account Number and Tax ID fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 26, 2009

The Only Thing Necessary for Poor Data Quality

August 26, 2009/ Jim Harris

“Demonstrate projected defects and business impacts if the business fails to act,” explains Dylan Jones of Data Quality Pro in his recent and remarkable post How To Deliver A Compelling Data Quality Business Case:

“Presenting a future without data quality management...leaves a simple take-away message – do nothing and the situation will deteriorate.”

I can not help but be reminded of the famous quote often attributed to the 18th century philosopher Edmund Burke:

“The only thing necessary for the triumph of evil, is for good men to do nothing.”

Or the even more famous quote often attributed to the long time ago Jedi Master Yoda:

“Poor data quality is the path to the dark side. Poor data quality leads to bad business decisions.

Bad business decisions leads to lost revenue. Lost revenue leads to suffering.”

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, demonstrate that poor data quality is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision-critical enterprise information.

Preventing poor data quality is mission-critical. Poor data quality will undermine the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

“The only thing necessary for Poor Data Quality – is for good businesses to Do Nothing.”

Hyperactive Data Quality (Second Edition)

Data Quality: The Reality Show?

Data Governance and Data Quality

August 24, 2009

Resistance is NOT Futile

August 24, 2009/ Jim Harris

“Your opinion is irrelevant. We wish to improve ourselves.

We will add your business and technological distinctiveness to our own.

Your culture will adapt to service us.

You will be assimilated. Resistance is futile.”

Continuing my Star Trek theme, which began with my previous post Hailing Frequencies Open, imagine that you have been called into the ready room to be told your enterprise has decided to implement the proven data quality framework known as Business Operations and Reporting Governance – the BORG.

Frameworks are NOT Futile

Please understand – I am an advocate for methodology and best practices, and there are certainly many excellent frameworks that are far from futile. I have worked on many data quality initiatives that were following a framework and have seen varying degrees of success in their implementation.

However, the fictional BORG framework that I am satirizing exemplifies a general problem that I have with any framework that advocates a one-size-fits-all strategy, which I believe is an approach that is doomed to fail.

Any implemented framework must be customized to adapt to an organization's unique culture. In part, this is necessary because implementing changes of any kind will be met with initial resistance. An attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Resistance is NOT Futile

Everyone has opinions – and opinions are never irrelevant. Fundamentally, all change starts with changing people's minds.

The starting point has to be improving communication and encouraging open dialogue. This means listening to what people throughout the organization have to say and not just telling them what to do. Keeping data aligned with business processes and free from poor quality requires getting people aligned and free to communicate their concerns.

Obviously, there will be dissension. However, you must seek a mutual understanding by practicing empathic listening. The goal is to foster an environment in which a diversity of viewpoints is freely shared without bias.

“One of the real dangers is emphasizing consensus over dissent,” explains James Surowiecki in his excellent book The Wisdom of Crowds. “The best collective decisions are the product of disagreement and contest, not consensus or compromise. Group deliberations are more successful when they have a clear agenda and when leaders take an active role in making sure that everyone gets a chance to speak.”

Avoid Assimilation

In order to be successful in your attempt to implement any framework, you must have realistic expectations.

Starting with a framework simply provides a reference of best practices and recommended options of what has worked on successful data quality initiatives. But the framework must still be reviewed in order to determine what can be learned from it and to select what will work in the current environment and what simply won't.

This doesn't mean that the customized components of the framework will be implemented simultaneously. All change will be gradual and implemented in phases – without the use of BORG nanoprobes. You will NOT be assimilated.

Your organization's collective consciousness will be best served by adapting the framework to your corporate culture.

Your data quality initiative will facilitate the collaboration of business and technical stakeholders, as well as align data usage with business metrics, and enable people to be responsible for data ownership and data quality.

Best practices will be disseminated throughout your collective – while also maintaining your individual distinctiveness.

Hailing Frequencies Open

Data Governance and Data Quality

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

You're So Vain, You Probably Think Data Quality Is About You

August 20, 2009

Hailing Frequencies Open

August 20, 2009/ Jim Harris

“This is Captain James E. Harris of the Data Quality Starship Collaboration...”

Clearly, I am a Star Trek nerd – but I am also a people person. Although people, process, and technology are all important for successful data quality initiatives, without people, process and technology are useless.

Collaboration is essential. More than anything else, it requires effective communication – which begins with effective listening.

Seek First to Understand...Then to Be Understood

This is Habit 5 from Stephen Covey's excellent book The 7 Habits of Highly Effective People. “We typically seek first to be understood,” explains Covey. “Most people do not listen with the intent to understand; they listen with the intent to reply.”

We are all proud of our education, knowledge, understanding, and experience. Since it is commonly believed that experience is the path that separates knowledge from wisdom, we can't wait to share our wisdom with the world. However, as Covey cautions, our desire to be understood can make “our conversations become collective monologues.”

Covey explains that listening is an activity that can be practiced at one of the following five levels:

Ignoring – we are not really listening at all.
Pretending – we are only waiting for our turn to speak, constantly nodding and saying: “Yeah. Uh-huh. Right.”
Selective Listening – we are only hearing certain parts of the conversation, such as when we're listening to the constant chatter of a preschool child.
Attentive Listening – we are paying attention and focusing energy on the words that are being said.
Empathic Listening – we are actually listening with the intent to really try to understand the other person's frame of reference. You look out through it, you see the world the way they see the world, you understand their paradigm, you understand how they feel.

“Empathy is not sympathy,” explains Covey. “Sympathy is a form of agreement, a form of judgment. And it is sometimes the more appropriate response. But people often feed on sympathy. It makes them dependent. The essence of empathic listening is not that you agree with someone; it's that you fully, deeply, understand that person, emotionally as well as intellectually.”

Vulcans

Some people balk at discussing the use of emotion in a professional setting, where typically it is believed that rational analysis must protect us from irrational emotions. To return to a Star Trek metaphor, these people model their professional behavior after the Vulcans.

Vulcans live according to the philosopher Surak's code of emotional self-control. Starting at a very young age, they are taught meditation and other techniques in order to suppress their emotions and live a life guided by reason and logic alone.

Be Truly Extraordinary

In all professions, it is fairly common to encounter rational and logically intelligent people.

Truly extraordinary people masterfully blend both kinds of intelligence – intellectual and emotional. A well-grounded sense of self-confidence, an empathetic personality, and excellent communication skills, exert a more powerfully positive influence than simply remarkable knowledge and expertise alone.

Your Away Mission

As a data quality consultant, when I begin an engagement with a new client, I often joke that I shouldn't be allowed to speak for the first two weeks. This is my way of explaining that I will be asking more questions than providing answers.

I am seeking first to understand the current environment from both the business and technical perspectives. Only after I have achieved this understanding, will I then seek to be understood regarding my extensive experience of the best practices that I have seen work on successful data quality initiatives.

As fellow Star Trek nerds know, the captain doesn't go on away missions. Therefore, your away mission is to try your best to practice empathic listening at your next data quality discussion – “Make It So!”

Data quality initiatives require a holistic approach involving people, process, and technology. You must consider the people factor first and foremost, because it will be the people involved, and not the process or the technology, that will truly allow your data quality initiative to “Live Long and Prosper.”

As always, hailing frequencies remain open to your comments. And yes, I am trying my best to practice empathic listening.

Not So Strange Case of Dr. Technology and Mr. Business

The Three Musketeers of Data Quality

Data Quality is People!

You're So Vain, You Probably Think Data Quality Is About You

August 18, 2009

Adventures in Data Profiling (Part 4)

August 18, 2009/ Jim Harris

In Part 3 of this series: The adventures continued with a detailed analysis of the fields Birth Date, Telephone Number and E-mail Address. This provided you with an opportunity to become familiar with analysis techniques that use a combination of field values and field formats.

You also saw examples of how valid values in a valid format can have an invalid context, how valid field formats can conceal invalid field values, and how free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

In Part 4, you will continue your adventures in data profiling by going postal...postal address that is, by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Previously, the data profiling tool provided you with the following statistical summaries for postal address:

Country Code

In Part 1, we wondered if 5 distinct Country Code field values indicated international postal addresses. This drill-down “screen” provided by the data profiling tool shows the frequency distribution. First of all, the field name might have lead us to assume we would only see ISO 3166 standard country codes.

However, two of the field values are a country name and not a country code. This is another example of how verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

Secondly, the field values would appear to indicate that most of the postal addresses are from the United States. However, if you recall from Part 3, we discovered some potential clues during our analysis of Telephone Number, which included two formats that appear invalid based on North American standards, and E-mail Address, which included country code Top Level Domain (TLD) values for Canada and the United Kingdom.

Additionally, whenever a field is both 100% complete and has an extremely low cardinality, it could be an indication that the most frequently occurring value is the field's default value.

Therefore, is it possible that US is simply the default value for Country Code for this data source?

Zip Code

From the Part 1 comments, it was noted that Zip Code as a field name is unique to the postal code system used in the United States (US). This drill-down “screen” provided by the data profiling tool shows the field has only a total of ten field formats.

The only valid field formats for ZIP (which, by the way, is an acronym for Zone Improvement Plan) are 5 digits and 9 digits when the 4 digit ZIP+4 add-on code is also present, which according to the US postal standards should be separated from the 5 digit ZIP Code using a hyphen.

The actual field formats in the Zip Code field of this data source reveal another example of how we should not make assumptions about our data based on the metadata that describes it. Although the three most frequently occurring field formats appear to be representative of potentially valid US postal codes, the alphanumeric postal code field formats are our first indication that it is, perhaps sadly, not all about US (pun intended, my fellow Americans).

The two most frequently occurring alphanumeric field formats appear to be representative of potentially valid Canadian postal codes. An interesting thing to note is that their combined frequency distribution is double the count of the number of records having CA as a Country Code field value. Therefore, if these field formats are representative of a valid Canadian postal code, then some Canadian records have a contextually invalid field value in Country Code.

The other alphanumeric field formats appear to be representative of potentially valid postal codes for the United Kingdom (UK). To the uninitiated, the postal codes of Canada (CA) and the UK appear very similar. Both postal code formats contain two parts, which according to their postal standards should be separated by a single character space.

In CA postal codes, the first part is called the Forward Sortation Area (FSA) and the second part is called the Local Delivery Unit (LDU). In UK postal codes, the first part is called the outward code and the second part is called the inward code.

One easy way to spot the difference is that a UK inward code always has the format of a digit followed by two letters (i.e. “naa” in the field formats generated by my fictional data profiling tool), whereas a CA LDU always has the format of a digit followed by a letter followed by another digit (i.e. “nan”).

However, we should never rule out the possibility of transposed values making a CA postal code look like a UK postal code, or vice versa. Also, never forget the common data quality challenge of valid field formats concealing invalid field values.

Returning to the most frequently occurring field format of 5 digits, can we assume all valid field values would represent US postal addresses? Of course not. One significant reason is that a 5 digit postal code is one of the most common formats in the world.

Just some of the other countries also using a 5 digit postal code include: Algeria, Cuba, Egypt, Finland, France, Germany, Indonesia, Israel, Italy, Kuwait, Mexico, Spain, and Turkey.

What about the less frequently occurring field formats of 4 digits and 6 digits? It is certainly possible that these field formats could indicate erroneous attempts at entering a valid US postal code. However, it could also indicate the presence of additional non-US postal addresses.

Just some of the countries using a 4 digit postal code include: Australia, Austria, Belgium, Denmark, El Salvador, Georgia (no, the US state did not once again secede, there is also a country called Georgia and its not even in the Americas), Hungary, Luxembourg, Norway, and Venezuela. Just some of the countries using a 6 digit postal code include: Belarus, China, India, Kazakhstan (yes, Borat fans, Kazakhstan is a real country), Russia, and Singapore.

Additionally, why do almost 28% of the records in this data source not have a field value for Zip Code?

One of the possibilities is that we could have postal addresses from countries that do not have a postal code system. Just a few examples would be: Aruba, Bahamas (sorry fellow fans of the Beach Boys, but both Jamaica and Bermuda have a postal code system, and therefore I could not take you down to Kokomo), Fiji (home of my favorite bottled water), and Ireland (home of my ancestors and inventors of my second favorite coffee).

State Abbreviation

From the Part 1 comments, it was noted that the cardinality of State Abbreviation appeared suspect because, if we assume that its content matches its metadata, then we would expect only 51 distinct values (i.e. actual US state abbreviations without counting US territories) and not the 72 distinct values discovered by the data profiling tool.

Let's assume that drill-downs have revealed the single profiled field data type was CHAR, and the profiled minimum/maximum field lengths were both 2. Therefore, State Abbreviation, when populated, always contains a two character field value.

This drill-down “screen” first displays the top ten most frequently occurring values in the State Abbreviation field, which are all valid US state abbreviations. The frequency distributions are also within general expectations since eight of the largest US states by population are represented.

However, our previous analysis of Country Code and Zip Code has already made us aware that international postal addresses exist in this data source. Therefore, this drill-down “screen” also displays the top ten most frequently occurring non-US values based on the data profiling tool comparing all 72 distinct values against a list of valid US state and territory abbreviations.

Most of the field values discovered by this analysis appear to be valid CA province codes (including PQ being used as a common alternative for QC – the province of Quebec or Québec si vous préférez). These frequency distributions are also within general expectations since six of the largest CA provinces by population are represented. Their combined frequency distribution is also fairly close to the combined frequency distribution of potentially valid Canadian postal codes found in the Zip Code field.

However, we still have three additional values (ZZ, SA, HD) which require more analysis. Additionally, almost 22% of the records in this data source do not have a field value for State Abbreviation, which could be attributable to the fact that even when the postal standards for other countries include something analogous to a US state or CA province, it might not be compatible with a two character field length.

City Name

Let's assume that we have performed some preliminary analysis on the statistical summaries and frequency distributions provided by the data profiling tool for the City Name field using the techniques illustrated throughout this series so far.

Let's also assume analyzing the City Name field in isolation didn't reveal anything suspicious. The field is consistently populated and its frequently occurring values appeared to meet general expectations. Therefore, let's assume we have performed additional drill-down analysis using the data profiling tool and have selected the following records of interest:

Based on reviewing these records, what conclusions, assumptions, and questions do you have about the City Name field?

What other questions can you think of for these fields? What other analysis do you think should be performed for these fields?

In Part 5 of this series: We will continue the adventures in data profiling by completing our initial analysis of postal address by investigating the following fields: Postal Address Line 1 and Postal Address Line 2.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

International Man of Postal Address Standards

Since I am a geographically-challenged American, the first (and often the only necessary) option I choose for assistance with international postal address standards is Graham Rhind.

His excellent book The Global Source-Book for Address Data Management is an invaluable resource and recognized standard reference that contains over 1,000 pages of data pertaining to over 240 countries and territories.

August 15, 2009

Imagining the Future of Data Quality

August 15, 2009/ Jim Harris

Earlier this week on Data Quality Pro, Dylan Jones published an Interview with Larry English, one of the earliest pioneers of information quality management, one of the most prominent thought leaders in the industry, and one of the co-founders (along with Thomas Redman) of the International Association for Information and Data Quality (IAIDQ).

The interview also unintentionally sparked some very common debates, including the differences between data and information, data quality (DQ) and information quality (IQ), as well as proactive and reactive approaches to quality management.

Of course, I added my own strong opinions to these debates, including a few recent posts – The General Theory of Data Quality and Hyperactive Data Quality (Second Edition).

On a much lighter note, and with apologies to fellow fans of John Lennon, I also offer the following song:

Imagining the Future of Data Quality

Imagine there's no defects
It's easy if you try
No data cleansing beneath us
Above us only sky
Imagine all the data
Living with quality

Imagine there's no companies
It isn't hard to do
Nothing to manage or govern
And no experts too
Imagine all the data
Living life in peace

You may say that I'm a dreamer
But I'm not the only one
I hope someday you'll join us
And the DQ/IQ world will be as one

Imagine no best practices
I wonder if you can
No need for books or lectures
A brotherhood of man
Imagine all the data
Sharing all the world

You may say that I'm a dreamer
But I'm not the only one
I hope someday you'll join us
And the DQ/IQ world will live as one

OCDQ Blog

The Dawn of OCDQ

LinkedIn

Twitter

Facebook

Additional Social Media Websites

Social Media Tools and Services

Social Media Strategy

Social Karma

Please Share Your Social Media Odyssey

Related Posts

Account Number

Tax ID

Potential Duplicate Records

Related Posts

Commendable Comments

Tá mé buíoch díot

Related Posts

Related Posts

Additional Resources

Related Posts

It Takes a Village (Idiot)

The Rich Stuff of Comments

Commendable Comments

Because of You

Related Posts

Master Data Management

Data Warehousing

Business Intelligence

Data Quality

Conclusion

Free-Form Fields

Much Ado About Analysis

The Taming of the Variations

Shall I compare thee to other records?

Doth the bard protest too much?

Postal Address Line 1

Postal Address Line 2

Postal Address Validation

Related Posts

Related Posts

Frameworks are NOT Futile

Resistance is NOT Futile

Avoid Assimilation

Related Posts

Seek First to Understand...Then to Be Understood

Vulcans

Be Truly Extraordinary

Your Away Mission

Related Posts

Country Code

Zip Code

State Abbreviation

City Name

Related Posts

International Man of Postal Address Standards

Imagining the Future of Data Quality

OCDQ Blog