October 19, 2009

Adventures in Data Profiling (Part 7)

October 19, 2009/ Jim Harris

In Part 6 of this series: You completed your initial analysis of the Account Number and Tax ID fields.

Previously during your adventures in data profiling, you have looked at customer name within the context of other fields. In Part 2, you looked at the associated customer names during drill-down analysis on the Gender Code field while attempting to verify abbreviations as well as assess NULL and numeric values. In Part 6, you investigated customer names during drill-down analysis for the Account Number and Tax ID fields while assessing the possibility of duplicate records.

In Part 7 of this award-eligible series, you will complete your initial analysis of this data source with direct investigation of the Customer Name 1 and Customer Name 2 fields.

Previously, the data profiling tool provided you with the following statistical summaries for customer names:

As we discussed when we looked at the E-mail Address field (in Part 3) and the Postal Address Line fields (in Part 5), most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

Customer Name 1 and Customer Name 2 are additional examples of the necessity of this analysis technique. Not only are the cardinality of these fields very high, but they also have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

Customer Name 1

The data profiling tool has provided you the following drill-down “screen” for Customer Name 1:

Please Note: The differentiation between given and family names has been based on our fictional data profiling tool using probability-driven non-contextual classification of the individual field values.

For example, Harris, Edward, and James are three of the most common names in the English language, and although they can also be family names, they are more frequently given names. Therefore, “Harris Edward James” is assigned “Given-Name Given-Name Given-Name” for a field format. For this particular example, how do we determine the family name?

The top twenty most frequently occurring field formats for Customer Name 1 collectively account for over 80% of the records with an actual value in this field for this data source. All of these field formats appear to be common potentially valid structures. Obviously, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Customer Name 1 field?

Customer Name 2

The data profiling tool has provided you the following drill-down “screen” for Customer Name 2:

The top ten most frequently occurring field formats for Customer Name 2 collectively account for over 50% of the records with an actual value in this sparsely populated field for this data source. Some of these field formats show common potentially valid structures. Again, more than one sample field value would need to be reviewed using more drill-down analysis.

What conclusions, assumptions, and questions do you have about the Customer Name 2 field?

The Challenges of Person Names

Not that business names don't have their own challenges, but person names present special challenges. Many data quality initiatives include the business requirement to parse, identify, verify, and format a “valid” person name. However, unlike postal addresses where country-specific postal databases exist to support validation, no such “standards” exist for person names.

In his excellent book Viral Data in SOA: An Enterprise Pandemic, Neal A. Fishman explains that “a person's name is a concept that is both ubiquitous and subject to regional variations. For example, the cultural aspects of an individual's name can vary. In lieu of last name, some cultures specify a clan name. Others specify a paternal name followed by a maternal name, or a maternal name followed by a paternal name; other cultures use a tribal name, and so on. Variances can be numerous.”

“In addition,” continues Fishman, “a name can be used in multiple contexts, which might affect what parts should or could be communicated. An organization reporting an employee's tax contributions might report the name by using the family name and just the first letter (or initial) of the first name (in that sequence). The same organization mailing a solicitation might choose to use just a title and a family name.”

However, it is not a simple task to identify what part of a person's name is the family name or the first given name (as some of the above data profiling sample field values illustrate). Again, regional, cultural, and linguistic variations can greatly complicate what at first may appear to be a straightforward business request (e.g. formatting a person name for a mailing label).

As Fishman cautions, “many regions have cultural name profiles bearing distinguishing features for words, sequences, word frequencies, abbreviations, titles, prefixes, suffixes, spelling variants, gender associations, and indications of life events.”

If you know of any useful resources for dealing with the challenges of person names, then please share them by posting a comment below. Additionally, please share your thoughts and experiences regarding the challenges (as well as useful resources) associated with business names.

What other analysis do you think should be performed for customer names?

In Part 8 of this series: We will conclude the adventures in data profiling with a summary of the lessons learned.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Getting Your Data Freq On

October 16, 2009

Commendable Comments (Part 3)

October 16, 2009/ Jim Harris

In a July 2008 blog post on Men with Pens (one of the Top 10 Blogs for Writers 2009), James Chartrand explained:

“Comment sections are communities strengthened by people.”

“Building a blog community creates a festival of people” where everyone can, as Chartrand explained, “speak up with great care and attention, sharing thoughts and views while openly accepting differing opinions.”

I agree with James (and not just because of his cool first name) – my goal for this blog is to foster an environment in which a diversity of viewpoints is freely shared without bias. Everyone is invited to get involved in the discussion and have an opportunity to hear what others have to offer. This blog's comment section has become a community strengthened by your contributions.

This is the third entry in my ongoing series celebrating my heroes – my readers.

Commendable Comments

On The Fragility of Knowledge, Andy Lunn commented:

“In my field of Software Development, you simply cannot rest and rely on what you know. The technology you master today will almost certainly evolve over time and this can catch you out. There's no point being an expert in something no one wants any more! This is not always the case, but don't forget to come up for air and look around for what's changing.

I've lost count of the number of organizations I've seen who have stuck with a technology that was fresh 15 years ago and a huge stagnant pot of data, who are now scrambling to come up to speed with what their customers expect. Throwing endless piles of cash at the problem, hoping to catch up.

What am I getting at? The secret I've learned is to adapt. This doesn't mean jump on every new fad immediately, but be aware of it. Follow what's trending, where the collective thinking is heading and most importantly, what do your customers want?

I just wish more organizations would think like this and realize that the systems they create, the data they hold, and the customers they have are in a constant state of flux. They are all projects that need care and attention. All subject to change, there's no getting away from it, but small, well planned changes are a lot less painful, trust me.”

On DQ-Tip: “Data quality is primarily about context not accuracy...”, Stephen Simmonds commented:

“I have to agree with Rick about data quality being in the eye of the beholder – and with Henrik on the several dimensions of quality.

A theme I often return to is 'what does the business want/expect from data?' – and when you hear them talk about quality, it's not just an issue of accuracy. The business stakeholder cares – more than many seem to notice – about a number of other issues that are squarely BI concerns:

– Timeliness ('WHEN I want it')
– Format ('how I want to SEE it') – visualization, delivery channels
– Usability ('how I want to then make USE of it') – being able to extract information from a report (say) for other purposes
– Relevance ('I want HIGHLIGHTED the information that is meaningful to me')

And so on. Yes, accuracy is important, and it messes up your effectiveness when delivering inaccurate information. But that's not the only thing a business stakeholder can raise when discussing issues of quality. A report can be rejected as poor quality if it doesn't adequately meet business needs in a far more general sense. That is the constant challenge for a BI professional.”

On Mistake Driven Learning, Ken O'Connor commented:

“There is a Chinese proverb that says:

'Tell me and I'll forget; Show me and I may remember; Involve me and I'll understand.'

I have found the above to be very true, especially when seeking to brief a large team on a new policy or process. Interaction with the audience generates involvement and a better understanding.

The challenge facing books, whitepapers, blog posts etc. is that they usually 'Tell us,' they often 'Show us,' but they seldom 'Involve us.'

Hence, we struggle to remember, and struggle even more to understand. We learn best by 'doing' and by making mistakes.”

You Are Awesome

Thank you very much for your comments. For me, the best part of blogging is the dialogue and discussion provided by interactions with my readers. Since there have been so many commendable comments, please don't be offended if your commendable comment hasn't been featured yet. Please keep on commenting and stay tuned for future entries in the series.

By the way, even if you have never posted a comment on my blog, you are still awesome — feel free to tell everyone I said so.

Commendable Comments (Part 1)

Commendable Comments (Part 2)

October 15, 2009

DQ-Tip: “...Go talk with the people using the data”

October 15, 2009/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“In order for your data quality initiative to be successful, you must:

Walk away from the computer and go talk with the people using the data.”

This DQ-Tip came from the TDWI World Conference Chicago 2009 presentation Modern Data Quality Techniques in Action by Gian Di Loreto from Loreto Services and Technologies.

As I blogged about in Data Gazers (borrowing that excellent phrase from Arkady Maydanchik), within cubicles randomly dispersed throughout the sprawling office space of companies large and small, there exist countless unsung heroes of data quality initiatives. Although their job titles might be labeling them as a Business Analyst, Programmer Analyst, Account Specialist or Application Developer, their true vocation is a far more noble calling. They are Data Gazers.

A most bizarre phenomenon (that I have witnessed too many times) is that as a data quality initiative “progresses” it tends to get further and further away from the people who use the data on a daily basis.

Please follow the excellent advice of Gian and Arkady — go talk with your users.

Trust me — everyone on your data quality initiative will be very happy that you did.

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “Don't pass bad data on to the next person...”

October 13, 2009

Blog-Bout: “Risk” versus “Monopoly”

October 13, 2009/ Jim Harris

A “blog-bout” is a good-natured debate between two bloggers. This blog-bout is between Jim Harris and Phil Simon, where they debate which board game is the better metaphor for an Information Technology (IT) project: “Risk” or “Monopoly.”

Why “Risk” is a better metaphor for an IT Project

By Jim Harris

IT projects and “Risk” have a great deal in common. I thought long and hard about this while screaming obscenities and watching professional sports on television, the source of all of my great thinking. I came up with five world dominating reasons.

1. Both things start with the players marking their territory. In Risk, the game begins with the players placing their “armies” on the territories they will initially occupy. On IT projects, the different groups within the organization will initially claim their turf.

Please note that the term “Information Technology” is being used in a general sense to describe a project (e.g. Data Quality, Master Data Management, etc.) and should not be confused with the IT group within an organization. At a very high level, the Business and IT are the internal groups representing the business and technical stakeholders on a project.

The Business usually owns the data and understands its meaning and use in the day-to-day operation of the enterprise. IT usually owns the hardware and software infrastructure of the enterprise's technical architecture.

Both groups can claim they are only responsible for what they own, resist collaborating with the “other side” and therefore create organizational barriers as fiercely defended as the continental borders of Europe and Asia in Risk.

2. In both, there are many competing strategies. In Risk, the official rules of the game include some basic strategies and over the years many players have developed their own fool-proof plans to guarantee victory. Some strategies advocate focusing on controlling entire continents, while others advise fortifying your borders by invading and occupying neighboring territories. And my blog-bout competitor Phil Simon half-jokingly claims that the key to winning Risk is securing the island nation of Madagascar.

On IT projects, you often hear a lot of buzzwords and strategies bandied about, such as Lean, Agile, Six Sigma, and Kaizen, to name but a few. Please understand – I am an advocate for methodology and best practices, and there are certainly many excellent frameworks out there, including the paradigms I just mentioned.

However, a general problem that I have with most frameworks is their tendency to adopt a one-size-fits-all strategy, which I believe is an approach that is doomed to fail. Any implemented framework must be customized to adapt to an organization’s unique culture.

In part, this is necessary because implementing changes of any kind will be met with initial resistance, but an attempt at forcing a one-size-fits-all approach almost sends a message to the organization that everything they are currently doing is wrong, which will of course only increase the resistance to change.

Starting with a framework simply provides a reference of best practices and recommended options of what has worked on successful IT projects. The framework should be reviewed in order to determine what can be learned from it and to select what will work in the current environment and what simply won't.

3. Pyrrhic victories are common during both endeavors. In Risk, sacrificing everything to win a single battle or to defend your favorite territory can ultimately lead you to lose the war. Political fiefdoms can undermine what could otherwise have been a successful IT project. Do not underestimate the unique challenges of your corporate culture.

Obviously, business, technical and data issues will all come up from time to time, and there will likely be disagreements regarding how these issues should be prioritized. Some issues will likely affect certain stakeholders more than others.

Keeping data and technology aligned with business processes requires getting people aligned and free to communicate their concerns. Coordinating discussions with all of the stakeholders and maintaining open communication can prevent a Pyrrhic victory for one stakeholder causing the overall project to fail.

4. Alliances are the key to true victory. In Risk, it is common for players to form alliances by combining their resources and coordinating their efforts in order to defend their shared borders or to eliminate a common enemy.

On IT projects, knowledge about data, business processes and supporting technology are spread throughout the organization. Neither the Business nor IT alone has all of the necessary information required to achieve success.

Successful projects are driven by an executive management mandate for the Business and IT to forge an alliance of ongoing and iterative collaboration throughout the entire project.

5. The outcomes of both are too often left to chance. IT projects are complex, time-consuming, and expensive enterprise initiatives. Success requires people taking on the challenge united by collaboration, guided by an effective methodology, and implementing a solution using powerful technology.

But the complexity of an IT project can sometimes work against your best intentions. It is easy to get pulled into the mechanics of documenting the business requirements and functional specifications, drafting the project plan and then charging ahead on the common mantra: “We planned the work, now we work the plan.”

Once an IT project achieves some momentum, it can take on a life of its own and the focus becomes more and more about making progress against the tasks in the project plan, and less and less on the project's actual business goals. Typically, this leads to another all too common mantra: “Code it, test it, implement it into production, and then declare victory.”

In Risk, the outcomes are literally determined by a roll of the dice. If you allow your IT project to lose sight of its business goals, then you treat it like a game of chance. And to paraphrase Albert Einstein:

“Do not play dice with IT Projects.”

Why “Monopoly” is a better metaphor for an IT Project

By Phil Simon

IT projects and “Monopoly” have a great deal in common. I thought long and hard about this at the gym, the source of all of my great thinking. I came up with six really smashing reasons.

1. Both things take much longer than originally expected. IT projects typically take much longer than expected for a wide variety of reasons. Rare is the project that finishes on time (with expected functionality delivered).

The same holds true for Monopoly. Remember when you were a kid and you wanted to play a quick game? Now, I consider the term “a quick game of Monopoly” to be the very definition of an oxymoron. You’d better block off about four to six hours for a proper game. Unforeseen complexities will doubtlessly delay even the best intentions.

2. During both endeavors, screaming matches typically erupt. Many projects become tense. I remember one in which two participants nearly came to blows. Most projects have key players engage in very heated debates over strategic vision and execution.

With Monopoly, especially after the properties are divvied up, players scream and yell over what constitutes a “fair” deal. “What do you mean Boardwalk for Ventnor Avenue and Pennsylvania Railroad isn’t reasonable? IT’S COMPLETELY FAIR!” Debates like this are the rule, not the exception.

3. While the basic rules may be the same, different people play by different rules. The vast majority of projects on which I have worked have had the usual suspects: steering committees, executive sponsors, PMOs, different stages of testing, and ultimately system activation. However, different organizations often try to do things in vastly different ways. For example, on two similar projects in different organizations, you are likely to find differences with respect to:

the number of internal and external folks assigned to a project
the project’s timeline and budget
project objectives

By the same token, people play Monopoly in somewhat different ways. Many don’t know about the auction rule. Others replenish Free Parking with a new $500 bill after someone lands on it. Also, many people disregard altogether the property assessment card while sticklers like me assess penalties when that vaunted red card appears.

4. Personal relationships can largely determine the outcome in both. Negotiation is key on IT projects. Clients negotiate rates, prices, and responsibilities with consulting vendors and/or software vendors.

In Monopoly, personal rivalries play a big part in who makes a deal with whom. Often players chime in (uninvited, of course) with their opinions on potential deals, without a doubt to affect the outcome.

5. Little things really matter, especially at the end. Towards the end of an IT project, snakes in the woodwork often come out to bite people when they least expect it. A tightly staffed or planned project may not be able to withstand a relatively minor problem, especially if the go-live date is non-negotiable.

In Monopoly, the same holds true. Laugh all you want when your opponent builds hotels on Mediterranean Avenue and Baltic Avenue, but at the end of the game those $250 and $450 charges can really hurt, especially when you’re low on cash.

6. Many times, each does not end; it is merely abandoned. A good percentage of projects have their plugs pulled prior to completion. A CIO may become tired with an interminable project and decide to simply end it before costs skyrocket even further.

I’d say that about half of the Monopoly games that I’ve played in the last fifteen years have also been called by “executive decision.” The writing is on the board, as 1 a.m. rolls around and only two players remain. Often player X simply cedes the game to player Y.

You are the Referee

All bouts require a referee. Blog-bouts are refereed by the readers. Therefore, please cast your vote in the poll and also weigh in on this debate by sharing your thoughts by posting a comment below. Since a blog-bout is co-posted, your comments will be copied (with full attribution) into the comments section of both of the blogs co-hosting this blog-bout.

About Jim Harris

Jim Harris is the Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ), which is an independent blog offering a vendor-neutral perspective on data quality. Jim is also an independent consultant, speaker, writer and blogger with over 15 years of professional services and application development experience in data quality (DQ), data integration, data warehousing (DW), business intelligence (BI), customer data integration (CDI), and master data management (MDM). Jim is also a contributing writer to Data Quality Pro, the leading online magazine and community resource dedicated to data quality professionals.

About Phil Simon

Phil Simon is the author of the acclaimed book Why New Systems Fail: Theory and Practice Collide and the highly anticipated upcoming book The Next Wave of Technologies: Opportunities from Chaos. Phil is also an independent systems consultant and a dynamic public speaker for hire focusing on how organizations use technology. Phil also writes for a number of technology media outlets.

October 08, 2009

Mistake Driven Learning

October 08, 2009/ Jim Harris

In his Copyblogger article How to Stop Making Yourself Crazy with Self-Editing, Sean D'Souza explains:

“Competency is a state of mind you reach when you’ve made enough mistakes.”

One of my continuing challenges is staying informed about the latest trends in data quality and its related disciplines, including Master Data Management (MDM), Dystopian Automated Transactional Analysis (DATA), and Data Governance (DG) – I am fairly certain that one of those three things isn't real, but I haven't figured out which one yet.

I read all of the latest books, as well as the books that I was supposed to have read years ago, when I was just pretending to have read all of the latest books. I also read the latest articles, whitepapers, and blogs. And I go to as many conferences as possible.

The basis of this endless quest for knowledge is fear. Please understand – I have never been afraid to look like an idiot. After all, we idiots are important members of society – we make everyone else look smart by comparison.

However, I also market myself as a data quality expert. Therefore, when I consult, speak, write, or blog, I am always at least a little afraid of not getting things quite right. Being afraid of making mistakes can drive you crazy.

But as a wise man named Seal Henry Olusegun Olumide Adeola Samuel (wisely better known by only his first name) lyrically taught us back in 1991:

“We're never gonna survive unless, we get a little crazy.”

“It’s not about getting things right in your brain,” explains D’Souza, “it’s about getting things wrong. The brain has to make hundreds, even thousands of mistakes — and overcome those mistakes — to be able to reach a level of competency.”

So, get a little crazy, make a lot of mistakes, and never stop learning.

The Fragility of Knowledge

The Wisdom of Failure

A Portrait of the Data Quality Expert as a Young Idiot

The Nine Circles of Data Quality Hell

October 01, 2009

Poor Data Quality is a Virus

October 01, 2009/ Jim Harris

“A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.”

These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.

“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus. Like a virus, data by itself is inert. Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”

“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”

“The antonym of viral data is trusted information.”

Data Quality

“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.” Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:

David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses. It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
Thomas Redman: “Data are of high quality if those who use them say so. Usually, high-quality data must be both free of defects and possess features that customers desire.”

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative. This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

However, the enterprise-wide data quality standards must be understood as dynamic. Therefore, enforcing strict conformance to data quality standards can be self-defeating. On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”

Defining data quality is both an essential and challenging exercise for every enterprise. “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business. Business data is data about the business, and any data about the business is metadata. First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”

Data Governance

“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.”

“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.”

For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance.

However, these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”

Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”

Conclusion

“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”

Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions. As the pathogen replicates, more and more decision-critical enterprise information will be compromised.

According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:

“Prevent viral data and preserve trusted information.”

Books Referenced in this Post

Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman

Enterprise Knowledge Management: The Data Quality Approach by David Loshin

Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information by Danette McGilvray

Data Quality: The Field Guide by Thomas Redman

Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services by Joseph Juran

Customer Data Integration: Reaching a Single Version of the Truth by Jill Dyché and Evan Levy

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

Data Governance and Data Quality

September 27, 2009

Tweet 2001: A Social Media Odyssey

September 27, 2009/ Jim Harris

“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”

As I get closer and closer to my 2001^sttweet on Twitter, I wanted to pause for some quiet reflection on my personal odyssey in social media – but then I decided to blog about it instead.

The Dawn of OCDQ

Except for LinkedIn, my epic drama of social media adventure and exploration started with my OCDQ blog.

In my Data Quality Pro article Blogging about Data Quality, I explained why I started this blog and discussed some of my thoughts on blogging. Most importantly, I explained that I am neither a blogging expert nor a social media expert.

But now that I have been blogging and using social media for over six months, I feel more comfortable sharing my thoughts and personal experiences with social media without worrying about sounding like too much of an idiot (no promises, of course).

My social media odyssey began in 2007 when I created my account on LinkedIn, which I admit, I initially viewed as just an online resume. I put little effort into my profile, only made a few connections, and only joined a few groups.

Last year (motivated by the economic recession), I started using LinkedIn more extensively. I updated my profile with a complete job history, asked my colleagues for recommendations, expanded my network with more connections, and joined more groups. I also used LinkedIn applications (e.g. Reading List by Amazon and Blog Link) to further enhance my profile.

My favorite feature is the LinkedIn Groups, which not only provide an excellent opportunity to connect with other users, but also provide Discussions, News (including support for RSS feeds), and Job Postings.

By no means a comprehensive list, here are some LinkedIn Groups that you may be interested in:

For more information about LinkedIn features and benefits, check out the following posts on the LinkedIn Blog:

Twitter

Shortly after launching my blog in March 2009, I created my Twitter account to help promote my blog content. In blogging, content is king, but marketing is queen. LinkedIn (via group news feeds) is my leading source of blog visitors from social media, but Twitter isn't far behind.

However, as Michele Goetz of Brain Vibe explained in her blog post Is Twitter an Effective Direct Marketing Tool?, Twitter has a click-through rate equivalent to direct mail. Citing research from Pear Analytics, a “useful” tweet was found to have a shelf life of about one hour with about a 1% click-through rate on links.

In his blog post Is Twitter Killing Blogging?, Ajay Ohri of Decision Stats examined whether Twitter was a complement or a substitute for blogging. I created a Data Quality on Twitter page on my blog in order to illustrate what I have found to be the complementary nature of tweeting and blogging.

My ten blog posts receiving the most tweets (tracked using the Retweet Button from TweetMeme):

The Nine Circles of Data Quality Hell – 13 Tweets
Adventures in Data Profiling (Part 1) – 13 Tweets
Fantasy League Data Quality – 12 Tweets
Not So Strange Case of Dr. Technology and Mr. Business – 12 Tweets
The Fragility of Knowledge – 11 Tweets
The General Theory of Data Quality – 9 Tweets
The Very True Fear of False Positives – 8 Tweets
Data Governance and Data Quality – 8 Tweets
Adventures in Data Profiling (Part 3) – 8 Tweets
Data Quality: The Reality Show? – 7 Tweets

Most of my social networking is done using Twitter (with LinkedIn being a close second). I have also found Twitter to be great for doing research, which I complement with RSS subscriptions to blogs.

To search Twitter for data quality content:

If you are new to Twitter, then I would recommend reading the following blog posts:

Facebook

I also created my Facebook account shortly after launching my blog. Although I almost exclusively use social media for professional purposes, I do use Facebook as a way to stay connected with family and friends.

I created a page for my blog to separate my professional and personal aspects of Facebook without the need to manage multiple accounts. Additionally, this allows you to become a “fan” of my blog without requiring you to also become my “friend.”

A quick note on Facebook games, polls, and trivia: I do not play them. With my obsessive-compulsive personality, I have to ignore them. Therefore, please don't be offended if for example, I have ignored your invitation to play Mafia Wars.

By no means a comprehensive list, here are some Facebook Pages or Groups that you may be interested in:

Additional Social Media Websites

Although LinkedIn, Twitter, and Facebook are my primary social media websites, I also have accounts on three of the most popular social bookmarking websites: Digg, StumbleUpon, and Delicious.

Social bookmarking can be a great promotional tool that can help blog content go viral. However, niche content is almost impossible to get to go viral. Data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll.

This doesn't mean that data quality isn't an important subject – it just means that you will not see a blog post about data quality hitting the front pages of mainstream social bookmarking websites anytime soon. Dylan Jones of Data Quality Pro created DQVote, which is a social bookmarking website dedicated to sharing data quality community content.

I also have an account on FriendFeed, which is an aggregator that can consolidate content from other social media websites, blogs or anything providing a RSS feed. My blog posts and my updates from other social media websites (except for Facebook) are automatically aggregated. On Facebook, my personal page displays my FriendFeed content.

Social Media Tools and Services

Social media tools and services that I personally use (listed in no particular order):

Flock – The Social Web Browser Powered by Mozilla
TweetDeck – Connecting you with your contacts across Twitter, Facebook, MySpace and more
Digsby – Digsby = Instant Messaging (IM) + E-mail + Social Networks
Ping.fm – Update all of your social networks at once
HootSuite – The professional Twitter client
Twitterfeed – Feed your blog to Twitter
Google FeedBurner – Provide an e-mail subscription to your blog
TweetMeme – Add a Retweet Button to your blog
Squarespace Blog Platform – The secret behind exceptional websites

Social Media Strategy

As Darren Rowse of ProBlogger explained in his blog post How I use Social Media in My Blogging, Chris Brogan developed a social media strategy using the metaphor of a Home Base with Outposts.

“A home base,” explains Rowse, “is a place online that you own.” For example, your home base could be your blog or your company's website. “Outposts,” continues Rowse, “are places that you have an online presence out in other parts of the web that you might not own.” For example, your outposts could be your LinkedIn, Twitter, and Facebook accounts.

According to Rowse, your Outposts will make your Home Base stronger by providing:

“Relationships, ideas, traffic, resources, partnerships, community and much more.”

Social Karma

An effective social media strategy is essential for both companies and individual professionals. Using social media can help promote you, your expertise, your company and your products and services.

However, too many companies and individuals have a selfish social media strategy.

You should not use social media exclusively for self-promotion. You should view social media as Social Karma.

If you can focus on helping others when you use social media, then you will get much more back than just a blog reader, a LinkedIn connection, a Twitter follower, a Facebook friend, or even a potential customer.

Yes, I use social media to promote myself and my blog content. However, more than anything else, I use social media to listen, to learn, and to help others when I can.

Please Share Your Social Media Odyssey

As always, I am interested in hearing from you. What have been your personal experiences with social media?

September 23, 2009

DQ-Tip: “Data quality is primarily about context not accuracy...”

September 23, 2009/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Data quality is primarily about context not accuracy.

Accuracy is part of the equation, but only a very small portion.”

This DQ-Tip is from Rick Sherman's recent blog post summarizing the TDWI Boston Chapter Meeting at MIT.

I define data using the Dragnet definition – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). A common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements. Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Alternatively, information can be defined as data in context.

Quality, as Sherman explains, “is in the eyes of the beholder, i.e. the business context.”

DQ-Tip: “Don't pass bad data on to the next person...”

The General Theory of Data Quality

The Data-Information Continuum

September 21, 2009

Adventures in Data Profiling (Part 6)

September 21, 2009/ Jim Harris

In Part 5 of this series: You completed your initial analysis of the fields relating to postal address with the investigation of Postal Address Line 1 and Postal Address Line 2.

You saw additional examples of why free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.

You learned this analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high Distinctness (i.e. the exact same field value rarely occurs on more than one record).

You also saw examples of how the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.

In Part 6, you will continue your adventures in data profiling by analyzing the Account Number and Tax ID fields.

Account Number

The field summary for Account Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

In Part 2, we learned that Customer ID is likely an integer surrogate key and the primary key for this data source because it is both 100% complete and 100% unique. Account Number is 100% complete and almost 100% unique. Perhaps it was intended to be the natural key for this data source?

Let's assume that drill-downs revealed the single profiled field data type was VARCHAR and the single profiled field format was aa-nnnnnnnnn (i.e. 2 characters, followed by a hyphen, followed by a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that not only is Account Number always populated, it is also consistently formatted.

The profiled minimum/maximum field values appear somewhat suspicious, possibly indicating the presence of invalid values?

We can use drill-downs on the field summary “screen” to get more details about Account Number provided by the data profiling tool.

The cardinality of Account Number is very high, as is its Distinctness (i.e. the same field value rarely occurs on more than one record). Therefore, when we limit the review to only the top ten most frequently occurring values, it is not surprising to see low counts.

Since we do not yet have a business understanding of the data, we are not sure if it is valid for multiple records to have the same Account Number.

Additional analysis can be performed by extracting the alpha prefix and reviewing its top ten most frequently occurring values. One aspect of this analysis is that it can be used to assess the possibility that Account Number is an “intelligent key.” Perhaps the alpha prefix is a source system code?

Tax ID

The field summary for Tax ID includes input metadata along with the summary and additional statistics provided by the data profiling tool.

Let's assume that drill-downs revealed the single profiled field data type was INTEGER and the single profiled field format was nnnnnnnnn (i.e. a 9 digit number).

Combined with the profiled minimum/maximum field lengths, the good news appears to be that Tax ID is also consistently formatted. However, the profiled minimum/maximum field values appear to indicate the presence of invalid values.

In Part 4, we learned that most of the records appear to have either an United States (US) or Canada (CA) postal address. For US records, the Tax ID field could represent the social security number (SSN), federal employer identification number (FEIN), or tax identification number (TIN). For CA records, this field could represent the social insurance number (SIN). All of these identifiers are used for tax reporting purposes and have a 9 digit number format (when no presentation formatting is used).

We can use drill-downs on the field summary “screen” to get more details about Tax ID provided by the data profiling tool.

The Distinctness of Tax ID is slightly lower than Account Number and therefore the same field value does occasionally occur on more than one record.

Since the cardinality of Tax ID is very high, we will limit the review to only the top ten most frequently occurring values. This analysis reveals the presence of more (most likely) invalid values.

Potential Duplicate Records

In Part 1, we asked if the data profiling statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records. In other words, since some distinct actual values for these fields occur on more than one record, does this imply more than just a possible data relationship, but a possible data redundancy? Obviously, we would need to interact with the business team in order to better understand the data and their business rules for identifying duplicate records.

However, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

What other analysis do you think should be performed for these fields?

In Part 7 of this series: We will continue the adventures in data profiling by completing our initial analysis with the investigation of the Customer Name 1 and Customer Name 2 fields.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

September 19, 2009

Commendable Comments (Part 2)

September 19, 2009/ Jim Harris

In a recent guest post on ProBlogger, Josh Hanagarne “quoted” Jane Austen:

“It is a truth universally acknowledged, that a blogger in possession of a good domain must be in want of some worthwhile comments.”

“The most rewarding thing has been that comments,” explained Hanagarne, “led to me meeting some great people I possibly never would have known otherwise.” I wholeheartedly echo that sentiment.

This is the second entry in my ongoing series celebrating my heroes – my readers.

Commendable Comments

Proving that comments are the best part of blogging, on The Data-Information Continuum, Diane Neville commented:

“This article is intriguing. I would add more still.

A most significant quote: 'Data could be considered a constant while Information is a variable that redefines data for each specific use.'

This tells us that Information draws from a snapshot of a Data store. I would state further that the very Information [specification] is – in itself – a snapshot.

The earlier quote continues: 'Data is not truly a constant since it is constantly changing.'

Similarly, it is a business reality that 'Information is not truly a constant since it is constantly changing.'

The article points out that 'The Data-Information Continuum' implies a many-to-many relationship between the two. This is a sensible CONCEPTUAL model.

Enterprise Architecture is concerned as well with its responsibility for application quality in service to each Business Unit/Initiative.

For example, in the interest of quality design in Application Architecture, an additional LOGICAL model must be maintained between a then-current Information requirement and the particular Data (snapshots) from which it draws. [Snapshot: generally understood as captured and frozen – and uneditable – at a particular point in time.] Simply put, Information Snapshots have a PARENT RELATIONSHIP to the Data Snapshots from which they draw.

Analyzing this further, refer to this further piece of quoted wisdom (from section 'Subjective Information Quality'): '...business units and initiatives must begin defining their Information...by using...Data...as a foundation...necessary for the day-to-day operation of each business unit and initiative.'

From logically-related snapshots of Information to the Data from which it draws, we can see from this quote that yet another PARENT/CHILD relationship exists...that from Business Unit/Initiative Snapshots to the Information Snapshots that implement whatever goals are the order of the day. But days change.

If it is true that 'Data is not truly a constant since it is constantly changing,' and if we can agree that Information is not truly a constant either, then we can agree to take a rational and profitable leap to the truth that neither is a Business Unit/Initiative...since these undergo change as well, though they represent more slowly-changing dimensions.

Enterprises have an increasing responsibility for regulatory/compliance/archival systems that will qualitatively reproduce the ENTIRE snapshot of a particular operational transaction at any given point in time.

Thus, the Enterprise Architecture function has before it a daunting task: to devise a holistic process that can SEAMLESSLY model the correct relationship of snapshots between Data (grandchild), Information (parent) and Business Unit/Initiative (grandparent).

There need be no conversion programs or redundant, throw-away data structures contrived to bridge the present gap. The ability to capture the activities resulting from the undeniable point-in-time hierarchy among these entities is where tremendous opportunities lie.”

On Missed It By That Much, Vish Agashe commented:

“My favorite quote is 'Instead of focusing on the exceptions – focus on the improvements.'

I think that it is really important to define incremental goals for data quality projects and track the progress through percentage improvement over a period of time.

I think it is also important to manage the expectations that the goal is not necessarily to reach 100% (which will be extremely difficult if not impossible) clean data but the goal is to make progress to a point where the purpose for cleaning the data can be achieved in much better way than had the original data been used.

For example, if marketing wanted to use the contact data to create a campaign for those contacts which have a certain ERP system installed on-site. But if the ERP information on the contact database is not clean (it is free text, in some cases it is absent etc...) then any campaign run on this data will reach only X% contacts at best (assuming only X% of contacts have ERP which is clean)...if the data quality project is undertaken to clean this data, one needs to look at progress in terms of % improvement. How many contacts now have their ERP field cleaned and legible compared to when we started etc...and a reasonable goal needs to be set based on how much marketing and IT is willing to invest in these issues (which in turn could be based on ROI of the campaign based on increased outreach).”

Proving that my readers are way smarter than I am, on The General Theory of Data Quality, John O'Gorman commented:

“My theory of the data, information, knowledge continuum is more closely related to the element, compound, protein, structure arc.

In my world, there is no such thing as 'bad' data, just as there is no 'bad' elements. Data is either useful or not: the larger the audience that agrees that a string is representative of something they can use, the more that string will be of value to me.

By dint of its existence in the world of human communication and in keeping with my theory, I can assign every piece of data to one of a fixed number of classes, each with characteristics of their own, just like elements in the periodic table. And, just like the periodic table, those characteristics do not change. The same 109 usable elements in the periodic table are found and are consistent throughout the universe, and our ability to understand that universe is based on that stability.

Information is simply data in a given context, like a molecule of carbon in flour. The carbon retains all of its characteristics but the combination with other elements allows it to partake in a whole class of organic behavior. This is similar to the word 'practical' occurring in a sentence: Jim is a practical person or the letter 'p' in the last two words.

Where the analogue bends a bit is a cause of a lot of information management pain, but can be rectified with a slight change in perspective. Computers (and almost all indexes) have a hard time with homographs: strings that are identical but that mean different things. By creating fixed and persistent categories of data, my model suffers no such pain.

Take the word 'flies' in the following: 'Time flies like an arrow.' and 'Fruit flies like a pear.' The data 'flies' can be permanently assigned to two different places, and their use determines which instance is relevant in the context of the sentence. One instance is a verb, the other a plural noun.

Knowledge, in my opinion, is the ability to recognize, predict and synthesize patterns of information for past, present and future use, and more importantly to effectively communicate those patterns in one or more contexts to one or more audiences.

On one level, the model for information management that I use makes no apparent distinction between the data: we all use nouns, adjectives, verbs and sometimes scalar objects to communicate. We may compress those into extremely compact concepts but they can all be unraveled to get at elemental components. At another level every distinction is made to insure precision.

The difference between information and knowledge is experiential and since experience is an accumulative construct, knowledge can be layered to appeal to common knowledge, special knowledge and unique knowledge.

Common being the most easily taught and widely applied; Special being related to one or more disciplines and/or special functions; and, Unique to individuals who have their own elevated understanding of the world and so have a need for compact and purpose-built semantic structures.

Going back to the analogue, knowledge is equivalent to the creation by certain proteins of cartilage, the use to which that cartilage is put throughout a body, and the specific shape of the cartilage that forms my nose as unique from the one on my wife's face.

To me, the most important part of the model is at the element level. If I can convince a group of people to use a fixed set of elemental categories and to reference those categories when they create information, it's amazing how much tension disappears in the design, creation and deployment of knowledge.”

Tá mé buíoch díot

Daragh O Brien recently taught me the Irish Gaelic phrase Tá mé buíoch díot, which translates as I am grateful to you.

I am very grateful to all of my readers. Since there have been so many commendable comments, please don't be offended if your commendable comment hasn't been featured yet. Please keep on commenting and stay tuned for future entries in the series.

Commendable Comments (Part 1)

Commendable Comments (Part 3)

September 16, 2009

DQ-Tip: “Don't pass bad data on to the next person...”

September 16, 2009/ Jim Harris

Data Quality (DQ) Tips is a new regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Don't pass bad data on to the next person. And don't accept bad data from the previous person.”

This DQ-Tip is from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset.

In the book, Redman explains that this advice is a rewording of his favorite data quality policy of all time.

Assuming that it is someone else's responsibility is a fundamental root case for enterprise data quality problems. One of the primary goals of a data quality initiative must be to define the roles and responsibilities for data ownership and data quality.

In sports, it is common for inspirational phrases to be posted above every locker room exit door. Players acknowledge and internalize the inspirational phrase by reaching up and touching it as they head out onto the playing field.

Perhaps you should post this DQ-Tip above every break room exit door throughout your organization?

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

Data Governance and Data Quality

Additional Resources

Who is responsible for data quality?

DQ Problems? Start a Data Quality Recognition Program!

Starting Your Own Personal Data Quality Crusade

September 15, 2009

The Fragility of Knowledge

September 15, 2009/ Jim Harris

In his excellent book The Black Swan: The Impact of the Highly Improbable, Nassim Nicholas Taleb explains:

“What you don’t know is far more relevant than what you do know.”

Our tendency is to believe the opposite. After we have accumulated the information required to be considered knowledgeable in our field, we believe that what we have learned and experienced (i.e. what we know) is far more relevant than what we don’t know. We are all proud of our experience, which we believe is the path that separates knowledge from wisdom.

“We tend to treat our knowledge as personal property to be protected and defended,” explains Taleb. “It is an ornament that allows us to rise in the pecking order. We take what we know a little too seriously.”

However, our complacency is all too often upset by the unexpected. Some new evidence is discovered that disproves our working theory of how things work. Or something that we have repeatedly verified in the laboratory of our extensive experience, suddenly doesn’t produce the usual results.

Taleb cautions that this “illustrates a severe limitation to our learning from experience and the fragility of our knowledge.”

I have personally encountered this many times throughout my career in data quality. At first, it seemed like a cruel joke or some bizarre hazing ritual. Every time I thought that I had figured it all out, that I had learned all the rules, something I didn’t expect would come along and smack me upside the head.

“We do not spontaneously learn,” explains Taleb, “that we don’t learn that we don’t learn. The problem lies in the structure of our minds: we don’t learn rules, just facts, and only facts.”

Facts are important. Facts are useful. However, sometimes our facts are really only theories. Mistaking a theory for a fact can be very dangerous. What you don’t know can hurt you.

However, as Taleb explains, “what you know cannot really hurt you.” Therefore, we tend to only “look at what confirms our knowledge, not our ignorance.” This is unfortunate, because “there are so many things we can do if we focus on antiknowledge, or what we do not know.”

This is why, as a data quality consultant, when I begin an engagement with a new client, I usually open with the statement (completely without sarcasm):

“Tell me something that I don’t know.”

Hailing Frequencies Open

September 13, 2009

Commendable Comments (Part 1)

September 13, 2009/ Jim Harris

Six month ago today, I launched this blog by asking: Do you have obsessive-compulsive data quality (OCDQ)?

As of September 10, here are the monthly traffic statistics provided by my blog platform:

It Takes a Village (Idiot)

In my recent Data Quality Pro article Blogging about Data Quality, I explained why I started this blog. Blogging provides me a way to demonstrate my expertise. It is one thing for me to describe myself as an expert and another to back up that claim by allowing you to read my thoughts and decide for yourself.

In general, I have always enjoyed sharing my experiences and insights. A great aspect to doing this via a blog (as opposed to only via whitepapers and presentations) is the dialogue and discussion provided via comments from my readers.

This two-way conversation not only greatly improves the quality of the blog content, but much more importantly, it helps me better appreciate the difference between what I know and what I only think I know.

Even an expert's opinions are biased by the practical limits of their personal experience. Having spent most of my career working with what is now mostly IBM technology, I sometimes have to pause and consider if some of that yummy Big Blue Kool-Aid is still swirling around in my head (since I “think with my gut,” I have to “drink with my head”).

Don't get me wrong – “You're my boy, Blue!” – but there are many other vendors and all of them also offer viable solutions driven by impressive technologies and proven methodologies.

Data quality isn't exactly the most exciting subject for a blog. Data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll.

This doesn't mean that data quality isn't an important subject – it just means that you will not see a blog post about data quality hitting the front page of Digg anytime soon.

All blogging is more art than science. My personal blogging style can perhaps best be described as mullet blogging – not “business in the front, party in the back” but “take your subject seriously, but still have a sense of humor about it.”

My blog uses a lot of metaphors and analogies (and sometimes just plain silliness) to try to make an important (but dull) subject more interesting. Sometimes it works and sometimes it sucks. However, I have never been afraid to look like an idiot. After all, idiots are important members of society – they make everyone else look smart by comparison.

Therefore, I view my blog as a Data Quality Village. And as the Blogger-in-Chief, I am the Village Idiot.

The Rich Stuff of Comments

Earlier this year in an excellent IT Business Edge article by Ann All, David Churbuck of Lenovo explained:

“You can host focus groups at great expense, you can run online surveys, you can do a lot of polling, but you won’t get the kind of rich stuff (you will get from blog comments).”

How very true. But before we get to the rich stuff of our village, let's first take a look at a few more numbers:

Not counting this one, I have published 44 posts on this blog
Those blog posts have collectively received a total of 185 comments
Only 5 blog posts received no comments
30 comments were actually me responding to my readers
45 comments were from LinkedIn groups (23), SmartData Collective re-posts (17), or Twitter re-tweets (5)

The ten blog posts receiving the most comments:

The Two Headed Monster of Data Matching – 11 Comments
Adventures in Data Profiling (Part 4) – 9 Comments
Adventures in Data Profiling (Part 2) – 9 Comments
You're So Vain, You Probably Think Data Quality Is About You – 8 Comments
There are no Magic Beans for Data Quality – 8 Comments
The General Theory of Data Quality – 8 Comments
Adventures in Data Profiling (Part 1) – 8 Comments
To Parse or Not To Parse – 7 Comments
The Wisdom of Failure – 7 Comments
The Nine Circles of Data Quality Hell – 7 Comments

Commendable Comments

This post will be the first in an ongoing series celebrating my heroes – my readers.

As Darren Rowse and Chris Garrett explained in their highly recommended ProBlogger book: “even the most popular blogs tend to attract only about a 1 percent commenting rate.”

Therefore, I am completely in awe of my blog's current 88 percent commenting rate. Sure, I get my fair share of the simple and straightforward comments like “Great post!” or “You're an idiot!” – but I decided to start this series because I am consistently amazed by the truly commendable comments that I regularly receive.

On The Data Quality Goldilocks Zone, Daragh O Brien commented:

“To take (or stretch) your analogy a little further, it is also important to remember that quality is ultimately defined by the consumers of the information. For example, if you were working on a customer data set (or 'porridge' in Goldilocks terms) you might get it to a point where Marketing thinks it is 'just right' but your Compliance and Risk management people might think it is too hot and your Field Sales people might think it is too cold. Declaring 'Mission Accomplished' when you have addressed the needs of just one stakeholder in the information can often be premature.

Also, one of the key learnings that we've captured in the IAIDQ over the past 5 years from meeting with practitioners and hosting our webinars is that, just like any Change Management effort, information quality change requires you to break the challenge into smaller deliverables so that you get regular delivery of 'just right' porridge to the various stakeholders rather than boiling the whole thing up together and leaving everyone with a bad taste in their mouths. It also means you can more quickly see when you've reached the Goldilocks zone.”

On Data Quality Whitepapers are Worthless, Henrik Liliendahl Sørensen commented:

“Bashing in blogging must be carefully balanced.

As we all tend to find many things from gurus to tools in our own country, I have also found one of my favourite sayings from Søren Kirkegaard:

If One Is Truly to Succeed in Leading a Person to a Specific Place, One Must First and Foremost Take Care to Find Him Where He is and Begin There.

This is the secret in the entire art of helping.

Anyone who cannot do this is himself under a delusion if he thinks he is able to help someone else. In order truly to help someone else, I must understand more than he–but certainly first and foremost understand what he understands.

If I do not do that, then my greater understanding does not help him at all. If I nevertheless want to assert my greater understanding, then it is because I am vain or proud, then basically instead of benefiting him I really want to be admired by him.

But all true helping begins with a humbling.

The helper must first humble himself under the person he wants to help and thereby understand that to help is not to dominate but to serve, that to help is not to be the most dominating but the most patient, that to help is a willingness for the time being to put up with being in the wrong and not understanding what the other understands.”

On All I Really Need To Know About Data Quality I Learned In Kindergarten, Daniel Gent commented:

“In kindergarten we played 'Simon Says...'

I compare it as a way of following the requirements or business rules.

Simon says raise your hands.

Simon says touch your nose.

Touch your feet.

With that final statement you learned very quickly in kindergarten that you can be out of the game if you are not paying attention to what is being said.

Just like in data quality, to have good accurate data and to keep the business functioning properly you need to pay attention to what is being said, what the business rules are.

So when Simon says touch your nose, don't be touching your toes, and you'll stay in the game.”

Since there have been so many commendable comments, I could only list a few of them in the series debut. Therefore, please don't be offended if your commendable comment didn't get featured in this post. Please keep on commenting and stay tuned for future entries in the series.

Because of You

As Brian Clark of Copyblogger explains, The Two Most Important Words in Blogging are “You” and “Because.”

I wholeheartedly agree, but prefer to paraphrase it as: Blogging is “because of you.”

Not you meaning me, the blogger – you meaning you, the reader.

Thank You.

Commendable Comments (Part 2)

Commendable Comments (Part 3)

September 08, 2009

Fantasy League Data Quality

September 08, 2009/ Jim Harris

For over 25 years, I have been playing fantasy league baseball and football. For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team. Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?

Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

Master Data – data describing the real-world players on the roster of each fantasy team.
Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number. Team attributes include name, owner, home city, and the name and seating capacity of their stadium.
Definitions – Player and Team have both Professional and Fantasy definitions. Professional teams and players are real-world objects managed independent of fantasy sports. Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League). Therefore, Fantasy Team and Fantasy Player are the true master data objects. The distinction between professional and fantasy teams is simpler than between professional and fantasy players. Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).
Roles – In baseball, the player roles are Batter, Pitcher, and Fielder. In football, the player roles are Offense, Defense and Special Teams. In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).
Connections – Fantasy Players are connected to Fantasy Teams via a roster. On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball). These connections change throughout the season. Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).
Taxonomies – Positions played are defined individually and organized into taxonomies. In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield. Second base and short stop are also infield positions, and more specifically middle infield. And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).

Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse. I downloaded master, reference, and transaction data from my fantasy league's website. I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team. All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables. For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions.

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics. These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball.

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics. This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues. Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams. Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only. Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables. In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century).

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century. However, there are still statistics with multiple meanings. For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas. Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.

Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team. Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website. This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.” However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites.

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online. Of course, this wasn't always the case.

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers. We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs.

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves. I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.

Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.
Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.
Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.
Presentation Quality – A measure of how information is presented to and collected from those who utilize it. Format and appearance support appropriate use of the information.
Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.

Conclusion

I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you. It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby. They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

September 03, 2009

To Parse or Not To Parse

September 03, 2009/ Jim Harris

“To Parse, or Not To Parse,—that is the question:
Whether 'tis nobler in the data to suffer
The slings and arrows of free-form fields,
Or to take arms against a sea of information,
And by parsing, understand them?”

Little known fact: before William Shakespeare made it big as a playwright, he was a successful data quality consultant.

Alas, poor data quality! The Bard of Avon knew it quite well. And he was neither a fan of free verse nor free-form fields.

Free-Form Fields

A free-form field contains multiple (usually interrelated) sub-fields. Perhaps the most common examples of free-form fields are customer name and postal address.

A Customer Name field with the value “Christopher Marlowe” is comprised of the following sub-fields and values:

Given Name = “Christopher”
Family Name = “Marlowe”

A Postal Address field with the value “1587 Tambur Lane” is comprised of the following sub-fields and values:

House Number = “1587”
Street Name = “Tambur”
Street Type = “Lane”

Obviously, both of these examples are simplistic. Customer name and postal address are comprised of additional sub-fields, not all of which will be present on every record or represented consistently within and across data sources.

Returning to the bard's question, a few of the data quality reasons to consider parsing free-form fields include:

Data Profiling
Data Standardization
Data Matching

Much Ado About Analysis

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field. In Adventures in Data Profiling (Part 5), a data profiling tool was used to analyze the field Postal Address Line 1:

The Taming of the Variations

Free-form fields often contain numerous variations resulting from data entry errors, different conventions for representing the same value, and a general lack of data quality standards. Additional variations are introduced by multiple data sources, each with its own unique data characteristics and quality challenges.

Data standardization parses free-form fields to break them down into their smaller individual sub-fields to gain improved visibility of the available input data. Data standardization is the taming of the variations that creates a consistent representation, applies standard values where appropriate, and when possible, populates missing values.

The following example shows parsed and standardized postal addresses:

In your data quality implementations, do you use this functionality for processing purposes only? If you retain the standardized results, do you store the parsed and standardized sub-fields or just the standardized free-form value?

Shall I compare thee to other records?

Data matching often uses data standardization to prepare its input. This allows for more direct and reliable comparisons of parsed sub-fields with standardized values, decreases the failure to match records because of data variations, and increases the probability of effective match results.

Imagine matching the following product description records with and without the parsed and standardized sub-fields:

Doth the bard protest too much?

Please share your thoughts and experiences regarding free-form fields.

OCDQ Blog

Customer Name 1

Customer Name 2

The Challenges of Person Names

Related Posts

Commendable Comments

You Are Awesome

Related Posts

Related Posts

Why “Risk” is a better metaphor for an IT Project

Why “Monopoly” is a better metaphor for an IT Project

You are the Referee

About Jim Harris

About Phil Simon

Related Posts

Data Quality

Data Governance

Conclusion

Books Referenced in this Post

Related Posts

The Dawn of OCDQ

LinkedIn

Twitter

Facebook

Additional Social Media Websites

Social Media Tools and Services

Social Media Strategy

Social Karma

Please Share Your Social Media Odyssey

Related Posts

Account Number

Tax ID

Potential Duplicate Records

Related Posts

Commendable Comments

Tá mé buíoch díot

Related Posts

Related Posts

Additional Resources

Related Posts

It Takes a Village (Idiot)

The Rich Stuff of Comments

Commendable Comments

Because of You

Related Posts

Master Data Management

Data Warehousing

Business Intelligence

Data Quality

Conclusion

Free-Form Fields

Much Ado About Analysis

The Taming of the Variations

Shall I compare thee to other records?

Doth the bard protest too much?

OCDQ Blog