So Long 2009, and Thanks for All the . . .

Before I look ahead to the coming New Year and wonder what it may (or may not) bring, I wanted to pause, reflect on, and in the following OCDQ Video, share some of the many joys I was thankful for 2009 bringing to me.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: OCDQ Video

 

Thank You

Thank you all—and I do mean every single one of you—thank you for everything.

Happy New Year!!!

The War of Word Craft

After publishing my previous post, I watched Empire of the Word Part 4: The Future of Reading, which was a panel discussion on The Agenda with Steve Paikin, featuring Cynthia Good, Keith Oatley, Mark Federman, Bob Stein, and Bill Buxton.

Please let me stress that I highly respect all of the panelists who were involved in this discussion.  My selective paraphrasing of their quotes, which I have woven into the tapestry of this blog post, doesn't come close to doing justice to the full range of excellent insights they shared.  Therefore, although it is 53 minutes long, I highly recommend watching the full video.

 

The War of Word Craft

Bob Stein used the extremely popular multi-player online game World of Warcraft, where the players collaboratively create the narrative in real-time, as an example of the type of interactive multimedia experience that may be the true future of reading.

This analogy inspired my post title—since the debate seems to be about not only the future of reading, but also the future of how what we read (and by whatever means we “read” it) will be produced—or using far more dramatic flourish, this debate is about:

The War of Word Craft

e-Books are the end of anything worth reading?

When the financial implications of electronic publishing were briefly discussed, Bill Buxton explained that when things go digital and there is no cost of goods (i.e., producing an e-book), there is a law of economics that states the price drops essentially to zero.

Buxton argued this would mean the end of anything worth reading.  Since, when professional writers are no longer able to make a living from writing (i.e., because e-books are “free”), then only amateurs will write.  This will cause a dramatic drop in the overall quality of writing, and therefore no new writing will be worth reading.

 

Publishing companies are the gatekeepers of standards?

A somewhat similar sentiment was expressed by Cynthia Good, in defending what have traditionally been considered the gatekeepers for the standards of high quality, professional writing—publishing companies. 

(Please note: Good was formerly the president of a publishing company, and is now an academic director of publishing.)

Good argues that historically it has been publishers and editors who select and perfect the books to be published, thereby guaranteeing high standards for quality writing—and that society still requires these standards.

 

The Cult of the Amateur

In 2007, Andrew Keen wrote the controversial book The Cult of the Amateur, which has the provocative sub-title: “how blogs, MySpace, YouTube, and the rest of today's user-generated media are destroying our economy, our culture, and our values.”

I am definitely not suggesting Buxton and Good are advocating a similar perspective.  However, I find both the notion that only “professional” writers can write anything worth reading, and we require gatekeepers of “standards” to protect us from ourselves, to be incredibly pretentious and outdated ideas.

Writing is not an esoteric skill possessed by only a select few—and the best writers are not motivated (only) by money.

Publishing companies publish books that guarantee a high profit margin—and not high standards for quality writing.

 

The New Word Order

Bob Stein discussed the differences between the old-school and new-school mentality of authors.

The commitment of old-school authors is to engage with the subject matter on behalf of future readers.

By contrast, the commitment of new-school authors is to engage with readers in the context of the subject matter.

Stein believes the future role of the publisher is to develop a community around the subject matter, and bring the content to the community who wants to read it, instead of pushing the community toward the content you tell them they should read.

Mark Federman agreed, and sees the role of the publisher changing into one of creating an environment of engagement for genres and niche communities, which bring together writers and readers.

Federman also sees the roles of writers and readers becoming interchangeable within these communities. 

Quoting Finnegans Wake by James Joyce: “my consumers, are they not my producers?”

Pardon the pun, but I believe this will become the new order of the publishing world, or more simply: The New Word Order.

 

A Different Kind of Social Media

Bob Stein explained that solitary reading is really a recent development in human history.  Previously, most reading was a very social activity, where groups of people came together to listen to books (and poetry and other works) being read out loud.

Books (and reading as we know it) will not go away.  However, Stein believes we are at the very beginning of the explosion of new forms of written (and other creative) expression. 

The idea of reading (and writing) with others is going to become commonplace again, because we value the input of others, which greatly improves our individual experience, understanding, and unleashes the true joy of reading.

In what Stein describes, I see the future of reading and writing as a different kind of social media—a better kind of social media.

 

New Medium, New Message

In his book Understanding Media: The Extensions of Man, Marshall McLuhan coined the phrase: “the medium is the message.”

Steve Paikin asked what, within this new medium we have been discussing, is the message?

Mark Federman responded:

“Connection—the ability to connect readers and writers and interchange their roles.  The ability to collaborate as we construct knowledge, as we engage with one another's experiences, as we bring multiple contexts into understanding what it is we are reading and creating simultaneously—that's the message.”


Will people still read in the future?

This question and debate was motivated by my comments on the recent blog post The Future of Reading by Phil Simon.

In the following OCDQ Video, I share some of my perspectives on the future of reading, specifically covering three key points:

  1. Books vs. e-Books
  2. Print Media vs. Social Media
  3. Reading vs. Multimedia

  If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: OCDQ Video

 

A Very Brief History of Human Communication

Long before written language evolved, humans communicated using hand and facial gestures, monosyllabic and polysyllabic grunting, as well as crude drawings and other symbols, all in an attempt to share our thoughts and feelings with each other.

First, improved spoken language increased our ability to communicate by using words as verbal symbols for emotions and ideas.  Listening to stories, and retelling them to others, became the predominant means of education and “recording” our history.

Improved symbolism via more elaborate drawings, sculptures, and other physical and lyrical works of artistic expression, greatly enhanced our ability to not only communicate, but also leave a lasting legacy beyond the limits of our individual lives.

Later, written language would provide a quantum leap in human evolution.  Writing (and reading) greatly improved our ability to communicate, educate, record our history, and thereby pass on our knowledge and wisdom to future generations.

 

The Times They Are a-Changin’

The pervasiveness of the Internet and the rapid proliferation of powerful mobile technology is transforming the very nature of human communication—some purists might even argue it is regressing human communication.

I believe there is already a declining interest in reading throughout society in general, and more specifically, a marked decline across current generation gaps, which will become even more dramatic in the coming decades.

 

Books vs. e-Books

People are reading fewer books—and fewer people are reading books.  The highly polarized “book versus e-book debate” is really only a debate within the shrinking segment of the population that still reads books. 

So, yes, between us book lovers, some of us will not exchange our personal tactile relationship with printed books for an e-book reader made of the finest plastic, glass, and metal, and equipped with all the bells and whistles of the latest technology. 

However, e-book readers simply aren't going to make non-book readers want to read books.  I am truly sorry Amazon and Barnes & Noble, but the truth is—the Kindle and Nook are not going to making reading books cool—they will simply provide an alternative for people who already enjoy reading books, and mostly for those who also love having the latest techno-gadgets.

 

Print Media vs. Social Media

We continue to see print media (newspapers, magazines, and books) either offering electronic alternatives, or transitioning into online publications—or in some cases, simply going out of business.

I believe the primary reason for this media transition is our increasing interest in exchanging what has traditionally been only a broadcast medium (print media) for a conversation medium (social media).

Social media can engage us in conversation and enable communication between content creators and their consumers.

We are constantly communicating with other people via phone calls, text messages, e-mails, and status updates on Twitter and Facebook.  We are also sharing more of our lives visually through the photos we post on Flickr and the videos we post on YouTube.  More and more, we are creating—and not just consuming—content that we want to share with others.

We are also gaining more control over how we filter communication.  Google real-time searches and e-mail alerts, RSS readers, and hashtagged Twitter streams—these are just a few examples of the many tools currently allowing us to customize and personalize the content we create and consume.

We are becoming an increasingly digital society, and through social media, we are living more and more of both our personal and professional lives online, blurring—if not eliminating—the distinction between the two.

 

Reading vs. Multimedia

I believe the future of human communication will be a return to the more direct social interactions that existed before the evolution of written language.  I am not predicting a return to polysyllabic grunting and interpretive dance. 

Instead, I believe we will rely less and less on reading and writing, and more and more on watching, listening, and speaking.

The future of human communication may become short digital bursts of multimedia experiences, seamlessly blending an economy of words with audio and video elements.  Eventually, even digitally written words may themselves disappear—and we will communicate via interactive digital video and audio—and the very notion of “literacy” may become meaningless.

But fear not—I don't predict this will happen until the end of the century—and I am probably completely wrong anyway.

 

Please Share Your Thoughts

Do you read a lot of books?  If so, have you purchased an e-book reader (e.g., Amazon Kindle, Barnes & Noble Nook) or are you planning to in the near-future?  If you have an e-book reader, how would you compare it to reading a printed book?

Do you read newspapers and/or magazines?  If so, are you reading them in print or online? 

How often do you read blogs and other publications that are only available as online content?

How often do you listen to podcasts or watch video blogs or other online videos (excluding television and movies)?

What is the future of reading?


Recently Read: December 21, 2009

Recently Read is an OCDQ regular segment.  Each entry provides links to blog posts, articles, books, and other material I found interesting enough to share.  Please note “recently read” is literal – therefore what I share wasn't necessarily recently published.

 

Data Quality

For simplicity, “Data Quality” also includes Data Governance, Master Data Management, and Business Intelligence.

  • Welcome to DQ Directions – In this blog post, Dylan Jones of Data Quality Pro formally announced the DQ Directions online conference, which will debut in Q2 2010, and will feature presentations from experts and industry thought leaders specializing in data quality, data governance, and master data management.

     

  • Ways to 'Communivate' your Data Issues – In her Purple Cow of a blog post, Jill Wanless (aka Sheezaredhead) explains that ‘Communivate’ is a combination of the words communicate and innovate, and it means to communicate in an innovative way, which she does regarding the importance of data quality.

     

  • ’Tis the Season for a Data Governance Carol – Part 1 and Part 2 – In his excellent two-part series, Rob Paller of Baseline Consulting uses a Dickensian framework to explain the importance of data governance and data quality – and the fact that there isn’t a simple framework to blindly follow for Data Governance.

     

  • The “Santa Intelligence” Team – An excellent Christmas-themed blog post from Paul Boal, in which we learn that Santa does indeed have a Business Intelligence team.

     

  • Data quality is for life not just for Christmas – In this Diary of a Marketing Insight Guy blog post, Simon Daniels reminds us data quality can be a gift that will keep on giving—if data quality management is built into the heart of an organization’s processes and operations.

     

  • Finding a home for MDM – In his second post on the DataFlux Community of Experts, Charles Blyth examines where master data management (MDM) fits within your overall enterprise architecture.

     

  • The Decade of Data: Seven Trends to Watch in 2010 – In his blog post on Informatica Perspectives, Joe McKendrick examines some up-and-coming trends that he predicts will shape the data management space in 2010.

     

  • Are we ready for all this data? – In his blog post, Rich Murnane uses some recent news stories to ponder if even us experienced data geeks are really ready for the amount of data we're going to need to manage due to the unrelenting increases in data volumes.

 

Social Media

For simplicity, “Social Media” also includes Blogging, Writing, Social Networking, and Online Marketing.

 

Book Quotes

An eclectic list of quotes from some recently read (and/or simply my favorite) books.

  • From Crush It! by Gary Vaynerchuk – “Your business and your personal brand need to be one and the same...Your latest tweet and comment on Facebook and most recent blog post—that's your résumé now...It's a whole new world, build your personal brand and get ready for it.”

     

  • From A Whole New Mind by Daniel Pink – “Empathy is neither a deviation from intelligence nor the single route to it.  Sometimes we need detachment; many other times we need attachment.  The people who will thrive will be those who can toggle between the two.” 

     

  • From Connected by Nicholas Christakis and James Fowler – “Just as brains can do things that no single neuron can do, so can social networks do things that no single person can do...our connections to other people matter...most of all it is about what makes us uniquely human...To know who we are, we must understand how we are connected.”

Podcast: Stand-Up Data Quality

December—the last month of the year when we hustle and bustle to finish our work, while visions of sugar-plums dance in our holiday shopping heads.  During this time of year, little attention (and rightfully so) is paid to the blogosphere—especially the neither naughty nor nice, but simply niche-y corners of the blogosphere.

As I have often joked, data quality is not just a niche – if technology blogging was a Matryoshka (a.k.a. Russian nested) doll, then data quality would be the last, innermost doll.  This doesn't mean that data quality isn't an important subject – it just means its extra-niche-y-ness all but guarantees December (and usually January and most of February too) will be a very cold month – when all niche blogs struggle to rub two random RSS readers together in order to start a cozy fire, keeping them warm until their blogging hope springs eternal once again come springtime.

Niche blogs can either shutdown during this blogging lull, or use it as an opportunity to experiment.  I have chosen the latter, which explains why four of my last six blog posts have used either a Podcast or a Video

Not to worry though, I haven't given up writing more “traditional” blog posts.  I simply plan to use more podcasts and videos in 2010 as a way to add more variety (and more of a personal touch) to my blog content.  They may not appear as frequently as they have recently, but more is to come in the new year.  For now, I am experimenting with how best to produce them.

 

Stand-Up Data Quality

In this OCDQ Podcast, I discuss using humor to enliven a niche topic, and revisit some of the stand-up comedy aspects of some of my favorite written-down blog posts from earlier this year.

Humor can be a great way to start a conversation and hold your readers' attention for those few precious additional seconds while you are getting to your point.  Obviously, there will be times when the seriousness of your subject would make comedy inappropriate, and if you are not naturally inclined to use humor, then you shouldn't try to force it.

 

You can also download this podcast (MP3 file) by clicking on this link: Stand-Up Data Quality

 

Related Posts

The Tell-Tale Data

Data Quality: The Reality Show?

Data Quality is People!

All I Really Need To Know About Data Quality I Learned In Kindergarten

The Mullet Blogging Manifesto

Video: The DQ General's Song

In this OCDQ Video, I revisit The Very Model of a Modern DQ General, which was the second post ever published on this blog.

Using The Major-General's Song from The Pirates of Penzance by Gilbert and Sullivan as a framework, I encapsulated into lyrics some of the knowledge I have accumulated from over 15 years of experience in the data quality profession.  The intended result was a comical delivery of serious insight.

I recorded a video and not simply a podcast so that you could follow along with the lyrics.  However, my budget couldn't afford the inclusion of the “follow the bouncing ball” technology I enjoyed in many of my favorite childhood cartoons. 

Sparing you the pain of listening to me actually sing, I instead offer for your amusement, my recital of The DQ General's Song:

 

If you are reading this blog post via e-mail or a feed reader, then to view this video, please click on this link: OCDQ Video

 

Related Posts

The Very Model of a Modern DQ General

Imagining the Future of Data Quality

Data Quality is Sexy

‘Twas Two Weeks Before Christmas

‘Twas two weeks before Christmas, and all about the data warehouse,
Every employee was stirring, busy clicking their mouse;
The stockings were hung on our cubicle walls with care,
In hopes that year-end bonus checks soon would be there.

The data were nestled all snug in their test beds,
While visions of sugar-plums danced in DBA's heads; 
Working together, the Business and IT, for collaboration is best,
All had just settled in, for a winter night's long, pre-production test.

When out in the parking lot there arose such a clatter,
We all sprang from our desk chairs to see what was the matter;
Away to the window we flew like a flash,
Tore open the shutters and threw up the sash.

The moon on the crest of the new-fallen snow,
Gave the luster of mid-day to objects below;
When, what to our wondering eyes should appear?

The Big Boss Man dressed up as Santa,
Carrying eight tiny candles, to Light the Menorah.

We descended the stairs to the lobby, so lively and quick,
We wanted to know in mere moments, if this was some trick;
The Big Boss Man greeted us, as into the lobby we all did file,
He whistled, and shouted, then gave us a big grinning smile.

He was dressed all in faux fur, from his head to his toes,
And his clothes were well-tailored with buttons and bows;
A bundle of bonus checks he had flung on his back,
We were as giddy as young children as he opened the sack.

His eyes—how they twinkled, his dimples how merry!
His cheeks were like roses, his nose like a cherry!
His droll little mouth was drawn up like a bow,
And the beard of his chin was as white as the snow.

The stump of a pipe he held tight in his teeth,
And the smoke it encircled his head like a wreath;
He had a broad face and a little round belly,
That shook when he laughed, like a bowlful of jelly.

He was chubby and plump, a right jolly old elf,
And we laughed when we saw him, in spite of ourselves;
A wink of his eye and a twist of his head,
Soon gave us to know, we had nothing to dread.

And these were the words that carefully he said:

“Whether you celebrate Christmas or Hanukkah, Kwanzaa or Festivus,
Whether for you, these are Holy Days or holidays, or simply a rest for us,
My words are the same, and they are just as bright:

Peace, Love, and Happiness to All,
And to all—A Good Night.”

To you and yours, from the entire OCDQ Blog family.

Video: Twitter Search Tutorial

In this OCDQ Video, I provide a brief tutorial on Twitter Search.

Key points about Twitter Search covered in the video tutorial:

  • Unlike other social networking sites (e.g., Facebook, LinkedIn), you don't need an account for read access to Twitter content
  • This is a safe way for you or your company to start leveraging Twitter for “listening purposes only”
  • You can save Twitter Search queries as RSS feeds (e.g., for viewing within Google Reader)

 

If you are reading this blog post via e-mail or a feed reader, then to view this video, please click on this link: OCDQ Video

 

For more help finding data quality content on Twitter, click on this link: Data Quality on Twitter

 

Related Posts

Live-Tweeting: Data Governance

Brevity is the Soul of Social Media

If you tweet away, I will follow

Tweet 2001: A Social Media Odyssey

Recently Read: December 7, 2009

Recently Read is an OCDQ regular segment.  Each entry provides links to blog posts, articles, books, and other material I found interesting enough to share.  Please note “recently read” is literal – therefore what I share wasn't necessarily recently published.

 

Data Quality

For simplicity, “Data Quality” also includes Data Governance, Master Data Management, and Business Intelligence.

  • Data Quality Blog Roundup - November 2009 Edition – Dylan Jones at Data Quality Pro always provides a great collection of the previous month's best blog posts, which covers most of the my “recently reads” for data quality.

     

  • The value of Christmas cards – In this Data Value Talk blog post from Human Inference, we learn about how sending Christmas cards can optimize your data quality.

     

  • Santa Quality – Yes, Virginia, there is a Santa Claus—as well as a Saint Nicholas, a Père Noël, a Weihnachtsmann, and a Julemand.  In this blog post, Henrik Liliendahl Sørensen explains some ho-ho-holiday data quality issues.

     

  • Some TLC for Your Data – Data really needs some tender loving care.  Daniel Gent explains in his latest blog post.

     

  • Determining data quality is the first key step – In the second part of a blog series on data migration, James Standen explains that a data migration project will be required to actually improve data quality at the same time, and therefore it is really two projects in one.  The post contains the great line: “data quality sense tingling.”

     

  • Data Chaos and Five Truisms of Data Quality – In his debut post on the DataFlux Community of Experts, my good friend Phil Simon provides a quick case study and five universal truths of data quality.

 

Social Media

For simplicity, “Social Media” also includes Blogging, Writing, Social Networking, and Online Marketing.

 

Awesome Stuff

An eclectic list of articles, blog posts, and other “non-data quality, non-social media, but still awesome” stuff.

  • The Greatest Book Of All Time? – Josh Hanagarne (a.k.a. the “World’s Strongest Librarian”) recently reviewed a book he received from Ethan.  Josh has a simple philosophy of life — “Don’t make anyone’s day worse” — if you are having a bad day (like I was the day I found this), then check this out.

     

  • Cute Apple parody from The Sun – Rob Beschizza on Boing Boing shares a great one minute video of a recent commercial from The Sun about “The UK's best handheld for 40 years.”


Podcast: Your Blog, Your Voice

In this OCDQ Podcast, I discuss the importance of blogging in your own voice. 

The best way to produce unique content is to let your blogging style reflect your personality.  Make your readers feel like they are having a conversation with a real person – not just someone who is blogging what they think people want to read.

Your Blog, Your Voice

 

You can also download this podcast (MP3 file) by clicking on this link: Your Blog, Your Voice

 

Related Posts

The Mullet Blogging Manifesto

Collablogaunity

Brevity is the Soul of Social Media

Live-Tweeting: Data Governance

The term “live-tweeting” describes using Twitter to provide near real-time reporting from an event.  I live-tweet from the sessions I attend at industry conferences as well as interesting webinars.

Recently, I live-tweeted Successful Data Stewardship Through Data Governance, which was a data governance webinar featuring Marty Moseley of Initiate Systems and Jill Dyché of Baseline Consulting.

Instead of writing a blog post summarizing the webinar, I thought I would list my tweets with brief commentary.  My goal is to provide an example of this particular use of Twitter so you can decide its value for yourself.

 

As the webinar begins, Marty Moseley and Jill Dyché provide some initial thoughts on data governance:

Live-Tweets 1

 

Jill Dyché provides a great list of data governance myths and facts:

Live-Tweets 2

 

Jill Dyché provides some data stewardship insights:

Live-Tweets 3

 

As the webinar ends, Marty Moseley and Jill Dyché provide some closing thoughts about data governance and data quality:

Live-Tweets 4

 

Please Share Your Thoughts

If you attended the webinar, then you know additional material was presented.  Did my tweets do the webinar justice?  Did you follow along on Twitter during the webinar?  If you did not attend the webinar, then are these tweets helpful?

What are your thoughts in general regarding the pros and cons of live-tweeting? 

 

Related Posts

The following three blog posts are conference reports based largely on my live-tweets from the events:

Enterprise Data World 2009

TDWI World Conference Chicago 2009

DataFlux IDEAS 2009

Data Quality is Sexy

 

Jim Harris 017

I am sick and tired of hearing people talk about how data quality (DQ) is not sexy.

I was talking with my friend J.T. the other day and he told me I simply needed to remind people data quality has always been sexy.  Sometimes, people just have a tendency to forget. 

J.T. told me:

“You know what you gotta do J.H.?  You gotta bring DQ Sexy back.”

True dat, J.T.

 

I'm Bringing DQ Sexy Back

 

Jim Harris 001

 

I’m bringing DQ Sexy back

All you naysayers, watch how I attack

I think your data’s special, why does your quality lack?

Grant me some access, and I’ll pick up the slack

 

 

Jim Harris 008

 

Dirty data – you see the problems everywhere

Let me be your data cleanser, and baby, I'll be there

We'll whip the Business Process if it misbehaves

But just remember – trying to be perfect – it's not the way

 

 

Jim Harris 005 

I’m bringing DQ Sexy back

Them non-team players don’t know how to act

Let our collaboration get us back on track

Working together, we'll make the right impact

 

 

Jim Harris 010

 

Look at that data – it's your 'prise asset 
Treat it well, and all your business needs will be met

Understanding it will really make you smile 
To get started, you really need to profile

There's no need for you to be afraid – come on 
Go ahead – get your data freak on

 

Jim Harris 014 

I’m bringing DQ Sexy back

Any non-believers left?  Don't make me give you a smack

If you have data, you'd better watch out for what it lacks

'Cause quality is what it needs – and that’s a fact

 

 

Data Quality is Sexy

Jim Harris 015

That’s right. 

Data Quality is Sexy. 

Always has been. 

Always will be.

True dat, J.H.

Fo real!

 

Adventures in Data Profiling (Part 8)

Understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.  This post is the conclusion of a vendor-neutral series on the methodology of data profiling.

Data profiling can help you perform essential analysis such as:

  • Provide a reality check for the perceptions and assumptions you may have about the quality of your data
  • Verify your data matches the metadata that describes it
  • Identify different representations for the absence of data (i.e., NULL and other missing values)
  • Identify potential default values
  • Identify potential invalid values
  • Check data formats for inconsistencies
  • Prepare meaningful questions to ask subject matter experts

Data profiling can also help you with many of the other aspects of domain, structural and relational integrity, as well as determining functional dependencies, identifying redundant storage, and other important data architecture considerations.

 

Adventures in Data Profiling

This series was carefully designed as guided adventures in data profiling in order to provide the necessary framework for demonstrating and discussing the common functionality of data profiling tools and the basic methodology behind using one to perform preliminary data analysis.

In order to narrow the scope of the series, the scenario used was a customer data source for a new data quality initiative had been made available to an external consultant with no prior knowledge of the data or its expected characteristics.  Additionally, business requirements had not yet been documented, and subject matter experts were not currently available.

This series did not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that were covered.  Both the data profiling tool and data used throughout the series were fictional.  The “screen shots” were customized to illustrate concepts and were not modeled after any particular data profiling tool.

This post summarizes the lessons learned throughout the series, and is organized under three primary topics:

  1. Counts and Percentages
  2. Values and Formats
  3. Drill-down Analysis

 

Counts and Percentages

One of the most basic features of a data profiling tool is the ability to provide counts and percentages for each field that summarize its content characteristics:

 Data Profiling Summary

  • NULL – count of the number of records with a NULL value 
  • Missing – count of the number of records with a missing value (i.e., non-NULL absence of data, e.g., character spaces) 
  • Actual – count of the number of records with an actual value (i.e., non-NULL and non-Missing) 
  • Completeness – percentage calculated as Actual divided by the total number of records 
  • Cardinality – count of the number of distinct actual values 
  • Uniqueness – percentage calculated as Cardinality divided by the total number of records 
  • Distinctness – percentage calculated as Cardinality divided by Actual

Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a single primary key, which should be both 100% complete and 100% unique.  In Part 2, Customer ID provided an excellent example.

Distinctness can be useful in evaluating the potential for duplicate records.  In Part 6, Account Number and Tax ID were used as examples.  Both fields were less than 100% distinct (i.e., some distinct actual values occurred on more than one record).  The implied business meaning of these fields made this an indication of possible duplication.

Data profiling tools generate other summary statistics including: minimum/maximum values, minimum/maximum field sizes, and the number of data types (based on analyzing the values, not the metadata).  Throughout the series, several examples were provided, especially in Part 3 during the analysis of Birth Date, Telephone Number and E-mail Address.

 

Values and Formats

In addition to counts, percentages, and other summary statistics, a data profiling tool generates frequency distributions for the unique values and formats found within the fields of your data source.

A frequency distribution of unique values is useful for:

  • Fields with an extremely low cardinality, indicating potential default values (e.g., Country Code in Part 4)
  • Fields with a relatively low cardinality (e.g., Gender Code in Part 2)
  • Fields with a relatively small number of known valid values (e.g., State Abbreviation in Part 4)

A frequency distribution of unique formats is useful for:

  • Fields expected to contain a single data type and/or length (e.g., Customer ID in Part 2)
  • Fields with a relatively limited number of known valid formats (e.g., Birth Date in Part 3)
  • Fields with free-form values and a high cardinality (e.g., Customer Name 1 and Customer Name 2 in Part 7)

Cardinality can play a major role in deciding whether you want to be shown values or formats since it is much easier to review all of the values when there are not very many of them.  Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values, as we saw throughout the series (e.g., Telephone Number in Part 3).

Some fields can also be analyzed using partial values (e.g., in Part 3, Birth Year was extracted from Birth Date) or a combination of values and formats (e.g., in Part 6, Account Number had an alpha prefix followed by all numbers).

Free-form fields are often easier to analyze as formats constructed by parsing and classifying the individual values within the field.  This analysis technique is often necessary since not only is the cardinality of free-form fields usually very high, but they also tend to have a very high distinctness (i.e., the exact same field value rarely occurs on more than one record). 

Additionally, the most frequently occurring formats for free-form fields will often collectively account for a large percentage of the records with an actual value in the field.  Examples of free-form field analysis were the focal points of Part 5 and Part 7.

We also saw examples of how valid values in a valid format can have an invalid context (e.g., in Part 3, Birth Date values set in the future), as well as how valid field formats can conceal invalid field values (e.g., Telephone Number in Part 3).

Part 3 also provided examples (in both Telephone Number and E-mail Address) of how you should not mistake completeness (which as a data profiling statistic indicates a field is populated with an actual value) for an indication the field is complete in the sense that its value contains all of the sub-values required to be considered valid. 

 

Drill-down Analysis

A data profiling tool will also provide the capability to drill-down on its statistical summaries and frequency distributions in order to perform a more detailed review of records of interest.  Drill-down analysis will often provide useful data examples to share with subject matter experts.

Performing a preliminary analysis on your data prior to engaging in these discussions better facilitates meaningful dialogue because real-world data examples better illustrate actual data usage.  As stated earlier, understanding your data is essential to using it effectively and improving its quality.

Various examples of drill-down analysis were used throughout the series.  However, drilling all the way down to the record level was shown in Part 2 (Gender Code), Part 4 (City Name), and Part 6 (Account Number and Tax ID).

 

Conclusion

Fundamentally, this series posed the following question: What can just your analysis of data tell you about it?

Data profiling is typically one of the first tasks performed on a data quality initiative.  I am often told to delay data profiling until business requirements are documented and subject matter experts are available to answer my questions. 

I always disagree – and begin data profiling as soon as possible.

I can do a better job of evaluating business requirements and preparing for meetings with subject matter experts after I have spent some time looking at data from a starting point of blissful ignorance and curiosity.

Ultimately, I believe the goal of data profiling is not to find answers, but instead, to discover the right questions.

Discovering the right questions is a critical prerequisite for effectively discussing data usage, relevancy, standards, and the metrics for measuring and improving quality.  All of which are necessary in order to progress from just profiling your data, to performing a full data quality assessment (which I will cover in a future series on this blog).

A data profiling tool can help you by automating some of the grunt work needed to begin your analysis.  However, it is important to remember that the analysis itself can not be automated – you need to review the statistical summaries and frequency distributions generated by the data profiling tool and more important translate your analysis into meaningful reports and questions to share with the rest of your team. 

Always remember that well performed data profiling is both a highly interactive and a very iterative process.

 

Thank You

I want to thank you for providing your feedback throughout this series. 

As my fellow Data Gazers, you provided excellent insights and suggestions via your comments. 

The primary reason I published this series on my blog, as opposed to simply writing a whitepaper or a presentation, was because I knew our discussions would greatly improve the material.

I hope this series proves to be a useful resource for your actual adventures in data profiling.

 

The Complete Series


Recently Read: November 28, 2009

Recently Read is an OCDQ regular segment.  Each entry provides links to blog posts, articles, books, and other material I found interesting enough to share.  Please note “recently read” is literal – therefore what I share wasn't necessarily recently published.

 

Data Quality Blog Posts

For simplicity, “Data Quality” also includes Data Governance, Master Data Management, and Business Intelligence.

 

Social Media Blog Posts

For simplicity, “Social Media” also includes Blogging, Social Networking, and Online Marketing.

 

Book Quotes

An eclectic list of quotes from some recently read (and/or simply my favorite) books.

  • From The Wisdom of Crowds by James Surowiecki – “Refuse to allow the merit of an idea to be determined by the status of the person advocating it.”

     

  • From Purple Cow by Seth Godin – “We mistakenly believe that criticism leads to failure.”

     

  • From How We Decide by Jonah Lehrer – “The best decision-makers don't despair.  Instead, they become students of error, determined to learn from what went wrong.”

     

  • From The Whuffie Factor by Tara Hunt – “Whuffie is the residual outcome—the currency—of your reputation.  You lose or gain it based on positive or negative actions, your contributions to the community, and what people think of you.”

     

  • From Trust Agents by Chris Brogan and Julien Smith – “You accrue social capital as a side benefit of doing good, but doing good by itself is its own reward.”

Commendable Comments (Part 4)

Thanksgiving

Photo via Flickr (Creative Commons License) by: ella_marie 

Today is Thanksgiving Day, which is a United States holiday with a long and varied history.  The most consistent themes remain family and friends gathering together to share a large meal and express their gratitude.

This is the fourth entry in my ongoing series for expressing my gratitude to my readers for their truly commendable comments on my blog posts.  Receiving comments is the most rewarding aspect of my blogging experience.  Although I am truly grateful to all of my readers, I am most grateful to my commenting readers. 

 

Commendable Comments

On Days Without A Data Quality Issue, Steve Sarsfield commented:

“Data quality issues probably occur on some scale in most companies every day.  As long as you qualify what is and isn't a data quality issue, this gets back to what the company thinks is an acceptable level of data quality.

I've always advocated aggregating data quality scores to form business metrics.  For example, what data quality metrics would you combine to ensure that customers can always be contacted in case of an upgrade, recall or new product offering?  If you track the aggregation, it gives you more of a business feel.”

On Customer Incognita, Daragh O Brien commented:

“Back when I was with the phone company I was (by default) the guardian of the definition of a 'Customer'.  Basically I think they asked for volunteers to step forward and I was busy tying my shoelace when the other 11,000 people in the company as one entity took a large step backwards.

I found that the best way to get a definition of a customer was to lock the relevant stakeholders in a room and keep asking 'What' and 'Why'. 

My 'data modeling' methodology was simple.  Find out what the things were that were important to the business operation, define each thing in English without a reference to itself, and then we played the 'Yes/No Game Show' to figure out how that entity linked to other things and what the attributes of that thing were.

Much to IT's confusion, I insisted that the definition needed to be a living thing, not carved in two stone tablets we'd lug down from on top of the mountain. 

However, because of the approach that had been taken we found that when new requirements were raised (27 from one stakeholder), the model accommodated all of them either through an expansion of a description or the addition of a piece of reference data to part of the model.

Fast-forward a few months from the modeling exercise.  I was asked by IT to demo the model to a newly acquired subsidiary.  It was a significantly different business.  I played the 'Yes/No Game Show' with them for a day.  The model fitted their needs with just a minor tweak. 

The IT team from the subsidiary wanted to know how had I gone about normalizing the data to come up with the model, which is kind of like cutting up a perfectly good apple pie to find out how what an apple is and how to make pastry.

What I found about the 'Yes/No Game Show' approach was that it made people open up their thinking a bit, but it took some discipline and perseverance on my part to keep asking what and why.  Luckily, having spent most of the previous few years trying to get these people to think seriously about data quality they already thought I was a moron so they were accommodating to me.

A key learning for me out of the whole thing is that, even if you are doing a data management exercise for a part of a larger business, you need to approach it in a way that can be evolved and continuously improved to ensure quality across the entire organization. 

Also, it highlighted the fallacy of assuming that a company can only have one kind of customer.”

On The Once and Future Data Quality Expert, Dylan Jones commented:

“I recently attended a conference and sat in on a panel that discussed some of the future trends, such as cloud computing.  It was a great discussion, highly polarized, and as I came home I thought about how far we've come as a profession but more importantly, how much more there is to do.

The reality is that the world is changing, the volumes of data held by businesses are immense and growing exponentially, our desire for new forms of information delivery insatiable, and the opportunities for innovation boundless.

I really believe we're not innovating as an industry anything like we should be.  The cloud, as an example, offers massive opportunities for a range of data quality services but I've certainly not read anything in the media or press that indicates someone is capitalizing on this.

There are a few recent data quality technology innovations which have caught my eye, but I also think there is so much more vendors should be doing.

On the personal side of the profession, I think online education is where we're headed.  The concept of localized training is now being replaced by online learning.  With the Internet you can now train people on every continent, so why aren't more people going down this route?

I find it incredibly ironic when I speak to data quality specialists who admit that 'they don't have the first clue about all this social media stuff.'  This is the next generation of information management, it's here right now, they should be embracing it.  I think if you're a 'guru' author, trainer or consultant you need to think of new ways to engage with your clients/trainees using the tools available.

What worries me is that the growth of information doesn't match the maturity and growth of our profession.  For example, we really need more people who can articulate the value of what we can offer. 

Ted Friedman made a great point on Twitter recently when he talked about how people should stop moaning about executives that 'don't get it' and instead focus on improving ways to demonstrate the value of data quality improvement.

Just because we've come a long way doesn't mean we know it all, there is still a hell of a long way to go.”

Thanks for giving your comments

Thank you very much for giving your comments and sharing your perspectives with our collablogaunity.  Since there have been so many commendable comments, please don't be offended if your commendable comment hasn't been featured yet. 

Please keep on commenting and stay tuned for future entries in the series. 

 

Related Posts

Commendable Comments (Part 1)

Commendable Comments (Part 2)

Commendable Comments (Part 3)