OCDQ Blog

Big Data and the Infinite Inbox

HoardaBytes and the Big Data Lebowski

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

Information Overload Revisited

Big Data el Memorioso

Exercise Better Data Management

Dot Collectors and Dot Connectors

The Wisdom of Crowds, Friends, and Experts

A Contrarian’s View of Unstructured Data

The Flying Monkeys of Big Data

Cargo Cult Data Science

A Statistically Significant Resolution for 2013

Speed Up Your Data to Slow Down Your Decisions

Rage against the Machines Learning

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

February 05, 2013

Big Data and the Infinite Inbox

February 05, 2013/ Jim Harris

Occasionally it’s necessary to temper the unchecked enthusiasm accompanying the peak of inflated expectations associated with any hype cycle. This may be especially true for big data, and especially now since, as Svetlana Sicular of Gartner recently blogged, big data is falling into the trough of disillusionment and “to minimize the depth of the fall, companies must be at a high enough level of analytical and enterprise information management maturity combined with organizational support of innovation.”

I fear the fall may feel bottomless for those who fell hard for the hype and believe the Big Data Psychic capable of making better, if not clairvoyant, predictions. When, in fact, “our predictions may be more prone to failure in the era of big data,” explained Nate Silver in his book The Signal and the Noise: Why Most Predictions Fail but Some Don't. “There isn’t any more truth in the world than there was before the Internet. Most of the data is just noise, as most of the universe is filled with empty space.”

Proposing the 3Ss (Small, Slow, Sure) as a counterpoint to the 3Vs (Volume, Velocity, Variety), Stephen Few recently blogged about the slow data movement. “Data is growing in volume, as it always has, but only a small amount of it is useful. Data is being generated and transmitted at an increasing velocity, but the race is not necessarily for the swift; slow and steady will win the information race. Data is branching out in ever-greater variety, but only a few of these new choices are sure.”

Big data requires us to revisit information overload, a term that was originally about, not the increasing amount of information, but instead the increasing access to information. As Clay Shirky stated, “It’s not information overload, it’s filter failure.”

As Silver noted, the Internet (like the printing press before it) was a watershed moment in our increased access to information, but its data deluge didn’t increase the amount of truth in the world. And in today’s world, where many of us strive on a daily basis to prevent email filter failure and achieve what Merlin Mann called Inbox Zero, I find unfiltered enthusiasm about big data to be rather ironic, since big data is essentially enabling the data-driven decision making equivalent of the Infinite Inbox.

Imagine logging into your email every morning and discovering: You currently have (∞) Unread Messages.

However, I’m sure most of it probably would be spam, which you obviously wouldn’t have any trouble quickly filtering (after all, infinity minus spam must be a back of the napkin calculation), allowing you to only read the truly useful messages. Right?

HoardaBytes and the Big Data Lebowski

Open MIKE Podcast — Episode 05: Defining Big Data

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

Information Overload Revisited

Exercise Better Data Management

A Statistically Significant Resolution for 2013

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

January 22, 2013

Popeye, Spinach, and Data Quality

January 22, 2013/ Jim Harris

As a kid, one of my favorite cartoons was Popeye the Sailor, who was empowered by eating spinach to take on many daunting challenges, such as battling his brawny nemesis Bluto for the affections of his love interest Olive Oyl, often kidnapped by Bluto.

I am reading the book The Half-life of Facts: Why Everything We Know Has an Expiration Date by Samuel Arbesman, who explained, while examining how a novel fact, even a wrong one, spreads and persists, that one of the strangest examples of the spread of an error is related to Popeye the Sailor. “Popeye, with his odd accent and improbable forearms, used spinach to great effect, a sort of anti-Kryptonite. It gave him his strength, and perhaps his distinctive speaking style. But why did Popeye eat so much spinach? What was the reason for his obsession with such a strange food?”

The truth begins over fifty years before the comic strip made its debut. “Back in 1870,” Arbesman explained, “Erich von Wolf, a German chemist, examined the amount of iron within spinach, among many other green vegetables. In recording his findings, von Wolf accidentally misplaced a decimal point when transcribing data from his notebook, changing the iron content in spinach by an order of magnitude. While there are actually only 3.5 milligrams of iron in a 100-gram serving of spinach, the accepted fact became 35 milligrams. Once this incorrect number was printed, spinach’s nutritional value became legendary. So when Popeye was created, studio executives recommended he eat spinach for his strength, due to its vaunted health properties, and apparently Popeye helped increase American consumption of spinach by a third!”

“This error was eventually corrected in 1937,” Arbesman continued, “when someone rechecked the numbers. But the damage had been done. It spread and spread, and only recently has gone by the wayside, no doubt helped by Popeye’s relative obscurity today. But the error was so widespread, that the British Medical Journal published an article discussing this spinach incident in 1981, trying its best to finally debunk the issue.”

“Ultimately, the reason these errors spread,” Arbesman concluded, “is because it’s a lot easier to spread the first thing you find, or the fact that sounds correct, than to delve deeply into the literature in search of the correct fact.”

What “spinach” has your organization been falsely consuming because of a data quality issue that was not immediately obvious, and which may have led to a long, and perhaps ongoing, history of data-driven decisions based on poor quality data?

Popeye said “I yam what I yam!” Your organization yams what your data yams, so you had better make damn sure it’s correct.

The Family Circus and Data Quality

Can Data Quality avoid the Dustbin of History?

Retroactive Data Quality

Spartan Data Quality

Pirates of the Computer: The Curse of the Poor Data Quality

The Tooth Fairy of Data Quality

The Dumb and Dumber Guide to Data Quality

Darth Data

Occurred, a data defect has . . .

The Data Quality Placebo

Data Quality is People!

DQ-View: The Five Stages of Data Quality

DQ-BE: Data Quality Airlines

Wednesday Word: Quality-ish

The Five Worst Elevator Pitches for Data Quality

Shining a Social Light on Data Quality

The Poor Data Quality Jar

Data Quality and #FollowFriday the 13th

Dilbert, Data Quality, Rabbits, and #FollowFriday

Data Love Song Mashup

August 01, 2012

Exercise Better Data Management

August 01, 2012/ Jim Harris

Recently on Twitter, Daragh O Brien and I discussed his proposed concept. “After Big Data,” Daragh tweeted, “we will inevitably begin to see the rise of MOData as organizations seek to grab larger chunks of data and digest it. What is MOData? It’s MO’Data, as in MOre Data. Or Morbidly Obese Data. Only good data quality and data governance will determine which.”

Daragh asked if MO’Data will be the Big Data Killer. I said only if MO’Data doesn’t include MO’BusinessInsight, MO’DataQuality, and MO’DataPrivacy (i.e., more business insight, more data quality, and more data privacy).

“But MO’Data is about more than just More Data,” Daragh replied. “It’s about avoiding Morbidly Obese Data that clogs data insight and data quality, etc.”

I responded that More Data becomes Morbidly Obese Data only if we don’t exercise better data management practices.

Agreeing with that point, Daragh replied, “Bring on MOData and the Pilates of Data Quality and Data Governance.”

To slightly paraphrase lines from one of my favorite movies — Airplane! — the Cloud is getting thicker and the Data is getting laaaaarrrrrger. Surely I know well that growing data volumes is a serious issue — but don’t call me Shirley.

Whether you choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how much reality bites, the truth is we were consuming way more than our recommended daily allowance of data long before the data management industry took a tip from McDonald’s and put the word “big” in front of its signature sandwich. (Oh great . . . now I’m actually hungry for a Big Mac.)

But nowadays with silos replicating data, as well as new data, and new types of data, being created and stored on a daily basis, our data is resembling the size of Bob Parr in retirement, making it seem like not even Mr. Incredible in his prime possessed the super strength needed to manage all of our data. Those were references to the movie The Incredibles, where Mr. Incredible was a superhero who, after retiring into civilian life under the alias of Bob Parr, elicits the observation from this superhero costume tailor: “My God, you’ve gotten fat.” Yes, I admit not even Helen Parr (aka Elastigirl) could stretch that far for a big data joke.

A Healthier Approach to Big Data

Although Daragh’s concerns about morbidly obese data are valid, no superpowers (or other miracle exceptions) are needed to manage all of our data. In fact, it’s precisely when we are so busy trying to manage all of our data that we hoard countless bytes of data without evaluating data usage, gathering data requirements, or planning for data archival. It’s like we are trying to lose weight by eating more and exercising less, i.e., consuming more data and exercising less data quality and data governance. As Daragh said, only good data quality and data governance will determine whether we get more data or morbidly obese data.

Losing weight requires a healthy approach to both diet and exercise. A healthy approach to diet includes carefully choosing the food you consume and carefully controlling your portion size. A healthy approach to exercise includes a commitment to exercise on a regular basis at a sufficient intensity level without going overboard by spending several hours a day, every day, at the gym.

Swimming is a great form of exercise, but swimming in big data without having a clear business objective before you jump into the pool is like telling your boss that you didn’t get any work done because you decided to spend all day working out at the gym.

Carefully choosing the data you consume and carefully controlling your data portion size is becoming increasingly important since big data is forcing us to revisit information overload. However, the main reason that traditional data management practices often become overwhelmed by big data is because traditional data management practices are not always the right approach.

We need to acknowledge that some big data use cases differ considerably from traditional ones. Data modeling is still important and data quality still matters, but how much data modeling and data quality is needed before big data can be effectively used for business purposes will vary. In order to move the big data discussion forward, we have to stop fiercely defending our traditional perspectives about structure and quality. We also have to stop fiercely defending our traditional perspectives about analytics, since there will be some big data use cases where depth and detailed analysis may not be necessary to provide business insight.

Better than Big or More

Jim Ericson explained that your data is big enough. Rich Murnane explained that bigger isn’t better, better is better. Although big data may indeed be followed by more data that doesn’t necessarily mean we require more data management in order to prevent more data from becoming morbidly obese data. I think that we just need to exercise better data management.

HoardaBytes and the Big Data Lebowski

Big Data and the Infinite Inbox

The Laugh-In Effect of Big Data

The Need for Data Philosophers

OCDQ Radio - Demystifying Data Science

i blog of Data glad and big

Big Data is Just Another Brick in the Wall

The Wisdom of Crowds, Friends, and Experts

Magic Elephants, Data Psychics, and Invisible Gorillas

July 17, 2012

DQ-View: The Five Stages of Data Quality

July 17, 2012/ Jim Harris

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

In my experience, all organizations cycle through five stages while coming to terms with the daunting challenges of data quality, which are somewhat similar to The Five Stages of Grief. So, in this short video, I explain The Five Stages of Data Quality:

Denial — Our organization is well-managed and highly profitable. We consistently meet, or exceed, our business goals. We obviously understand the importance of high-quality data. Data quality issues can’t possibly be happening to us.
Anger — We’re now in the midst of a financial reporting scandal, and facing considerable fines in the wake of a regulatory compliance failure. How can this be happening to us? Why do we have data quality issues? Who is to blame for this?
Bargaining — Okay, we may have just overreacted a little bit. We’ll purchase a data quality tool, approve a data cleansing project, implement defect prevention, and initiate data governance. That will fix all of our data quality issues — right?
Depression — Why, oh why, do we keep having data quality issues? Why does this keep happening to us? Maybe we should just give up, accept our doomed fate, and not bother doing anything at all about data quality and data governance.
Acceptance — We can’t fight the truth anymore. We accept that we have to do the hard daily work of continuously improving our data quality and continuously implementing our data governance principles, policies, and procedures.

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link:DQ-View on Vimeo

You can also watch a regularly updated page of my videos by clicking on this link:OCDQ Videos

Posts related to the Denial Stage of Data Quality:

Data Quality and Chicken Little Syndrome

The Illusion-of-Quality Effect

Perception Filters and Data Quality

“Some is not a number and soon is not a time”

The Data Quality Wager

Posts related to the Anger Stage of Data Quality:

Jack Bauer and Enforcing Data Governance Policies

Beware the Data Governance Ides of March

Aristotle, Data Governance, and Lead Rulers

Why isn’t our data quality worse?

Don’t Do Less Bad; Do Better Good

Posts related to the Bargaining Stage of Data Quality:

Data Quality and Miracle Exceptions

Do you believe in Magic (Quadrants)?

Which came first, the Data Quality Tool or the Business Need?

The Technology Carousel

The Stakeholder’s Dilemma

Posts related to the Depression Stage of Data Quality:

There is No Such Thing as a Root Cause

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

Data Quality and the Bystander Effect

Posts related to the Acceptance Stage of Data Quality:

You only get a Return from something you actually Invest in

Data Governance Frameworks are like Jigsaw Puzzles

The HedgeFoxian Hypothesis

Finding Data Quality

Data Quality: Quo Vadimus?

July 10, 2012

Shining a Social Light on Data Quality

July 10, 2012/ Jim Harris

Last week, when I published my blog post Lightning Strikes the Cloud, I unintentionally demonstrated three important things about data quality.

The first thing I demonstrated was even an obsessive-compulsive data quality geek is capable of data defects, since I initially published the post with the title Lightening Strikes the Cloud, which is an excellent example of the difference between validity and accuracy caused by the Cupertino Effect, since although lightening is valid (i.e., a correctly spelled word), it isn’t contextually accurate.

The second thing I demonstrated was the value of shining a social light on data quality — the value of using collaborative tools like social media to crowd-source data quality improvements. Thankfully, Julian Schwarzenbach quickly noticed my error on Twitter. “Did you mean lightning? The concept of lightening clouds could be worth exploring further,” Julian humorously tweeted. “Might be interesting to consider what happens if the cloud gets so light that it floats away.” To which I replied that if the cloud gets so light that it floats away, it could become Interstellar Computing or, as Julian suggested, the start of the Intergalactic Net, which I suppose is where we will eventually have to store all of that big data we keep hearing so much about these days.

The third thing I demonstrated was the potential dark side of data cleansing, since the only remaining trace of my data defect is a broken URL. This is an example of not providing a well-documented audit trail, which is necessary within an organization to communicate data quality issues and resolutions.

Communication and collaboration are essential to finding our way with data quality. And social media can help us by providing more immediate and expanded access to our collective knowledge, experience, and wisdom, and by shining a social light that illuminates the shadows cast upon data quality issues when a perception filter or bystander effect gets the better of our individual attention or undermines our collective best intentions — which, as I recently demonstrated, occasionally happens to all of us.

Data Quality and the Cupertino Effect

Are you turning Ugly Data into Cute Information?

The Importance of Envelopes

The Algebra of Collaboration

Finding Data Quality

The Wisdom of the Social Media Crowd

Perception Filters and Data Quality

Data Quality and the Bystander Effect

The Family Circus and Data Quality

Data Quality and the Q Test

Metadata, Data Quality, and the Stroop Test

The Three Most Important Letters in Data Governance

June 17, 2012

The Family Circus and Data Quality

June 17, 2012/ Jim Harris

Like many young intellectuals, the only part of the Sunday newspaper I read growing up was the color comics section, and one of my favorite comic strips was The Family Circus created by cartoonist Bil Keane. One of the recurring themes of the comic strip was a set of invisible gremlins that the children used to shift blame for any misdeeds, including Ida Know, Not Me, and Nobody.

Although I no longer read any section of the newspaper on any day of the week, this Sunday morning I have been contemplating how this same set of invisible gremlins is used by many people throughout most organizations to shift blame for any incidents when poor data quality negatively impacted business activities, especially since, when investigating the root cause, you often find that Ida Know owns the data, Not Me is accountable for data governance, and Nobody takes responsibility for data quality.

May 17, 2012

The Data Quality Placebo

May 17, 2012/ Jim Harris

Inspired by a recent Boing Boing blog post

Are you suffering from persistent and annoying data quality issues? Or are you suffering from the persistence of data quality tool vendors and consultants annoying you with sales pitches about how you must be suffering from persistent data quality issues?

Either way, the Data Division of Prescott Pharmaceuticals (trusted makers of gastroflux, datamine, selectium, and qualitol) is proud to present the perfect solution to all of your real and/or imaginary data quality issues — The Data Quality Placebo.

Simply take two capsules (made with an easy-to-swallow coating) every morning and you will be guaranteed to experience:

“Zero Defects with Zero Side Effects” TM

(Legal Disclaimer: Zero Defects with Zero Side Effects may be the result of Zero Testing, which itself is probably just a side effect of The Prescott Promise: “We can promise you that we will never test any of our products on animals because . . . we never test any of our products.”)

February 14, 2012

Data Love Song Mashup

February 14, 2012/ Jim Harris

Today is February 14 — Valentine’s Day — the annual celebration of enduring romance, where true love is publicly judged according to your willingness to purchase chocolate, roses, and extremely expensive jewelry, and privately judged in ways that nobody (and please, trust me when I say nobody) wants to see you post on Twitter, Facebook, YouTube, or your blog.

Valentine’s Day is for people in love to celebrate their love privately in whatever way works best for them.

But since your data needs love too, this blog post provides a mashup of love songs for your data.

Data Love Song Mashup

I’ve got sunshine on a cloud computing day
When it’s cold outside, I’ve got backups from the month of May
I guess you’d say, what can make me feel this way?
My data, my data, my data
Singing about my data
My data

My data’s so beautiful
And I tell it every day
When I see your user interface
There’s not a thing that I would change
Because my data, you’re amazing
Just the way you are
You’re amazing data
Just the way you are

They say we’re young and we don’t know
We won’t find data quality issues until we grow
Well I don’t know if that is true
Because you got me, data
And data, I got you
I got you, data

Look into my eyes, and you will see
What my data means to me
Don’t tell me data quality is not worth trying for
Don’t tell me it’s not worth fighting for
You know it’s true
Everything I do, I do data quality for you

I can’t make you love data if you don’t
I can’t make your heart feel something it won’t

But there’s nothing you can do that can’t be done
Nothing you can sing that can’t be sung
Nothing you can make that can’t be made
All you need is love . . . for data
Love for data is all you need

Business people working hard all day and through the night
Their database queries searching for business insight
Some will win, some will lose
Some were born to sing the data quality blues
Oh, the need for business insight never ends
It goes on and on and on and on
Don’t stop believing
Hold on to that data loving feeling

Look at your data, I know its poor quality is showing
Look at your organization, you don’t know where it’s going
I don’t know much, but I know your data needs love too
And that may be all I need to know

Nothing compares to data quality, no worries or cares
Business regrets and decision mistakes, they’re memories made
But if you don’t continuously improve, how bittersweet that will taste
I wish nothing but the best for you
I wish nothing but the best for your data too
Don’t forget data quality, I beg, please remember I said
Sometimes quality lasts in data, but sometimes it hurts instead

Happy Valentine’s Day to you and yours
Happy Data Quality to you and your data

January 30, 2012

HoardaBytes and the Big Data Lebowski

January 30, 2012/ Jim Harris

The recent #GartnerChat on Big Data was an excellent Twitter discussion about what I often refer to as the Seven Letter Tsunami of the data management industry, which as Gartner Research explains, although the term acknowledges the exponential growth, availability, and use of information in today’s data-rich landscape, big data is about more than just data volume. Data variety (i.e., structured, semi-structured, and unstructured data, as well as other types, such as the sensor data emanating from the Internet of Things), and data velocity (i.e., how fast data is produced and how fast data must be processed to meet demand) are also key characteristics of the big challenges associated with the big buzzword that big data has become over the last year.

Since ours is an industry infatuated with buzzwords, Timo Elliott remarked “new terms arise because of new technology, not new business problems. Big Data came from a need to name Hadoop [and other technologies now being relentlessly marketed as big data solutions], so anybody using big data to refer to business problems is quickly going to tie themselves in definitional knots.”

To which Mark Troester responded, “the hype of Hadoop is driving pressure on people to keep everything — but they ignore the difficulty in managing it.” John Haddad then quipped that “big data is a hoarders dream,” which prompted Andy Bitterer to coin the term HoardaByte for measuring big data, and then asking, “Would the real Big Data Lebowski please stand up?”

HoardaBytes

Although it’s probably no surprise that a blogger with obsessive-compulsive in the title of his blog would like Bitterer’s new term, the fact is that whether you choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how much reality bitterly bites, our organizations have been compulsively hoarding data for a long time.

And with silos replicating data as well as new data, and new types of data, being created and stored on a daily basis, managing all of the data is not only becoming impractical, but because we are too busy with the activity of trying to manage all of it, we are hoarding countless bytes of data without evaluating data usage, gathering data requirements, or planning for data archival.

The Big Data Lebowski

In The Big Lebowski, Jeff Lebowski (“The Dude”) is, in a classic data quality blunder caused by matching on person name only, mistakenly identified as millionaire Jeffrey Lebowski (“The Big Lebowski”) in an eccentric plot expected from a Coen brothers film, which, since its release in the late 1990s, has become a cult classic and inspired a religious following known as Dudeism.

Historically, a big part of the problem in our industry has been the fact that the word “data” is prevalent in the names we have given industry disciplines and enterprise information initiatives. For example, data architecture, data quality, data integration, data migration, data warehousing, master data management, and data governance — to name but a few.

However, all this achieved was to perpetuate the mistaken identification of data management as an esoteric technical activity that played little more than a minor, supporting, and often uncredited, role within the business activities of our organizations.

But since the late 1990s, there has been a shift in the perception of data. The real data deluge has not been the rising volume, variety, and velocity of data, but instead the rising awareness of the big impact that data has on nearly every aspect of our professional and personal lives. In this brave new data world, companies like Google and Facebook have built business empires mostly out of our own personal data, which is why, like it or not, as individuals, we must accept that we are all data geeks now.

All of the hype about Big Data is missing the point. The reality is that Data is Big — meaning that data has now so thoroughly pervaded mainstream culture that data has gone beyond being just a cult classic for the data management profession, and is now inspiring an almost religious following that we could call Dataism.

The Data must Abide

“The Dude abides. I don’t know about you, but I take comfort in that,” remarked The Stranger in The Big Lebowski.

The Data must also abide. And the Data must abide both the Business and the Individual. The Data abides the Business if data proves useful to our business activities. The Data abides the Individual if data protects the privacy of our personal activities.

The Data abides. I don’t know about you, but I would take more comfort in that than in any solutions The Stranger Salesperson wants to sell me that utilize an eccentric sales pitch involving HoardaBytes and the Big Data Lebowski.

Big Data and the Infinite Inbox

The Laugh-In Effect of Big Data

The Need for Data Philosophers

OCDQ Radio - Demystifying Data Science

i blog of Data glad and big

Big Data is Just Another Brick in the Wall