Jim Harris

My name is Jim Harris, I am the Blogger-in-Chief of OCDQ Blog, and an independent consultant, speaker, and freelance writer for hire.

My Services Contact Me
Search OCDQ Blog
Recent Comments

Entries in Big Data (39)

Thursday
May162013

The Need for Data Philosophers

In my post On Philosophy, Science, and Data, I explained that although some argue philosophy only reigns in the absence of data while science reigns in the analysis of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  Therefore, I argued that an endless oscillation persists between science and philosophy, which is why, despite the fact that all we hear about is the need for data scientists, there’s also a need for data philosophers.

Of course, the debate between science and philosophy is a very old one, as is the argument we need both.  In my previous post, I slightly paraphrased Immanuel Kant (“perception without conception is blind and conception without perception is empty”) by saying that science without philosophy is blind and philosophy without science is empty.

In his book Cosmic Apprentice: Dispatches from the Edges of Science, Dorion Sagan explained that science and philosophy hang “in a kind of odd balance, watching each other, holding hands.  Science’s eye for detail, buttressed by philosophy’s broad view, makes for a kind of alembic, an antidote to both.  Although philosophy isn’t fiction, it can be more personal, creative and open, a kind of counterbalance for science even as it argues that science, with its emphasis on a kind of impersonal materialism, provides a crucial reality check for philosophy and a tendency to over-theorize that’s inimical to the scientific spirit.  Ideally, in the search for truth, science and philosophy, the impersonal and autobiographical, can keep each other honest in a kind of open circuit.”

“Science’s spirit is philosophical,” Sagan concluded.  “It is the spirit of questioning, of curiosity, of critical inquiry combined with fact-checking.  It is the spirit of being able to admit you’re wrong, of appealing to data, not authority.”

“Science,” as his father Carl Sagan said, “is a way of thinking much more than it is a body of knowledge.”  By extension, we could say that data science is about a way of thinking much more than it is about big data or about being data-driven.

I have previously blogged that science has always been about bigger questions, not bigger data.  As Claude Lévi-Strauss said, “the scientist is not a person who gives the right answers, but one who asks the right questions.”  As far as data science goes, what are the right questions?  Data scientist Melinda Thielbar proposes three key questions (Actionable? Verifiable? Repeatable?).

Here again we see the interdependence of science and philosophy.  “Philosophy,” Marilyn McCord Adams said, “is thinking really hard about the most important questions and trying to bring analytic clarity both to the questions and the answers.”

“Philosophy is critical thinking,” Don Cupitt said. “Trying to become aware of how one’s own thinking works, of all the things one takes for granted, of the way in which one’s own thinking shapes the things one’s thinking about.”  Yes, even a data scientist’s own thinking could shape the things they are thinking scientifically about.  Big data evangelist James Kobielus recently blogged about five biases that may crop up in a data scientist’s work (Cognitive, Selection, Sampling, Modeling, Funding).

“Data science has a bright future ahead,” explained Hilary Mason in a recent interview.  “There will only be more data, and more of a need for people who can find meaning and value in that data.  We’re also starting to see a greater need for data engineers, people to build infrastructure around data and algorithms, and data artists, people who can visualize the data.”

I agree with Mason, and I would add that we are also starting to see a greater need for data philosophers, people who can, borrowing the words that Anthony Kenny used to define philosophy, “think as clearly as possible about the most fundamental concepts that reach through all the disciplines.”

 

Related Posts

On Philosophy, Science, and Data

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

Big Data and the Infinite Inbox

The Laugh-In Effect of Big Data

HoardaBytes and the Big Data Lebowski

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

The Wisdom of Crowds, Friends, and Experts

Why Data Science Storytelling Needs a Good Editor

Predictive Analytics, the Data Effect, and Jed Clampett

Bigger Questions, not Bigger Data

The Flying Monkeys of Big Data

Cargo Cult Data Science

Speed Up Your Data to Slow Down Your Decisions

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

Tuesday
Apr232013

The Laugh-In Effect of Big Data

Although I am an advocate for data science and big data done right, lately I have been sounding the Anti-Hype Horn with blog posts offering a contrarian’s view of unstructured data, forewarning you about the flying monkeys of big data, cautioning you against performing Cargo Cult Data Science, and inviting you to ponder the perils of the Infinite Inbox.

The hype of big data has resulted in a lot of people and vendors extolling its virtues with stories about how Internet companies, political campaigns, and new technologies have profited, or otherwise benefited, from big data.  These stories are served up as alleged business cases for investing in big data and data science.  Although some of these stories are fluff pieces, many of them accurately, and in some cases comprehensively, describe a real-world application of big data and data science.  However, these messages most often lack a critically important component — applicability to your specific business.  In Made to Stick: Why Some Ideas Survive and Others Die, Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless.  If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”

Rowan & Martin’s Laugh-In was an American sketch comedy television series, which aired from 1968 to 1973.  One of the recurring characters portrayed by Arte Johnson was Wolfgang the German soldier, who would often comment on the previous comedy sketch by saying (in a heavy and long-drawn-out German accent): “Very interesting . . . but stupid!”

From now on whenever someone shares another interesting story masquerading as a solid business case for big data that lacks any applicability beyond the specific scenario in the story, a common phenomenon I call The Laugh-In Effect of Big Data, my unapologetic response will resoundingly be: “Very interesting . . . but stupid!”

 

Related Posts

On Philosophy, Science, and Data

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

Big Data and the Infinite Inbox

HoardaBytes and the Big Data Lebowski

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Big Data el Memorioso

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

Dot Collectors and Dot Connectors

The Wisdom of Crowds, Friends, and Experts

A Contrarian’s View of Unstructured Data

The Flying Monkeys of Big Data

Cargo Cult Data Science

A Statistically Significant Resolution for 2013

Speed Up Your Data to Slow Down Your Decisions

Rage against the Machines Learning

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

Thursday
Mar282013

The Big Datastillery

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: The Big Datastillery on Vimeo

To view or download the infographic featured in the video, click on this direct link to its PDF: The Big Datastillery.pdf

 

This video was sponsored by the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this video are my own and don’t necessarily represent IBM’s positions, strategies, or opinions.

 

Related Posts

Smart Big Data Adoption for Midsize Businesses

Big Data is not just for Big Businesses

Social Business is more than Social Marketing

Social Media Marketing: From Monologues to Dialogues

Social Media for Midsize Businesses

Cloud Computing is the New Nimbyism

Leveraging the Cloud for Application Development

Barriers to Cloud Adoption

Will Big Data be Blinded by Data Science?

Big Data Lessons from Orbitz

The Graystone Effects of Big Data

Talking Business about the Weather

Word of Mouth has become Word of Data

Information Asymmetry versus Empowered Customers

The Age of the Mobile Device

Devising a Mobile Device Strategy

Thursday
Mar142013

On Philosophy, Science, and Data

Ever since Melinda Thielbar helped me demystify data science on OCDQ Radio, I have been pondering my paraphrasing of an old idea: Science without philosophy is blind; Philosophy without science is empty; Data needs both science and philosophy.

“A philosopher’s job is to find out things about the world by thinking rather than observing,” the philosopher Bertrand Russell once said.  One could say a scientist’s job is to find out things about the world by observing and experimenting.  In fact, Russell observed that “the most essential characteristic of scientific technique is that it proceeds from experiment, not from tradition.”

Russell also said that “science is what we know, and philosophy is what we don’t know.”  However, Stuart Firestein, in his book Ignorance: How It Drives Science, explained “there is no surer way to screw up an experiment than to be certain of its outcome.”

Although it seems it would make more sense for science to be driven by what we know, by facts, “working scientists,” according to Firestein, “don’t get bogged down in the factual swamp because they don’t care that much for facts.  It’s not that they discount or ignore them, but rather that they don’t see them as an end in themselves.  They don’t stop at the facts; they begin there, right beyond the facts, where the facts run out.  Facts are selected for the questions they create, for the ignorance they point to.”

In this sense, philosophy and science work together to help us think about and experiment with what we do and don’t know.

Some might argue that while anyone can be a philosopher, being a scientist requires more rigorous training.  A commonly stated requirement in the era of big data is to hire data scientists, but this begs the question: Is data science only for data scientists?

“Clearly what we need,” Firestein explained, “is a crash course in citizen science—a way to humanize science so that it can be both appreciated and judged by an informed citizenry.  Aggregating facts is useless if you don’t have a context to interpret them.”

I would argue that clearly what organizations need is a crash course in data science—a way to humanize data science so that it can be both appreciated and judged by an informed business community.  Big data is useless if you don’t have a business context to interpret it.  Firestein also made great points about science not being exclusionary (i.e., not just for scientists).  Just as you can enjoy watching sports without being a professional athlete and you can appreciate music without being a professional musician, you can—and should—learn the basics of data science (especially statistics) without being a professional data scientist.

In order to truly deliver business value to organizations, data science can not be exclusionary.  This doesn’t mean you shouldn’t hire data scientists.  In many cases, you will need the expertise of professional data scientists.  However, you will not be able to direct them or interpret their findings without understanding the basics, what could be called the philosophy of data science.

Some might argue that philosophy only reigns in the absence of data, while science reigns in the analysis of data.  Although in the era of big data there seems to be fewer areas truly absent of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  So, an endless oscillation persists between science and philosophy, which is why science without philosophy is blind, and philosophy without science is empty.  Data needs both science and philosophy.

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

 

Related Posts

Big Data and the Infinite Inbox

HoardaBytes and the Big Data Lebowski

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Big Data el Memorioso

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

Dot Collectors and Dot Connectors

The Wisdom of Crowds, Friends, and Experts

A Statistically Significant Resolution for 2013

Speed Up Your Data to Slow Down Your Decisions

Rage against the Machines Learning

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

Tuesday
Feb262013

Smart Big Data Adoption for Midsize Businesses

In a previous post, I explained that big data is not just for big businesses.  During a recent interview Ed Abrams discussed how mobile, social, and cloud are driving big data adoption among midsize businesses.

As Sharon Hurley Hall recently blogged, midsize businesses are adopting social for the simple reason “a significant proportion of your potential customers are online, and while there they could be buying your product or service.”  She also makes a great point about social adoption not being only externally focused.  “Social business technologies will improve internal communication and knowledge-sharing.  The result is a better-informed and more engaged workforce, and the technology gives the ability to harness creativity and implement innovation to increase your competitive advantage.”

“Becoming more social,” Hall concluded, “doesn’t mean that employees waste time online; in fact, it means that everyone is better informed about both data and strategy, leading to business benefits.  The combination of integrating social technologies to improve your operational efficiency and harnessing social data to boost your knowledge base means that your business can be more competitive and more profitable.”

Hall’s insights also exemplify the proper perspective for midsize businesses to use when adopting big data.  No business of any size should adopt big data just because everyone is talking about it, nor simply because they think it might help their business.

As with everything in the business world, you should seek first to understand what big data adoption can offer, and what kind of investment it requires, before making any type of commitment.  The best thing about big data for midsize businesses is that it provides a big list of possibilities.  But trying to embrace all of the possibilities of big data would be a big mistake.  Start small with big data.  Smart big data adoption for midsize businesses means starting with well-defined, business-enhancing opportunities.

 

This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet. I’ve been compensated to contribute to this program, but the opinions expressed in this post are my own and don’t necessarily represent IBM’s positions, strategies, or opinions.

 

Related Posts

Big Data is not just for Big Businesses

Devising a Mobile Device Strategy

Social Business is more than Social Marketing

Barriers to Cloud Adoption

Leveraging the Cloud for Application Development

Cloud Computing for Midsize Businesses

Social Media Marketing: From Monologues to Dialogues

Social Media for Midsize Businesses

Cloud Computing is the New Nimbyism

The Age of the Mobile Device

Big Data Lessons from Orbitz

The Graystone Effects of Big Data

Word of Mouth has become Word of Data

Information Asymmetry versus Empowered Customers

Talking Business about the Weather

Will Big Data be Blinded by Data Science?

Tuesday
Feb192013

Demystifying Data Science 

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.

Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific.  Her work there focuses on power system optimization using real-time prediction models.  She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.

Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.

While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

 

Demystifying Data Science

Additional listening options:

 

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

 

Related Posts

There is No Such Thing as a Root Cause

Big Data and the Infinite Inbox

HoardaBytes and the Big Data Lebowski

OCDQ Radio - Data Quality and Big Data

Will Big Data be Blinded by Data Science?

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

How Predictable Are You?

A Statistically Significant Resolution for 2013

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Tuesday
Feb052013

Big Data and the Infinite Inbox

Occasionally it’s necessary to temper the unchecked enthusiasm accompanying the peak of inflated expectations associated with any hype cycle.  This may be especially true for big data, and especially now since, as Svetlana Sicular of Gartner recently blogged, big data is falling into the trough of disillusionment and “to minimize the depth of the fall, companies must be at a high enough level of analytical and enterprise information management maturity combined with organizational support of innovation.”

I fear the fall may feel bottomless for those who fell hard for the hype and believe the Big Data Psychic capable of making better, if not clairvoyant, predictions.  When, in fact, “our predictions may be more prone to failure in the era of big data,” explained Nate Silver in his book The Signal and the Noise: Why Most Predictions Fail but Some Don't.  “There isn’t any more truth in the world than there was before the Internet.  Most of the data is just noise, as most of the universe is filled with empty space.”

Proposing the 3Ss (Small, Slow, Sure) as a counterpoint to the 3Vs (Volume, Velocity, Variety), Stephen Few recently blogged about the slow data movement.  “Data is growing in volume, as it always has, but only a small amount of it is useful.  Data is being generated and transmitted at an increasing velocity, but the race is not necessarily for the swift; slow and steady will win the information race.  Data is branching out in ever-greater variety, but only a few of these new choices are sure.”

Big data requires us to revisit information overload, a term that was originally about, not the increasing amount of information, but instead the increasing access to information.  As Clay Shirky stated, “It’s not information overload, it’s filter failure.”

As Silver noted, the Internet (like the printing press before it) was a watershed moment in our increased access to information, but its data deluge didn’t increase the amount of truth in the world.  And in today’s world, where many of us strive on a daily basis to prevent email filter failure and achieve what Merlin Mann called Inbox Zero, I find unfiltered enthusiasm about big data to be rather ironic, since big data is essentially enabling the data-driven decision making equivalent of the Infinite Inbox.

Imagine logging into your email every morning and discovering: You currently have () Unread Messages.

However, I’m sure most of it probably would be spam, which you obviously wouldn’t have any trouble quickly filtering (after all, infinity minus spam must be a back of the napkin calculation), allowing you to only read the truly useful messages.  Right?

 

Related Posts

HoardaBytes and the Big Data Lebowski

OCDQ Radio - Data Quality and Big Data

Open MIKE Podcast — Episode 05: Defining Big Data

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

A Statistically Significant Resolution for 2013

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

Thursday
Jan032013

Best OCDQ Blog Posts of 2012

Welcome to my roundup of the best blog posts published on the Obsessive-Compulsive Data Quality (OCDQ) blog during 2012.

My selections were based on a pseudo-scientific, quasi-statistical combination of page views, comments, and re-tweets, as well as choosing a few of my personal favorites, and which I have organized into four sections of ten best posts by topic or type.

 

Ten Best Posts on Big Data

  • Dot Collectors and Dot Connectors — The multifaceted challenges of big data require the dot collectors of data management and the dot connectors of business intelligence to overcome their attention blindness and work together more collaboratively.
  • HoardaBytes and the Big Data Lebowski — Don’t hoard Data, dude.  The Data must abide.  The Data must abide both the Business, by proving useful to our business activities, and the Individual, by protecting the privacy of our personal activities.
  • Our Increasingly Data-Constructed World — What we now call Big Data is in fact a long-running macro trend underlying the many recent trends and innovations making our world, not just more data-driven, but increasingly data-constructed.
  • Will Big Data be Blinded by Data Science? — With apologies to Thomas Dolby, will the business leaders being told to hire data scientists to derive business value from big data analytics be blind to what data science tries to show them?
  • The Graystone Effects of Big Data — Using a metaphor based on the science fiction television show Caprica, I refer to the positive aspects of Big Data as the Zoe Graystone Effect, and the negative aspects of Big Data as the Daniel Graystone Effect.
  • Exercise Better Data Management — Big Data may be followed by MOData (i.e., MOre Data or Morbidly Obese Data), but that doesn’t necessarily mean we require more data management, instead we just need to exercise better data management.
  • A Tale of Two Datas — Inspired by Malcolm Chisholm and Charles Dickens, there are two types of data (i.e., representation and observation, not big and not-so-big) with different data uses that will require different data management approaches.
  • Data Silence — Not only do we need to adopt a mindset that embraces the principles of data science, but we also have to acknowledge that the biases and preconceptions in our minds could silence the signal and amplify the noise in big data.
  • The Wisdom of Crowds, Friends, and Experts — The future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three sources often contributing to data-driven decision making.

 

Ten Best Posts on Data Governance and Data Quality

  • Data Quality: Quo Vadimus? — With lots of help from Henrik Liliendahl Sørensen, Garry Ure, Bryan Larkin, and many others via the comments, I ponder where data quality is going, and whether data quality is a journey or a destination.
  • Data Quality and Miracle Exceptions — Battling the dark forces of poor data quality doesn’t require any superpowers, and data quality doesn’t have any miracle exceptions, so for the love of high-quality data everywhere, stop trying to sell us one.
  • Data Myopia and Business Relativity — Examines the two most prevalent definitions for data quality, real-world alignment and fitness for the purpose of use, otherwise known as the danger of data myopia and the challenge of business relativity.
  • How Data Cleansing Saves Lives — Although proactive defect prevention is far superior to reactive data cleansing, the history of the Hubble Space Telescope proves that data cleansing can be not just a necessary evil, but also a necessary good.
  • Data Quality and the Bystander Effect — The most common reason data quality issues are neither reported nor corrected is the Bystander Effect making people less likely to interpret bad data as a problem or, at the very least, not their responsibility.
  • Data Quality and Chicken Little Syndrome — A chicken-metaphor-based post about the far-too-common and fowl folly of, instead of trying to sell the business benefits of data quality, emphasizing the negative aspects of not investing in data quality.
  • Data and its Relationships with Quality — The metadata linking the data management industry to what it manages suffers from the one-to-many relationships created by never agreeing on how data, information, and quality should be defined.
  • Cooks, Chefs, and Data Governance — Implementing policies requires cooks who are adept at carrying out a recipe, as well as chefs who are trusted to figure out how to best combine policies with the organizational ingredients available to them.
  • Availability Bias and Data Quality Improvement — The availability heuristic explains why a reactive data cleansing project is easily approved, and availability bias explains why initiating a proactive data quality program is usually resisted.

 

Ten Best Podcasts

  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Saving Private Data — Recorded in December 2011, guest Daragh O Brien discusses the data privacy and data protection implications of social media, cloud computing, and big data.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Defining Big Data — This episode of the Open MIKE Podcast, with assistance from Robert Hillard, discusses how big data refers to big complexity, not big volume, even though complex datasets tend to grow rapidly, thus making them voluminous.
  • Getting to Know NoSQL — This episode of the Open MIKE Podcast discusses how NoSQL does not mean AntiSQL (i.e., NoSQL is not a Relational replacement), and that business-driven big data needs will often require “Not Only SQL.”

 

Ten Best of the Rest

  • DQ-View: Data Is as Data Does — In this short video, I explain that data’s value comes from data’s usefulness, exemplifying the potential value of unstructured data based on whether or not you put what you read in data management books to use.
  • DQ-View: The Five Stages of Data Quality — In this short video, using my superb acting skills, I demonstrate how coming to terms with the daunting challenge of data quality is somewhat similar to experiencing the Five Stages of Grief.
  • DQ-View: MetaData makes BettahMusic — In this short video, I demonstrate how better metadata makes data better using the metadata automatically and manually created after importing my CD collection into my iTunes library.
  • Metadata, Data Quality, and the Stroop Test — In this colorful (and perhaps too colorful) post, I use the Stroop Test, where colors do not match their names, to discuss the relationship between metadata and data quality.
  • Quality is the Higgs Field of Data — Using one of the biggest science stories of 2012, the potential discovery of the elusive Higgs Boson (which I also attempt to explain), I attempt an analogy for data quality based on the Higgs Field.
  • The Family Circus and Data Quality — Thanks to The Family Circus comic strip created by cartoonist Bil Keane, I explain how Ida Know owns the data, Not Me is accountable for data governance, and Nobody takes responsibility for data quality.
  • Data Love Song Mashup — Since your data needs love too, on Valentine’s Day I wrote this post providing a mashup of love songs for your data (and Rob DuMoulin added a few more in the comments) — Happy Data Quality to you and your data!
  • The Algebra of Collaboration — The trick of algebra equates collaboration with data quality and data governance success when collaboration is viewed not just as a guiding principle, but also as a call to action in your daily practices.
  • The Return of the Dumb Terminal — With help from author Kevin Kelly and my old green machine, I ponder how the mobile-app-portal-to-the-cloud computing model means mobile devices are bringing about the return of the dumb terminal.
  • An Enterprise Carol — Jacob Marley raises the ghosts of a few ideas to consider about how to keep the Enterprise well in the new year via the Ghosts of Enterprise Past (Legacy Applications), Present (IT Consumerization), and Future (Big Data).

 

Thank You for Reading OCDQ Blog in 2012

In 2012, the Obsessive-Compulsive Data Quality (OCDQ) blog published 92 posts, which received 160,000 total page views, while averaging over 400 page views and 200 unique visitors a day.

Thank you for reading OCDQ Blog in 2012.  Your readership was deeply appreciated.

 

Related Posts

Best OCDQ Blog Posts of 2011

So Long 2011, and Thanks for All the . . . – The OCDQ Radio 2011 Year in Review

2012 Quarterly Review of the Data Roundtable (Part 4)

2012 Quarterly Review of the Data Roundtable (Part 3)

2012 Quarterly Review of the Data Roundtable (Part 2)

2012 Quarterly Review of the Data Roundtable (Part 1)

2011 Quarterly Review of the Data Roundtable (Part 4)

2011 Quarterly Review of the Data Roundtable (Part 3)

2011 Quarterly Review of the Data Roundtable (Part 2)

2011 Quarterly Review of the Data Roundtable (Part 1)

Saturday
Dec292012

An Enterprise Resolution

This blog post is sponsored by the Enterprise CIO Forum and HP.

Since just before Christmas I posted An Enterprise Carol, I decided just before New Year’s to post An Enterprise Resolution.

In her article The Irrational Allure of the Next Big Thing, Karla Starr examined why people value potential over achievement in books, sports, and politics.  However, her findings apply equally well to technology and the enterprise’s relationship with IT.

“Subjectivity and hype,” Starr explained, “make people particularly prone to falling for Next Best Thing-ism.”

“Our collective willingness to jump on the bandwagon,” Starr continued, “seems at odds with one of psychology’s most robust findings: We are averse to uncertainty.  But as it turns out, our reaction to incomplete information depends on our interpretation of the scant data we do have.  Uncertainty is a sort of amplifier, intensifying our response whether it’s positive or negative.  As long as we react positively to the little information shown, we’re actually attracted to uncertainty.  It’s curiosity rather than knowledge that leads to increased cognitive engagement.  If the only information at hand is positive, your mind is going to fill in the gaps with other positive details.  A whiff of positive information is all we need to set our minds aflutter.”

In his book Thinking, Fast and Slow, Daniel Kahneman explained “when people are favorably disposed toward a technology, they rate it as offering large benefits and imposing little risk; when they dislike a technology, they can think only of its disadvantages, and few advantages come to mind.  People who receive a message extolling the benefits of a technology also change their beliefs about its risks.  Good technologies have few costs in the imaginary world we inhabit, bad technologies have no benefits, and all decisions are easy.  In the real world of course, we often face painful tradeoffs between benefits and costs.”

In his book What Technology Wants, Kevin Kelly explained that technology has a social dimension beyond the mere functionality it provides.  “We adopt new technologies largely because of what they do for us, but also in part because of what they mean to us.  Often we refuse to adopt technology for the same reason: because of how the avoidance reinforces or shapes our identity.”

So, in 2013, as the big data hype cycle comes down from the peak of inflated expectations, as the painful tradeoffs between the benefits and costs of cloud computing are faced, and as IT consumerization continues to reshape the identity of the IT function, let’s make an enterprise resolution to deal with these realities before we go off chasing the next best thing.  Happy New Year!

This blog post is sponsored by the Enterprise CIO Forum and HP.

 

Related Posts

An Enterprise Carol

Why does the sun never set on legacy applications?

Are Applications the La Brea Tar Pits for Data?

The Diffusion of the Consumerization of IT

Serving IT with a Side of Hash Browns

The Cloud is shifting our Center of Gravity

A Swift Kick in the AAS

Sometimes all you Need is a Hammer

Shadow IT and the New Prometheus

The IT Consumerization Conundrum

The Diderot Effect of New Technology

More Tethered by the Untethered Enterprise?

The Return of the Dumb Terminal

Magic Elephants, Data Psychics, and Invisible Gorillas

Big Data el Memorioso

Information Overload Revisited

The Limitations of Historical Analysis

OCDQ Radio - The Evolution of Enterprise Security

Enterprise Security and Social Engineering

Can the Enterprise really be Secured?

Thursday
Dec202012

Big Data is not just for Big Businesses

“It is widely assumed that big data, which imbues a sense of grandiosity, is only for those large enterprises with enormous amounts of data and the dedicated IT staff to tackle it,” opens the recent article Big data: Why it matters to the midmarket.

Much of the noise generated these days about the big business potential of big data certainly seems to contain very little signal directed at small and midsize businesses.  Although it’s true that big businesses generate more data, faster, and in more varieties, a considerable amount of big data is externally generated, much of which is freely available for use by businesses of all sizes.

The easiest example is the poster child for leveraging big data — Google Search.  But there’s also a growing number of open data sources (e.g., weather data) and social data sources (e.g., Twitter), and, since more of the world is becoming directly digitized, more businesses are now using more data no matter how big they are.  Additionally, as Phil Simon wrote about in The New Small, the free and open source software, as-a-service, cloud, mobile, and social technology trends driving the consumerization of IT are enabling small and midsize businesses to, among other things, use more data and be more competitive with big businesses.

“Each minute of every day, information is produced about the activities of your business, your customers, and your industry,” explained Sarita Harbour in her recent blog post Harnessing Big Data: Giving Midsize Business a Competitive Edge.  “Hidden within this enormous amount of data are trends, patterns, and indicators that, if extracted and identified, can yield important information to make your business more efficient and more competitive, and ultimately, it can make you more money.”

However, the biggest driver of the misperception about big data is its over-identification with data volume.  Which is why earlier this year in his blog post It’s time for a new definition of big data, Robert Hillard used several examples to explain that big data refers more to big complexity than big volume.  While acknowledging that complex datasets tend to grow rapidly, thus making big data voluminous, his wonderfully pithy conclusion was that “big data can be very small and not all large datasets are big.”

Therefore, by extension we could say that the businesses using big data can be small, or mid-sized, and not all the businesses using big data are big.  But, of course, that’s not quite pithy enough.  So let’s simply say that big data is not just for big businesses.

 

This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet.

 

Related Posts

Will Big Data be Blinded by Data Science?

Big Data Lessons from Orbitz

The Graystone Effects of Big Data

Word of Mouth has become Word of Data

Information Asymmetry versus Empowered Customers

Talking Business about the Weather

Magic Elephants, Data Psychics, and Invisible Gorillas

Open MIKE Podcast — Episode 05: Defining Big Data

Open MIKE Podcast — Episode 06: Getting to Know NoSQL

OCDQ Radio - Data Quality and Big Data

HoardaBytes and the Big Data Lebowski

Sometimes it’s Okay to be Shallow

How Predictable Are You?

The Wisdom of Crowds, Friends, and Experts

Exercise Better Data Management

A Tale of Two Datas

Darth Vader, Big Data, and Predictive Analytics

The Big Data Theory

Data Management: The Next Generation

Big Data: Structure and Quality

Tuesday
Dec182012

An Enterprise Carol

This blog post is sponsored by the Enterprise CIO Forum and HP.

Since ‘tis the season for reflecting on the past year and predicting the year ahead, while pondering this post my mind wandered to the reflections and predictions provided by the ghosts of A Christmas Carol by Charles Dickens.  So, I decided to let the spirit of Jacob Marley revisit my previous Enterprise CIO Forum posts to bring you the Ghosts of Enterprise Past, Present, and Future.

 

The Ghost of Enterprise Past

Legacy applications have a way of haunting the enterprise long after they should have been sunset.  The reason that most of them do not go gentle into that good night, but instead rage against the dying of their light, is some users continue using some of the functionality they provide, as well as the data trapped in those applications, to support the enterprise’s daily business activities.

This freaky feature fracture (i.e., technology supporting business needs being splintered across new and legacy applications) leaves many IT departments overburdened with maintaining a lot of technology and data that’s not being used all that much.

The Ghost of Enterprise Past warns us that IT can’t enable the enterprise’s future if it’s stuck still supporting its past.

 

The Ghost of Enterprise Present

While IT was busy battling the Ghost of Enterprise Past, a familiar, but fainter, specter suddenly became empowered by the diffusion of the consumerization of IT.  The rapid ascent of the cloud and mobility, spirited by service-oriented solutions that were more focused on the user experience, promised to quickly deliver only the functionality required right now to support the speed and agility requirements driving the enterprise’s business needs in the present moment.

Gifted by this New Prometheus, Shadow IT emerged from the shadows as the Ghost of Enterprise Present, with business-driven and decentralized IT solutions becoming more commonplace, as well as begrudgingly accepted by IT leaders.

All of which creates quite the IT Conundrum, forming yet another front in the war against Business-IT collaboration.  Although, in the short-term, the consumerization of IT usually better services the technology needs of the enterprise, in the long-term, if it’s not integrated into a cohesive strategy, it creates a complex web of IT that entangles the enterprise much more than it enables it.

And with the enterprise becoming much more of a conceptual, rather than a physical, entity due to the cloud and mobile devices enabling us to take the enterprise with us wherever we go, the evolution of enterprise security is now facing far more daunting challenges than the external security threats we focused on in the past.  This more open business environment is here to stay, and it requires a modern data security model, despite the fact that such a model could become the weakest link in enterprise security.

The Ghost of Enterprise Present asks many questions, but none more frightening than: Can the enterprise really be secured?

 

The Ghost of Enterprise Future

Of course, the T in IT wasn’t the only apparition previously invisible outside of the IT department to recently break through the veil in a big way.  The I in IT had its own coming-out party this year also since, as many predicted, 2012 was the year of Big Data.

Although neither the I nor the T is magic, instead of sugar plums, Data Psychics and Magic Elephants appear to be dancing in everyone’s heads this holiday season.  In other words, the predictive power of big data and the technological wizardry of Hadoop (as well as other NoSQL techniques) seem to be on the wish list of every enterprise for the foreseeable future.

However, despite its unquestionable potential, as its hype starts to settle down, the sobering realities of big data analytics will begin to sink in.  Data’s value comes from data’s usefulness.  If all we do is hoard data, then we’ll become so lost in the details that we’ll be unable to connect enough of the dots to discover meaningful patterns and convert big data into useful information that enables the enterprise to take action, make better decisions, or otherwise support its business activities.

Big data will force us to revisit information overload as we are occasionally confronted with the limitations of historical analysis, and blindsided by how our biases and preconceptions could silence the signal and amplify the noise, which will also force us to realize that data quality still matters in big data and that bigger data needs better data management.

As the Ghost of Enterprise Future, big data may haunt us with more questions than the many answers it will no doubt provide.

 

“Bah, Humbug!”

I realize that this post lacks the happy ending of A Christmas Carol.  To paraphrase Dickens, I endeavored in this ghostly little post to raise the ghosts of a few ideas, not to put my readers out of humor with themselves, with each other, or with the season, but simply to give them thoughts to consider about how to keep the Enterprise well in the new year.  Happy Holidays Everyone!

This blog post is sponsored by the Enterprise CIO Forum and HP.

 

Related Posts

Why does the sun never set on legacy applications?

Are Applications the La Brea Tar Pits for Data?

The Diffusion of the Consumerization of IT

The Cloud is shifting our Center of Gravity

More Tethered by the Untethered Enterprise?

A Swift Kick in the AAS

The UX Factor

Sometimes all you Need is a Hammer

Shadow IT and the New Prometheus

The IT Consumerization Conundrum

OCDQ Radio - The Evolution of Enterprise Security

The Cloud Security Paradox

The Good, the Bad, and the Secure

The Weakest Link in Enterprise Security

Can the Enterprise really be Secured?

Magic Elephants, Data Psychics, and Invisible Gorillas

Big Data el Memorioso

Information Overload Revisited

The Limitations of Historical Analysis

Data Silence

Tuesday
Dec112012

Devising a Mobile Device Strategy

As I previously blogged in The Age of the Mobile Device, the disruptiveness of mobile devices to existing business models is difficult to overstate.  Mobile was also cited as one of the complementary technology forces, along with social and cloud, in the recent Harvard Business Review blog post by R “Ray” Wang about new business models being enabled by big data.

Since their disruptiveness to existing IT models is also difficult to overstate, this post ponders the Bring Your Own Device (BYOD) trend that’s forcing businesses of all sizes to devise a mobile device strategy.  BYOD is often not about bringing your own device to the office, but about bringing your own device with you wherever you go (even though the downside of this untethered enterprise may be that our always precarious work-life balance surrenders to the pervasive work-is-life feeling mobile devices can enable).

In his recent InformationWeek article, BYOD: Why Mobile Device Management Isn’t Enough, Michael Davis observed that too many IT departments are not devising a mobile device strategy, but instead “they’re merely scrambling to meet pressure from the CEO on down to offer BYOD options or increase mobile app access.”  Davis also noted that when IT creates BYOD policies, they often to fail to acknowledge mobile devices have to be managed differently, partially since they are not owned by the company.

An alternative to BYOD, which Brian Proffitt recently blogged about, is Corporate Owned, Personally Enabled (COPE). “Plenty of IT departments see BYOD as a demon to be exorcised from the cubicle farms,” Proffitt explained, “or an opportunity to dump the responsibility for hardware upkeep on their internal customers.  The idea behind BYOD is to let end users choose the devices, programs, and services that best meet their personal and business needs, with access, support, and security supplied by the company IT department — often with subsidies for device purchases.”  Whereas the idea behind COPE is “the organization buys the device and still owns it, but the employee is allowed, within reason, to install the applications they want on the device.”

Whether you opt for BYOD or COPE, Information Management recently highlighted 5 Trouble Spots to consider, which included assuming that mobile device security is already taken care of by in-house security initiatives, data integration disconnects with on-premises data essentially turning mobile devices into mobile data silos, and the combination of personal and business data, which complicates, among other things, remote wiping the data on a mobile device in the event of a theft or security violation, which is why, as Davis concluded, managing the company data on the device is more important than managing the device itself.

With the complex business and IT challenges involved, how is your midsize business devising a mobile device strategy?

 

This post was written as part of the IBM for Midsize Business program, which provides midsize businesses with the tools, expertise and solutions they need to become engines of a smarter planet.

 

Related Posts

The Age of the Mobile Device

The Return of the Dumb Terminal

More Tethered by the Untethered Enterprise?

OCDQ Radio - Social Media for Midsize Businesses

Social Media Marketing: From Monologues to Dialogues

Social Business is more than Social Marketing

The Cloud is shifting our Center of Gravity

Barriers to Cloud Adoption

OCDQ Radio - Cloud Computing for Midsize Businesses

Cloud Computing is the New Nimbyism

The Cloud Security Paradox

OCDQ Radio - The Evolution of Enterprise Security

The Graystone Effects of Big Data

Big Data Lessons from Orbitz

Will Big Data be Blinded by Data Science?

Tuesday
Dec042012

The Wisdom of Crowds, Friends, and Experts

I recently finished reading the TED Book by Jim Hornthal, A Haystack Full of Needles, which included an overview of the different predictive approaches taken by one of the most common forms of data-driven decision making in the era of big data, namely, the recommendation engines increasingly provided by websites, social networks, and mobile apps.

These recommendation engines primarily employ one of three techniques, choosing to base their data-driven recommendations on the “wisdom” provided by either crowds, friends, or experts.

 

The Wisdom of Crowds

In his book The Wisdom of Crowds, James Surowiecki explained that the four conditions characterizing wise crowds are diversity of opinion, independent thinking, decentralization, and aggregation.  Amazon is a great example of a recommendation engine using this approach by assuming that a sufficiently large population of buyers is a good proxy for your purchasing decisions.

For example, Amazon tells you that people who bought James Surowiecki’s bestselling book also bought Thinking, Fast and Slow by Daniel Kahneman, Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business by Jeff Howe, and Wikinomics: How Mass Collaboration Changes Everything by Don Tapscott.  However, Amazon neither provides nor possesses knowledge of why people bought all four of these books or qualification of the subject matter expertise of these readers.

However, these concerns, which we could think of as potential data quality issues, and which would be exacerbated within a small amount of transaction data where the eclectic tastes and idiosyncrasies of individual readers would not help us decide what books to buy, within a large amount of transaction data, we achieve the Wisdom of Crowds effect when, taken in aggregate, we receive a general sense of what books we might like to read based on what a diverse group of readers collectively makes popular.

As I blogged about in my post Sometimes it’s Okay to be Shallow, sometimes the aggregated, general sentiment of a large group of unknown, unqualified strangers will be sufficient to effectively make certain decisions.

 

The Wisdom of Friends

Although the influence of our friends and family is the oldest form of data-driven decision making, historically this influence was delivered by word of mouth, which required you to either be there to hear those influential words when they were spoken, or have a large enough network of people you knew that would be able to eventually pass along those words to you.

But the rise of social networking services, such as Twitter and Facebook, has transformed word of mouth into word of data by transcribing our words into short bursts of social data, such as status updates, online reviews, and blog posts.

Facebook “Likes” are a great example of a recommendation engine that uses the Wisdom of Friends, where our decision to buy a book, see a movie, or listen to a song might be based on whether or not our friends like it.  Of course, “friends” is used in a very loose sense in a social network, and not just on Facebook, since it combines strong connections such as actual friends and family, with weak connections such as acquaintances, friends of friends, and total strangers from the periphery of our social network.

Social influence has never ended with the people we know well, as Nicholas Christakis and James Fowler explained in their book Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives.  But the hyper-connected world enabled by the Internet, and further facilitated by mobile devices, has strengthened the social influence of weak connections, and these friends form a smaller crowd whose wisdom is involved in more of our decisions than we may even be aware of.

 

The Wisdom of Experts

Since it’s more common to associate wisdom with expertise, Pandora is a great example of a recommendation engine that uses the Wisdom of Experts.  Pandora used a team of musicologists (professional musicians and scholars with advanced degrees in music theory) to deconstruct more than 800,000 songs into 450 musical elements that make up each performance, including qualities of melody, harmony, rhythm, form, composition, and lyrics, as part of what Pandora calls the Music Genome Project.

As Pandora explains, their methodology uses precisely defined terminology, a consistent frame of reference, redundant analysis, and ongoing quality control to ensure that data integrity remains reliably high, believing that delivering a great radio experience to each and every listener requires an incredibly broad and deep understanding of music.

Essentially, experts form the smallest crowd of wisdom.  Of course, experts are not always right.  At the very least, experts are not right about every one of their predictions.  Nor do experts always agree with other, which is why I imagine that one of the most challenging aspects of the Music Genome Project is getting music experts to consistently apply precisely the same methodology.

Pandora also acknowledges that each individual has a unique relationship with music (i.e., no one else has tastes exactly like yours), and allows you to “Thumbs Up” or “Thumbs Down” songs without affecting other users, producing more personalized results than either the popularity predicted by the Wisdom of Crowds or the similarity predicted by the Wisdom of Friends.

 

The Future of Wisdom

It’s interesting to note that the Wisdom of Experts is the only one of these approaches that relies on what data management and business intelligence professionals would consider a rigorous approach to data quality and decision quality best practices.  But this is also why the Wisdom of Experts is the most time-consuming and expensive approach to data-driven decision making.

In the past, the Wisdom of Crowds and Friends was ignored in data-driven decision making for the simple reason that this potential wisdom wasn’t digitized.  But now, in the era of big data, not only are crowds and friends digitized, but technological advancements combined with cost-effective options via open source (data and software) and cloud computing make these approaches quicker and cheaper than the Wisdom of Experts.  And despite the potential data quality and decision quality issues, the Wisdom of Crowds and/or Friends is proving itself a viable option for more categories of data-driven decision making.

I predict that the future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three potential sources of wisdom often acknowledged as contributors to data-driven decision making.

 

Related Posts

Sometimes it’s Okay to be Shallow

Word of Mouth has become Word of Data

The Wisdom of the Social Media Crowd

Data Management: The Next Generation

Exercise Better Data Management

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

The Big Data Theory

Finding a Needle in a Needle Stack

Big Data, Predictive Analytics, and the Ideal Chronicler

The Limitations of Historical Analysis

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

A Tale of Two Datas

Friday
Nov232012

The Limitations of Historical Analysis

This blog post is sponsored by the Enterprise CIO Forum and HP.

“Those who cannot remember the past are condemned to repeat it,” wrote George Santayana in the early 20th century to caution us about not learning the lessons of history.  But with the arrival of the era of big data and dawn of the data scientist in the early 21st century, it seems like we no longer have to worry about this problem since not only is big data allowing us to digitize history, data science is also building us sophisticated statistical models from which we can analyze history in order to predict the future.

However, “every model is based on historical assumptions and perceptual biases,” Daniel Rasmus blogged. “Regardless of the sophistication of the science, we often create models that help us see what we want to see, using data selected as a good indicator of such a perception.”  Although perceptual bias is a form of the data silence I previously blogged about, even absent such a bias, there are limitations to what we can predict about the future based on our analysis of the past.

“We must remember that all data is historical,” Rasmus continued. “There is no data from or about the future.  Future context changes cannot be built into a model because they cannot be anticipated.”  Rasmus used the example that no models of retail supply chains in 1962 could have predicted the disruption eventually caused by that year’s debut of a small retailer in Arkansas called Wal-Mart.  And no models of retail supply chains in 1995 could have predicted the disruption eventually caused by that year’s debut of an online retailer called Amazon.  “Not only must we remember that all data is historical,” Rasmus explained, “but we must also remember that at some point historical data becomes irrelevant when the context changes.”

As I previously blogged, despite what its name implies, predictive analytics can’t predict what’s going to happen with certainty, but it can predict some of the possible things that could happen with a certain probability.  Another important distinction is that “there is a difference between being uncertain about the future and the future itself being uncertain,” Duncan Watts explained in his book Everything is Obvious (Once You Know the Answer).  “The former is really just a lack of information — something we don’t know — whereas the latter implies that the information is, in principle, unknowable.  The former is an orderly universe, where if we just try hard enough, if we’re just smart enough, we can predict the future.  The latter is an essentially random world, where the best we can ever hope for is to express our predictions of various outcomes as probabilities.”

“When we look back to the past,” Watts explained, “we do not wish that we had predicted what the search market share for Google would be in 1999.  Instead we would end up wishing we’d been able to predict on the day of Google’s IPO that within a few years its stock price would peak above $500, because then we could have invested in it and become rich.  If our prediction does not somehow help to bring about larger results, then it is of little interest or value to us.  We care about things that matter, yet it is precisely these larger, more significant predictions about the future that pose the greatest difficulties.”

Although we should heed Santayana’s caution and try to learn history’s lessons in order to factor into our predictions about the future what was relevant from the past, as Watts cautioned, there will be many times when “what is relevant can’t be known until later, and this fundamental relevance problem can’t be eliminated simply by having more information or a smarter algorithm.”

Although big data and data science can certainly help enterprises learn from the past in order to predict some probable futures, the future does not always resemble the past.  So, remember the past, but also remember the limitations of historical analysis.

This blog post is sponsored by the Enterprise CIO Forum and HP.

 

Related Posts

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

WYSIWYG and WYSIATI

Will Big Data be Blinded by Data Science?

Big Data el Memorioso

Information Overload Revisited

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

The Big Data Theory

Finding a Needle in a Needle Stack

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

A Tale of Two Datas

Tuesday
Nov132012

Data Silence

This blog post is sponsored by the Enterprise CIO Forum and HP.

In the era of big data, information optimization is becoming a major topic of discussion.  But when some people discuss the big potential of big data analytics under the umbrella term of data science, they make it sound like since we have access to all the data we would ever need, all we have to do is ask the Data Psychic the right question and then listen intently to the answer.

However, in his recent blog post Silence Isn’t Always Golden, Bradley S. Fordham, PhD explained that “listening to what the data does not say is often as important as listening to what it does.  There can be various types of silences in data that we must get past to take the right actions.”  Fordham described these data silences as various potential gaps in our analysis.

One data silence is syntactic gaps, which is a proportionately small amount of data in a very large data set that “will not parse (be converted from raw data into meaningful observations with semantics or meaning) in the standard way.  A common response is to ignore them under the assumption there are too few to really matter.  The problem is that oftentimes these items fail to parse for similar reasons and therefore bear relationships to each other.  So, even though it may only be .1% of the overall population, it is a coherent sub-population that could be telling us something if we took the time to fix the syntactic problems.”

This data silence reminded me of my podcast discussion with Thomas C. Redman, PhD about big data and data quality, during which we discussed how some people erroneously assume that data quality issues can be ignored in larger data sets.

Another data silence is inferential gaps, which is basing an inference on only one variable in a data set.  The example Fordham uses is from a data set showing that 41% of the cars sold during the first quarter of the year were blue, from which we might be tempted to infer that customers bought more blue cars because they preferred blue.  However, by looking at additional variables in the data set and noticing that “70% of the blue cars sold were from the previous model year, it is likely they were discounted to clear them off the lots, thereby inflating the proportion of blue cars sold.  So, maybe blue wasn’t so popular after all.”

Another data silence Fordham described using the same data set is gaps in field of view.  “At first glance, knowing everything on the window sticker of every car sold in the first quarter seems to provide a great set of data to understand what customers wanted and therefore were buying.  At least it did until we got a sinking feeling in our stomachs because we realized that this data only considers what the auto manufacturer actually built.  That field of view is too limited to answer the important customer desire and motivation questions being asked.  We need to break the silence around all the things customers wanted that were not built.”

This data silence reminded me of WYSIATI, which is an acronym coined by Daniel Kahneman to describe how the data you are looking at can greatly influence you to jump to the comforting, but false, conclusion that “what you see is all there is,” thereby preventing you from expanding your field of view to notice what data might be missing from your analysis.

As Fordham concluded, “we need to be careful to listen to all the relevant data, especially the data that is silent within our current analyses.  Applying that discipline will help avoid many costly mistakes that companies make by taking the wrong actions from data even with the best of techniques and intentions.”

Therefore, in order for your enterprise to leverage big data analytics for business success, you not only need to adopt a mindset that embraces the principles of data science, you also need to make sure that your ears are set to listen for data silence.

This blog post is sponsored by the Enterprise CIO Forum and HP.

 

Related Posts

Magic Elephants, Data Psychics, and Invisible Gorillas

OCDQ Radio - Data Quality and Big Data

Big Data: Structure and Quality

WYSIWYG and WYSIATI

Will Big Data be Blinded by Data Science?

Big Data el Memorioso

Information Overload Revisited

HoardaBytes and the Big Data Lebowski

The Data-Decision Symphony

OCDQ Radio - Decision Management Systems

The Big Data Theory

Finding a Needle in a Needle Stack

Darth Vader, Big Data, and Predictive Analytics

Data-Driven Intuition

A Tale of Two Datas