Demystifying Data Science
Jim Harris in
Data Quality,
OCDQ Radio,
Podcasts tagged
Best of 2013,
Big Data,
Data Science,
Melinda Thielbar,
Philosophy,
Predictive Analytics,
Statistics
Tuesday, February 19, 2013 at 3:14AM OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.
During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.
Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific. Her work there focuses on power system optimization using real-time prediction models. She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.
Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.
While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

Demystifying Data Science
Additional listening options:
Related OCDQ Radio Episodes
Clicking on the link will take you to the episode’s blog post:
- Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
- Decision Management Systems — Guest James Taylor discusses data-driven decision making and analytical concepts from his book Decision Management Systems: A Practical Guide to Using Business Rules and Predictive Analytics.
- Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.
- So Long 2011, and Thanks for All the . . . — The OCDQ Radio 2011 Year in Review, featuring Jarrett Goldfedder, who discusses Big Data, Nicola Askham, who discusses Data Governance, and Daragh O Brien, who discusses Data Privacy.
- Big Data and Big Analytics — Special Guests Jill Dyché and Dan Soceanu, following the 2011 Pacific Northwest BI Summit, discuss big trends in business intelligence, including cloud computing, collaboration, and big data analytics.



Reader Comments (2)
I found your podcast on Stitcher, and it is the only one I have listened to so far, so apologies if this comment is taking a single remark out of context. I found the podcast very interesting and have subscribed.
However, I wanted to respond to the criticism/comment about Netflix, because the very nature of their competition known as The Netflix Prize* was specifically about improving prediction, in a context where it is required to make a prediction, rather than to try to reduce the "viewers" down to series of featureless numbers.
But your other comment along the lines of "I liked Star Wars" so "I should also like Star Trek" also annoyed me slightly, because I followed some of the early bloggers who were using SVD techniques, and whilst you get a lot of what you would expect, the more detailed categories that emerged from the data were really quite interesting.
Here were some examples of contrasts (Source: sifter.org/~simon/journal/20061027.2.html):
I found category 9 funny, so people who like Star Trek, really don't like the Office:
Category 9:
Star Trek VI: The Undiscovered Country (1991)
Star Trek: The Next Generation: Season 3 (1989)
Star Trek: Generations (1994)
Star Trek: First Contact (1996)
Star Trek: Insurrection (1998)
Star Trek: The Next Generation: Season 1 (1987)
Star Trek III: The Search for Spock (1984)
Labyrinth (1986)
Star Trek V: The Final Frontier (1989)
Star Trek: The Next Generation: Season 7 (1993)
Star Trek: The Next Generation: Season 5 (1991)
What Dreams May Come (1998)
Star Trek IV: The Voyage Home (1986)
Star Trek: The Next Generation: Season 2 (1988)
Star Trek: The Next Generation: Season 4 (1990)
Vs.
The Passion of the Christ (2004)
The Office: Series 1 (2001)
The Office Special (2001)
The Office: Series 2 (2002)
Diary of a Mad Black Woman (2005)
Curb Your Enthusiasm: Season 1 (2000)
Arrested Development: Season 1 (2003)
Because of Winn-Dixie (2005)
City of God (2002)
Curb Your Enthusiasm: Season 2 (2001)
Madea's Class Reunion (2003)
Barbershop 2: Back in Business (2004)
The Fast and the Furious (2001)
Shark Tale (2004)
The Wire: Season 1 (2003)
* "The Netflix Prize sought to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences."
** "It is our great honor to announce the $1M Grand Prize winner of the Netflix Prize contest as team BellKor’s Pragmatic Chaos for their verified submission on July 26, 2009 at 18:18:28 UTC, achieving the winning RMSE of 0.8567 on the test subset. This represents a 10.06% improvement over Cinematch’s score on the test subset at the start of the contest."
Thanks for your excellent comment, Tom.
I really appreciate you calling me out on what was an unfair criticism of the Netflix algorithm, which actually does a good job helping me discover television shows and movies that I would not have previously considered.
During that segment of the podcast, Melinda and I were discussing the limitations of predicting the behavior of humans using data science and I simply chose Netflix as a quick example of something that was inherently more predictable than other human behavior as well as something that has little negative effect.
By the latter point, I mean that poorly predicting my tastes in movies is not as concerning as poorly predicting the likelihood I would default on my mortgage, or poorly predicting my plans to save and invest for retirement.
Additionally, an important aspect of data science that was not covered during this podcast is the data privacy implications of predictive models using big data. Here again, entertainment-enhancing algorithms such as Netflix do not raise the same concerns as Orwellian Big Brother data-surveillance algorithms.
Best Regards,
Jim