In my post The Wisdom of Crowds, Friends, and Experts, I used Amazon, Facebook, and Pandora respectively as examples of three techniques used by the recommendation engines increasingly provided by websites, social networks, and mobile apps.
Richard Jarvis commented that my assessment of the data quality associated with these techniques actually needed to look at metadata, data, and information, as well as knowledge management. “For crowd-sourced data, we’re assessing quality based on the first-order value rather than the immediate downstream usability. We’re not questioning the accuracy of Amazon’s assertion that customers who purchased X also purchased Y. Rather, we’re interested in the relevance of that information. In terms of knowledge management, I would describe this as broadening data quality to embrace information and knowledge quality.”
As usual, I agreed with Richard. Amazon is not providing access to the operational data underlying their recommendations, which is, of course, understandable, but instead Amazon is providing some aggregated information (e.g., sales rank) along with some detailed information (e.g., consumer reviews) and numerous metadata attributes (e.g., product category).
We have no way of knowing if the underlying operational data is accurate (as well as other aspects of data quality), nor do we have any way of verifying any aspect of the information quality. Some of the metadata could be verified by cross-referencing other sources (e.g., for books, we could verify the metadata with the publishers and other sellers such as Barnes & Noble).
Making use of Amazon’s information has to be done on the assumption of quality — something that data and information quality professionals would never endorse in other contexts (e.g., within Amazon’s internal financial accounting systems).
While this situation has always existed, the Internet and the era of big data is exacerbating it. Although this example focused on recommendation engines, many of the sources involved in big data analytics face this same challenge, such as sentiment analysis and other analysis that is dependent upon self-reported data.
Furthermore, I would argue that many, for lack of a better term, traditional data and information management applications, have functioned off of the same assumption of quality even though data and information quality best practices are implemented.
By extension, although there is an assumption that quality business decisions can only be made based on quality metadata, data, and information, if that were true in all cases, then every business would be bankrupt.
None of this is meant to imply that quality is not important.
On the contrary, my point is that in almost every application of metadata, data, and information, there is an assumption of quality. Obviously, this assumption should be tested whenever it can be, but we have to accept the fact that there will be many times when we will not be able to, thus forcing us to leverage metadata, data, and information on the assumption of their quality.