Is big data more than just lots and lots of data? Is big data unstructured and not-so-big data structured? Malcolm Chisholm explored these questions in his recent Information Management column, where he posited that there are, in fact, two datas.
“One type of data,” Chisholm explained, “represents non-material entities in vast computerized ecosystems that humans create and manage. The other data consists of observations of events, which may concern material or non-material entities.”
Providing an example of the first type, Chisholm explained, “my bank account is not a physical thing at all; it is essentially an agreed upon idea between myself, the bank, the legal system, and the regulatory authorities. It only exists insofar as it is represented, and it is represented in data. The balance in my bank account is not some estimate with a positive and negative tolerance; it is exact. The non-material entities of the financial sector are orderly human constructs. Because they are orderly, we can more easily manage them in computerized environments.”
The orderly human constructs that are represented in data, in the stories told by data (including the stories data tell about us and the stories we tell data) is one of my favorite topics. In our increasingly data-constructed world, it’s important to occasionally remind ourselves that data and the real world are not the same thing, especially when data represents non-material entities since, with the possible exception of Makers using 3-D printers, data-represented entities do not re-materialize into the real world.
Describing the second type, Chisholm explained, “a measurement is usually a comparison of a characteristic using some criteria, a count of certain instances, or the comparison of two characteristics. A measurement can generally be quantified, although sometimes it’s expressed in a qualitative manner. I think that big data goes beyond mere measurement, to observations.”
Chisholm called the first type the Data of Representation, and the second type the Data of Observation.
The data of representation tends to be structured, in the relational sense, but doesn’t need to be (e.g., graph databases) and the data of observation tends to be unstructured, but it can also be structured (e.g., the structured observations generated by either a data profiling tool analyzing structured relational tables or flat files, or a word-counting algorithm analyzing unstructured text).
“Structured and unstructured,” Chisholm concluded, “describe form, not essence, and I suggest that representation and observation describe the essences of the two datas. I would also submit that both datas need different data management approaches. We have a good idea what these are for the data of representation, but much less so for the data of observation.”
I agree that there are two types of data (i.e., representation and observation, not big and not-so-big) and that different data uses will require different data management approaches. Although data modeling is still important and data quality still matters, how much data modeling and data quality is needed before data can be effectively used for specific business purposes will vary.
In order to move our discussions forward regarding “big data” and its data management and business intelligence challenges, we have to stop fiercely defending our traditional perspectives about structure and quality in order to effectively manage both the form and essence of the two datas. We also have to stop fiercely defending our traditional perspectives about data analytics, since there will be some data use cases where depth and detailed analysis may not be necessary to provide business insight.
A Tale of Two Datas
In conclusion, and with apologies to Charles Dickens and his A Tale of Two Cities, I offer the following A Tale of Two Datas:
It was the best of times, it was the worst of times.
It was the age of Structured Data, it was the age of Unstructured Data.
It was the epoch of SQL, it was the epoch of NoSQL.
It was the season of Representation, it was the season of Observation.
It was the spring of Big Data Myth, it was the winter of Big Data Reality.
We had everything before us, we had nothing before us,
We were all going direct to hoarding data, we were all going direct the other way.
In short, the period was so far like the present period, that some of its noisiest authorities insisted on its being signaled, for Big Data or for not-so-big data, in the superlative degree of comparison only.