The Hidden Structure of a Tweet

A closer look reveals that some unstructured data lends itself to insightful analysis.

July 26, 2013

(originally published by Booz & Company)

In my last post, I offered a taxonomy of “big data.” I limited the discussion to externally gathered, customer-oriented data, which I then divided into two categories: structured and unstructured. Before delving into how the various parts of the taxonomy combine to unlock insights, I’d like to dig a little deeper into unstructured data. As several readers have pointed out, the category is more complex than one might think. Part of this complexity stems from the fact that not all of what we typically classify as unstructured data is completely unstructured.

Because much unstructured data is machine-generated, it can be thought of as a log file, which contains a fair amount of embedded information. Take a tweet, for example. Only a small part of the information in the log file is the text of the tweet itself. The rest is data about the identity and screen name of the Twitter user, the language, the time of day, the time zone, the location of the user, and so on. A detailed log file would show up as a long string of numbers, letters, and characters. Writing code to parse this complex string can be difficult. One way of dealing with this is to use programming languages like Python, through which concise blocks of code can train the computer to recognize the patterns. In other words, they impose some structure.

The next issue in analyzing unstructured data is storage. Consider the problem of storing multiple tweets from an array of individuals, when some Twitter users generate thousands of tweets and others only a few. In the world of standard relational databases, this looks like unstructured data because, among other things, you cannot specify a pre-defined schema. Newer tools help overcome this problem. Document databases, for instance, break away from the relational storage model by removing the need for complex joins across multiple tables. One mechanism for doing this is JavaScript Object Notation (JSON). By creating JSON objects, the database does not need to look for data that’s commonly related, because it’s already being stored in the same place.

What remains is the text portion itself, which requires analysis that goes beyond typical structured data algorithms. This can be as simple as obtaining word counts or testing for the occurrence of pre-specified words or as complex as requiring advanced natural language processing and other text analysis algorithms. These algorithms might perform processes like “stemming” and “lemmatization,” which find word roots (without prefixes, suffixes, and conjugation forms)—thereby enabling a classification of the comment. This in turn can lead to a determination of the user’s sentiment. Other approaches enable human analysts to categorize a series of text posts, which in turn trains an algorithm to categorize millions more. Regardless of which approach is used, text mining of this sort has enormous value.

Therefore, tweets and blog posts, typically thought of as unstructured data, really turn out to be a mix of structured data (log files or other metadata) and unstructured text data that can be mined with advanced algorithms. I’d like to call this kind of data “semi-structured,” and add that notion to our taxonomy. Other common forms of semi-structured data are Web and scientific data.

What about other types of unstructured user-generated data such as audio, image, and video content? In my view, this data is much harder to make sense of and usually not relevant in most business contexts—at least for now. Take video, for example. Gaining consumer insight from analyzing video is nothing new. Years ago, on behalf of a large CPG company, my colleagues and I asked consumers to take home videos of their spouses doing the laundry and painstakingly watched the videos to gain qualitative insights. No data mining, no algorithms—just people watching, categorizing, and distilling what they observed. Retailers have used similar techniques to understand the way shoppers navigate a store. But today, this kind of data is better leveraged when it is created, meaning when it is purposefully generated to address a particular question and interpreted by humans, not machines.

This is not to say that for non-business uses—crime fighting, national security, newsgathering—more automated ways to mine video wouldn’t be valuable. In fact, video analysis software is already commercially available, and new developments such as biometric “loyalty cards” and facial recognition are starting to appear in mainstream retailing. But we’re not there yet. For now, in the consumer-oriented context, most companies should focus on deriving more insight from text, the most prevalent and useful form of so-called unstructured data.

David Meer

David Meer is a thought leader on consumer insights and marketing analytics, with a special focus on the retail and consumer sectors at Strategy&, PwC’s strategy consulting group. Based in New York, he is a principal with PwC US.