What Is “Big Data,” Anyway?

A crucial first step in applying analytics is to decode the jargon.

June 25, 2013

(originally published by Booz & Company)

“Big data”—the large data sets that can be managed and analyzed only by increasingly powerful and sophisticated tools—is an expansive and rapidly evolving field. For now, I’m not going to talk about internally generated information that companies use primarily to mine for operational and financial efficiency or the increasing amount of data that machines can generate to indicate that they need servicing, that they are out of an item, and so on. This is fascinating stuff, but beyond the scope of this blog.

Yet even “limiting” ourselves to customer-oriented data barely shrinks the field. So for my inaugural post, my colleagues and I have created a taxonomy to help us get beyond generalities like big data, zero in on the most useful information, and point out how it can help companies get to new insights.

Why is this important? Because the amount of data generated by digitization will always exceed our ability to store, process, and make sense of it. Don’t just take my word for it. The celebrated statistician Nate Silver says that every day, three times per second, we produce the equivalent of the amount of data that the Library of Congress has in its entire collection. Most of it is irrelevant noise, so unless non-technical businesspeople are clear about the kinds of data being gathered and how to make practical use of it, they will be overwhelmed.

Externally gathered data can be divided into two main categories: structured and unstructured. Most businesspeople are more familiar with structured data, of which there are five main types: created, provoked, transacted, compiled, and experimental.

Created data includes things like old-fashioned market research surveys and consumer panels. Another source is people registering online (or even offline) for clubs and loyalty programs, and thereby voluntarily providing information about themselves. This is “created” data because it wouldn’t exist unless we put some mechanism into place to ask people questions and capture their answers.

Provoked data is generated by giving people the opportunity to express their views. The most well-known kind of provoked data is ratings and reviews, the one-to-five-star ratings that you would give to a restaurant or a product when prompted on a retailer’s website or e-commerce site. You add to this growing stream whenever you visit Yelp or Amazon and post a review.

Transacted data is generated every time we click on a website or on a banner, buy something online, or check out at a cash register. Transacted data is a powerful way to understand exactly what was bought, where it was bought, and when. Matching this type of data with other information, such as weather, can yield even more insights. (We know that people buy more Pop-Tarts at Walmart when a storm is predicted.)

Compiled data comes from the giant databases that companies like Axciom and Experian maintain on every U.S. household. They compile your credit scores, where you live, your purchase history, what automobiles you’ve registered in your name, and more. These databases use name and address as common identifiers and provide a wealth of information for marketing companies to mine and match up against other data they might have.

Experimental data is really a hybrid of created and transacted data. It involves designing experiments in which different customer sets receive different marketing treatments (the created piece) and observing the results in the real world (the transactional piece).

Now let’s talk about unstructured data—information that can’t be easily classified by a numerical rating, click, computer IP address, cookie, or barcode. Unstructured data is, of course, where the real explosion in data quantity is happening. Think about it: Every time you post a picture to Instagram, you’re adding to the mountain of unstructured data being generated around the world. Unstructured data can be divided into two main parts: captured and user-generated.

Captured data refers to information gathered passively from an individual’s behavior, such as search terms you enter and the location data that your phone generates through its GPS. In these cases, you are not necessarily aware that you are generating information about yourself.

User-generated data includes the videos posted on YouTube; the collaborative project files serviced by SharePoint; or the views expressed when someone comments on an article on a news media website, writes a blog entry, or posts an opinion on Twitter. Most user-generated data is not attributable to an individual. (I know the hash tag, but not the person.) It can be used to provide a context for product design and communications, but not for direct targeting.

The real magic happens when these disparate data sources are combined, harmonized, and used as the basis for powerful experiments and predictive models. More on this in my next post.

David Meer

David Meer is a thought leader on consumer insights and marketing analytics, with a special focus on the retail and consumer sectors at Strategy&, PwC’s strategy consulting group. Based in New York, he is a principal with PwC US.