Will data science become necessary in all businesses like marketing or accounting departments? Neal Patel investigates
There was a time when you could go to cocktail parties and use the words “big data” with careless abandon. No one laughed uncomfortably. There were no unpleasant pauses. No one gave you those derisive looks we normally reserve for the guy who turns up at the show wearing the headlining band’s T-shirt. But today, it seems like those of us who work at the intersection of computer and social science are eager to distance ourselves from terms like “big data”, “analytics” and even “data science”. In fact, every time I hire a new lab member, we have an awkward conversation in which they delicately ask permission to switch their job title from “data scientist” to something else.
What everyone appears to be thinking, but doesn’t want to say out loud, is that big data is massively overhyped and is increasingly embarrassing to legitimate scientists. First, in practice, most of what passes for “data science” is not all that scientific, and done without any empirical rigor. Second, “big data” is inherently limited in what it can do – there are forms of consumer knowledge that remain beyond its reach. Nevertheless, big data continues to be hyped because the intimidating scale and technology makes it easy to sell “truthy” tidbits to neophytes who don’t know any better. Indeed, the prevalence of “analytics” and the big data shovel routinely displaces other, less popular research methods more appropriately tuned to uncovering the insights businesses are actually looking for.
In practice, “big data” lacks rigour
Practically speaking, most companies turn to big data to better understand and market products or services to their customers. Unfortunately, they are likely to encounter a range of pseudo-scientific gimmicks masquerading as research before they arrive at anything resembling the truth.
The first excursion is typically with “analytics.” Corporate leaders notice how successfully some systems predict consumer preferences, and assume the same can be done with consumer insights. They want apps, which mine Twitter posts, they want to know what’s “trending” on Facebook, they believe a trending Tweet puts them “inside” the mind of their customers.
But the problem with these sorts of analytics is that they don’t get inside the mind of consumers, they measure an outcome behaviour (in this case “Tweeting”) driven by what’s going on inside a given customer’s head. The problem with reporting observed outcomes is that they are rarely directional. It’s like trying to understand a cause and effect relationship without knowing the cause. This is precisely where we get things like alchemy and bloodletting. In the Middle Ages, barbers observed that people were more calm and relaxed when they lost blood. While relaxation is an outcome of exsanguination, it is because the body is weakened by blood loss and, for all intents and purposes, is preparing to die! Therefore, taking guidance from a Tweet without understanding why that person tweeted, is “soothsaying” in the truest sense: it invites assumptions which cannot be verified.
Seldom do these methods successfully link informational or behavioral exchange (tweets, social contacts, etc) with revenue, profit, loss, growth or any corresponding strategic guidance. Meanwhile, “analytics” relies on a host of pseudo-metrics such as “item trend velocity”, and “Tweet volume”, which create a false sense of precision.
Several months ago, I met with a prominent firm pitching a self-described “integrated strategy, technology, and marketing” solution. Out of a team of six, no one – not a single soul – could describe either the monetary or strategic value of a “trending” item on Twitter; or what action should be taken when content is sufficiently “Liked” on Facebook; or whether increased “trend velocity” is good or bad.
As a matter of fact, the team claimed to have devised a six-million dollar “real-time rapid-response framework” driven by “social media analytics.” After a few tough questions, the “rapid-response framework” appeared more like a person watching TV while Tweeting. Then Tweeting during the commercials. Then maybe monitoring Bottlenose and producing qualitative reports. Setting aside the obvious scientific problem of influencing the outcome variable (a term most commonly used in correlation and regression designs in which cause-and-effect relationships cannot be demonstrated), this fails to provide a detailed understanding of how consumers think or make sense of the world. This is not consumer insight. This is snake oil.
But many companies have a taste for snake oil. Indeed, because computational methods driven by “big data” work convincingly well when applied to search and ads, practically anything that can be called “data science” enjoys similar credibility – whether scientific or not. Unfortunately, because “data science” merely refers to the practice of extracting generalizable knowledge from data, it can mean anything. The choice between deeper analytic engagement with customers and the previous example is a choice between treating “data science” as “science” or discarding empirical rigor altogether.
Credit companies, for example, employ “data scientists” to learn about their customers. Visa famously disavows being able to predict whether a person is involved in divorce based on their credit records. Is this science? On the one hand, the correlation, in and of itself, might be sufficient for Visa’s hypothetical purposes – after all, they should be primarily interested in their own customers.
On the other hand, the results are not intended for application without limit. However, whenever a data scientist “discovers” a thought-provoking correlation – between, say marital status and credit records – the subsequent public discourse assumes it applies to everyone. This, in turn, distorts the scientific merit of the “discovery” by stretching its claims beyond the supporting evidence. Generalization to the public on the basis of one statistically valid association is like releasing a drug to the public on the basis of a single clinical trial or, for that matter, simply discovering that it works for the very first time.
Indeed, if we treated what passes for “data science” as rigorously as “real science,” we might decide that simply finding an association is only the beginning, not the end, of a discovery. This goes beyond the mere distinction between correlation and causation, although it certainly applies. Rather, there is an underlying social process or cultural framework accounting for the correlation which requires further investigation, in the same way that the demonstrated effect of a drug requires a detailed explanation of its function within the human body.
This explanation is not something readily understood or captured by the methods that “discovered” the association. Instead, cultural and phenomenological methods must decode the underlying framework of shared assumptions, taboos and structures of meaning that explain the existence of the observed correlation. Just because something correlates doesn’t mean we can decide which caused which or, indeed, if either caused the other at all.
Big data has limits
One of the reasons why the abuse of science is so prevalent in big data is because it allows less scrupulous researchers to avoid having to acknowledge the limits of what big data can do. At the recent Consumer Electronics Show, Yahoo! CEO Marissa Mayer proclaimed: “The future of search is contextual knowledge”. By “contextual”, Mayer means using cues from an individual’s previous online activity to guess what they might mean when they mistype search terms, or the next album they’ll want to purchase. Big data originated in search, where it is a proven solution to problems like “contextual search”. Imagine how a search engine works. Billions of bits of data from every website must be “read” in some way and then organized so they can be referred to later. This task alone requires a database so massive and complex that it exceeds the computational ability of any individual computer, necessitating a “cloud” of computers acting in concert.
Compare this to a conventional survey dataset. It may contain tens of thousands of observations, but a cloud-enabled survey could solicit a random sample every day, for months. Before long, the observation count quickly grows from thousands to millions. Big data delivers impossibly comprehensive datasets, in some cases, population-level data to rival the
At a very basic level, contextual search determines what else consumers want, based on the content they already click on. Imagine that each click is merely one in a cascading set of pair-wise choices, choice (a) versus (b), choice (c) versus (b), (a) over (c), and so on. These choices eventually form a hierarchy. After thousands of iterations, it becomes possible to calculate the probability of selecting choice (a) versus any other option in the set. A massive computational infrastructure powers this algorithm, arriving at a top set of preferences for every individual who visits a site.
This is an exceptionally simple version of the algorithms internet companies use, but the advantages are straightforward. Whereas a randomly placed ad is essentially a wager, an ad targeted at click-through behaviour is a data-driven decision. Indeed, the same system makes it possible to compare revenue from targeted ads with revenue from other advertising streams. Finally, costly marketing research can be reduced to simple A/B testing, a practice known as “optimization”.
The system works well until we begin to consider “context” more rigorously. Even a perfectly accurate system – and I’ve seen contextual search platforms ranging from eerily accurate to comically non sequitur – fails to answer the most simple, fundamental question: why?
“Contextual knowledge” is important because it provides a framework for understanding subjective mental order – the choices, biases, predilections and assumptions that organize our comprehension of reality. A hypothetically perfect contextual search solution generates, at best, the outcome of this mental order, an articulated preference for one choice over another. But there is tremendous business value in understanding the mental order driving those outcomes: understanding why people make the choices that they do.
A company which understands how its consumers think and make sense of their world, speaks to them in a shared language. It is the difference between the short margins of selling option (a) over option (b), and a deep, intersubjective connection between a product and an individual—one in which consumers believe the brand is an authentic expression of themselves.
In other words, despite “big” data, some things remain unmeasurable, especially because most big data is completely observational. It can neither confirm causation the way manmade experiments do, nor explain the mental framework which drives decision-making. This is the domain of qualitative and theoretical methods, which social scientists have been using to mine mental life since 1906. In The Philosophy of Money, founding sociologist Georg Simmel distinguishes between “objective” knowledge, which can be measured and quantified; and abstract, “subjective” knowledge concerned with “those questions… that we have so far been unable either to answer or dismiss” (Simmel, 1978).
By “subjective”, Simmel refers to both inner mental experience and fundamental philosophical questions about the origins of things, neither of which readily lend themselves to a quantifiable solution. Yet both are integral to understanding social phenomena. “Even the empirical in its perfected state,” Simmel (1978) argues, “might no more replace philosophy as an interpretation… than would the perfection of mechanical reproduction of phenomena make the visual arts superfluous.”
Thus, extending the example of credit card records, discovering a correlation to marital status should compel an investigation of underlying the values, choices and cultural rules embedded in those spending habits – the truest form of contextual knowledge, the reasons why. Yet these deeper scientific questions are routinely overlooked in the favour of the surface correlation. It is no wonder the label “data science” is passé among those who analyze big data for a living – for true computational social scientists, observing the frenzy around pithy “data science” correlations in public discourse is like an electrical engineer watching an electro-magnetic field detector in the hands of a “ghost” hunter.
Big data done the right way
There are, of course, examples of data science which reach the highest standards of scientific rigor. Sandy Pentland’s (MIT Media Lab) recent investigation of what makes team’s “click”, for instance, deployed 2,500 sociometric badges collecting vast amounts of sensor and proximity data among sales and support teams. Pentland and his team discovered that they could accurately predict each team’s success based on their communication pattern data, without even meeting the
team’s members. All of this a culmination of the team building activities Melbourne xlevents.com.au that had been included in the office tasks for the welfare of the employees.
Were this the typical shiny object served up by “data science”, the discussion would have ended there. However, Pentland and team conducted a deep investigation into the underlying framework for social interaction which explained this result – identifying three key communication dynamics which influence performance and experimentally demonstrating improvement among teams that map their communication dynamics to the ideal pattern. As a result, Pentland not only made a significant scientific discovery, but generated clear business recommendations about what teams should do to succeed. According to Pentland, stellar teams hold frequent informal meetings, or “asides”, and ensure everyone speaks and listens in equal measure. In determining why, MIT researchers transformed a curious insight into concrete guidance for businesses interested in higher team productivity.
The question, preoccupying individuals in my profession is whether high or low scientific standards will prevail in the world of “big data” – or will “data science”, and “analytics”, become the next “focus group”, or something equally discredited?
Indeed, as the set of practices referred to as “big data” and “data science” evolves into a professional institution, it will have to choose between the high standards of Sandy Pentland’s work and the pseudo-scientific slapstick of the next big “integrated technology, strategy and marketing” solution. Today data scientists are born “by accident”, when a social science PhD picks up programming skills, or a computer scientist develops an intellectual fascination with human behaviour. The next generation will train in “data science” degree programmes. Today data scientists define their own positions within organizations. Tomorrow, there will be official data science departments. The level of scientific rigour these programmes instil in future students, and what prospective employers expect of them, will depend on the prevailing professional standards of our day. Indeed, a number of
universities have started data science Masters programmes.
Unfortunately, industrial competition does not always select for the fittest institutional structures. As Neil Fligstein’s study of the savings and loan industry suggests, firms operating in “emerging” industries can behave mimetically – that is, they simply copy what other firms do, because it seems to work. And this is the greatest danger posed by the fact that “big data” works too well: the promise of immediate reward is more attractive than the more difficult engagement with context, with why – although the latter poses a greater long-term
For the individuals who work with big data every day, the challenge is ultimately ours. In the long run, will data science become a necessary competency of all businesses (like marketing, IT or accounting)? A skill everyone applies in their job? Or will it become the province of outside consultants who operate on demand? Ultimately, each of us must decide whether to embrace computational methods as true scientists, to pursue an engagement with why, or gorge ourselves with the golden goose of pseudo “data science”.
Neal Patel is technical program lead, Advanced Technology and Projects (ATAP) at Google Inc
How Visa Predicts Divorce, The Daily Beast, Nicholas Ciarelli, (2010)
Yahoo Just Acquired A New Search Product That Could Hurt Google, Business Insider, Jim Edwards (2014)
Eric Schmidt: The Future of Magazines Is on Tablets, Mashable, Lauren Indvik (Oct 23, 2013)
The New Science of Building Great Teams, Harvard Business Review, Alex “Sandy” Pentland (April 2012)
The Philosophy of Money, Georg Simmel, David Frisby ed. (1978) London: Routledge.