Seven myths of business experimentation

Disciplined, rigorous testing of ideas and strategies is fundamental to innovation, but businesses are often held back by misconceptions.

August 10, 2020

When Isaac Newton published his third law of motion in 1687 — that for every action, there is an equal and opposite reaction — he unintentionally gave us a conceptual model that extended beyond the motion of physical objects. About 300 years later, the economist Albert Hirschman applied this action–reaction lens to the study of political, social, and economic progress and arrived at a provocative conclusion. He proposed in his book The Rhetoric of Reaction that opposition to progress is often “shaped, not so much by fundamental personality traits, but simply by the imperatives of argument, almost regardless of the desires, character, or conviction of the participants.” Hirschman’s theses can help us understand why some executives aren’t going full throttle with business experimentation, a practice that is key to innovation, that drives profitable growth and creates shareholder value.

I’ve found that there are commonly held misconceptions — deployed as rhetorical devices — that are holding organizations back. These misconceptions need to be understood, addressed, and then set aside.

Hirschman concluded that arguments directed against progress usually come in three ﬂavors: the perversity thesis, the futility thesis, and the jeopardy thesis. When you try to change an organization, it’s likely that opponents will put forward such theses. According to the perversity thesis, any action taken to improve some aspect of a system will backﬁre, and the organization will be worse oﬀ than before the action began. The futility thesis holds that any eﬀorts to transform an organization will barely make a dent because they don’t address the deeper structural challenges. Any action is futile and not worth pursuing.

But the jeopardy thesis is perhaps the most dangerous, because it asserts that a proposed action, though beneﬁcial, involves unacceptable risk and costs. Herein lies the argument’s danger. It’s easy to specify costs and risks up front; however, the beneﬁts of action are often elusive, especially before the action is taken. For example, a supermarket chain would have no problem calculating the cost of remodeling its stores. But the impact on revenue remains uncertain until the stores open for business. The true cost of inaction is opportunity cost, which doesn’t appear on any balance sheets or income statements. The most potent weapons of jeopardy thesis proponents are fear, uncertainty, and doubt.

Becoming an experimentation organization will undoubtedly cause friction, as for every action there will be an opposing reaction. The causes that I’ve come across cover a broad spectrum: inertia, anxiety, incentives, hubris, perceived costs and risks, and so on. But I have also found that managers aren’t always aware of the power of business experiments. This failure to understand and appreciate their true beneﬁts has given rise to fallacies that undermine innovation. Here are seven speciﬁc myths that I’ve come across.

Myth 1: Experimentation-driven innovation will kill intuition and judgment

A few years ago, I gave a presentation on business experimentation to a large audience of executives and entrepreneurs. The audience was intrigued until one participant, the founder and CEO of a national restaurant chain, energetically voiced his opposition to subjecting his employees’ ideas to rigorous tests. He strongly believed that innovation is about creativity, conﬁdence, and vision. In a loud voice, he proclaimed: “Steve Jobs didn’t test any of his ideas.” His perversity message was unambiguous: A greater focus on experiments will backﬁre, will put great ideas at risk of being prematurely dismissed, and will ultimately kill intuition and judgment.

Managers aren’t always aware of the power of business experiments. This failure to understand and appreciate their true beneﬁts has given rise to fallacies that undermine innovation.

But, I countered, it’s not about intuition versus experiments; in fact, the two need each other. Intuition, customer insights, and qualitative research are valuable sources for new hypotheses, which may or may not be refuted — but hypotheses can often be improved through rigorous testing. The empirical evidence shows that even experts are poor at predicting customer behavior. We have encountered ample evidence of that; in fact, they get it wrong most of the time. Wouldn’t it be preferable to know what does and does not work early, and focus resources on the most promising ideas? After some participants sided with this reasoning, the CEO gradually relented. (Curiously, I later found out that his company had been a user of a popular tool for running rigorous in-restaurant experiments, yet he was unaware of it.) With respect to his comment about Steve Jobs, it’s remarkable how many people believe that their intuition and creativity can match Jobs’s track record — until they don’t. Incidentally, let me dispel another myth: Apple does run experiments.

Myth 2: Online experiments will lead to incremental innovation, but not breakthrough performance changes

Managers commonly assume that the greater a change they make, the larger an impact they will see. But this is another manifestation of the perversity thesis: Breakthroughs in business performance aren’t always the result of one or a few big changes. They can also come from the continuous ﬂow of many smaller successful changes that accumulate quickly and can operate on customers over a long period of time. A culture of incremental innovation can be a good thing as long as there are many improvements, they are tested and scaled quickly, and there is scientiﬁc evidence for cause and eﬀect. In the digital world, having impact is also about getting many small changes right and scaling them to millions or billions of users.

Live experiments can be scary when we make big changes. For one thing, they can fail in big ways and expose customers to poor outcomes. For a high-traﬃc online business, the cost of a sudden drop in user conversion can escalate rapidly to millions of dollars. There is another concern: What could an organization possibly learn about cause and effect when several changes are made at once and they can’t isolate the variable that caused the metric to change? Big changes work best when you want to explore and move to a new plateau (such as a new business model or Web experience) because you’ve reached a local optimum: Successive experiments yield results with diminishing returns.

Certainly, experienced experimenters run breakthrough experiments in which they change several variables at once. And when they do, they pay close attention to behaviors such as change aversion. Short-term reactions to large changes may not be indicative of long-term eﬀects. All innovation involves uncertainty, and both incremental and radical experiments are instrumental to addressing it.

Myth 3: We don’t have enough hypotheses for large-scale experimentation

When managers hear about leading digital companies launching dozens of new experiments every day, they get intimidated. To reach 10,000 experiments per year, their employees would have to design, approve, launch, and analyze around 40 experiments daily, which seems impossible. Worse yet, companies such as Amazon, Booking.com, and Microsoft seem so far ahead that they are not even considered role models. Opponents claim that the small number of feasible experiments their organizations are able to implement will barely make a dent in their company’s ﬁnancial performance — that they will be futile. But none of the organizations described started out as virtuosos. Everything they’ve accomplished has come through the careful design and redesign of experimentation systems and years of practice. The reality is that most companies don’t run thousands of experiments each year. State Farm runs between 100 and 200 tests annually (with many variants) and beneﬁts signiﬁcantly from what it learns. Some companies run even fewer experiments and observe improvements on key performance metrics. Over time, organizations can increase scale and outrun competition. So it’s not surprising that the adoption of A/B testing tools is especially prominent in startup companies. High-velocity testing gives them the agility to respond to market and customer changes and reduces marketing research expenses. A 2019 study^PDF by researchers at Duke and Harvard Universities found that 75 percent of a sample of 13,935 startups founded in 2013 used A/B testing tools. Even though it’s unclear how eﬀectively these companies deployed these tools, the study found that A/B testing had a positive impact on business performance.

Myth 4: Brick-and-mortar companies don’t have enough transactions to run experiments

A risk of using large digital companies to demonstrate the power of business experiments is that skeptics immediately focus on sample size. They note that the vast majority of business isn’t conducted through digital channels: It uses complex distribution systems, such as store networks, sales territories, bank branches, and so on. Business experiments in the latter environments suﬀer from a variety of analytical complexities, the most important of which is that sample sizes are typically too small to yield statistically valid results. Whereas a large online retailer can simply select 50,000 consumers in a random fashion and determine their reactions to an experiment, even the largest brick-and-mortar retailers can’t randomly assign 50,000 stores to test a new promotion. For them, a realistic test group usually numbers in the dozens, not the thousands. They may ask, Why bother with disciplined business experiments?

But experiments can work well in the brick-and-mortar context, for a number of reasons. First, experiments simply need a sample large enough to average out the eﬀects of all variables except those being studied. The sample size required depends in large part on the magnitude of the expected eﬀect the experimenters are trying to pinpoint. If a company expects the cause it’s studying to have a large eﬀect, the sample size can be smaller. If the expected eﬀect is small, the sample must be larger. That’s because the smaller the expected eﬀect, the greater the number of observations that are required to detect it amid the surrounding noise of other potential causes with the desired statistical conﬁdence.

Second, managers often mistakenly assume that a larger sample will automatically lead to better data. Indeed, an experiment can involve a lot of observations, but if they are highly clustered, or correlated to one another, then the true sample size might actually be quite small. Third, companies can utilize special algorithms in combination with multiple sets of big data to oﬀset the limitations of environments with sample sizes even smaller than 100. And ﬁnally, experiments that lack a high level of rigor can still be useful for exploration when you are looking for changes in direction.

It’s also true that companies without digital roots are increasingly ﬁnding themselves exposed to digital competition. And when they are interacting with customers through Web-based and mobile channels, companies have access to larger sample sizes. When they do, managers should realize that having an experimentation capability they can use to optimize customer experiences will be necessary to compete.

Myth 5: We tried A/B testing, but it had only a modest impact on our business performance

About a year ago, I discussed online testing with a colleague, and he told me about a conversation he’d had with the CEO of a travel business. The company utilized A/B testing, but according to the CEO, “It didn’t create the promised business value.” In situations like this, instead of pushing scale, scope, and integration across business units, the futility mind-set becomes self-fulﬁlling. An organization runs a few dozen tests, ﬁnds few winners, and declares the initiative a ﬂop. A variant of this futility thesis is, “We are disappointed with A/B testing because the cumulative business impact is lower than the expected sum of test results.” Perhaps executives zero in too quickly on good news, or teams are understandably excited and overpromise when they “win.”

But there are several reasons test results don’t have to add up. For one, interaction eﬀects don’t make results additive. Here is a very simple example: Imagine running two experiments, one on font color and the other on background color. Independent experiments show that changing the color to blue in either case results in a sales conversion increase of 1 percent. But when both are changed to blue at the same time, the metrics crash (blue font on blue background is impossible to read). That’s a negative interaction. On the other hand, positive interaction eﬀects can make the whole eﬀect greater than the sum of the experiments. Instead of changing font color, now imagine changing just the wording and again observing a lift of 1 percent. But this time, the combination of better words and blue background color results in a 3 percent improvement (not 1 percent plus 1 percent).

There are other reasons (false positives, testing on subsets of a customer base, etc.) experiments don’t have to be additive, and it’s important to manage expectations. As Douglas C. Montgomery points out in his book Design and Analysis of Experiments, experimental designs that are particularly suitable to ﬁnding and leveraging interaction eﬀects can help.

At times, I’ve also run into skeptics who are concerned about the cost of experimenting at large scale. They want to see the return on investment (ROI) on experimentation before getting started, because that’s how they evaluate all new initiatives. In the past, I used to patiently explain the costs and beneﬁts, so they could ﬁll their spreadsheets for a ﬁnancial analysis. But, as we’ve seen, the costs are tangible and the beneﬁts are about opportunity, which requires a leap of faith. So I’ve changed my response to: “What’s the ROI on breathing?” Perhaps it’s a ridiculous response, but if mastering experimentation is critical to survival, the analogy isn’t so far-fetched.

Myth 6: Understanding causality is no longer needed in the age of big data and business analytics

That’s an actual statement an executive made at the end of a classroom discussion, and another myth that stems from the futility mind-set. He had read stories about companies that found correlations between seemingly unrelated variables (such as buying behaviors of customers) that a company could act on without understanding why those correlations happened. For example, Amazon at one point gave its customers a recommendation to buy organic extra-virgin olive oil when they bought toilet paper — because the correlation was an actual big data ﬁnding. (I would have loved to attend the meeting to discuss possible causal explanations!)

But correlation is not causation, and having only a superﬁcial understanding of why things happen can be costly or, in the case of medicine, even dangerous. I told the executive that experiments and advances in big data are complements to each other. Correlations and other interesting patterns that are learned from the analysis of large data sets are excellent sources for new hypotheses that need to be rigorously tested for cause and eﬀect. And big data can help make experiments more eﬃcient, especially when sample sizes are small.

Myth 7: Running experiments on customers without advance consent is always unethical

This myth is the product of a jeopardy mind-set, but it does address some legitimate concerns. Companies must behave lawfully, and they need to demonstrate ethical behavior in order to earn and retain the trust of their customers. In academia, social science researchers have to follow strict protocols when their work involves human subjects. Before getting started, projects are approved by review boards. Medical research has even higher standards and carefully weighs the therapeutic and welfare beneﬁts of experiments against the cost to patients. But we ought to be careful about overstating the potential risks of business experiments and downplaying the true beneﬁts. Without rigorous experiments — without the scientiﬁc method — building and organizing knowledge about cause and eﬀect stagnates. If anything, companies don’t experiment enough.

Clearly, the search for knowledge doesn’t give companies a license to run tests that are unethical. The real jeopardy, however, isn’t running unethical experiments that are somehow out of control. The bigger risk lies in not experimenting, and in so doing, forgoing a capability that’s critical for innovation. Some companies institute practices that can strengthen ethical behavior among employees. LinkedIn’s internal guidelines state that the company will not run experiments “that are intended to deliver a negative member experience, have a goal of altering members’ moods or emotions, or override existing members’ settings or choices.” Booking.com includes ethical training as part of its onboarding process for new recruits. The company also demands complete transparency before and after an experiment is launched. Ethical discussions are open to all employees and can be vigorous at times, but ultimately everyone has the same objective: to improve customer experiences and take the friction out of travel. Tricking customers or persuading them to do things that go against this objective doesn’t work in the long run. To ﬁnd out what does and does not work with speed and rigor, according to the company’s former CEO Gillian Tans, consider that “Everything is a test.” To get to a place where testing is more commonplace at companies, the myths have to make way for facts.

Author profile:

Stefan H. Thomke is the William Barclay Harding professor of business at Harvard Business School. He has chaired numerous executive education programs, both at Harvard Business School and in companies around the world.
Reprinted by permission of Harvard Business Review Press. Excerpted from Experimentation Works: The Surprising Power of Business Experiments by Stefan H. Thomke. Copyright 2020 Stefan H. Thomke. All rights reserved.