Meltdown: Why Our Systems Fail and What We Can About It
by Chris Clearfield and András Tilcsik, Penguin Press, 2018
If you like disaster stories, you’ll love Meltdown, by Chris Clearfield, a principal at risk consultancy System Logic, and András Tilcsik, an associate professor at the Rotman School of Management. The authors cover a gamut of catastrophe, from a ruined Thanksgiving dinner to the water crisis in Flint, Mich., and the multiple meltdowns at the Fukushima Daiichi Nuclear Power Plant caused by the Tōhoku earthquake and tsunami in 2011. The worst part of all these examples: According to the authors, they were preventable.
All the disasters recounted in Meltdown share characteristics first identified by sociologist Charles Perrow. Now in his nineties, Perrow earned the appellation “master of disaster” for his seminal study of a host of incidents in high-risk settings, starting with the Three Mile Island Nuclear Generating Station accident in 1979. “In Perrow’s view,” explain Clearfield and Tilcsik, “the accident was not a freak occurrence, but a fundamental feature of the nuclear power plant as a system.”
This system — indeed, each of the systems described in Meltdown’s disasters — is complex and tightly coupled: complex in that the systems are nonlinear, with parts sometimes interacting in hidden ways, and tightly coupled in that there is little slack in these systems. A failure in one part quickly, and often, unexpectedly affects other parts.
For instance, on May 18, 2012, at 11:05 a.m. ET, when Facebook stock was supposed to trade for the first time on Nasdaq, nothing happened. Three billion dollars in trades were not executed. Within minutes, programmers tracked the problem to several lines of code installed years earlier as a validation check. Under pressure, a senior executive instructed the programmers to remove the code. “But Nasdaq’s system was incredibly complicated, and the workaround caused a series of unexpected failures,” write Clearfield and Tilcsik. Some trades were ignored and delayed, and hours elapsed before other trades were reported. Although legally prohibited from trading, Nasdaq itself sold US$125 million worth of Facebook stock. Hundreds of millions of dollars in trading losses resulted.
Clearfield and Tilcsik argue that there are more and more complex, tightly coupled systems in our world — including social media platforms, dam management systems, computerized trading programs, deep-sea oil drilling rigs, and ATMs — and they are accompanied by a steadily increasing risk of meltdowns. “Complexity and coupling make these failures more likely and more consequential,” they write, “and our brains and organizations aren’t built to deal with these kinds of systems.”
Happily, most of Meltdown is devoted to solutions, as well as to some counterintuitive insights that are well worth considering. One of those insights is that adding safety features, such as redundancies and alarms, to protect systems from meltdowns adds complexity — and this makes them more vulnerable to failure, not less. “One study of bedside alarms in five intensive-care units found that, in just one month, there were 2.5 million alerts, nearly four hundred thousand of which made some kind of sound. That’s about one alert every second and some sort of beeping every eight minutes,” report Clearfield and Tilcsik. “Nearly 90 percent of the alarms were false positives. It’s like the old fable: cry wolf every eight minutes, and soon people will tune you out. Worse, when something serious does happen, constant alerts make it hard to sort out the important from the trivial.”
Adding safety features to protect systems adds complexity — and this makes them more vulnerable to failure, not less.
So what can you do to prevent meltdowns? One set of solutions offered by the authors revolves around design. First, figure out if your system is vulnerable because of complexity and tight coupling. Then, make it less so. Clearfield and Tilcsik tell us that Boeing avoids desensitizing pilots to alarms by prioritizing them. The highest level of alarm is reserved for an aerodynamic stall, because it requires an urgent response. Red warning lights go on, a red message appears on the cockpit screen, and the control columns shake.
Another set of recommendations revolves around diversity. The authors note that at a high-profile technology company that failed spectacularly, the board of directors was “remarkable for its lack of diversity.” Other than two company executives who served on the board, every other member was a white man. And their average age was 76. The problem: Research shows that homogenous groups are less likely to distrust and challenge each other; they also are more likely to accept wrong answers.
Meltdown offers advice in other areas, too. There are sets of recommendations related to identifying clues to systemic weakness and learning from them, encouraging dissent, using outsiders to surface problems, and crafting an effective response when a meltdown occurs. Each set is heavily illustrated with stories and research, although maybe a touch too heavily here and there, with the point getting partially buried.
Overall, Meltdown is something of a rarity: an enlightening and entertaining business book. It synthesizes the work of experts in high-risk systems such as Perrow, Karl Weick of the University of Michigan, Kathleen Sutcliffe of Johns Hopkins University, and Marlys Christianson of the University of Toronto, all of whom studied effective response, and builds upon it in an accessible and practical way.