Review: The Field Guide to Understanding ‘Human Error’
Why looking at problems through the lens of human error is fraught with danger, and what to do about it
Last week Facebook, WhatsApp, and Facebook Messenger went down, for everyone in the world, for six hours. The services were unavailable not just to their ordinary users, but to those inside Facebook; there were stories of technicians sent to repair the damage being unable to access Facebook’s data centres, because the service that checked their badges and unlocked the doors was one of those that was offline.
After the collective global meltdown, the accounts of what happened began to emerge and coverage switched to how the problems had occurred.1 Predictably, virtually all of them contained the phrase “human error”. “It was simply human error,” said The Times. “The outage was caused by human error that occurred while an engineer was doing routine maintenance work,” offered USA Today’s fact-checking department.
If there’s anyone whose gears are ground by this invocation of “human error” to explain incidents, it’s Sidney Dekker. A former commercial pilot and now psychology professor in Australia, Dekker has made it his life’s work to understand why things go wrong and how to avoid it, studying industries like aviation, logistics, medicine and manufacturing. One of the worst things you can do in trying to create a culture of safety, Dekker writes in his Field Guide to Understanding ‘Human Error’, is to use the phrase “human error” and think that it means or explains anything beyond how deficient your own view of the situation is.
Viewing failures through the lens of “human error”, and looking for the mistakes that individuals made as explanations of what happened, is symptomatic of what Dekker calls the “old view” of safety. This old view is deficient because it…
…fails to take into account the realities of work: people have to behave in a safe and correct manner, avoiding mistakes, but they also have to get the job done on time and on budget; they face competing and contradictory pressures that undermine efforts to remain safe
…generally operates in hindsight, through which it’s very easy to connect cause to effect and with which you have unlimited time and concentration
…spends too long focusing on counterfactuals (“they should have done X”, “they shouldn’t have done Y”) rather than explaining why what happened happened
…is judgemental: it gets caught up in a moralistic condemnation of the people who were involved, rather than a dispassionate view of the events that unfolded
…focuses too much on the events and people who were closest to the incident, rather than on the bigger picture; it pays too close attention to the “sharp end” rather than the “blunt end”, the wider organisation that supports and constraints what happens at the sharp end
In reality, Dekker argues, people don’t go to work with the intention of failing or doing people harm. (If your employees do, you’re systematically hiring psychopaths, and probably have bigger problems to deal with.) People generally operate according to the “local rationality principle”: they do what makes sense to them at any given moment, given their goals, attentional focus, and knowledge. If what they do seems inexplicable to us, it’s because we haven’t made the effort to understand why it made sense for them to do what they did.
People’s motivations, as Daniel Pink has suggested, are driven by desires for autonomy, mastery, and purpose. We want to explore our own ways of doing our jobs rather than being dictated to; that’s autonomy. We want to get better at what we do, and are driven by a sense of progress; that’s mastery. And we want to feel a connection to something bigger, to get the sense that our work is a part of a larger whole; that’s purpose.
So if humans are generally doing their best, why do things go wrong? Not because of a failure in them as individuals, not because of “human error”, but because of failures in their environment, their tools, the objectives that they’ve been set and the way that they’re managed. Dekker suggests a new way of viewing human error that takes this into account:
|The old view…||The new view…|
|Asks who is responsible for the outcome||Asks what is responsible for the outcome|
|Sees “human error” as the cause of trouble||Sees “human error” as a symptom of deeper trouble|
|Sees “human error” as random, unreliable behaviour||Sees “human error” as systematically connected to people’s tools, tasks, and operating environment|
|Thinks “human error” is an acceptable conclusion of an investigation||Thinks “human error” is only the starting point for further investigation|
This new view responds to incidents and accidents not by pointing the finger and finding someone to blame, and not by introducing yet more rules, regulations, and procedures with which people must comply. Both of these responses are counter-productive. Witch-hunts and blame-casting backfire by creating cultures in which people don’t feel able to report problems, and so bury issues until they become too bad to ignore. Introducing new processes overwhelms people, and ignores the reality that people must constantly use their judgement to diverge from processes, both to balance their competing real-world objectives and to cope with new and uncertain situations.
There are subtler implications, too. For example, it’s not possible to erase errors solely through training or experience: even the most experienced and well-trained people make mistakes, and they will certainly at some point encounter novel situations, stressful circumstances, or other situations that cause them to deviate from their training. Nor is it the case that there are “bad apples”, people who by special incompetence make a larger share of mistakes. Believing that there are, and acting on it by firing people, means that you to remove people from an organisation without removing the factors that led them to get into the trouble that they got into.
Dekker’s book is a field guide, as the title suggests, for creating an organisational culture that values safety over bureaucracy, and that makes people the answer to improvement rather than the cause of failure. It’s a culture that balances the real-world needs of a business – the need for planes to take off on time and packages to make it to their destination – with safety, rather than promoting a blinkered, narrow culture that forces compliance with arbitrary rules.
As a handbook it’s obviously intended for industries and situations where the stakes are high: where safety failure means injury or death, not just lost money or mildly disgruntled customers. I’m lucky enough to work in an industry where those serious outcomes are incredibly rare. But it strikes me that the culture Dekker is describing is just as valid to, and desirable for, any industry. It’s a culture that values truth and openness, that promotes flexibility, and that allows everyone to learn from others’ mistakes. That’s a special thing, and is worth fostering in any organisation.
Cloudflare’s blog has a great technical explainer of the outage that assumes only basic tech knowledge (but assumes a lot of interest in the protocols that underpin how the internet operates, which you may or may not possess!) ↩