Jay Taylor's notes

back to listing index

Blameless PostMortems and a Just Culture

[web search]
Original source (codeascraft.com)
Tags: postmortem codeascraft.com
Clipped on: 2017-06-21

Image (Asset 1/5) alt= Does it mean everyone gets off the hook for making mistakes? No.

Well, maybe. It depends on what “gets off the hook” means. Let me explain.

Having a Just Culture means that you’re making effort to balance safety and accountability. It means that by investigating mistakes in a way that focuses on the situational aspects of a failure’s mechanism and the decision-making process of individuals proximate to the failure, an organization can come out safer than it would normally be if it had simply punished the actors involved as a remediation.

Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of:

  • what actions they took at what time,
  • what effects they observed,
  • expectations they had,
  • assumptions they had made,
  • and their understanding of timeline of events as they occurred.

…and that they can give this detailed account without fear of punishment or retribution.

Why shouldn’t they be punished or reprimanded? Because an engineer who thinks they’re going to be reprimanded are disincentivized to give the details necessary to get an understanding of the mechanism, pathology, and operation of the failure. This lack of understanding of how the accident occurred all but guarantees that it will repeat. If not with the original engineer, another one in the future.

We believe that this detail is paramount to improving safety at Etsy.

If we go with “blame” as the predominant approach, then we’re implicitly accepting that deterrence is how organizations become safer. This is founded in the belief that individuals, not situations, cause errors. It’s also aligned with the idea there has to be some fear that not doing one’s job correctly could lead to punishment. Because the fear of punishment will motivate people to act correctly in the future. Right?

This cycle of name/blame/shame can be looked at like this:

  1. Engineer takes action and contributes to a failure or incident.
  2. Engineer is punished, shamed, blamed, or retrained.
  3. Reduced trust between engineers on the ground (the “sharp end”) and management (the “blunt end”) looking for someone to scapegoat
  4. Engineers become silent on details about actions/situations/observations, resulting in “Cover-Your-Ass” engineering (from fear of punishment)
  5. Management becomes less aware and informed on how work is being performed day to day, and engineers become less educated on lurking or latent conditions for failure due to silence mentioned in #4, above
  6. Errors more likely, latent conditions can’t be identified due to #5, above
  7. Repeat from step 1

We need to avoid this cycle. We want the engineer who has made an error give details about why (either explicitly or implicitly) he or she did what they did; why the action made sense to them at the time. This is paramount to understanding the pathology of the failure. The action made sense to the person at the time they took it, because if it hadn’t made sense to them at the time, they wouldn’t have taken the action in the first place.

The base fundamental here is something Erik Hollnagel has said:

We must strive to understand that accidents don’t happen because people gamble and lose.
Accidents happen because the person believes that:
…what is about to happen is not possible,
…or what is about to happen has no connection to what they are doing,
…or that the possibility of getting the intended outcome is well worth whatever risk there is.

A Second Story

This idea of digging deeper into the circumstance and environment that an engineer found themselves in is called looking for the “Second Story”. In Post-Mortem meetings, we want to find Second Stories to help understand what went wrong.

From Behind Human Error here’s the difference between “first” and “second” stories of human error:

First Stories Second Stories
Human error is seen as cause of failure Human error is seen as the effect of systemic vulnerabilities deeper inside the organization
Saying what people should have done is a satisfying way to describe failure Saying what people should have done doesn’t explain why it made sense for them to do what they did
Telling people to be more careful will make the problem go away Only by constantly seeking out its vulnerabilities can organizations enhance safety

 Allowing Engineers to Own Their Own Stories

A funny thing happens when engineers make mistakes and feel safe when giving details about it: they are not only willing to be held accountable, they are also enthusiastic in helping the rest of the company avoid the same error in the future. They are, after all, the most expert in their own error. They ought to be heavily involved in coming up with remediation items.

So technically, engineers are not at all “off the hook” with a blameless PostMortem process. They are very much on the hook for helping Etsy become safer and more resilient, in the end. And lo and behold: most engineers I know find this idea of making things better for others a worthwhile exercise.

So what do we do to enable a “Just Culture” at Etsy?

  • We encourage learning by having these blameless Post-Mortems on outages and accidents.
  • The goal is to understand how an accident could have happened, in order to better equip ourselves from it happening in the future
  • We seek out Second Stories, gather details from multiple perspectives on failures, and we don’t punish people for making mistakes.
  • Instead of punishing engineers, we instead give them the requisite authority to improve safety by allowing them to give detailed accounts of their contributions to failures.
  • We enable and encourage people who do make mistakes to be the experts on educating the rest of the organization how not to make them in the future.
  • We accept that there is always a discretionary space where humans can decide to make actions or not, and that the judgement of those decisions lie in hindsight.
  • We accept that the Hindsight Bias will continue to cloud our assessment of past events, and work hard to eliminate it.
  • We accept that the Fundamental Attribution Error is also difficult to escape, so we focus on the environment and circumstances people are working in when investigating accidents.
  • We strive to make sure that the blunt end of the organization understands how work is actually getting done (as opposed to how they imagine it’s getting done, via Gantt charts and procedures) on the sharp end.
  • The sharp end is relied upon to inform the organization where the line is between appropriate and inappropriate behavior. This isn’t something that the blunt end can come up with on its own.

Failure happens. In order to understand how failures happen, we first have to understand our reactions to failure.

One option is to assume the single cause is incompetence and scream at engineers to make them “pay attention!” or “be more careful!”
Another option is to take a hard look at how the accident actually happened, treat the engineers involved with respect, and learn from the event.

That’s why we have blameless Post-Mortems at Etsy, and why we’re looking to create a Just Culture here.

Image (Asset 2/5) alt= – Data overload (are we not detecting when to take action? If we are, is the information that accompanies an alert informative or noisy?)
– Lots of other things that I can’t imagine Image (Asset 3/5) alt= http://asrs.arc.nasa.gov/overview/summary.html
The idea is to avoid people covering up mistakes. It is more important to report a mistake than it is to punish someone for making a mistake.

It could be one reason why aviation is so safe in the US.

It certainly is an unusual program for the government.

[…] the way in which the organisation deals with failures in the software systems needs to shift to a blame-free model, allowing the whole organisation to learn and improve. In our experience, a ‘big […]

[…] error” approach is the equivalent of cutting off your nose to spite your face. He explains in a blog post that at Etsy, their approach it to “view mistakes, errors, slips, lapses, etc., with a […]

There are two relevant books for reading up on this in depth:
Behind Human Error [Kindle Edition]
The Field Guide to Understanding Human Error [Kindle Edition]
Thank you for providing this info in your recent O’Reilly podcast!

[…] What are the various ways we can anticipate, monitor, respond to, and learn from our failures and our successes?  […]

[…] and operations domain. I’d love to think that the concepts that we’ve taken from the New View on ‘human error’ are becoming more widely known and that people are looking to explore their own narratives through […]

christian evans •
2 years ago

Thank you.

[…] idea of “the blameless postmortem”, a term they credit to Etsy’s John Allspaw. In his article Allspaw writes “an engineer who thinks they’re going to be reprimanded are […]

[…] RCA to be most effective we should instill the idea of the “blameless postmortem” into how we envision RCA. Blameless postmortem is an awesome concept that defines a culture […]

[…] Blameless Postmortems (Allspaw) […]

Kevin •
2 years ago

John,
Thank you very much for this insightful post. You have hit the nail squarely on the head with issues I am seeing at several of my clients these days. Referencing this post has made it much easier to consult with my clients, as I don’t have to be the messenger.

Thank you again.

[…] or too ready to blame others. Etsy, the recently public craft-focused e-commerce site, has made a concerted effort to change that. In a conversation yesterday with Quartz editor-in-chief Kevin Delaney, Etsy CEO Chad Dickerson […]

[…] behind the three armed sweater, Etsy SVP of Technical Operations (and now CTO) John Allway wrote a blog post in 2012 about how shaming people who make mistakes basically guarantees that the mistake will happen again, […]

[…] transition to this new model, not everything will go smoothly. You need to encourage individuals to learn from the inevitable mishaps and challenges, and to feed that learning back into the development cycle. Be proactive about building an […]

[…] Blameless PostMortems and a Just Culture […]

[…] Blameless PostMortems and a Just Culture […]

[…] focuses on the time line leading up to a failure. Etsy has done an excellent job of encouraging blameless postmortems in industry. It’s a matter of when things are going to fail, not if failure will occur, when […]

[…] Blameless Post-Mortems at Etsy […]

[…] things down requires us to take a pause, collect our thoughts and draft an impartial, sober, and fearless account of what happened, how we dealt with it, what we learned and what steps we’re taking to […]

[…] great article on Blameless PostMortems by John […]

[…] make time for a full post mortem.  I like the philosophy and format shared by Etsy, called a Blameless Post Mortem.  Basically, you want to emphasize that mistakes happen and the key is to focus on what […]

Andy •
1 year ago

I love this piece, but there are two implicit assumptions it makes that (in my experience) don’t hold true in the real world very often:

(1) The action that contributed to a failure or incident was an action taken by an Engineer, not a Manager, and
(2) A “blameful” approach will result in reduced quality of work experience over time because of reduced information flow from Engineering to Management.

In my career thus far (a decade and counting), I’ve observed that it’s much more likely for a failure or incident to have a root cause in Management, not in Engineering. My favorite example is when our Management signed a contract for a vendor to provide a critical service through an API, without even checking to see whether that vendor provided that particular service (they didn’t) or even had an API (they don’t). You can’t blame Engineering for being unable to use a product when that product literally doesn’t exist; that sort of logical atrocity is unique to the profession of Management. I’ve never seen an Engineering decision result in anything worse than a small amount of lost data or a few hours of downtime; but I’ve seen Management decisions result in company-wrecking calamities. There’s not even a contest.

Unfortunately, unlike Engineers, there is no mechanism in a corporate hierarchy to “punish, shame, blame, or retrain” a Manager when their action led to a failure or incident. Even when it is clear to everyone studying the situation that a particular Management decision is the direct cause of a failure, the political reality of living in a capitalist society prevents that person from being blamed or held responsible in any way. So, if avoiding blame is truly a better approach for minimizing failure over time — if the claim of this article is indeed true — then we would expect the areas of a company that are intrinsically shielded from blame to be the areas with the LOWEST rates of failure. Instead, we observe them to be the areas with the HIGHEST rates of failure. In other words, failure has an inverse correlation to blame, not a positive correlation as this article suggests.

You rightly point out that “human error is seen as the effect of systemic vulnerabilities deeper inside the organization.” However, those systemic vulnerabilities are all, categorically, the purview of Management. There is no part of the system of a company that is outside the authority of Management to control; ergo, any systemic vulnerability that has been identified and documented, but not eliminated, must exist BECAUSE OF (rather than despite) Management’s choices. When you are talking about Engineering, you can clearly separate the humans from the system they work within, because that system is created and maintained independently of the actions of the Engineers. But when you are talking about Management, there is NO distinction between human failure and systemic failure, because the “system” is entirely the product of human action — in particular, the actions of the humans called Managers. To put it another way, what we call “the system” is simply the collective actions and decisions of Management, so if “the system” fails, then by definition a Manager’s action (or decision not to act) is the cause of that failure.

You might be able to address SOME Engineering failures by avoiding blame and drawing distinctions between human and systemic failures. But you will never be able to minimize failure across an entire company by this approach because (a) people who expect to never be blamed can objectively be seen to fail more often, and (b) the distinctions between human and systemic failures are mostly fictional.

[…] Also check out John Allspaw, who gets credit for coining the ‘Blameless Post Mortem’ used here: https://codeascraft.com/2012/05/22/blameless-postmortems/ […]

[…] a post on Etsy’s blog, CTO John Allspaw states that, instead of punishing the “bad […]

[…] pages have gotten slower on the backend. The performance team kicked off this quarter by hosting a post mortem for a site-wide performance degradation that occurred at the end of Q2. At that time, we had […]

[…] a post on Etsy’s blog, CTO John Allspaw states that, instead of punishing the “bad […]

[…] instead of sitting at your desk and wondering what went wrong, try taking your team through a blameless post-mortem. It could be that the cause is something altogether different from what you think it […]

[…] game days to rehearse incident management practices, and after each incident we recommend that a blameless post mortem is conducted to identify whether there are actions that could improve the team’s ability to […]

[…] posts (part 1 and part 2). Adopting a DevOps model – with the attached concepts of blameless postmortems and failing often and failing fast – is essential to success here. The ability to rapidly […]

[…] game days to rehearse incident management practices, and after each incident we recommend that a blameless post mortem is conducted to identify whether there are actions that could improve the team’s ability to […]

[…] should feel comfortable in the post mortem giving [a] detailed account “without fear of punishment or retribution.” Because if engineers – or any individual for that matter – see the focus on blame, they […]

[…] “fail fast” to catch individual security defects, applying other best practices like blameless post mortems can help reinforce the learning […]

[…] Allspaw from Etsy has a concept called Blameless Post Mortem, built around the idea that “human error is seen as the effect of systemic vulnerabilities […]

[…] do postmortems or even specifically how to do them because there are already a lot of great posts out there on the topic. Like Jeff Atwood says, “I don’t think it matters how you conduct the […]

Very nice piece. It’s worth taking a look at the postmortem methodology practiced by the NTSB (National Transportation Safety Board). It is very similar to what you describe in your work at Etsy.

[…] you haven’t read John Allspaw’s piece on Blameless Postmortems, take a few minutes and do it now. John Allspaw is one of the greats in our […]

[…] idea of Blameless PostMortems is not new to TIM Group. We’ve done our best to use our RCAs as a tool for improving the system […]

[…] Blameless PostMortems and a Just Culture […]

[…] wonder how a blame-free culture could help ease channels of communication. Etsy has “blameless postmortems, because the company wants to “view mistakes, errors, slips, lapses, […]

[…] a postmortem won’t lead to the blame game; it will yield the root cause. As Etsy CTO John Allspaw says, people are “the most expert in their own error. They ought to be heavily involved in coming […]

Name required
Mail required
Website
Response required

Notify me of follow-up comments by email.

Notify me of new posts by email.

The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.

Visit Etsy.comRSSTwitterGitHubCareers

Image (Asset 4/5) alt=