Blameless Responsibility

I suspect that most people with jobs involving any kind of production (widgets, software, energy…) are familiar with the idea of a post-mortem. You may know it by another name; the Nuclear Navy calls it a critique. A friendlier term is retrospective, with the idea that it emphasizes that it’s important to note what went well in addition to discussing areas for improvement. If you’re in tech, this is extremely familiar. You may even use the eponymous SaaS tool, which is IMO legitimately useful if for nothing else than automatically spinning up a Slack channel, Zoom meeting, and Jira ticket with a single command.

The general idea is that if you remove the who element, people will feel freer to discuss the why, and thus improvements will be more easily made. An oft-quoted message is the Agile Prime Directive:

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

I think this quote generally holds true, but I think many learning opportunities (for both personal growth and technical knowledge) are being lost by insisting on removing any mention of people and eschewing attribution. Rein Henrichs makes this point (and many more) far better than I can, and that video is well worth your 37 minutes. Beyond that, I think (or to properly attribute, my wife - who is not in tech - said this to me during a discussion on the subject) that this has a far more insidious side effect, which is to condition people to be unwilling to step up and say “I made this mistake.”

Google, having created the concept of SRE, has this to say in their book on the subject:

When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.

I agree with much of what they’ve written (and spoiler, in Chapter 33 they reveal that someone heavily involved with the program’s creation came from the Nuclear Navy), but I do disagree (or perhaps misunderstand) their point on fixing people. Can you make someone do something? No, not really. Can someone willingly change their behavior, if their current approach has proved to be unsuccessful? Yes. Can you encourage and nurture their growth? Absolutely.

I’ve apparently never discussed my time in the Navy on this blog, other than a short mention on my About Me page. Briefly, I spent a decade in the Nuclear Navy as a Submarine Nuclear Electronics Technician, and I also spent time teaching Instrumentation & Control Equipment at Nuclear Field ‘A’ School. None of this has a lot to do with tech, with the exception of critiques/post-mortems. Though at the time, I hated them, in retrospect they were a highly effective process. The overall process is this:

Something goes wrong. This could be as minor as “I performed Step 3.b.iii without having performed Step 3.b.ii,” or as serious as “I misinterpreted indications during a casualty drill and isolated the non-faulted side of the steam plant.”
- Note that these are both attributable to personnel (although inadequate training is often substituted here; even then, tracing that backwards lands you at personnel again), which is common for Nuclear Navy critiques. The systems are massively redundant, heavily tested, and have well-known inputs and outputs. For your company’s CRUD app, this may not be true.
The maintenance, casualty drill, or other plant evolution is halted, and conditions are evaluated. For maintenance actions, it may be permissible to continue; for casualty drills, that’s highly unlikely.
A timeline is generated, which includes the names of all personnel remotely involved, actions taken (or missed) plant conditions, etc.
- For casualty drills, the entire watch team will be included, even if it’s clear that the incorrect action can be attributed to a single person.
The expected indications and/or actions are documented as a comparison to those seen and/or taken.
A critique is scheduled ASAP, generally within hours. The mandatory attendees will include the watch team / maintenance team, the leadership of the personnel involved, and all nuclear senior leadership, i.e. Engineering Department Master Chief, Engineering Officer, Executive Officer, and Commanding Officer. If the event happened in-port, various personnel from shore facilities may also attend.
At the critique, the following are presented:
- Timeline of events.
- Expected actions.
- Actual and potential consequences.
Exploratory questions are then asked of involved personnel, asking them to explain what they saw, why they took the actions they did, and what they expected to happen.
Generally, at this point all junior personnel - even if they are the attributable cause - are excused.
A root cause analysis is then performed, from which future mitigations can be developed.
Following the process, an anonymized message (e.g. instead of “On 2014-12-01 at 14:33Z, aboard USS MISSOURI, ET1(SS) Garland incorrectly…”, it would read “Recently on a ship of this class, a Propulsion Plant Operator…”) may be generated and sent to the rest of the fleet for sharing lessons learned.

Note that this doesn’t shy away from asking people to confront reality, to have ownership, and to explain why they made decisions, but nor does it then publicly shame those people. Accepting blame - ideally, claiming blame - is a critical part of the process. This is not to say that people are always wholly responsible - human-machine interaction is a complicated subject, and Naval designers take these into consideration for improvements. If one person mistakenly reads a gauge, it may be user error. If multiple people mistakenly read that gauge, there’s a far greater chance that it’s difficult to read.

Imagine this process if people didn’t feel comfortable (admittedly, you don’t really get a choice to participate, but just go with it) admitting when they’d made a mistake, or worse, if any mention of personnel action or inaction was simply elided in discussion. How would progress ever be made? How would you share lessons learned with others? For that matter, what lessons would there be to learn?

People are rightfully afraid of punishment. The etymology of the word makes this obvious - a Latin stem punire means to “cause pain for some offense.” An overly heavy-handed approach is unlikely to gain the results you actually want: if the consequences for failure are severe, people are likely to simply cover up failure. I believe that the opposite is also true, though: if there are no consequences, people are likely to not care if they’ve caused a problem. This doesn’t mean that people need to be demoted, be publicly shamed, or anything like that, but they should feel a sense of responsibility, and have a large role in helping to prevent the problem from happening again. To achieve this requires that the organization does not severely punish mistakes, even massive ones. If a junior dev deletes prod and causes millions of dollars in downtime, that is almost certainly not their fault. They may have been the one who ran DROP TABLE customers;, but why did they have unfettered access, without oversight or controls? Don’t fire the dev; have them help in the incident, come up with mitigating measures, and yes, speak in the post-mortem. It’s a wonderful learning opportunity.

A few pertinent incidents to illustrate these points, one from the Navy, and two from tech, at different companies.

I was performing maintenance on a system called Steam Generator Water Level Control System (abbreviated SGWLCS, and pronounced “squiggles”) while in-port. Part of this involved switching inputs such that Maneuvering (the control room for the reactor plant) didn’t lose any indications. They would lose redundancy, but this is an accepted trade-off for the maintenance in a shutdown condition. Somehow, I switched the wrong one, which caused all indications of that Steam Generator to be lost (there remains a purely mechanical indication elsewhere in the plant, but Maneuvering lost visibility). The Engineering Duty Officer immediately called for a halt to the maintenance over the 2MC (engineering spaces announcing circuit), and ordered the maintenance team to Maneuvering. Walking in, I saw the indications, and immediately knew what I’d done. Some part of my brain refused to accept that this could be, and I attempted to make some excuse about how it was expected, and fine. Even as I was saying these words, they sounded hollow and insincere. The excuse did not work, and a critique was scheduled. In the critique, I owned the mistake, and mentally vowed to never again do that.

Many years later, I was a Junior SRE (if while reading this, you think “this needs automation,” you’re right) at a SaaS monitoring company. At the time, it operated mostly on-prem across three datacenters, and had a regular (weekly?) deployment schedule, orchestrated by Bamboo, but kicked off manually. Part of this dance involved an SRE manually editing deployment targets the day before, to deploy a canary. This was documented in Confluence, with screenshots. I don’t precisely recall the confusion, only that instead of deploying to a single target, I deployed to the single target and an entire datacenter. The deployments quickly caught the attention of others, who rapidly assessed the situation, relative risk of rollback vs. continuing, and handled it. I DM’d my boss, saying something like “I’m very sorry, I didn’t mean to do that.” He replied, “Of course not - if you had meant to, we’d be having a different conversation.” My penance was to rewrite the documentation, and to walk another new person through the operation with my documentation, to gauge its clarity. No one shamed me for this beyond a light ribbing (perhaps subjective; I find that it helps bond, but others may not), and I never felt judged or made to feel lesser.

I don’t have as many details for this final story, as I wasn’t involved with the incident; the only reason I was in the meeting was that it was an executive roll-up of the past month’s incidents, and I was presenting the one I was involved with. What I do remember is the team presenting what had happened (customers were unable to login for somewhere around an hour), and how as a remediation, they were adding additional monitoring. Now, obviously monitoring and alerting is extremely important, and I’m not trying to say that it isn’t. What followed, though, has stuck with me: someone mentioned how they should add an E2E test for logins in staging, to which the reply was that there was a problem with OAuth and they hadn’t been able to figure it out yet. This was seemingly accepted by everyone, until a Director, obviously frustrated, said: “This incident [customers unable to login] has happened five times. Why can’t people own their work, and take the time to manually test login in staging?” Absolute silence. Claims about development velocity, scalability, etc. followed.

I firmly believe that the Director was spot-on, and that everyone else utterly missed her point. The point was not, “let’s manually run something to be sure that it works, and not worry about automation,” although that may have been a good side effect from customers’ perspective. The point was “why can’t people own their work?”

Own your work. Be accountable to yourself first and foremost, and then to your teammates. Leave ego at the door, and embrace failure.