The secret to blameless post mortems

How your engineering teams can move past finger-pointing to effectively managing mistakes

Sidney Dekker’s theory on ‘bad apples’ holds that complex systems think they would be fine if it were not for the erratic behavior of some unreliable people. According to this theory, when unexpected events are seen in an otherwise safe system, they are typically and conveniently assigned to “human error” and when they are severe to “operator carelessness”.. Similarly, post mortems often look to define and parcel out blame to engineers. Yet it begs the question of how effective the post mortems are if their only purpose is to assign blame. Instead, effective post mortems needs to “acknowledge the human tendency to blame, to allow for a productive form of its expression, and constantly refocus the postmortem’s attention past it.”

Post mortems vs retrospectives

The problem with post mortems begins with its name “post mortem”, which if you ask me sounds more than a bit macabre. But the distinction is necessary, as it needs to be defined against a “retrospective”. Retrospectives are routine and often take place at the end of a sprint or in a weekly meeting. They are planned conversations that allow for a positive exchange of ideas and thoughts about projects. Retrospectives will ask questions like:

How did release or deployment go?
What should we stop doing?
What should we start doing?
What metrics do we need to focus on?

Post mortems, on the other hand, will (ideally):

Take place within 24 hours after a significant failure like a major piece of code failing or a failed deployment
Not focus on blaming one individual
Focus on RCA (root-cause analysis)
Bring together key stake holders in the failure. This includes management
Document the issues and define actionable points
Create tickets to implement changes which will ensure the changes are made

The problem is that while retrospectives are often actionable and productive team exercises, post mortems are frequently demoralizing and end up being akin to a blame game. Particularly in the DevOps culture where there is often a wall between Devs and Ops, issues on one side get blamed on the other. But this problem of walls is true of any engineering company where groups are siloed. Furthermore, much valuable information that comes out of a post mortem on one side is not shared with the other side, much less with the business stake holders.

Given these realities, how can an engineer get a wider view? How can a post mortem work to help everyone in the company?

Post mortems: Why do bad things happen?

Engineers make trade-offs and don’t always realize the impact of their actions. Systems are complicated and a code problem might not be realized until the software is deployed along with 6 other pieces of software. One could argue that by putting Ops and Devs into teams together, some of these issues would be solved. See our blog on this topic for further discussion of the topic.

Often, teams run at 100% capacity without any room for a buffer. As such, the opportunity to work methodically is reduced. Managers don’t focus on managing the work load of their engineers or on making sure they have the ability at the team’s mathematical velocity. Instead, teams are managed by higher ups who don’t have visibility into the team’s constraints and ability to take on new work.

Treat blame as a speed bump

The goal of blameless is to not as much to avoid seeing where blame lies, but rather recognizing failure, understanding the root causes and moving on. In the case of a software bug causing a deployment failure, the following steps would ideally follow:

Use OnPage’s Enterprise console to get excellent insights into where alerts happened and how they can be fine-tuned so they are more actionable
Ensure the post mortem is constructive and doesn’t seek to divvy up blame as if it were a pie
Engineers should feel comfortable in the post mortem giving [a] detailed account “without fear of punishment or retribution.” Because if engineers – or any individual for that matter – see the focus on blame, they will shut down
Have Devs and Ops talk to one another. Need to break down the walls
Involve stake holders. Form task force. Build empathy. Realize it’s horrible on both sides.
Feed information of post mortems back upstream so that concrete actions are taken. Documents need to be made and in an accessible place. Make sure management sees it at a lunchtime seminar.
Have follow-ups on post mortems, tickets created and see where they are at the end of a month.

Management must understand what’s in it for them. It’s not just a matter of solving an engineering problem. When the engineering problem is solved, the product and company can be more productive. Strong leadership is a must in effective post mortems.

Conclusion

Management and engineers need to see the pain points felt by the other side. Each group needs to invite the other into ‘their house’. Then they’ll invite you into their house. Each side has different perspectives and different needs. In the end, for the team to be effective, everyone needs to buy in. Get devs to know ops. Get management to know their engineers. These will be the beginning steps of a painless post mortem or at least post mortems that don’t hurt as much.

OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!

Facebook

Google

Twitter

OnPage Corporation

Next What is MTTR? Everything you need to know »

Previous « OnPage uses Tropo to create bilingual office

Published by

OnPage Corporation

Tags: DevOpsNetwork Monitoring Linkspost mortemRCA

9 years ago

AI Reliability, Part 2: When the Datacenter Becomes the Bottleneck
In Part 1, we talked about all the hidden complexity inside AI systems: the pipelines,…
AI Infrastructure Is Creating a New Wave of Incidents, And Why Enterprises Need a Modern On-Call Strategy
Over the past couple of months, my entire world has felt flooded with AI breakthroughs.…
Best MSP Tools of 2025
Managed service providers (MSPs) are strong multitaskers, handling monitoring, documentation, security, infrastructure maintenance, support, and…

Top Incident Alerting and On-Call Management Software (2026 Buyer’s Guide)

Disclosure: This comparison is written by our product marketing team that works closely with…

2 days ago

incident management

AI Reliability, Part 2: When the Datacenter Becomes the Bottleneck

In Part 1, we talked about all the hidden complexity inside AI systems: the pipelines,…

6 days ago

press release

OnPage Introduces Multi-Language Mobile App Localization on iOS & Android

As organizations continue to adopt OnPage across regions and operational environments, providing an experience that…

2 weeks ago

incident response

AI Infrastructure Is Creating a New Wave of Incidents, And Why Enterprises Need a Modern On-Call Strategy

Over the past couple of months, my entire world has felt flooded with AI breakthroughs.…

2 weeks ago

Uncategorized

Manual Call Forwarding vs. Schedule-Based Call Routing: What’s the Better Way to Handle On-Call Support?

When your team shares one support number, someone has to decide who gets the calls…

1 month ago

critical communication and alerting

Replacing AT&T Email-to-Text with OnPage’s Critical Alerting

When AT&T officially shut down its email-to-text and text-to-email service on June 17, 2025, a…