Sidney Dekker’s theory on ‘bad apples’ holds that complex systems think they would be fine if it were not for the erratic behavior of some unreliable people. According to this theory, when unexpected events are seen in an otherwise safe system, they are typically and conveniently assigned to “human error” and when they are severe to “operator carelessness”.. Similarly, post mortems often look to define and parcel out blame to engineers. Yet it begs the question of how effective the post mortems are if their only purpose is to assign blame. Instead, effective post mortems needs to “acknowledge the human tendency to blame, to allow for a productive form of its expression, and constantly refocus the postmortem’s attention past it.”
The problem with post mortems begins with its name “post mortem”, which if you ask me sounds more than a bit macabre. But the distinction is necessary, as it needs to be defined against a “retrospective”. Retrospectives are routine and often take place at the end of a sprint or in a weekly meeting. They are planned conversations that allow for a positive exchange of ideas and thoughts about projects. Retrospectives will ask questions like:
Post mortems, on the other hand, will (ideally):
The problem is that while retrospectives are often actionable and productive team exercises, post mortems are frequently demoralizing and end up being akin to a blame game. Particularly in the DevOps culture where there is often a wall between Devs and Ops, issues on one side get blamed on the other. But this problem of walls is true of any engineering company where groups are siloed. Furthermore, much valuable information that comes out of a post mortem on one side is not shared with the other side, much less with the business stake holders.
Given these realities, how can an engineer get a wider view? How can a post mortem work to help everyone in the company?
Engineers make trade-offs and don’t always realize the impact of their actions. Systems are complicated and a code problem might not be realized until the software is deployed along with 6 other pieces of software. One could argue that by putting Ops and Devs into teams together, some of these issues would be solved. See our blog on this topic for further discussion of the topic.
Often, teams run at 100% capacity without any room for a buffer. As such, the opportunity to work methodically is reduced. Managers don’t focus on managing the work load of their engineers or on making sure they have the ability at the team’s mathematical velocity. Instead, teams are managed by higher ups who don’t have visibility into the team’s constraints and ability to take on new work.
The goal of blameless is to not as much to avoid seeing where blame lies, but rather recognizing failure, understanding the root causes and moving on. In the case of a software bug causing a deployment failure, the following steps would ideally follow:
Management must understand what’s in it for them. It’s not just a matter of solving an engineering problem. When the engineering problem is solved, the product and company can be more productive. Strong leadership is a must in effective post mortems.
Management and engineers need to see the pain points felt by the other side. Each group needs to invite the other into ‘their house’. Then they’ll invite you into their house. Each side has different perspectives and different needs. In the end, for the team to be effective, everyone needs to buy in. Get devs to know ops. Get management to know their engineers. These will be the beginning steps of a painless post mortem or at least post mortems that don’t hurt as much.
OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…
Recognition highlights OnPage's commitment to advancing healthcare communication through new integrations and platform upgrades. Waltham,…