Monitoring Alerts

The secret to blameless post mortems

How your engineering teams can move past finger-pointing to effectively managing mistakes

Sidney Dekker’s theory on ‘bad apples’ holds that complex systems think they would be fine if it were not for the erratic behavior of some unreliable people. According to this theory, when unexpected events are seen in an otherwise safe system, they are typically and conveniently assigned to “human error” and when they are severe to “operator carelessness”.. Similarly, post mortems often look to define and parcel out blame to engineers. Yet it begs the question of how effective the post mortems are if their only purpose is to assign blame. Instead, effective post mortems needs to “acknowledge the human tendency to blame, to allow for a productive form of its expression, and constantly refocus the postmortem’s attention past it.”

Post mortems vs retrospectives

The problem with post mortems begins with its name “post mortem”, which if you ask me sounds more than a bit macabre. But the distinction is necessary, as it needs to be defined against a “retrospective”. Retrospectives are routine and often take place at the end of a sprint or in a weekly meeting.  They are planned conversations that allow for a positive exchange of ideas and thoughts about projects.  Retrospectives will ask questions like:

  • How did release or deployment go?
  • What should we stop doing?
  • What should we start doing?
  • What metrics do we need to focus on?

Post mortems, on the other hand, will (ideally):

  • Take place within 24 hours after a significant failure like a major piece of code failing or a failed deployment
  • Not focus on blaming one individual
  • Focus on RCA (root-cause analysis)
  • Bring together key stake holders in the failure. This includes management
  • Document the issues and define actionable points
  • Create tickets to implement changes which will ensure the changes are made

The problem is that while retrospectives are often actionable and productive team exercises, post mortems are frequently demoralizing and end up being akin to a blame game. Particularly in the DevOps culture where there is often a wall between Devs and Ops, issues on one side get blamed on the other. But this problem of walls is true of any engineering company where groups are siloed. Furthermore, much valuable information that comes out of a post mortem on one side is not shared with the other side, much less with the business stake holders.

Given these realities, how can an engineer get a wider view? How can a post mortem work to help everyone in the company?

Post mortems: Why do bad things happen?

Engineers make trade-offs and don’t always realize the impact of their actions. Systems are complicated and a code problem might not be realized until the software is deployed along with 6 other pieces of software.  One could argue that by putting Ops and Devs into teams together, some of these issues would be solved. See our blog on this topic for further discussion of the topic.

Often, teams run at 100% capacity without any room for a buffer. As such, the opportunity to work methodically is reduced. Managers don’t focus on managing the work load of their engineers or on making sure they have the ability at the team’s mathematical velocity. Instead, teams are managed by higher ups who don’t have visibility into the team’s constraints and ability to take on new work.

Treat blame as a speed bump

The goal of blameless is to not as much to avoid seeing where blame lies, but rather recognizing failure, understanding the root causes and moving on.  In the case of a software bug causing a deployment failure, the following steps would ideally follow:

  • Use OnPage’s Enterprise console to get excellent insights into where alerts happened and how they can be fine-tuned so they are more actionable
  • Ensure the post mortem is constructive and doesn’t seek to divvy up blame as if it were a pie
  • Engineers should feel comfortable in the post mortem giving [a] detailed account “without fear of punishment or retribution.” Because if engineers – or any individual for that matter – see the focus on blame, they will shut down
  • Have Devs and Ops talk to one another. Need to break down the walls
  • Involve stake holders. Form task force. Build empathy. Realize it’s horrible on both sides.
  • Feed information of post mortems back upstream so that concrete actions are taken. Documents need to be made and in an accessible place. Make sure management sees it at a lunchtime seminar.
  • Have follow-ups on post mortems, tickets created and see where they are at the end of a month.

Management must understand what’s in it for them. It’s not just a matter of solving an engineering problem. When the engineering problem is solved, the product and company can be more productive. Strong leadership is a must in effective post mortems.

Conclusion

Management and engineers need to see the pain points felt by the other side. Each group needs to invite the other into ‘their house’. Then they’ll invite you into their house. Each side has different perspectives and different needs. In the end, for the team to be effective, everyone needs to buy in. Get devs to know ops. Get management to know their engineers.  These will be the beginning steps of a painless post mortem or at least post mortems that don’t hurt as much.

OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!

OnPage Corporation

Share
Published by
OnPage Corporation

Recent Posts

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

6 days ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

4 weeks ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

1 month ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago

OnPage Lands Spot on Constellation ShortList™ for Clinical Communication in 2024

Recognition highlights OnPage's commitment to advancing healthcare communication through new integrations and platform upgrades. Waltham,…

3 months ago