Almost half of all technology professionals experience on-call as an integral part of their job. The typical IT on-call schedule often spells a 2 am wake up call that ends in a false alarms or for an issue the engineer can do little about. The results of these sorts of sleep interruptions and tensions inevitably lead to alert fatigue which is considered to be the #1 pain point for both traditional IT teams as well as modern DevOps engineers.
Previous guides have failed to focus on the salient issues that need to be addressed in order to move the conversation forward. As such, OnPage is putting forth the following to highlight the issues that need to be discussed and provide solutions to help improve life on call.
The goal of this blog is to:
Email remains the number one channel people learn about problems. However, this is the worst way to learn about an issue. Email often gets buried under many other messages so it provides the recipient with no immediacy. Furthermore, there is no easily separate communications on a particular incident in an email channel.
Alert Noise
As more technologies get added to the IT stack, the number of items being monitored is vastly increasing. This need to monitor more things than we used to is often referred to as ‘alert hell’ and it is only going to increase exponentially in the future. In fact, large IT organizations can receive up to 150,000 alerts per day from their monitoring systems. It is physically impossible for teams to respond to this number of alerts.
Inefficient Communication
When you are unable to effectively reach engineers or colleagues and don’t know who is on-call, your ability to effectively resolve problems drastically decreases. Additionally, not having the tools to exchange information quickly is also a significant problem. If on-call engineers do have effective communication tools at their finger tips, they are much more productive in managing their on-call shifts and solving problems quickly.
More than limiting the number of alerts to the on-call team, the goal of on-call is to limit disruption to the end customer. To this end, a pageable alert is only fired when action must be taken. Anything that doesn’t take place in that context, is a ticket.
Step 1: Create a fair on-call schedule
Use group schedules to make sure everyone gets a chance at bat. Rotations are key in this regard as they ensure everyone is put on-call at some point during a normal schedule. Moreover, a fair schedule will promote the sense that no one group is being picked on or forced to work more hours than any other.
Step 2: Make sure alerts are persistent
How many times has someone on your team said they didn’t respond to the alert because they didn’t hear it? Most alerting technologies notify engineers via SMS or email and don’t provide persistent alerting if the engineer is temporarily out of range.
Instead, make sure you are using a tool that avoids these problems and instead creates persistent alerts that will be heard. Additionally, make sure the alerts will be heard when the engineer comes back into range.
Step 3: Messaging for efficient communications
Make sure the on-call communications tools you use enable communications between engineers. That is, make sure they have the right tool which will enable both alerting and critical communications. Engineers should be able to message fellow engineers as well as groups.
Ideally, your messaging platform will also integrate with widely used industry tools such as Slack. From Slack, for example, engineers could alert individuals to significant events that need their colleague’s input.
Life on-call doesn’t need to remind everyone of a Stephen King horror novel. Instead, with adequate forethought, life on call can actually be manageable and lead to a decrease in alert fatigue.
Want to read 4 more steps to improve on-call scheduling? Download our whitepaper.
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…