on-call management

OnPage’s 3 Steps to Mastering IT On-Call Scheduling

Almost half of all technology professionals experience on-call as an integral part of their job. The typical IT on-call schedule often spells a 2 am wake up call that ends in a false alarms or for an issue the engineer can do little about.  The results of these sorts of sleep interruptions and tensions inevitably lead to alert fatigue which is considered to be the #1 pain point for both traditional IT teams as well as modern DevOps engineers.

Previous guides have failed to focus on the salient issues that need to be addressed in order to move the conversation forward. As such, OnPage is putting forth the following to highlight the issues that need to be discussed and provide solutions to help improve life on call.

The goal of this blog is to:

  • note what has impeded us from reaching effective life on-call
  • provide 3 steps to mastering life on-call
  • highlight what will be achieved with effective life on-call

Issues impeding effective life on call

Email

Email remains the number one channel people learn about problems. However, this is the worst way to learn about an issue. Email often gets buried under many other messages so it provides the recipient with no immediacy. Furthermore, there is no easily separate communications on a particular incident in an email channel.

Alert Noise

As more technologies get added to the IT stack, the number of items being monitored is vastly increasing. This need to monitor more things than we used to is often referred to as ‘alert hell’ and it is only going to increase exponentially in the future. In fact, large IT organizations can receive up to 150,000 alerts per day from their monitoring systems. It is physically impossible for teams to respond to this number of alerts.

Inefficient Communication

When you are unable to effectively reach engineers or colleagues and don’t know who is on-call, your ability to effectively resolve problems drastically decreases. Additionally, not having the tools to exchange information quickly is also a significant problem. If on-call engineers do have effective communication tools at their finger tips, they are much more productive in managing their on-call shifts and solving problems quickly.

Improving IT on-call scheduling

More than limiting the number of alerts to the on-call team, the goal of on-call is to limit disruption to the end customer. To this end, a pageable alert is only fired when action must be taken. Anything that doesn’t take place in that context, is a ticket.

Step 1: Create a fair on-call schedule

Use group schedules to make sure everyone gets a chance at bat. Rotations are key in this regard as they ensure everyone is put on-call at some point during a normal schedule. Moreover, a fair schedule will promote the sense that no one group is being picked on or forced to work more hours than any other.

Step 2: Make sure alerts are persistent

How many times has someone on your team said they didn’t respond to the alert because they didn’t hear it? Most alerting technologies notify engineers via SMS or email and don’t provide persistent alerting if the engineer is temporarily out of range.

Instead, make sure you are using a tool that avoids these problems and instead creates persistent alerts that will be heard. Additionally, make sure the alerts will be heard when the engineer comes back into range.

Step 3: Messaging for efficient communications

Make sure the on-call communications tools you use enable communications between engineers.  That is, make sure they have the right tool which will enable both alerting and critical communications. Engineers should be able to message fellow engineers as well as groups.

Ideally, your messaging platform will also integrate with widely used industry tools such as Slack. From Slack, for example, engineers could alert individuals to significant events that need their colleague’s input.

Conclusion

Life on-call doesn’t need to remind everyone of a Stephen King horror novel. Instead, with adequate forethought, life on call can actually be manageable and lead to a decrease in alert fatigue.

Want to read 4 more steps to improve on-call scheduling? Download our whitepaper.

OnPage Corporation

Share
Published by
OnPage Corporation

Recent Posts

OnPage’s Strategic Edge Earns Coveted ‘Challenger’ Spot in 2024 Gartner MQ for Clinical Communication & Collaboration

Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…

4 days ago

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

2 weeks ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

1 month ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

2 months ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago