Emergency Messaging

7 steps to creating an actionable IT on-call schedule

How to make sure IT on-call works for you

I spent a bit of time on Reddit the other day and thought it interesting just how many posts were focused on IT on-call and on-call scheduling. Some posts were rants on horrible customers – who hasn’t had some of those? Some actually wrote about positive interactions from being on-call – those were rare posts. But many engineers in DevOps and IT posted on their trepidation about being on-call. They wondered:

  • What is the best way for my team to create an after-hours schedule?
  • How do I ensure I wake up if I am alerted?
  • Should my growing on-call team use a cell phone and hand it off between rotations?
  • How do I manage being on-call and then having to show up at eight a.m. the next morning?
  • Is it reasonable to expect on-call duty 24/7?

The answers to these questions though don’t need to cause trepidation. While after-hour assignments can be anxiety producing, having the right tools and management go a long way toward helping to create reasonable expectations and outcomes.

Why on call is necessary for all

If I were to ask you about why on-call is necessary you might think me a bit of a dunce – go ahead, I’ve been called worse. Isn’t it obvious that it’s needed to answer customer questions about the product? Duh?!

But truth is that answering customer product questions is not the only reason on-call exists. In the realm of product development, it’s a necessary pursuit. You cannot develop product effectively if the product is disconnected from testing its resilience. And you cannot know the product’s resilience unless you put it in front of your customers and allow them to test it. And let customers call you when it breaks.

Additionally, after-hour rotations allow Dev, Ops and all of your IT team to see how well the product or set up they have created is working. Many I have spoken to in the DevOps world call this ‘eating your own dogfood.’ Yuck. This statement is meant to illustrate that no one in the IT family can simply create their perceived technical masterpiece and walk away. Instead, they need to take responsibility for their creation. Being part of the after-hours family helps ensure this level of responsibility.

Traditional problems with IT alerting

In addition to being on-call, there are many additional issues with alerting. Often, issues come in after hours and they lack context. These sorts of problems come in many flavors. For example:

  • A call comes in but the engineer cannot escalate the issue if they need to
  • There’s a hand-off of a customer problem from regular hours to after-hours and the issue gets muddled because there’s no audit trail on the alert
  • During ‘sleeping hours,’ alerts are not sufficiently persistent to get engineers out of bed
  • Poor management of on-call and alerting causes engineer burnout

A much better idea is to create an actual schedule with a dedicated tool designed to handle effective alerting, auditing and messaging. A tool like OnPage can answer these on-call issues as well as many of the trepidations which engineers face about being on-call.

Improving life on-call

Effective management of after-hour assignments need to be premeditated. That is, the process needs to be thought through and cannot be ad hoc. While most DevOps teams and IT teams have a schedule, they haven’t thought through the whole process. Instead, teams should create on-call schedules that:

  • Enables escalation. For example, you cannot expect one person to be on-call for 24/7 without having an escalation procedure. Everyone needs a back-up if they cannot attend to a call. People have lives and stuff happens. So, make sure there’s an escalation procedure. OnPage’s tool has strong escalation capabilities for issues in this realm
  • Provide time off after being on-call over-night. When a team member has been actively on-call overnight, it is only fair to give that person a reasonable amount of time off before showing up to work again
  • Have schedules. Make sure all your team members have a chance to be on-call. Create scheduling that rotates through the team members equitably
  • Run books – defined procedures. When your on-call engineer is alerted in the middle of the night, help them out by having run books available to provide solutions to problems that have crept up in the past. This is really helpful when woken up at 2 a.m. and the engineer’s thinking is somewhat clouded.
  • Include prominent and persistent alerts. OnPage provides persistent alerting that will continue for up to 8 hours until answered. Also, there’s no chance of sleeping through the OnPage alerts as they are really designed to wake you up.
  • Ensure audit trails to help with hand-offs. Provide an audit trail for alerts so it is clear who on the team is working on an existing issue. Audit trails also provide context to MTTR and help your team keep track of metrics.
  • Based on a communal app. Ensure your team has an alerting app on their smartphone so there is no need to physically handoff pagers. By ensuring the use of a smartphone application like OnPage, scheduling is much easier as is ensuring response by the right person every time.

Conclusion

While IT on-call might cause trepidation initially, the time spent planning will definitely pay dividends. Again, use a scheduling tool that will allow your team to work effectively together and more like a, well…, team.

OnPage is an excellent tool for managing and improving life during after hours. Learn how OnPage can help you and your team better manage alerting. Schedule a demo with OnPage today.

OnPage Corporation

Share
Published by
OnPage Corporation

Recent Posts

OnPage’s Strategic Edge Earns Coveted ‘Challenger’ Spot in 2024 Gartner MQ for Clinical Communication & Collaboration

Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…

1 day ago

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

2 weeks ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

1 month ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago