Guide to:
Facilitating Equitable
On-Call Rotations
When required to work after hours, IT teams can face burnout, decreased productivity, and negative job satisfaction. So, it is management’s responsibility to make sure that on-call rotations are as equitable as possible to improve their engineers’ work-life balance.
By facilitating an environment with exceptional work-life balance, job satisfaction will be improved, thus allowing engineers to diligently accomplish their daily tasks without the burden of burnout from their previous on-call shifts.
Unfortunately, fostering equitable on-call management is not always intuitive, so this guide provides you with essential practices that will enable your team to facilitate equitable on-call rotations.
What is an On-Call Rotation?
Seamless service delivery has become the industry standard, so clients will not tolerate prolonged outages or unexpected delays. With that, the need for on-call management cannot be overstated. However, recognizing the importance of on-call management is just the first step.
Teams must establish an effective and equitable on-call rotation that will maintain the satisfaction and productivity of their engineers, and further ensure continuous service operations.
During on-call rotations, there are many aspects that vary from company to company including:
- Rotation Schedule
An on-call rotation is a scheduling practice where teams take turns being available after hours to tend to unexpected outages, threats, or system failures.
Engineers are assigned on-call shifts – specified lengths of time, outside of normal business hours where they must be available to promptly respond to critical incidents.
There are multiple types of on-call rotations that teams can employ that are dependent on the company’s size, industry or an employees’ role. Deciding which is best for your team is incredibly important to ensure the equitable distribution of on-call workloads.
- Triaging
When dealing with an incident, on-call engineers must be able to categorize them correctly based on impact and severity. This allows engineers to further prioritize incidents – high-priority incidents must be resolved immediately, and low-priority incidents can be left alone until normal business hours.
It is up to the discretion of an organization to determine which incidents are considered high-priority. And, for effective incident management, there must be a shared understanding amongst the on-call team about what types of incidents require prompt attention, to avoid confusion and maintain consistent reliability.
- Escalation
When identifying the impact and severity of an incident, on-call engineers must know when to get subject matter experts (SME) or management involved. By having a clear idea of when to escalate incidents, teams can ensure that issues are resolved both efficiently and effectively.
Escalation procedures vary between companies based on how they categorize incidents or structure their hierarchy. With escalation procedures that align with a company’s specific goals, teams can ensure that the right individuals are mobilized to critical incidents as soon as possible.
- Resolution
When a high-priority incident is identified, on-call engineers are responsible for resolving the issue. The goal is to immediately restore normal business operations and mitigate the impact of the critical incident.
Each organization may have different protocols on how to resolve an incident after hours. So it is important that they establish a response plan that enables on-call engineers to accurately tend to an issue in accordance with an organization’s policies.
- Post-Incident Reviews
After an incident occurs, whether it is after hours or during normal business hours, a post-incident review must be conducted. Post-incident reviews allow teams to analyze an incident, to identify the origin and mitigate the possibility of recurrences.
There are many ways that a post-incident review can be conducted including meetings, workshops, or written reports, so teams must have a standardized approach to conducting these reviews to ensure a shared objective that enhances the team’s knowledge base.
Who Owns On-Call?
There are many reasons why an organization may employ an on-call rotation, meaning that there is not just one role that is responsible for after hours maintenance. Teams may rely on multiple on-call engineers who are assigned different systems or duties to monitor.
Some of the roles that may typically have an on-call rotation include:
- IT Engineers
IT engineers manage the IT infrastructure and have a well-developed understanding of the systems. So oftentimes they are included in the on-call rotation and are responsible for addressing issues that arise within critical systems after hours.
With their advanced expertise and their ability to diagnose and resolve critical issues in important systems, it is crucial that there are IT engineers available 24/7, to ensure seamless service operations.
- Help Desk/Support Staff
A lot of the time clients require a company’s service outside of normal business hours, and if they run into a service issue they do not want to wait until the morning to get it fixed. If teams do not have an on-call support staff to handle client issues, their reputation may be tarnished considering that continuous service delivery is expected.
Support staff is on an on-call rotation so that they can immediately resolve client specific issues and ensure their satisfaction even after hours.
- Security Analysts
Security analysts ensure that there are no vulnerabilities within critical systems that could put the security of the systems, data, or the organization at risk. They are knowledgeable about cybersecurity threats and possess the ability to identify and promptly respond to security incidents.
Cyberattacks are not limited to office hours, so security engineers must be available 24/7 to mitigate the potential damage on an organization’s infrastructure or loss of sensitive data.
- DevOps Engineers
DevOps must ensure that the software that they are responsible for is reliable and available at all times. With an advanced understanding of their systems and software, they can quickly resolve critical issues that may occur after hours.
It is important that DevOps are on an on-call rotation so that they can prevent further service interruptions faced by clients.
- Site Reliability Engineers
There are many incidents that could affect system performance after hours and it is the site reliability engineers’ job to maintain the continuous delivery of optimal services.
So, they are a part of the on-call rotation to quickly diagnose and resolve issues, ultimately ensuring the reliability of critical systems and services.
What are the Responsibilities for On-Call Engineers?
On-call engineers have multiple responsibilities that go beyond being available after hours. It is important that before their first on-call shift, engineers are aware of all the on-call procedures and company policies.
So, it is essential that these responsibilities are clearly outlined by management, to ensure smooth after hours operations:
- Incident Response
One of the top priorities for on-call engineers is incident response, and while that is rather intuitive, it is essential to maintaining continuous service delivery after hours.
When the responder receives an alert, they must categorize and prioritize it, if they do not have an automated way of doing so. The main goal of incident response is to resolve an incident as quickly as possible, so that they can prevent prolonged damages.
So, if they consider the incident high-priority, it is their responsibility to follow the response plan and do all that they can to resolve the critical incident.
- Diagnosis
On-call staff is also responsible for diagnosing incidents that occur after hours. They must analyze current and historical data in order to identify the origin of the incident.
By determining the cause, IT teams can detect system vulnerabilities that will allow them to effectively resolve the incident and avoid recurrences. They also must document their findings and insights, to enhance the knowledge base and provide valuable insights for future incident resolution.
- Documentation
Along with documenting the diagnosis, on-call staff must have a way to document incidents, and what they have done to mitigate their effects. There are many ways to record this information, so it is crucial that the on-call staff is trained on how and where your organization documents this data.
This is an essential practice that aids both handoffs and future incident management. If there is an on-call shift change, it is important that the primary on-call individual relays accurate and contextual information to the next person in the on-call rotation. Exchanging this information allows for a smooth rotation that enables the prompt and effective resolution of incidents that rollover between shifts.
- Communication and Collaboration
Sometimes, incidents require the on-call engineer to involve SMEs who can better handle it, so they must be able to quickly communicate incidents, to ensure prompt resolution.
There must be one centralized place where communication takes place, to ensure the seamless delivery of critical information and avoid missed alerts and miscommunication. It is also essential that the communication method facilitate collaboration, and allow for the exchange of ideas and improved solutions.
- Improving the Knowledge Base
The importance of a robust knowledge base has been sprinkled throughout this section. It is a crucial aspect of incident management that is frequently used by on-call staff.
On-call engineers are constantly recording, analyzing and documenting data during after hours incidents. While these records help hold these engineers accountable and help with current issues, that is not their only purpose. With well-documented information, teams can analyze historical data to improve the reliability of systems, services and security. By looking back at incidents that frequently recur or systems that have begun to fail at an increased rate, the process of fixing vulnerabilities and avoiding damages is simplified.
Types of On-Call Rotations
Evaluating the requirements and needs of your organization and the on-call staff is essential when establishing an on-call rotation. Depending on varying aspects including company size, office hours, and the service provided there are multiple types of on-call rotation schedules that can be employed.
Some of the common types of on-call rotations are:
- Primary and Secondary On-Call Schedules
Having a primary and secondary on-call schedule is a schedule type that can be used alongside all of these schedule types. It ensures that critical incidents are never missed by employing a backup plan.
With this type of schedule there are proactive plans in place to alert progressive tiers of response in the off-chance that the primary on-call engineer misses the alert. Providing an extra layer of support is a crucial practice that enables teams to rest assured knowing that incidents will always be responded to.
- Inverse Schedule on an Escalation Policy
This schedule type goes hand-in-hand with the previous type. It ensures the equal distribution of the after hours workload, while still enabling a cushion for error.
When creating an inverse schedule, the primary and secondary teams alternate, so that one team is not constantly being alerted first. These schedules ensure the escalation of missed alerts while simultaneously making for a more equitable on-call rotation.
- Follow-the-Sun Schedules
The follow-the-sun schedule is used with widespread teams that have locations across various time zones allowing them to provide 24/7 on-call coverage without having to schedule after hours engineers. Engineers are scheduled for on-call shifts during office hours and are responsible for critical incidents across the company during that time.
This is incredibly effective in creating a better work-life balance for your team considering that they will not have to be ready to respond after hours, allowing them to fully rest before their next day-time shift. But, of course if this is not possible these other options can also create equitability and maintain employee satisfaction.
- Bi-Weekly Schedules
Similar to the inverse schedule, bi-weekly schedules alternate on-call duties to temporarily alleviate the workload of the on-call staff. This schedule puts specific team members on-call every other week.
There are, of course, other ways that alternating schedules can be designed including alternating weekends or months. When designing an alternating schedule, it is important that management evaluates the organization’s needs and creates a schedule accordingly.
- Expert That is Always On-Call
This type of schedule must be employed with another type of on-call rotation to ensure that the SME does not face burnout.
With this schedule, someone who is very knowledgeable of the organizations’ systems and services would always be on-call, specifically, to deal with high-level incidents. Low-level incidents that occur more frequently would be routed to other on-call engineers who would be on an on-call rotation, ensuring that the SME maintains a fair workload.
Obstacles that Stem from Ineffective On-Call Management
So far, we have discussed the logistics of on-call management, but without the proper tools and practices on-call rotations can cause frustrations and inefficiencies. It is essential that these obstacles are evaluated, so that the after hours schedule is effective and equitable.
Some of the challenges to look out for when deploying on-call management include:
- Alert Fatigue
Oftentimes when there is not an automated way to prioritize alerts, engineers experience alert fatigue – mental exhaustion that comes from constantly receiving unactionable alerts.
Engineers who work after hours typically sleep until they receive an incident alert. A lot of the time these alerts are unactionable or low-priority and do not require immediate response. It is unacceptable that on-call staff must be constantly woken up for alerts that they cannot tend to.
Another issue that stems from this is that engineers may begin putting off alerts, due to fatigue and accidentally miss a high-priority alert.
So, ultimately, ensuring that your staff is equipped with an advanced alerting system that prioritizes alerts, reduces alert fatigue and enhances on-call management.
- Poor Work-Life Balance
Work-life balance is a crucial aspect of all careers that directly affects job satisfaction and productivity. On-call staff need to have time off where they can solely focus on their at-home life and not be worried about responding to incidents.
Sometimes, there is not a formal on-call rotation in place and the same few individuals are constantly responding to incidents. These individuals are at risk of being unsatisfied with their career, because they feel overworked, resulting in decreased productivity and quality of work.
Without a structured approach there is no guarantee that the on-call duties are equally distributed, so it is important to evaluate the effectiveness of your on-call management.
- Missed Alerts
When an incident occurs in the middle of the night, teams must be sure that the communication method is capable of waking up the on-call staff. Using email or text may be easily employed but many individuals are desensitized to common smartphone noises or leave their phones on Do Not Disturb. Ultimately, this increases the number of missed alerts, leaving your organizations’ systems and security at risk.
It is important to discuss what methods of communication would be effective with your team and create a plan that will ensure that they never miss a critical alert.
- Heavy Workloads
If you do not work after hours, it is easy to overlook the amount of work that it actually is. When a high-priority incident occurs at night, the on-call responder has to tend to it, and still show up in the morning to accomplish their daily duties.
This can put a strain on on-call workers if this becomes a routine and they begin to lose sleep. So, management must be aware of this and employ equitable on-call rotations and allow for days-off or late starts when stressful incidents occur after hours to reduce the fatigue of on-call staff.
What Tools are Required for Equitable On-Call Management
There are many tools that are required when managing on-call duties that ensure seamless business operations throughout the night. These tools also offer ways to effectively alleviate some of the on-call duties from engineers.
These tools include:
- Company Devices or Accounts
When deploying these tools, it is essential that all of the on-call staff is equipped with the resources to effectively respond to critical incidents before their first shift. This can include login information to alerting tools, access to documentation and reports, and ways to contact management or SMEs.
- Alerting Tools
Alerting tools are one of the most important tools required for on-call management. Engineers must be effectively alerted of critical incidents, and without a robust tool that has these functions, it is incredibly difficult for on-call engineers to stay on top of critical incidents.
- Prioritized Alerts
- Loud, Audible Alerts that Wake Up Engineers
- Escalation Policies
- Monitoring Tools
Monitoring tools analyze the health and security of systems and can detect irregularities that may indicate vulnerabilities. They integrate with alerting tools, so that when they do detect an irregularity, an alert is immediately delivered to the on-call engineer mobilizing them into action.
- Communication and Collaboration Tools
As mentioned previously, teams need an effective way to communicate to SMEs or management during a complex, critical incident. So, it is crucial that there is an effective system in place that allows them to securely deliver contextual messages about the situation to the right individuals.
On-Call Management Best Practices
Along with the right tools, there are many recommended approaches that teams should practice to improve their on-call management. These practices will not only enhance productivity, but also the satisfaction of the on-call staff as well:
- Transparency
It is essential for teams to practice transparency to avoid any future issues or mistakes. On-call engineers need to be able to communicate when something went wrong or a problem is too big for them to fix. By encouraging this type of communication, engineers are more motivated to take risks and effectively remediate situations without the fear of being punished.
This practice, overall, reduces incidents where engineers feel the need to keep mistakes to themselves and unfortunately lead to bigger problems in the infrastructure.
- Consistent, Equal On-Call Rotations
Choosing the right on-call rotation for your team is one of the best ways to ensure the equitable distribution of on-call duties. If the on-call staff seems to burnout quickly or have reduced productivity during normal business hours, it may be time to change the on-call rotation schedule to something more consistent that caters to your team’s needs. This will improve your team’s overall satisfaction and facilitate equal on-call shifts.
- Analyze KPIs/Metrics
It is also important to have a routine check on the effectiveness of your on-call management approach. This can be done by analyzing metrics like mean time to respond or mean time to acknowledge. With these metrics you can analyze both how well your team is responding to incidents after hours, as well as how effective your incident management tools are.
Facilitating Equitable On-Call Rotations with OnPage
OnPage is a game-changing tool that empowers teams to effectively manage on-call rotations and significantly improve incident alerting. Some of OnPage’s features that enhance on-call management are:
- Prioritized Alerts
With OnPage, teams can prioritize alerts based on specific thresholds, ensuring that only high-priority incidents are notifying team members after hours. This will reduce the effects of alert fatigue and ensure that critical incidents rise above the clutter of numerous unactionable alerts and smartphone notifications.
- Loud, Audible Alerts that Wake Up Engineers
When a high-priority alert is sent, OnPage produces a loud, distinguishable alert that bypasses Do Not Disturb on smartphones. This ensures that engineers are immediately mobilized when they hear an OnPage alert, allowing them to promptly resolve critical incidents.
- Escalation Policies
OnPage also automates the escalation process and will escalate alerts to the subsequent on-call engineer if an alert is not acknowledged by the primary responder. This is an essential feature to ensure that there is quick action taken when a critical incident arises.
- Role-Based Messaging
There is also a digital scheduler that helps to ensure that the right people are always notified of an incident. This is true for both automatic and manual alerting and messaging. If an alert is triggered by a monitoring system it will automatically be routed to the on-call engineer. And, if that engineer is required to elevate the incident to an on-call SME, they can just message the on-call SME without having to look up their specific contact information, allowing for a seamless transaction that enhances after hours operations.