Being on-call doesn’t have to mean you’re always tired
The introduction of monitoring into the DevOps world means alerts will occur 24/7 and that there will be alert fatigue in DevOps. Monitoring needs alerts in order to be effective but the issue is that while our technology is 24/7, humans cannot work in a similar fashion. Even if engineers do attempt to push at the margins and be on-call longer and later, there are considerable health, psychological and work-related effects. Even with on-call schedules, burnout is inevitable. There are also significant financial implications for companies if they have stressed, unhappy and sleep deprived engineers. For example, engineers who are feeling the stress of alert fatigue are likely to leave for greener pastures, leaving their employers without their knowledge reservoir and needing to rehire which can cost as much as 30% of the individual’s salary. Clearly, 24/7 alerts need to be better calibrated with human physiological realities in order to avoid alert fatigue.
Alert fatigue in DevOps
The traditional setup of IT and DevOps is such that email is the main form of relating issues such as deployment problems or server problems. If software fails to deploy correctly, an email goes to a designated engineer. Similarly, if a server experiences a power surge, an email is sent. Monitoring tools such as Nagios or Solarwinds are wonderful monitoring tools that can identify critical events in the monitoring life-cycle. However, if they are configured to send emails to the group or to a pager, they are like a Ferrari stuck in rush hour traffic – unable to move at their true potential. When not configured correctly, priorities are unclear, meaningless alerts are sent and engineers are woken up in the middle of the night for no reason.
Often, individuals are alerted for non-critical events or events for which no action is required. While it might seem like it’s better to get a false positive than to miss an alert, there is definitely a cost to receiving too many false positives. As Twitter engineer Caitie McCaffrey noted in her recent Monitorama Conference, “when alerts are more often false than true, the on-call’s sense of urgency in responding to alerts is diminished…the simple burden of alerts desensitizes the on-call to alerts”. This desensitizing inevitably has a significant negative impact on customer satisfaction.
Impact of DESENSITIZING
The scope of alert fatigue is brought into further focus when the impacts are examined. First is that traditional alerting workflows are often poorly calibrated and don’t alert the correct person. So, in effect, the wrong people are alerted to a situation and woken up for no reason. Think of a doctor who is constantly sleep deprived and you have a pretty good image of what the effects of imperfect alerting are. You get tired engineers making poor diagnosis or missing alerts or developing pathological work behaviors that are detrimental to the team as a whole. MTTR decreases as issues take longer to resolve.
Second, and this is often not discussed in an examination of alert fatigue, is the cost of engineers having to constantly juggle and refocus which cost the company in terms of lost efficiency of the engineer. Engineers, like most professionals, work best when they can focus. Constantly juggling priorities is detrimental to the bottom line.
Third, and this was alluded to earlier, is the “crying wolf syndrome”. Engineers will ignore warnings when they have been alerted too many times to meaningless alerts. These false positives are a ticking time bomb as it is only a matter of time before a truly critical alert is ignored.
Ways to avoid fatigue
Simply hiring more people to handle the problem won’t end well either. The new hires will also be stressed by false alarms and sleep deprivation for the time they are on call. The fundamental issues are not addressed by these actions which are, at their core, resulting from human error rather than technical malfunction. There will still be false alarms, poor routing of issues and poor addressing of problems. Meeting SLAs will not be achieved through more hires. More hires will just make the problem more expensive.
However, there are some tried and true ways to overcome the significant issue of alert fatigue and take positive steps towards a happy workforce.
(1) Avoid email Send alerts through proper incident management tools that send critical alerts to the right individual
(2) Alert through a priority messaging application like OnPage which enables engineers to prioritize and act like team players rather than like solo warriors.
(3) Provide context for the problem or issue so that issue is actionable.
(4) Make sure you have runbooks so engineers don’t need to recreate the wheel when issues arise. This will decrease MTTR.
(5) Alert the right person and make it loud. Proper scheduling will ensure that the person who can do the most to correct the problem is alerted.
(6) Have escalation procedures so that on-call individuals can escalate critical issues when they need assistance. Enable ChatOps through tools like Slack to enable communication.
(7) Work smarter by having post mortems.
By following these 7 steps, you will decrease overall stress levels and you will increase happiness of the team.
OnPage gives alerts a voice – like Adele
Oncall is only as good as the tools generating the alerts so invest in good tools. Furthermore, as a caveat, on-call is only as good as the tool delivering the information. If that alert is being delivered by email or sms or ping, it’s not very audible. Every critical alert wants to be heard. OnPage gives it a voice.