The introduction of monitoring into the DevOps world means alerts will occur 24/7. As such, there will be alert fatigue in DevOps. Monitoring needs alerts in order to be effective but the issue is that while our technology is 24/7, humans cannot work in a similar fashion. Clearly, 24/7 alerts need to be better calibrated with human physiological realities in order to avoid alert fatigue.
The remedy then is to implement an IT alerting system that differentiates high priority alerts and allows for messaging with attachments.
The traditional setup of IT and DevOps is such that email is the main form of relating issues such as deployment problems or server problems. If software fails to deploy correctly, an email goes to a designated engineer. Similarly, if a server experiences a power surge, an email is sent.
Monitoring tools such as Nagios or SolarWinds are wonderful monitoring tools that can identify critical events in the monitoring life-cycle. However, if they are configured to send emails to the group or to a pager, they are like a Ferrari stuck in rush hour traffic – unable to move at their true potential.
When not configured correctly, priorities are unclear, meaningless alerts are sent and engineers are woken up in the middle of the night for no reason.
Often, individuals are alerted for non-critical events or events for which no action is required. While it might seem like it’s better to get a false positive than to miss an alert, there is definitely a cost to receiving too many false positives.
As Twitter engineer Caitie McCaffrey noted in her recent Monitorama Conference, “when alerts are more often false than true, the on-call’s sense of urgency in responding to alerts is diminished…the simple burden of alerts desensitizes the on-call to alerts”. This desensitizing inevitably has a significant negative impact on customer satisfaction.
The scope of alert fatigue is brought into further focus when the impacts are examined. First is that traditional alerting workflows are often poorly calibrated and don’t alert the correct person. So, in effect, the wrong people are alerted to a situation and woken up for no reason.
Think of a doctor who is constantly sleep deprived and you have a pretty good image of what the effects of imperfect alerting are. You get tired engineers making poor diagnosis or missing alerts or developing pathological work behaviors that are detrimental to the team as a whole.
Second, and this is often not discussed in an examination of alert fatigue, is the cost of engineers having to constantly juggle and refocus which cost the company in terms of lost efficiency of the engineer. Engineers, like most professionals, work best when they can focus. Constantly juggling priorities is detrimental to the bottom line.
Third, and this was alluded to earlier, is the “crying wolf syndrome”. Engineers will ignore warnings when they have been alerted too many times to meaningless alerts. These false positives are a ticking time bomb as it is only a matter of time before a truly critical alert is ignored.
Simply hiring more people to handle the problem won’t end well either. The new hires will also be stressed by false alarms and sleep deprivation for the time they are on call. The fundamental issues are not addressed by these actions which are, at their core, resulting from human error rather than technical malfunction.
However, there are some tried and true ways to overcome the significant issue of alert fatigue and take positive steps toward a happy workforce.
By following these 7 steps, you will decrease overall stress levels and you will increase the happiness of the team.
Oncall is only as good as the tools generating the alerts so invest in good tools. Furthermore, as a caveat, on-call is only as good as the tool delivering the information. If that alert is being delivered by email or SMS or ping, it’s not very audible. Every critical alert wants to be heard. OnPage gives it a voice.
Schedule a demo to learn how OnPage can help you better manage your alert fatigue issues.
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…