In IT Ops, monitors send alerts when thresholds are passed. Unfortunately, these thresholds are responsible for much of the excessive noise. In fact, organizations waste an average of $1.27 million every year responding to the noise of false alerts. Today, it’s no longer enough to simply get alerts. Ops teams get far too many of them already. In fact, the steady stream of alerts is training Ops to ignore them. Ops teams need to learn better Incident alert management.
The goal of this blog is to highlight ways to eliminate noise from monitoring and as a result, enhance the effectiveness of your alerting. By following these practices, you will ensure that you effectively manage your critical events.
Use Analytics to Learn Normal Behavior
Effective alerting means effective monitoring. It means you are alerting on the right metrics at the right intensity. You don’t alert on events which are not actionable and you don’t alert on events which are redundant. You alert on IT events that have meaning. The ultimate goal of alerts is to raise awareness of underlying code or infrastructure problems.
Too many events and alerts (false positives) will reduce the effectiveness of IT operations. You’ll also start to overlook important events or alerts. Consequently, it is important to learn what are the important statistics to keep track of. Is it MySQL availability, aborted connections or error logs? Know which ones are important for your organization and alert on them.
Minimize monitoring surface area
In the world of IT monitoring, there are those data alerts which are available and those data alerts which are desirable. Not every item that you can monitor on is important or useful. Additionally, not every item you want to monitor is available to you. As such, it is important to determine the sweet spot of where these two circles intersect.
Part of the goal of effective monitoring is to minimize the surface area of what you alert on in order to avoid duplicate alerts or repetitive signals as they decrease the quality of your alerts. Your goal is to eliminate alerts on everything except root causes. If you are providing alerts on resource limits then you are alerting to alert to an event such as percent of maximum connections available as the main issue. Events such as open file handles on table caches are subsidiary to the main concern.
By looking through your monitoring alerts, you might also find that you actually have duplicate alerts. These could be alerts that are left over from tests and replicate the functionality of other alerts. If you don’t eliminate these repetitive alerts, you will receive a lot of alert noise that inundates your team. Again, your goal is to minimize the alerting surface area. Eliminating duplicates will help in reaching this goal.
Bring in rich alerting (Low vs. high)
Not every heartburn is a heart attack. Similarly, not every alert is high priority and requires a 2am alert wake-up call. Instead, monitoring should be programmed to send different alerts based on different scenarios. Equally important is for the alerting tool you use to enable different types of alerts with different ring tones or different levels of persistence.
For example, a major issue with the server room should trigger a high-priority alert. However, it can be sent to the one person or team responsible for taking care of investigating this topic. But this alert should come as an immediate and high priority alert that needs to be fixed. Alerting should also come with escalation so that if the first person receiving the alert is unable to answer the call of duty, there is another person who receives the notification.
Alternatively, a low priority alert such as CPU usage at 65% does not require an immediate action but should be investigated. However, the investigation does not need to happen immediately and definitely does not need to wake anyone up at 2 am. These sorts of low priority alerts should be set so that they don’t get sent outside of a certain time range.
CONCLUSION – Alert on what matters
The fewer alerts you receive the happier you will be. This is particularly true if many of the alerts are the result of noise in the system or duplicate alerting. What you need to ensure effective alerting is not only a robust monitoring tool but also a robust alerting tools that enable you to effectively respond to situations that arise in IT.
Pagers just don’t have the robustness required to live up to the demands of effective alerting. You need to bring in a tool that will alert you in an intelligent manger and allow you to take control of the situation in a quick and efficient manner.
To read more solution for how to improve IT alert management, download our whitepaper.
We’re thrilled to announce the launch of OnPage’s new Multiple Account Login feature. Designed to…
Whether it's your first or hundredth home call shift, preparing yourself both physically and mentally…
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…