Improve IT alert management

Reduce IT alert noise and improve productivity

In IT Ops, monitors send alerts when thresholds are passed. Unfortunately, these thresholds are responsible for much of the excessive noise. In fact, organizations waste an average of $1.27 million every year responding to the noise of false alerts. Today, it’s no longer enough to simply get alerts. Ops teams get far too many of them already. In fact, the steady stream of alerts is training Ops to ignore them. Ops teams need to learn better Incident alert management.

The goal of this blog is to highlight ways to eliminate noise from monitoring and as a result, enhance the effectiveness of your alerting. By following these practices, you will ensure that you effectively manage your critical events.

Use Analytics to Learn Normal Behavior

Effective alerting means effective monitoring. It means you are alerting on the right metrics at the right intensity. You don’t alert on events which are not actionable and you don’t alert on events which are redundant. You alert on IT events that have meaning. The ultimate goal of alerts is to raise awareness of underlying code or infrastructure problems.

Too many events and alerts (false positives) will reduce the effectiveness of IT operations. You’ll also start to overlook important events or alerts. Consequently, it is important to learn what are the important statistics to keep track of. Is it MySQL availability, aborted connections or error logs? Know which ones are important for your organization and alert on them.

Minimize monitoring surface area

In the world of IT monitoring, there are those data alerts which are available and those data alerts which are desirable. Not every item that you can monitor on is important or useful. Additionally, not every item you want to monitor is available to you. As such, it is important to determine the sweet spot of where these two circles intersect.

Part of the goal of effective monitoring is to minimize the surface area of what you alert on in order to avoid duplicate alerts or repetitive signals as they decrease the quality of your alerts. Your goal is to eliminate alerts on everything except root causes. If you are providing alerts on resource limits then you are alerting to alert to an event such as percent of maximum connections available as the main issue. Events such as open file handles on table caches are subsidiary to the main concern.

By looking through your monitoring alerts, you might also find that you actually have duplicate alerts. These could be alerts that are left over from tests and replicate the functionality of other alerts. If you don’t eliminate these repetitive alerts, you will receive a lot of alert noise that inundates your team. Again, your goal is to minimize the alerting surface area. Eliminating duplicates will help in reaching this goal.

Bring in rich alerting (Low vs. high)

Not every heartburn is a heart attack. Similarly, not every alert is high priority and requires a 2am alert wake-up call. Instead, monitoring should be programmed to send different alerts based on different scenarios. Equally important is for the alerting tool you use to enable different types of alerts with different ring tones or different levels of persistence.

For example, a major issue with the server room should trigger a high-priority alert. However, it can be sent to the one person or team responsible for taking care of investigating this topic. But this alert should come as an immediate and high priority alert that needs to be fixed. Alerting should also come with escalation so that if the first person receiving the alert is unable to answer the call of duty, there is another person who receives the notification.

Alternatively, a low priority alert such as CPU usage at 65% does not require an immediate action but should be investigated. However, the investigation does not need to happen immediately and definitely does not need to wake anyone up at 2 am. These sorts of low priority alerts should be set so that they don’t get sent outside of a certain time range.

CONCLUSION – Alert on what matters

The fewer alerts you receive the happier you will be. This is particularly true if many of the alerts are the result of noise in the system or duplicate alerting. What you need to ensure effective alerting is not only a robust monitoring tool but also a robust alerting tools that enable you to effectively respond to situations that arise in IT.

Pagers just don’t have the robustness required to live up to the demands of effective alerting. You need to bring in a tool that will alert you in an intelligent manger and allow you to take control of the situation in a quick and efficient manner.

To read more solution for how to improve IT alert management, download our whitepaper.

Facebook

Google

Twitter

OnPage Corporation

Next How to solve healthcare's on-call management problem »

Previous « The ConnectWise Incident Response Guide

Published by

OnPage Corporation

Tags: IT management

8 years ago

What Grafana OnCall’s Maintenance Mode Means for On-Call Teams
If you’ve been using Grafana OnCall OSS for incident management, you may have already heard…
From Tickets to Action: Ensuring Proactive IT Support with Jira and OnPage
We're excited to announce the launch of our bi-directional integration between OnPage and Jira! This…
OpsGenie End of Life? What’s next for OpsGenie users.
If you haven’t heard already (which would be shocking considering the numerous posts I’ve seen…

How to Combat MSP Alert Fatigue

Managed service providers (MSPs) are responsible for monitoring hundreds or even thousands of devices, meaning…

3 days ago

IT Alerting

What Grafana OnCall’s Maintenance Mode Means for On-Call Teams

If you’ve been using Grafana OnCall OSS for incident management, you may have already heard…

3 weeks ago

incident management

From Tickets to Action: Ensuring Proactive IT Support with Jira and OnPage

We're excited to announce the launch of our bi-directional integration between OnPage and Jira! This…

4 weeks ago

critical communication and alerting

OpsGenie End of Life? What’s next for OpsGenie users.

If you haven’t heard already (which would be shocking considering the numerous posts I’ve seen…

4 weeks ago

clinical communication and collaboration

Reflections from HIMSS 2025: Conversations, Challenges & The Future

HIMSS 2025 is in the books, and after days of conversations, sessions, and navigating the…

1 month ago

IT Alerting

The Need for Full-Stack Observability

In a recent survey, it was discovered that 57% of software developers' time is spent…

2 months ago