Improve IT alert management

Reduce IT alert noise and improve productivity

In IT Ops, monitors send alerts when thresholds are passed. Unfortunately, these thresholds are responsible for much of the excessive noise. In fact, organizations waste an average of $1.27 million every year responding to the noise of false alerts. Today, it’s no longer enough to simply get alerts. Ops teams get far too many of them already. In fact, the steady stream of alerts is training Ops to ignore them. Ops teams need to learn better Incident alert management.

The goal of this blog is to highlight ways to eliminate noise from monitoring and as a result, enhance the effectiveness of your alerting. By following these practices, you will ensure that you effectively manage your critical events.

Use Analytics to Learn Normal Behavior

Effective alerting means effective monitoring. It means you are alerting on the right metrics at the right intensity. You don’t alert on events which are not actionable and you don’t alert on events which are redundant. You alert on IT events that have meaning. The ultimate goal of alerts is to raise awareness of underlying code or infrastructure problems.

Too many events and alerts (false positives) will reduce the effectiveness of IT operations. You’ll also start to overlook important events or alerts. Consequently, it is important to learn what are the important statistics to keep track of. Is it MySQL availability, aborted connections or error logs? Know which ones are important for your organization and alert on them.

Minimize monitoring surface area

In the world of IT monitoring, there are those data alerts which are available and those data alerts which are desirable.  Not every item that you can monitor on is important or useful. Additionally, not every item you want to monitor is available to you. As such, it is important to determine the sweet spot of where these two circles intersect.

 

Part of the goal of effective monitoring is to minimize the surface area of what you alert on in order to avoid duplicate alerts or repetitive signals as they decrease the quality of your alerts. Your goal is to eliminate alerts on everything except root causes. If you are providing alerts on resource limits then you are alerting to alert to an event such as percent of maximum connections available as the main issue. Events such as open file handles on table caches are subsidiary to the main concern.

By looking through your monitoring alerts, you might also find that you actually have duplicate alerts. These could be alerts that are left over from tests and replicate the functionality of other alerts.  If you don’t eliminate these repetitive alerts, you will receive a lot of alert noise that inundates your team. Again, your goal is to minimize the alerting surface area. Eliminating duplicates will help in reaching this goal.

Bring in rich alerting (Low vs. high)

Not every heartburn is a heart attack. Similarly, not every alert is high priority and requires a 2am alert wake-up call. Instead, monitoring should be programmed to send different alerts based on different scenarios. Equally important is for the alerting tool you use to enable different types of alerts with different ring tones or different levels of persistence.

For example, a major issue with the server room should trigger a high-priority alert. However, it can be sent to the one person or team responsible for taking care of investigating this topic. But this alert should come as an immediate and high priority alert that needs to be fixed. Alerting should also come with escalation so that if the first person receiving the alert is unable to answer the call of duty, there is another person who receives the notification.

Alternatively, a low priority alert such as CPU usage at 65% does not require an immediate action but should be investigated. However, the investigation does not need to happen immediately and definitely does not need to wake anyone up at 2 am. These sorts of low priority alerts should be set so that they don’t get sent outside of a certain time range.

CONCLUSION – Alert on what matters

The fewer alerts you receive the happier you will be. This is particularly true if many of the alerts are the result of noise in the system or duplicate alerting.  What you need to ensure effective alerting is not only a robust monitoring tool but also a robust alerting tools that enable you to effectively respond to situations that arise in IT.

Pagers just don’t have the robustness required to live up to the demands of effective alerting. You need to bring in a tool that will alert you in an intelligent manger and allow you to take control of the situation in a quick and efficient manner.

To read more solution for how to improve IT alert management, download our whitepaper.

OnPage Corporation

Share
Published by
OnPage Corporation

Recent Posts

OnPage’s Strategic Edge Earns Coveted ‘Challenger’ Spot in 2024 Gartner MQ for Clinical Communication & Collaboration

Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…

2 days ago

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

2 weeks ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

1 month ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago