Managing alert noise from monitoring systems like SolarWinds can be tricky and failing to order the noise can cause:

  • Alert fatigue: too many alerts waking engineers up at night will not only cause tired engineers, but also hurt your team’s effectiveness at maintaining effectiveness.
  • Decreased MTTR: Because there are too many alerts, it will take extra time for engineers to respond intelligently to the issue or begin proper escalation.
  • Missed alerts: Like the boy who cried wolf, after too many false positives, engineers will begin to ignore alerts and, as a result, miss important issues.

SolarWinds Alerts with OnPage focus on MTTR

The very purpose of monitoring is to set thresholds that inform the team on how to act upon them. If the monitoring tools along with alerting tools are not providing actionable events, then there is a problem with how the system is set up. By bringing a strong testing mindset, monitoring and alerting can help solve many of the issues. OnPage’s integration partner SolarWinds gets this just right when they indicate:

It is only with continuous monitoring that a network admin can maintain a high-performance IT infrastructure for an organization. … Adopting the best practices can help the network admin streamline their network monitoring to identify and resolve issues much faster with very less MTTR (Mean Time To Resolve)

So how can organizations make order of the noise?

At OnPage, our best practices encourage DevOps, IT or SecOps to implement the following procedures:

  • Establish a baseline for the system. Initially set the IT monitoring and IT alerting parameters somewhat loosely so that you can determine the overall health and robustness of your system. While initially painful, this will allow you to see what types of alerts are garbage and which are meaningful. You won’t always know this type of information from the outset so it is a necessary part of the process. OnPage’s Integration Partner SolarWinds go on to note, “Once normal or baseline behavior of the various elements and services in the network are understood, the information can be used by the admin to set threshold values for alerts.”
  • After three to four weeks of monitoring, you can review the audit trail on your OnPage console. Reviewing the console will allow you to see which components of your system are producing alerts that need immediate answering and which ones do not.In the language of OnPage you are able to determine which alerts are low priority and which are high priority. Low priority alerts such as ‘server is 90% full’ can often be taken care of during normal working hours. High priority alerts such as a potential zero-day attack need immediate attention and should wake up the on-call engineer.
  • Ensure that the alerts come with proactive messaging. Messaging allows engineers to quickly solve problems. By having proactive messaging included, engineers can know if the problem needs escalation or if they can handle the issue.
  • In order to keep up with the pace of change that will inevitably befall your system, it is important that every component of your IT stack follow this process. Otherwise, you will quickly be drowning in alerts.

Not every attack of heartburn is a heart attack. Similarly, not every alert is high priority requiring a 2 a.m. wake-up call. You need to know how to tell the difference.

Control the noise

If you want to maintain monitoring tool’s usefulness, you need to have alerting which is meaningful and useful. You need to create thresholds and analyze them. Having a thousand alerts come through will cause the most tolerant of engineers to lose their cool. You don’t want that and we at OnPage don’t want that for your team either.

Shawn Lazarus

Share
Published by
Shawn Lazarus

Recent Posts

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

1 week ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

4 weeks ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago

OnPage Lands Spot on Constellation ShortList™ for Clinical Communication in 2024

Recognition highlights OnPage's commitment to advancing healthcare communication through new integrations and platform upgrades. Waltham,…

3 months ago