IT alerting and IT monitoring are not what they used to be. In years past, software releases were scheduled a few times per year. Often, one monitoring tool would review the infrastructure and would catch and spit out alerts. Sorry, but those days are gone. Nowadays, start-ups use containers and microservices, continuous integration and delivery. As such, monitoring can and needs to be at multiple points along the pipeline.
If you are not taking the time to calibrate your systems to reduce the amount of noise and ensure effective alerting, then you’ve got monitoring and alerting all wrong. Don’t worry though. It’s not a death sentence – thankfully. There are clear methods for turning IT monitoring noise into actionable IT alerting.
It’s not just a catchy line from Quiet Riot. ‘Come on feel the noise’ also encapsulates how many engineers in IT Ops experience monitoring. Because of the need to monitor multiple points in the stack, multiple monitoring tools have arisen. And because there are multiple monitoring tools, there is a lot of noise. Per Big Panda’s CTO:
The old “one tool to rule them all” approach no longer works. Instead, many enterprises are selecting the best tool for each part of their stack with different choices for systems monitoring, application monitoring, error tracking, and web and user monitoring.
….
As companies add more tools, the number of alerts that they must field can grow by orders of magnitude. It’s simply impossible for any human, or teams of humans, to effectively manage that.
Indeed, it is impossible for Dev, Ops, IT or SecOps to stay on top of 100 alerts during the day and night. Instead, these groups need to find a way to make order of the madness. Teams need to be nimble to remain competitive and support the multiple moving parts that comprise their groups. As Big Panda’s CTO goes on to add:
If organizations [do not adjust their monitoring strategies] they will not only cripple their ability to identify, triage and remediate issues, but they run the risk of violating SLAs, suffering downtime, and losing the trust of customers.
Furthermore, by failing to order the noise, engineers and corporations will suffer a predictable set of problems:
The very purpose of monitoring is to set thresholds that inform the team on how to act upon them. If the monitoring tools along with alerting tools are not providing actionable events, then there is a problem with how the system is set up.
But by bringing a strong testing mindset to bear, monitoring and alerting can help solve many of the issues. SolarWinds gets this just right when they indicate:
It is only with continuous monitoring that a network admin can maintain a high-performance IT infrastructure for an organization. … Adopting the best practices can help the network admin streamline their network monitoring to identify and resolve issues much faster with very less MTTR (Mean Time To Resolve)
So how can and should organizations make order of the noise? At OnPage, our best practices encourage DevOps, IT or SecOps to implement the following procedures:
Not every attack of heartburn is a heart attack. Similarly, not every alert is high priority requiring a 2 a.m. wake-up call. You need to know how to tell the difference.
If you want to maintain your stack’s value and usefulness, you need to have alerting which is meaningful and useful. You need to create thresholds and analyze them. Having a thousand alerts come through will cause the most tolerant of engineers to lose their cool. You don’t want that and we at OnPage don’t want that for your team either.
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…