When tools are out of control you need critical alerting

Tools going Rogue – a story for Halloween

We have all heard stories of DevOps woe. Some tales are sad. Some tales describe true misfortune. And some tales just leave you thinking what the heck were developers thinking? This story is a tale of the later. This story will tell the tale of how some developers at a start-up in New England created code which was supposed to live and work on AWS EC2 servers. However, the developers never thought to test what they were spinning up or to put critical alerting in place for when things went wrong. And that is where our tale of woe begins.

Tale number 1: Automation destroyed the world

What the tool was supposed to do has long since been forgotten but the horror and nightmares it caused will not go away so soon. At the start-up I am referring to here, every protocol was seemingly ignored when this new tool was written and deployed. There was little documentation, little testing, way too much privilege and no way to muzzle the tools. The code had to run at highest privilege which means it could do anything. But this is bad because code that can do anything will do everything. Somehow, the engineers just thought the tools would work without these necessary considerations.

This tool was built to reach out to AWS and scale up infrastructure based on requests. But, the tool did not work consistently. For example, the base instance of the code worked but when a new instance of the code was required and new servers were spun up, the tool didn’t reach out to the GitHub repository to get a new copy of the code. Instead, the code had to be pushed manually.

Additionally, when the servers were no longer needed, the code would scale down by killing oldest instances first. And indeed, this is how scaling down should work. However, the tool didn’t check if the code was working on the newest servers before destroying the old ones. And, as the tool didn’t efficiently push code to the new instances the company was left with these new servers that had no code running on them.

The resolution: How Automation was Muzzled

Eventually, the issues of the original tool were recognized. The reality is that the tool was actually not a terrible tool. The problem was that it was that it was not fully thought out. What the tool needed was another tool and critical alerting to monitor the original tool. This second tool which was eventually built could check if the data and the environment were in good order. The second tool checked if the input was healthy. If not, the original tool would do nothing.

In addition to validation requests, code also needs critical alerts to notify developers and Ops for when it doesn’t work as planned. You wouldn’t let a developer deploy code without testing. Similarly, you cannot build a tool and assume it works the way it does on laptop. Critical alerting tools such as OnPage are clearly needed in DevOps to ensure requests and environmental checks are working.

Tale number 2: Automation quickly balloons your Amazon bill

In this second tale of DevOps horror and woe, another New England start-up had a group of its developers create a tool for spinning up infrastructure. When this tool decided it needed something, it would go out and build it. That is to say, the tool had no dependencies. What does that exactly mean? That means that if the tool wanted to build a SQS queue, it could. If the tool wanted to create a SNS operator, it could. If the tool wanted to build up another server, it could. Again, the tool was designed to have no dependencies and indeed it did not. Every instance could create anything and everything.

Sounds great, right? Not exactly. The problem with having no dependencies is that you can build everything and there is a cost to building infrastructure. No dependencies can start costing your team a lot of money very quickly. And that is indeed what happened in this case. The lack of control on this tool enabled it to create lots of unnecessary infrastructure.

The lack of supervision also allowed for the creation of thousands of CloudWatch metrics. These thousands of metrics however were initially unbeknownst to the DevOps team. As the team was using Datadog to check CloudWatch metrics, Datadog would make API calls every 10 minutes to check the metrics. However, as there were 80,000 CloudWatch metrics as a result of all of the infrastructure and the metrics were checked every 10 minutes, that soon became a lot of API calls. AWS charged the company anytime it made more than 2 million API calls per month. However, with 80,000 metrics being called every 10 minutes, the company very quickly exceed 2 million calls.

The only recognition of how far things had gone out of control was only the bill came at the end of the month and the team realized their error.

Fix it

From a pure DevOps perspective, there should have been – from the very beginning –a clear understanding of what the code was designed to do and how the Ops team would monitor the tool. Should it ever be the case that infrastructure is built and Ops doesn’t know about it? Probably not.

What proved the saving grace in this situation was critical alerting. Any time costs exceed $X, the managers would receive an alert. Any time new infrastructure was spun up, the managers would receive an alert. With a tool like OnPage this company, could have easily created low priority and high priority alerts based on the how big the bill was or how much new infrastructure was created.

Furthermore, developers were now given their own account that is separate from Testing and Production that does not create new infrastructure that impacts Ops. Now Ops can create and ideate all they want without costing the company thousands of dollars.

Finally, the company instigated weekly Dev and Ops meetings to ensure that each side is aware of the others’ pain points. Ops knows what the goals of the Devs are and vice-a-versa. We actually wrote a blog on this very point a few months back highlighting the need for this constant back and forth communication between Devs and Ops. Only through this constant communication can companies hope to achieve true collaboration and growth.

A cautionary tale for developers everywhere – use critical alerting

While much fault in both tales can be found with the developers, the scenarios are not unique. Most DevOps engineers can probably think of instances where imperfect code was pushed into production. The cautionary tale here lies in that there was no mechanism or alerting platform to recognize the problems. Prayer is not a strategy for effective DevOps. Instead, you need:

You need to have mechanisms in place to alert you when things go wrong. OnPage’s critical alerting is the place to start. Alerts can be used to let you know when your AWS bill has hit $1000 or when new server instances have deployed incorrectly.
TEST!! Don’t assume that because things are supposed to work, they will. Use effective monitoring tools to keep track of how your system is responding to growth and the data it is given.
Create documentation that explains how the tool is supposed to work and how to test inputs
Make sure your Dev and Ops teams are talking to each other and each knows what the code is supposed to do. Never deploy new code and go off on vacation.

To learn more about OnPage’s critical alerting platform and how it can help you avoid DevOps nightmares, you can visit our website or schedule a demo.

Facebook

Google

Twitter

OnPage Corporation

Next Pagers - the new national security threat! »

Previous « OnPage releases new version for AppStore

Published by

OnPage Corporation

Tags: CRITICALPriority Messaging

9 years ago

The Hidden Cost of AI Productivity: When Efficiency Turns Into “Brain Fry”
A new HBR study reveals that the race to build and manage AI agents may…
Top 12 AI and LLM Observability Tools in 2026 Compared: Open-Source and Paid
Artificial intelligence has moved far beyond experimentation. In 2026, AI systems are embedded into customer…
What are the MOST Promising and High-Demand IT Jobs Right Now
Jobs in the technological sector have been shrinking. The Chief Economist at Glassdoor states that in the…

Best Secure Messaging Apps for Healthcare Workers (2026 Buyer’s Guide): OnPage

Secure messaging apps for healthcare workers are platforms designed to enable HIPAA-compliant communication, real-time collaboration…

3 days ago

on-call management

(2026 Buyer’s Guide) Best On-Call Management and Incident Alerting Platforms for On-call IT Teams

Disclosure: This comparison is written by our product marketing team that works closely with IT…

1 week ago

press release

OnPage Accelerates Global Growth in 2025 with Expanded Enterprise Adoption and Mission-Critical Innovation

Industry recognition, strategic partnerships and advanced product capabilities position the company for continued momentum across healthcare, IT and enterprise…

3 weeks ago

IT management thought leadership

The Hidden Cost of AI Productivity: When Efficiency Turns Into “Brain Fry”

A new HBR study reveals that the race to build and manage AI agents may…

3 weeks ago

critical communication and alerting

Do Veterinarians Go On Call? Reinventing OnCall Management for Veterinary Clinics

Veterinary clinics typically operate during standard 9–5 business hours. But emergencies don’t follow a schedule.…

3 weeks ago

clinical communication and collaboration

What is Ambient AI in Healthcare? Revolutionizing Clinical Care, Efficiency, and Outcomes

You probably use ambient AI every day without even knowing it. When your Apple Watch…

4 weeks ago

When tools are out of control you need critical alerting

Tools going Rogue – a story for Halloween

Tale number 1: Automation destroyed the world

The resolution: How Automation was Muzzled

Tale number 2: Automation quickly balloons your Amazon bill

Fix it

A cautionary tale for developers everywhere – use critical alerting

Related Post

Recent Posts

Best Secure Messaging Apps for Healthcare Workers (2026 Buyer’s Guide): OnPage

(2026 Buyer’s Guide) Best On-Call Management and Incident Alerting Platforms for On-call IT Teams

OnPage Accelerates Global Growth in 2025 with Expanded Enterprise Adoption and Mission-Critical Innovation

The Hidden Cost of AI Productivity: When Efficiency Turns Into “Brain Fry”

Do Veterinarians Go On Call? Reinventing OnCall Management for Veterinary Clinics

What is Ambient AI in Healthcare? Revolutionizing Clinical Care, Efficiency, and Outcomes