Faced with limited financing and a high burn rate, many startups focus on product development and application coding at the expense of back of operations engineering. The reasons for this focus are understandable to some extent. Companies need to develop product and unseasoned CEOs don’t always see the value in investing in IT Ops. Some call this movement towards operating without IT Ops, “NoOps” or “serverless”.
Yet there are “dusty old concepts”, as Charity Majors calls them, that arise when companies fail to worry about things like scalability and graceful degradation and think that those will take care of themselves. The problem becomes even more significant when developers try to remedy the issues that arise from not having an Ops team by creating DIY tools to remedy the shortfall. To paraphrase the poet Robert Browning, their reach is further than their grasp. And with this reach comes technical debt.
Martin Fowler has a good notion of technical debt. He describes it as follows:
You have a piece of functionality that you need to add to your system. You see two ways to do it, one is quick to do but is messy – you are sure that it will make further changes harder in the future. The other results in a cleaner design, but will take longer to put in place.
In this explanation, you can see the tradeoff made. The quick and messy way sets up technical debt, which like financial debt has implications for the future. If we choose not to do away with the technical debt, we will continue to pay interest on the debt. In the case of development it means we will often have to go back to the that quick and dirty piece of technology and pay with extra effort that wasn’t really necessary.
Alternatively, developers can invest in better design and, in the case of this argument, bring in IT operations to worry about the important things Ops worry about, like scalability, graceful degradation, queries and availability.
Unfortunately, DIY practitioners often don’t invest the time in thinking through this tradeoff. Given financial circumstances, rushed deadlines, short-sightedness or some combination there of, they choose the quick and dirty option.
My colleague Andrew Ben who is OnPage’s VP of Research and Development spoke with Nick Simmonds, the former Lead Operations Engineer at Datarista. Nick described his experience at a company he used to work at where one of his first jobs was to gain control over a DIY scaling tool that had been developed in-house. The tool was created before any operations engineers had been hired. As such the tool was designed to be a “quick and dirty” method of provisioning servers.
According to Simmonds,’ the faults of the tool were significant. For example, the tool was designed to eliminate the need for manual scaling of microservices, but instead the tool simply spun up new instances with no code on them at all. Furthermore, when the servers were spun down, the tool never checked as to whether the code was working on the newest servers before it destroyed the old ones. And, as the tool didn’t efficiently push code to the new instances, the company was left with new servers that had no code running on them.
A significant part of the problem with the tool Nick’s colleagues built is that the tool didn’t come with any alerting component. His team only recognized the failure when they were in production and live. No monitoring, no alerting. Nothing was in place to let Simmonds’ team know the deployment had failed.
I don’t want this cautionary tale to cause nightmares for any young DevOps engineers out there. I wouldn’t want that on my resume. Instead, I want to impress upon application developers the need to be mindful of how “quick and dirty” impacts future operations and releases. Tools shouldn’t be created as a temporary hack until you get ops gets on board.
Teams should invest the time into creating a robust piece of code or tool. Alternatively, if they don’t have the time, they should try to invest in tools that accomplish the desired result. Furthermore, and this is something Nick brought out in his interview with Andrew, never try to hack together a tool for monitoring and alerting.
Alerting is too important and complex to leave to a hack or technical debt. For example, here are some of the main points a robust alerting tool needs to accomplish:
Alerting is more than just enabling an email to show up in your inbox. Email alerting is useful only if your job requires you to have your eyes glued to email at all times. No letting down your guard. Instead you need to maintain constant vigilance.
If you instead wish to invest in a robust alerting tool that has all of the capabilities mentioned above already fully tested and working, then contact OnPage. We have alerting figured out.
OnPage is a critical alerting and incident notification platform used by DevOps and IT practitioners. Download a free trial to get started on the path to better incident management.
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…