The more I read and learn about how to succeed in DevOps the more I realize how important failure is to the process. You need to fail to be great at DevOps. Netflix, for example, even takes it a step further by introducing failure into their testing process. In our blog The Seven Deadly Sins of DevOps, we wrote about how you should not do DevOps. Interestingly enough though, failure is not a sin. In fact, failure is something you should strive for. This blog will give you a sense for how you can plan to fail strategically.
How do you succeed by failing? It sounds like a contradiction. Simply put, it’s by building failure into the testing process. Think of it as ‘controlled failure’ whereby you think strategically about where the system is likely to fail and undergo under stress.
At its core DevOps is about shifting from a fear of failure to a desire to a fail-fast and move forward. DevOps is about quickly adapting code and product to meet customer needs. If zero failure is part of your corporate mantra, then your employees are afraid to innovate. That is because at its core, innovation is about failing and failing many times before you succeed. I am not saying that your goal should be to fail. Rather, you need to look at failing as an opportunity to learn and improve.
According to the Agile Manifesto you should “welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.” By living up to this part of the Manifesto, you understand that some project and coding developments will fail either due to change in product specifications or due to a streamlining of the process. Either way, the goal should be to quickly shift focus and move forward.
Netflix has the Chaos Monkey, as part of its Simian Army, to introduce failure into the process when it is not expected so that engineers are trained to deal with things breaking and not working as planned. Netflix believes that the best way to avoid major failures is to fail constantly.
Chaos Monkey introduces scheduled failure that allows simulated failures to occur at times when they can be closely monitored. In this way, it’s possible to prepare for major unexpected errors or outages rather than just waiting for outages to occur and seeing how well the team manages.
The earlier you can get to issues, the cheaper it is too fix them. That alone is a reason to shift left. Additionally, concepts such as addressing security issues before they affect your customers is crucial. According to one DevOps expert, bringing quality earlier onto your development schedule typically means that:
Rather than recommending a specific tool, I will instead say that it is important to pick a set of tools that causes the least amount of friction, confusion, and breakdowns in communication between project stakeholders. Anything you can do to make tooling more coherent across your teams will help you focus on quality.
Similar to the way Linux brought open-source to developers everywhere, cloud infrastructure is enabling companies to lower the cost of development. By using AWS EC2, teams only pay for the memory they need rather than purchasing large racks that remain idle. One expert writes:
A wonderful thing happens because suddenly, the fear of failure evaporates. Just as with a public cloud, if a new application initiative fails, you simply shut it down — no harm, no foul. The cost of failure is inexpensive, which fosters risk taking, which in turn creates a culture of acceptance of failed initiatives.
Additionally, by combining DevOps with the cloud:
By moving further towards the cloud, DevOps furthers its mission of failing fast at lower cost.
Also, be sure to include strong Application Performance Monitoring tools (APM ) and feedback loops using tools such as Slack to communicate when tests fail. When significant failures do occur (and they inevitably will), it is important to have OnPage integrated into monitoring systems such as DataDog or SolarWinds to ensure that critical notifications are delivered prominently and promptly to the on-call engineer’s smartphone.
In last week’s blog, we talked about how the lack of critical alerting allowed for AWS bills to grow in ways that engineers couldn’t explain. Without alerting the engineers only noticed that over 80,000 API calls were being made to AWS every 10 minutes when their bill arrived.
It is easy to see how a team building a new piece of code isn’t aware of every way in which the code will interact with its environment. True enough. What is unacceptable is that there were no alerts in place to notify the DevOps teams when thousands of dollars were being spent on API unnecessary calls. And, more to the point, this was only recognized late in the game.
In order to minimize MTTR (mean time til resolution), it is key that OnPage’s prominent and persistent alerting tool is used. While failure cannot be avoided, your team does have the ability to effectively manage MTTR.
Know that when systems or code fails, there is documentation to fall back on to explain how the system is supposed to work. Documentation should also explain error codes, workflow and best practices. It is not always possible to reach out to the developer to ask why a piece of code is malfunctioning or why the code isn’t deploying as expected. With proper documentation, this knowledge is easily accessible.
As teams get larger, the documentation becomes like founding principles for why the code was written in a certain way and why decision were made. Without this foundation, the core thinking is likely to change based on which team member is in charge. No communal memory gets imprinted and this in turn makes it difficult to keep focused on what has failed in the past. Effective documentation will limit this path.
One of the most important things for business is learning from past mistakes. With DevOps, it’s crucial. In our blog from September, we explain how blameless post mortems are necessary for effective DevOps. Blameless post mortems point to a mature team that is willing to learn from mistakes. Moreover, the team realizes that mistakes are inevitable and that they are something the team can learn from.
Henry Ford noted that “Failure is simply an opportunity to begin again, only this time more intelligently”. If failure is seen as part of the process rather than a cause for persecution, then the failure can be effectively incorporated into improving the DevOps process.
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…