IT management thought leadership

Constant Vigilance in Continuous Delivery – How to do DevOps right

The importance of monitoring and alerting in the continuous delivery cycle

In the 2016 State of DevOps report, Puppet reported that the top DevOps shops like Amazon or Etsy deploy new software releases multiple times per day. The next tier of companies deploy on a weekly or monthly basis. What is the difference between High and Medium IT performers? More than just tools, it is mindset.

In addition to building code, the top DevOps teams have adopted a mindset of “constant vigilance”, as Mad-Eye Moody said to Harry Potter. Not only should teams always be building but they should also always be testing. Through testing they should remain constantly vigilant – attuned to their software and its performance. And equally important in this process should be incident alerting to let both Dev and Ops know when things go awry.

A cautionary tale: What happens without constant vigilance

While many DevOps shops buy into the testing mindset, it is important to present an awful case of what happens when sufficient testing is not done throughout the stack. In 2012, Knight Capital Group famously lost $460 million in 45 minutes of trading on the stock exchanges due to a failed software update. According to one article I read on the matter, “Had Knight automated its deployments, fully re-deployed servers periodically using automated tools, or removed old unused code from its codebase more aggressively, a technician’s failure to deploy code to the eighth server in a cluster would not have had such disastrous results.” Cluster indeed.

This cautionary tale instructs us that it’s not enough to build great software. You need to test it as well. If Knight Capital had employed continuous delivery philosophy, they would have realized that their 5th column laywith their 8th server. Knight Capital needed to test and attempt to simulate a full deployment in order to truly understand the robustness of their system.

Necessity of Tools in Continuous Deployment

Necessary to detecting issues such as those faced by Knight Capital and numerous others (although hopefully not at quite the same scale) is tools. In fact, one can argue, that DevOps today has very much become intertwined with tools ranging from Github to Fortify. If you have a spare moment, you should take a look at the XebiaLabs’ awesome periodic table to catalog all the tools available to DevOps.

Testing begins with continuous integration (CI). CI, the practice in which code changes are tested as soon as they are committed, is supported by tools such as Apache Continuum, Bamboo, Codeship, Jenkins/Hudson and TravisCI. Continuous integration is supposed to make merging software into the trunk everyone’s daily work. It is at this point that software engineers can begin to see what problems their software (and those of their colleagues) are bringing into the system.

This becomes a great opportunity to learn what bugs they are bringing to the steady state system and learn to create alerts with tools like OnPage to keep them apprised of these issues in future releases. Integrating your team’s incident management process into your continuous integration workflow is an excellent way to improve communication and transparency around failed builds.

Following CI is Continuous Delivery (CD). Martin Fowler defines CD as an extension of continuous integration in that, as soon as the unit tests pass, the code is immediately released to production. Tools for CD such as Puppet, Ansible, Chef or Salt help keep a pulse on the development system. These tools reduce process performance time and propensity for errors while improving reproducibility.

Continuous Delivery also requires that whenever anyone makes a change that causes an automated test to fail, breaking the deployment pipeline, the pipeline and all associated systems must be brought back into a deployable state. By having logable incidents, you can record steps taken towards remediation and have that information ready for future use. You want your company to place a high premium on learning and by having these processes in place that alert and log when incidents occur, you are taking a big step in that direction.

Monitoring – it’s importance in the Continuous Deployment process

Once the software has been tested and deployed, the time for monitoring arises. Monitoring helps to define the thresholds and notifications and provide definition for who needs to know and when. According to Cory von Wallenstein, Chief Technologist at Dyn, if you set thresholds too aggressively or notifications too broadly, folks will become numb to the system and lose trust, while setting thresholds too tolerantly or notifications too narrowly will cause issues affecting customers to get missed.

Only with effective alert management processes will engineers be able to react to monitoring notifications. Only then will they know if an alert from Logzio’s is indicating a new and relevant issue or if they don’t need to be alerted at all.

The Importance of incident management:

Based on many engineers that I have spoken to, alerts often fall into the chasm that Cory von Wallenstein refers to. That is, incident alerting tools are often too weakly defined and engineers receive numerous excess alerts. What Cory doesn’t mention but probably should is that while broad alerting might be necessary at first, over time you need to learn how to narrow the range. If this does not happen, then you are not effectively relying on the data.

Taking the information from the incident alerting tools in order to create better alerts in the future is a really important piece of feedback. In the realm of constant vigilance, monitoring is as important as testing and delivery. Specifically, it tells you if you are testing for the right things:

  • Security: is there right security embedded in the code
  • Scalability: Can this software handle 100,000 downloads?
  • Integration: Can the code integrate with known and necessary 3rd party software

Advantages from continuous testing

So what is the end result of all this testing? The advantages are many:

  • MTTR decreases: The time it takes for an issue to be recognized and resolved decreases dramatically
  • Lower change failure rate: High-performing organizations (ones that deploy frequently) have a change failure rate that’s three times lower, and they recover 24 times faster from failures.
  • Fast response to software vulnerability announcements: Because your organization is lean and able to adapt, it is better able to see which code produced by which person is responsible for the vulnerabilities and quickly resolve it.
  • Continuous feedback loop: Testing is always bringing to light more issues that need to be tested as well as a recognition of which issues are no longer important
  • More competitive: Because organizations are able to deploy more frequently, they can introduce changes more quickly and better keep up with competition

Continuous alerting is a key driver for many of these advantages in that the alerts from OnPage make it possible for engineers to realize that errors have occurred. Continuous alerting is not only helpful for continuous deployment, it is also essential for monitoring to let the right person know when failures have occurred.

Conclusion

While I’ve provided a framework to look at some of the tools and the necessary mindset for understanding continuous testing, there is always more to do. Embrace the journey and always be vigilant.

OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!

OnPage Corporation

Share
Published by
OnPage Corporation

Recent Posts

OnPage’s Strategic Edge Earns Coveted ‘Challenger’ Spot in 2024 Gartner MQ for Clinical Communication & Collaboration

Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…

2 days ago

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

2 weeks ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

1 month ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

2 months ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago