In the 2016 State of DevOps report, Puppet reported that the top DevOps shops like Amazon or Etsy deploy new software releases multiple times per day. The next tier of companies deploy on a weekly or monthly basis. What is the difference between High and Medium IT performers? More than just tools, it is mindset.
In addition to building code, the top DevOps teams have adopted a mindset of “constant vigilance”, as Mad-Eye Moody said to Harry Potter. Not only should teams always be building but they should also always be testing. Through testing they should remain constantly vigilant – attuned to their software and its performance. And equally important in this process should be incident alerting to let both Dev and Ops know when things go awry.
While many DevOps shops buy into the testing mindset, it is important to present an awful case of what happens when sufficient testing is not done throughout the stack. In 2012, Knight Capital Group famously lost $460 million in 45 minutes of trading on the stock exchanges due to a failed software update. According to one article I read on the matter, “Had Knight automated its deployments, fully re-deployed servers periodically using automated tools, or removed old unused code from its codebase more aggressively, a technician’s failure to deploy code to the eighth server in a cluster would not have had such disastrous results.” Cluster indeed.
This cautionary tale instructs us that it’s not enough to build great software. You need to test it as well. If Knight Capital had employed continuous delivery philosophy, they would have realized that their 5th column laywith their 8th server. Knight Capital needed to test and attempt to simulate a full deployment in order to truly understand the robustness of their system.
Necessary to detecting issues such as those faced by Knight Capital and numerous others (although hopefully not at quite the same scale) is tools. In fact, one can argue, that DevOps today has very much become intertwined with tools ranging from Github to Fortify. If you have a spare moment, you should take a look at the XebiaLabs’ awesome periodic table to catalog all the tools available to DevOps.
Testing begins with continuous integration (CI). CI, the practice in which code changes are tested as soon as they are committed, is supported by tools such as Apache Continuum, Bamboo, Codeship, Jenkins/Hudson and TravisCI. Continuous integration is supposed to make merging software into the trunk everyone’s daily work. It is at this point that software engineers can begin to see what problems their software (and those of their colleagues) are bringing into the system.
This becomes a great opportunity to learn what bugs they are bringing to the steady state system and learn to create alerts with tools like OnPage to keep them apprised of these issues in future releases. Integrating your team’s incident management process into your continuous integration workflow is an excellent way to improve communication and transparency around failed builds.
Following CI is Continuous Delivery (CD). Martin Fowler defines CD as an extension of continuous integration in that, as soon as the unit tests pass, the code is immediately released to production. Tools for CD such as Puppet, Ansible, Chef or Salt help keep a pulse on the development system. These tools reduce process performance time and propensity for errors while improving reproducibility.
Continuous Delivery also requires that whenever anyone makes a change that causes an automated test to fail, breaking the deployment pipeline, the pipeline and all associated systems must be brought back into a deployable state. By having logable incidents, you can record steps taken towards remediation and have that information ready for future use. You want your company to place a high premium on learning and by having these processes in place that alert and log when incidents occur, you are taking a big step in that direction.
Once the software has been tested and deployed, the time for monitoring arises. Monitoring helps to define the thresholds and notifications and provide definition for who needs to know and when. According to Cory von Wallenstein, Chief Technologist at Dyn, if you set thresholds too aggressively or notifications too broadly, folks will become numb to the system and lose trust, while setting thresholds too tolerantly or notifications too narrowly will cause issues affecting customers to get missed.
Only with effective alert management processes will engineers be able to react to monitoring notifications. Only then will they know if an alert from Logzio’s is indicating a new and relevant issue or if they don’t need to be alerted at all.
Based on many engineers that I have spoken to, alerts often fall into the chasm that Cory von Wallenstein refers to. That is, incident alerting tools are often too weakly defined and engineers receive numerous excess alerts. What Cory doesn’t mention but probably should is that while broad alerting might be necessary at first, over time you need to learn how to narrow the range. If this does not happen, then you are not effectively relying on the data.
Taking the information from the incident alerting tools in order to create better alerts in the future is a really important piece of feedback. In the realm of constant vigilance, monitoring is as important as testing and delivery. Specifically, it tells you if you are testing for the right things:
So what is the end result of all this testing? The advantages are many:
Continuous alerting is a key driver for many of these advantages in that the alerts from OnPage make it possible for engineers to realize that errors have occurred. Continuous alerting is not only helpful for continuous deployment, it is also essential for monitoring to let the right person know when failures have occurred.
While I’ve provided a framework to look at some of the tools and the necessary mindset for understanding continuous testing, there is always more to do. Embrace the journey and always be vigilant.
OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!
Gartner’s Magic Quadrant for CC&C recognized OnPage for its practical, purpose-built solutions that streamline critical…
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…