Fight Alert Fatigue

output_TAyP7X

How to Win the Alert Fatigue Battle IT engineers and DevOps teams cannot help but experience alert fatigue when they receive after-hour alerts lacking context or relevance. Messages come in, for example, telling the engineer on-call that disk space is used up. Does this mean 60% used up or 100% used up? Or an after-hours message might come in alerting to a downed server. Which server? Did the back-up server come on-line as a result? The remedy then is to implement an IT alerting system that differentiates high priority alerts and allows for messaging with attachments. Lack of context can cause significant frustration among engineers as well as alert fatigue. Impact of Alert fatigue Companies shouldn’t downplay the impact of alert fatigue. There are also significant financial implications for companies if they have stressed out, unhappy, sleep deprived engineers. For example, engineers who are feeling the stress of alert fatigue are […] Read more »

7 Ways DevOps Can Avoid Alert Fatigue

7 ways to avoid alert fatigue

Being on-call doesn’t have to mean you’re always tired The introduction of monitoring into the DevOps world means alerts will occur 24/7. As such, there will be alert fatigue in DevOps. Monitoring needs alerts in order to be effective but the issue is that while our technology is 24/7, humans cannot work in a similar fashion. Clearly, 24/7 alerts need to be better calibrated with human physiological realities in order to avoid alert fatigue. The remedy then is to implement an IT alerting system that differentiates high priority alerts and allows for messaging with attachments. Alert fatigue in DevOps The traditional setup of IT and DevOps is such that email is the main form of relating issues such as deployment problems or server problems. If software fails to deploy correctly, an email goes to a designated engineer. Similarly, if a server experiences a power surge, an email is sent. Monitoring […] Read more »

Why you need to fail to be great at DevOps

devops fail

Seven steps to failure and greatness The more I read and learn about how to succeed in DevOps the more I realize how important failure is to the process. You need to fail to be great at DevOps. Netflix, for example, even takes it a step further by introducing failure into their testing process. In our blog The Seven Deadly Sins of DevOps, we wrote about how you should not do DevOps. Interestingly enough though, failure is not a sin. In fact, failure is something you should strive for. This blog will give you a sense for how you can plan to fail strategically. Embrace DevOps and fail fast. Fail to be great How do you succeed by failing? It sounds like a contradiction. Simply put, it’s by building failure into the testing process. Think of it as ‘controlled failure’ whereby you think strategically about where the system is likely […] Read more »

Feel the burnout

Everyone on your team is feeling the pain.

Eleven practical ways for DevOps engineers to better manage their work environment At OnPage, we know the importance of devops burnout and have explored in other formats such as our e-book and video. The seriousness of the issue is highlighted by the following components: Decreased employee happiness. Employees become less satisfied and content with their work Decreased productivity. Because employees are fatigued, they are less productive Frequent job shifts. Throughout the industry, it has become standard for engineers to switch jobs every 2 to 3 years in hopes of finding employment that won’t burn them out. How to recognize devops burnout How do you realize that you are suffering from burnout? It’s like the famous description of a frog in boiling water. The frog only knows he’s going to die when it’s too late. Similarly, the engineer only knows they are suffering burnout when they have either burnt bridges or broken friendships […] Read more »

When tools are out of control you need critical alerting

critical alerting

Tools going Rogue – a story for Halloween We have all heard stories of DevOps woe. Some tales are sad. Some tales describe true misfortune. And some tales just leave you thinking what the heck were developers thinking? This story is a tale of the later. This story will tell the tale of how some developers at a start-up in New England created code which was supposed to live and work on AWS EC2 servers. However, the developers never thought to test what they were spinning up or to put critical alerting in place for when things went wrong.  And that is where our tale of woe begins. Tale number 1: Automation destroyed the world What the tool was supposed to do has long since been forgotten but the horror and nightmares it caused will not go away so soon. At the start-up I am referring to here, every protocol […] Read more »

The Seven Deadly Sins of DevOps

7 deadly sins of devops

What to avoid when you start DevOps While there are many ways to do DevOps correctly, there are specific cardinal sins that will put you afoul of the Church of DevOps. From lacking an incident management tool to handle critical alerts to treating DevOps as a job title, there are many ways for you to hurt your status as an A-class DevOps shop. In order to achieve excellence in DevOps, it is key for executives to avoid committing the cardinal sins of DevOps that are discussed below. DevOps sin 1: You treat DevOps as a title, not a philosophy In speaking to directors of engineering at numerous companies, I have heard the phrase: ‘if you have Devops in your title, you’re doing it wrong’. The point of this statement is that DevOps is a philosophy, not a title. You shouldn’t assume that you can simply put the word ‘DevOps’ in someone’s title and get anywhere […] Read more »

Netflix earnings, DevOps and profitability

netflix

DevOps as the road to profitability Netflix released a great earnings report earlier this week. According to the Wall Street Journal’s page one article on October 18th, “Netflix Inc. blew through its forecast for the subscriber additions in the September quarter…sending its shares soaring 20% in after-hours trading. … The better-than-expected performance came mainly in international markets, where the company has completed a massive, near global expansion this year.” For anyone who reads the DevOps literature, this success doesn’t come as a surprise. Rapid testing and provisioning is the name of the game at Netflix. Puppet’s 2016 State of DevOps Report, notes that Netflix is among the top DevOps performers: High [DevOps] performers deploy on demand, with Etsy deploying 80 times per day, and large companies like Amazon or Netflix deploying thousands of times per day. So how tightly are the components of profitability and DevOps excellence intertwined? Perhaps another […] Read more »

Serverless promises and the persistent need for critical alerting

critical alerting and serverless computing

Why serverless computing doesn’t end the need for security or alerts Serverless computing provides the advantage of taking away the problem of managing servers. For many small start-ups, this is a huge advantage as the cost of purchasing, maintaining and scaling servers is a real pain point. Serverless also holds forth the prospect of ending the need for Ops as we know it, ending the need for security worries and ending the need for being on-call. But, while this modern-day DevOps marvel known as serverless might seem like a panacea, serverless computing needs to come with a healthy dose of reality. The reality of serverless In an article I recently posted to DZone entitled How Smart Is Serverless, I question how smart it is to outsource your security concerns to a third party like AWS. As I note in the article, you cannot abstract security without facing some pretty scary consequences. Amichai […] Read more »

NoOps and the Need for Critical Alerting

critical alerting with serverless

NoOps eschews critical alerting at its own peril Many start-ups’ embrace serverless architectures such as AWS, believing they will be able to adopt NoOps. NoOps means no worries about servers as everything is on the cloud and if there are no worries about servers then there is no need to worry about critical alerting. The reality is slightly different. No matter how minimized Ops becomes, there will always be a need for strong incident management applications. The emphasis will simply further push monitoring from an Ops-only role to an important role for everyone on the development team. What is NoOps and why is there so much criticism? NoOps defines an IT environment that is so automated and abstracted from the underlying infrastructure that there is no need for a dedicated team to manage Ops in-house. The two main drivers behind NoOps are increasing IT automation and cloud computing. Even among […] Read more »

Constant Vigilance in Continuous Delivery – How to do DevOps right

Alastor Moody's mantra of  "constant vigilance" is applicable to DevOps.

The importance of monitoring and alerting in the continuous delivery cycle In the 2016 State of DevOps report, Puppet reported that the top DevOps shops like Amazon or Etsy deploy new software releases multiple times per day. The next tier of companies deploy on a weekly or monthly basis. What is the difference between High and Medium IT performers? More than just tools, it is mindset. In addition to building code, the top DevOps teams have adopted a mindset of “constant vigilance”, as Mad-Eye Moody said to Harry Potter.  Not only should teams always be building but they should also always be testing. Through testing they should remain constantly vigilant – attuned to their software and its performance. And equally important in this process should be incident alerting to let both Dev and Ops know when things go awry. A cautionary tale: What happens without constant vigilance While many DevOps […] Read more »

The Secret to Making Your DevOps Team World Class

linkedin image dev ops team

Continuous deployment is key to world class DevOps With their State of DevOps report released at the beginning of the summer, Puppet clearly defined the characteristics of world class DevOps organizations and the make-up of those lagging behind. According to Nigel Kersten, CIO of Puppet, there is a huge gap between organizations that get DevOps and are able to ship software on demand and “organizations that take days, weeks or even years to ship simple upgrades … and the gap is widening”. Where is your company on the spectrum? Is your company deploying 80 times per day like Etsy or thousands of times per day like Amazon? Is your company one of those that spends 50% less time remediating security issues than low performers, and 22% less time on unplanned work? How much time does your team have for building new code? Perhaps you don’t even know the exact answer […] Read more »

What you need to know about MTTR and why IT MaTTeRs

MTTR3

What all engineering teams should know about MTTR In the IT world, performance is everything. So when technology fails, your first thought is how to utilize incident management knowledge to repair the situation and minimize downtime. As both a manager and an engineer, you need to minimize your MTTR –Mean Time To Resolution- in order to comply with your SLAs – service level agreements – and keep your group at the top of its game.  This article will highlight the issues impeding effective MTTR management and offer insights on how to improve use of MTTR as a metric. Who cares about MTTR I have put the importance of MTTR out there and have not defined to whom in particular the metric is important. But the truth is that just about everyone in engineering uses MTTR to measure how long it takes their teams to resolve an incident after it has […] Read more »

The secret to blameless post mortems

blameless post

How your engineering teams can move past finger-pointing to effectively managing mistakes Sidney Dekker’s theory on ‘bad apples’ holds that complex systems think they would be fine if it were not for the erratic behavior of some unreliable people. According to this theory, when unexpected events are seen in an otherwise safe system, they are typically and conveniently assigned to “human error” and when they are severe to “operator carelessness”. Similarly, post mortems often look to define and parcel out blame to engineers. Yet it begs the question of how effective the post mortems are if their only purpose is to assign blame. Instead, effective post mortems needs to “acknowledge the human tendency to blame, to allow for a productive form of its expression, and constantly refocus the postmortem’s attention past it.” Post mortems vs retrospectives The problem with post mortems begins with its name “post mortem”, which if you ask […] Read more »

Bringing Dev and Ops together with on-call groups

on call scheduler

Make Dev and Ops better together by building empathy with on-call groups   Create Effective Schedules Much has been written on the tension that often exists between Dev and Ops teams in an organization. All too frequently, Devs are focused on rapid prototyping and creating code while Ops are focused on keeping the ship stable and making as few changes as possible. When I was at the DevOps Boston Conference last week, much of the “hallway conference” was devoted to conversations on how to build empathy between these frenemies and make them exist in less opposition to one another. How can Dev and Ops become less siloed? How can management encourage cross pollination? One important psychological realization was that in order to create empathy between these two groups and ensure an effective group dynamic, the teams need to spend more time living in one another’s shoes. One strong and significant step that can […] Read more »