Effective IT incident management is concerned with deviations from, and threats to, the standard operation of services. During the course of time, even the best IT of department will experience incidents. How IT reacts to incidents is a key driver of MTTR (mean time to repair) as well as customer satisfaction.
Yet not all IT departments handle incidents similarly. The differences in how IT departments handle these situations account for why some IT teams are more successful than others. Some departments look at each incident as a way to learn how to improve for the future while others attempt to resolve each issue as quickly as possible and not look back.
How can IT departments improve their incident management process? What processes and procedures do teams need to adopt? What tools should they bring on board? Read on to learn more.
IT teams need to experiment with their incident management process to see what processes are effective and which are not. Experimentation and iteration are key to making on-call rotations better over time. While this has become gospel for development teams, it also rings true for Ops and IT teams who can run simulations of outages to determine how to best manage downed servers, site latency or other similar situations. This practice is indeed the Simian Army protocol at Netflix for managing chaos and it has been responsible for much of their DevOps success.
Quick detection of potential issues is the most important objective of monitoring and alerting. The difficulty consists of pursuing two conflicting goals: speed and accuracy. In Target’s epic 2013 data breach, staff in Bangalore, India, notified Target staff in Minneapolis that they detected an attack. However, no action was taken because these alerts were included with many other likely false alerts. Target’s IT, like many other tech departments, suffered from alert overload as 52% of alerts were false positives. In the case of the 2013 hack, the Minneapolis-based team had become desensitized to actual alerts due to the overwhelming number of false ones.
Effective incident management is all also about reducing the noise so IT teams know which alerts truly require a reaction at 2 a.m. Too many events and alerts (false positives) will reduce the effectiveness of IT operations. You’ll start to overlook important events or alerts. Consequently, it is important to learn what are the important statistics to keep track of. Is it MySQL availability, aborted connections or error logs? Know which ones are important for your organization and alert on them.
Effective IT incident management requires effective use of tools. Most IT teams have an abundance of tools so having them is not as much of an issue as determining which ones are crucial in time of need.
If a task can be automated, then there is no reason an engineer needs to be alerted to the event. For example, if automated backups are available, then the IT team should bring on the technology and tools which enable this to happen. Enabling automation will mean that teams save money on man hours and avoid the potential for mistakes.
Engineers should really only be brought into work on a problem where their knowledge can add value. This is particularly true for issues picked up by monitoring and alerting tools. Indeed, effective identification of problems is the first step in successful incident management. Effective alerting brings the need for incident management to the forefront.
Effective alerting enables the expediting of subsequent repair and recovery. Reporting enables teams to have a record of what they have achieved and bring that knowledge back into the virtuous cycle. Yet effective incident management begins with an understanding of the points outlined above.
To read three more tips on how IT can improve incident management, download our white paper.
Site Reliability Engineer’s Guide to Black Friday It’s gotten to the point where Black Friday…
Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…
Organizations across the globe are seeing rapid growth in the technologies they use every day.…
How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…
What Are Large Language Models? Large language models are algorithms designed to understand, generate, and…
Recognition highlights OnPage's commitment to advancing healthcare communication through new integrations and platform upgrades. Waltham,…