What all engineering teams should know about MTTR
In the IT world, performance is everything. So when technology fails, your first thought is how to utilize incident management knowledge to repair the situation and minimize downtime. As both a manager and an engineer, you need to minimize your MTTR –Mean Time To Resolution- in order to comply with your SLAs – service level agreements – and keep your group at the top of its game. You want to ensure ITIL (information technology infrastructure library) and ITSM (information technology service management) best practices are followed for you to manage incidents effectively. Even in the best scenario however, failures are still part of the game. Reality dictates that you need to have a plan to receive alerts through your incident management tools to inform you that an event has occurred. Following the alert, you would be able to quickly deploy your team to fix the issue. Yet an ideal response is not easily achieved as the impediments to decreasing MTTR are abundant and can include issues such as engineers not being alerted to incidents, excessive latency in getting the message to engineers, insufficient data accompanying an alert and many more. This article will highlight the issues impeding effective MTTR management and offer insights on how to improve use of MTTR as a metric.
Who cares about MTTR
I have put the importance of MTTR out there and have not defined to whom in particular the metric is important. But the truth is that just about everyone in engineering uses MTTR to measure how long it takes their teams to resolve an incident after it has been reported. An incident can be that a server is down, a component is running too slowly, software is failing to deploy or deploy correctly. Here’s how 3 different groups encounter MTTR:
• IT Shops – MTTR is used by countless IT shops to delve into issues such as why the repair time for components is too high. For individuals working in IT shops, MTTR often means the time until a failed or broken part is replaced.
• DevOps – According to Payal, MTTR is a “true indicator of how good [the team is] getting with handling change.” When a deployment goes wrong or unusual activity occurs on the server, the DevOps team should be prepared to handle the issue in a time span agreed to by management. There will inevitably be spikes when the DevOps team encounters an issue it has never faced before, but the goal is to have MTTR decrease over time.
• MSPs – Managed service providers are constantly looking at MTTR as it defines their efficiency. MSPs look across the range of issues from monitoring to testing to constantly minimize MTTR. According to Kaseya, The key is to reduce the variability in time spent resolving issues.
Issues impeding effective MTTR
While the importance of MTTR is generally acknowledged, the impediments to its effective management are many.
• Data channel connectivity. Consider, for example, the situation where you have a team in India. Your U.S. based team should complement the hours not worked in India and vice-versa. Yet due to the high cost of the data channel, your team in India turns their data channel off and is only reachable if they are in the office. Since your India team is delayed in receiving and responding to messages, MTTR increases.
• Lack of effective monitoring tools. There is often no baseline for how your system should operate. In this situation, ITIL’s framework for providing best practices for aligning IT with business needs has been degraded. Instead your teams use homegrown tools to monitor and create a baseline. Effective ITSM best-practices are ignored. Without these tools or with tools that lack the necessary robustness, you are unable to truly understand your monitoring system.
• No escalation. Even if an engineer is alerted to the incident, he or she has no easy way to escalate the issue when they realize the scope of the problem. Often, effective resolution of problems requires bringing in other members of your team to resolve issues.
• Audit trails. No trail exists of who was alerted based on what criterion. Looking back, management is unable to see a history of the cause of the most recent alert and who was notified and in which order.
• Scheduling tools. Management cannot coordinate who’s to be alerted based on the type of incident. Instead, the whole team is alerted regardless of their ability to provide insight or assistance.
• Excessive alerting. Team receives too many false positives and inevitably begins to ignore alerts and eventually starts to miss important ones.
How to improve your MTTR
Clearly, MTTR has an impact on many industries. And these industries recognize its importance. So the question remains how to take away the barriers to decreasing MTTR and improve MTTR management.
• Start using better monitoring tools. Have good monitoring tools like Labtech, Nagios or Kaseya for example. Bring in a ticket with as much relevant information as possible. The time taken to become aware of a problem depends primarily on the sophistication of the monitoring system(s). Identifying the root cause is usually the biggest cause of MTTR variability and the one that has the highest cost associated with it. Once again the solution lays both with the tools you use and the processes you put in place.
• Ensure that information goes to the right person. Makes sure alerts go to the right person at the right time. Every time. When a monitoring system detects an issue and sends an email, use OnPage to make sure that the correct engineer is alerted.
• Enable escalations. Have escalations enabled so that engineers can reach out for the expertise and assistance of others in their DevOps group, MSP or IT team.
• Facilitate communications by allowing for redundancies. When engineers are living outside the wifi magic garden, they need to have tools that enable them to be reached when they don’t have a data package. Ensure that your engineers have redundancies that will allow them to be reached by phone.
• Know what is low vs high priority. Not all alerts are created equal. Make sure engineers know when an alert is low priority so they don’t spend time on resolving an issue today that stands in the way of resolving a higher priority issue.
• Measure: Make sure you take time to measure how long it takes your engineers to resolve issues. If MTTR is too high at your shop, you can only start addressing the issue by measuring how long it takes until an issue is identified (MTTI), how long it takes until an issue is acknowledged (MTTA), and then how long it takes until the issue is resolved (MTTR). Fully 70% of MTTR is taken up by MTTI so it is important to identify the correct solution quickly. To quote Peter Drucker, “If you can’t measure it you can’t improve it.”
• Bring human insights in as well. If monitoring alerts are not coming in with enough detail, find a way to provide further information or documentation. Also, if the wrong person is being alerted for an issue, determine who is the right person.
Effective incident management is key to improving MTTR. If incident management is not correctly handled and MTTR continues to rise, then the true bottom line (revenue) will take a beating along with corporate reputation. SLAs will not be met and productivity will diminish. Yet rather than staying helpless in the battle to improve MTTR, this article has provided key components for IT professionals to consider and implement.
OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!