What You Need to Know About MTTR and Why IT MaTTeRs

What All Engineering Teams Should Know About Mean Time to Resolution (MTTR)

In the IT world, performance is everything. So, when technology fails, your first thought is how to utilize incident management knowledge to repair the situation and minimize downtime. As both manager and engineer, you need to minimize your mean time to resolution (MTTR) to comply with your service level agreements (SLAs) and keep your group at the top of its game. You want to ensure that information technology infrastructure library (ITIL) and information technology service management (ITSM) best practices are followed for you to manage incidents effectively. However, even in the best scenario, failures are still part of the game. Reality dictates that you need to have a plan to receive alerts through your incident management tools to inform you that an event has occurred. Following the alert, you would be able to quickly deploy your team to fix the issue. Yet, an ideal response is not easily achieved as the impediments to decreasing MTTR are abundant and can include issues, such as engineers not being alerted of incidents, excessive latency in getting the message to engineers, insufficient data accompanying an alert and many more.

This article will highlight the issues impeding effective MTTR management and offer insights on how to improve use of MTTR as a metric.

MTTR

Who Cares About MTTR?

I have put the importance of MTTR out there and have not defined to whom the metric is important. The truth is that just about everyone in engineering uses MTTR to measure how long it takes their teams to resolve an incident after it has been reported. An incident can be that a server is down, a component is running too slowly and/or software is failing to deploy correctly. Here’s how three different groups encounter MTTR:

IT shops: MTTR is used by countless IT shops to delve into issues, such as why the repair time for components is too high. For individuals working in IT shops, MTTR often means the time until a failed or broken part is replaced.
DevOps: According to Payal Chakravarty, MTTR is a “True indicator of how good [the team is] getting with handling change.” When a deployment goes wrong or unusual activity occurs on the server, the DevOps team should be prepared to handle the issue in a time span agreed to by management. There will inevitably be spikes when the DevOps team encounters an issue it has never faced before, but the goal is to have MTTR decrease over time.
MSPs: Managed service providers are constantly looking at MTTR as it defines their efficiency. MSPs look across the range of issues from monitoring to testing to constantly minimize MTTR. According to Kaseya, the key is to reduce the variability in time spent resolving issues.

Issues Impeding Effective MTTR

While the importance of MTTR is generally acknowledged, the impediments to its effective management are many.

Data channel connectivity. Consider, for example, the situation where you have a team in India. Your U.S.-based team should complement the hours not worked in India and vice versa. Yet, due to the high cost of the data channel, your team in India turns their data channel off and is only reachable if they are in the office. Since your India team is delayed in receiving and responding to messages, MTTR increases.
Lack of effective monitoring tools. There is often no baseline for how your system should operate. In this situation, ITIL’s framework for providing best practices in aligning IT with business needs has been degraded. Instead, teams will use homegrown tools to monitor and create a baseline. Effective ITSM best practices are ignored. Without these tools or with tools that lack the necessary robustness, you are unable to truly understand your monitoring system.
No escalation. Even if an engineer is alerted of the incident, he or she has no easy way to escalate the issue when they realize the scope of the problem. Often, effective resolution of problems requires bringing in other members of your team to resolve issues.
Audit trails. No trail exists of who was alerted based on what criterion. Looking back, management is unable to see a history of the cause of the most recent alert and who was notified and in which order.
Scheduling tools. Management cannot coordinate who is alerted based on the type of incident. Instead, the whole team is alerted regardless of their ability to provide insight or assistance.
Excessive alerting. Team receives too many false positives and inevitably begins to ignore alerts. Engineers will start to miss important, critical notifications.

MTTR

How to Improve Your MTTR

Clearly, MTTR has an impact on many industries. And these industries recognize its importance. But, the question remains: How can you eliminate the barriers in decreasing MTTR and improving MTTR management?

Start using better monitoring tools. Have good monitoring tools like Labtech, Nagios or Kaseya for example. Bring in a ticket with as much relevant information as possible. The time taken to become aware of a problem depends primarily on the sophistication of the monitoring system(s). Identifying the root cause is usually the biggest cause of MTTR variability and the one that has the highest cost associated with it. The solution is based on the tools you use and the processes you have in place.
Ensure that information goes to the right person. Ensure alerts go to the right person at the right time. When a monitoring system detects an issue and sends an email, use OnPage’s IT service alerting (ITSA) solution to make sure that the correct engineer is notified promptly.
Enable escalations. Have escalations enabled so that engineers can reach out for the expertise and assistance of others in their DevOps groups, MSP or IT teams.
Facilitate communications by allowing for redundancies. When engineers are living outside the Wi-Fi “magic garden,” they need to have tools that enable them to be reached when they don’t have a data package. Ensure that your engineers have redundancies that will allow them to be reached by phone.
Know what is low versus high priority. Not all alerts are created equal. Ensure engineers know when an alert is low priority, so they don’t spend time in resolving an issue that stands in the way of resolving a time-sensitive, high-priority incident.
Measure. Make sure you take time to measure how long it takes your engineers to resolve issues. If MTTR is too high at your shop, you can only start addressing the issue by measuring how long it takes until an issue is identified (MTTI), how long it takes until an issue is acknowledged (MTTA), and then how long it takes until the issue is resolved (MTTR). Seventy percent of MTTR is taken up by MTTI, so it is important to identify the correct solution quickly. To quote Peter Drucker, “If you can’t measure it you can’t improve it.”
Bring in human insights. If monitoring alerts are not coming in with enough detail, find a way to provide further information or documentation. Also, if the wrong person is being alerted of an issue, determine who is the right person to address the incident.

Conclusion

Effective incident management is key to improving MTTR. If incident management is not correctly handled and MTTR continues to rise, then the true bottom line (i.e., revenue) will take a beating along with corporate reputation. SLAs will not be met and productivity will diminish. Rather than staying helpless in the battle to improve MTTR, this article has provided key components for IT professionals to consider and implement.