Incident Management Defined

Incident management refers to the IT processes and people put in place to identify, analyze and correct incidents that cause company downtime or service interruption.

The professionals who handle these incidents are part of an IT incident response team. This team is usually directed by an incident manager. The objective is to resolve major incidents as quickly as possible.

Incident management

Incident Management Means Everything

Incident Management

Without incident management, you may lose valuable data, experience reduced productivity and revenues due to downtime, or be held liable for breach of service level agreements (SLAs). Even when incidents are minor with no lasting harm, IT teams must devote valuable time to investigating and correcting issues.

Another benefit of incident management practices is an overall reduction in costs. According to a study by Gartner, system or service downtime can cost organizations $300k per hour. Additionally, regulatory fines and loss of customer trust can have significant financial impacts.

Incident Management

Why MTTR matters

 

MTTR - How Important is Incident Response and Resolution Speed?

Speed is of the utmost importance for IT incident response teams and the mean time to resolution (MTTR) is the metric that’s used for measurement. If an IT team doesn’t know how long it takes to fix issues, they can’t improve performance.

There are many roadblocks to minimizing MTTR, including:

> Inconsistent data channel connectivity: As an example, let’s say there’s an IT team in India as well as in the U.S. The U.S.-based team should complement the hours not worked in India and vice-versa. Yet, due to the high cost of the data channel, the team in India turns their data channel off and is only reachable if they are in the office. Since the India team is delayed in receiving and responding to messages, MTTR increases.

> Lack of effective monitoring tools: Without quality monitoring solutions and processes, it will take more time than necessary to do root cause analysis of the incident. Techs can also use monitoring tools to see the change in data as they apply fixes and tweaks to ensure that they are headed in the right direction toward resolution.

> No escalations: When an engineer is alerted of a critical incident, he or she may want to escalate the issue if the scope of the problem is larger than originally anticipated. Often, effective resolution of problems requires bringing in other members of the team to resolve an issue and if there’s not a fast way to alert the team or determine who’s available, the incident will take much longer to fix.

> Lack of audit trails: If no trail exists of who was alerted based on what criterion, management is unable to see incident reports with a history of the cause of the most recent alert and who was notified and in which order. This is a missed opportunity to help the IT team discuss their performance during a post-mortem review and work on continuously improving MTTR.

> No scheduling tools: Management cannot coordinate who’s to be alerted based on the type of incident. Instead, the whole team is alerted regardless of their ability to provide insight or assistance.

> Excessive alerting: The team receives too many false positives, inevitably begins to ignore alerts and eventually starts to miss important ones. Alert fatigue not only affects MTTR, but also leads to employee burnout and high employee churn rates in the IT organization.

OnPage Is the Ideal Incident Alert Management Solution

OnPage is a SaaS-based incident alert management system hosted in secure, SSAE-16-compliant hosting facilities across the U.S. With OnPage, IT professionals:

  • Get a complete system consisting of a web management console and mobile app
  • Receive mobile alerts that bypass the silent switch on all phones
  • Get instant visibility and feedback on incident status
  • Track alert delivery, ticket status and responses to tickets
  • Depend on rock-solid reliability: a must for those who need to elevate critical incidents and ensure fast resolution

OnPage provides powerful integrations with mission-critical systems through the industry’s easiest integration framework.

Not everything should be an emergency. Management should take steps to reduce the noise so IT teams know which alerts truly require action at 2 a.m. Too many events and alerts (false positives) will reduce the effectiveness of IT operations and the team will start to overlook critical events or alerts. To reduce noise, it’s important to determine the few occurrences, metrics and levels which command a high-priority response. The rest can be classified as low-priority and don’t require immediate action.

High-priority alerts need to be distinctive and get immediate attention. That means that they should not be transmitted via email, instant messaging or text messaging, where they’ll be buried under a multitude of non-important content. High-priority alerting should be delivered on a device that is always available and convenient: a smartphone!

Make It Easy to Escalate Alerts

To ensure that alerts are never missed, the workflow must include a way to automatically escalate the notification in case the tech assigned to the incident does not respond within a predetermined length of time. Some IT teams have found that by incorporating this technology and process, the number of missed alerts have been reduced quickly and dramatically (in many cases to zero) and responses to critical incidents have sped up by 300 percent or more.

Invest in Automation As Well as the Right Processes

Most IT teams have an abundance of tools, so a lack of solutions for automation is not as much of an issue as determining which ones are crucial in a time of need.

If a task can be automated, then there is no reason an engineer needs to be alerted of the event. For example, if automated backups are available, the IT team should bring on the technology and tools which enable this to happen, saving time, labor hours and avoiding the potential for human error.

Engineers should really only be assigned to work on a problem where their knowledge can add value. This is particularly true for issues picked up by monitoring and alerting tools. In fact, precise identification of problems is the first step in the incident management workflow.

OnPage