G2 - High Performer Fall 2024 G2 - Fastest Implementation Fall 2024 G2 - Best ROI Fall 2024 TrustRadius - Top Rated Capterra Shortlist 2024 GetApp Category Leaders 2024 Software Advice Front Runners 2024 G2 - High Performer Canada Summer 2024 G2 - Users Love Us

Major Incident Management

What’s an Incident?

ITIL defines an incident as an unplanned interruption to a service or reduction in the quality of a service. Incidents have four priority levels: Critical, High, Medium and Low. Major incidents are typically classified as high-priority events based on the urgency and business impact of the situation.

What's a Major Incident?

A major incident impacts major business operation. Major incidents bring an organization’s entire operation to a standstill and impacts their revenue and bottom line. This can also have far-reaching consequences for the company’s reputation.

Major incidents commonly include:

  • Unfunctional or unresponsive eCommerce websites
  • Client access portals are down
  • Severe outages in airline check-in processes

Time is critical during a major incident and an organization’s ability in bringing back business to normalcy makes all the difference. Major incident management (MIM) helps distinguish between a real major incident versus an outage. The goal of MIM is to manage the incident life cycle and remediate the issue while minimizing disruption.

The Cloudflare 2019 global outage is an example of a major incident. A minor change in the rules used to detect anomalies resulted in major outages. Per Cloudflare’s estimates, systems were down for approximately 27 minutes and affected almost half of the internet’s accessibility.

 

 

Major Incident Management

The Four Stages of Major Incident Management

 

Don’t Just Take Our Word For It

See what OnPage users say on trusted review platforms.

<span style="color: #001f58;">Reviews</span> Reviews

<span style="color: #001f58;">Reviews</span> Reviews

<span style="color: #001f58;">Reviews</span> Reviews

The Process in Detail ...

Stage 1: Detection

Identifying the Major Incident

The first step in the MIM process is identifying a major incident. Organizations encounter incidents every few minutes, so the challenge is distinguishing major incidents from the rest.

A key indicator of a major incident is that it affects many users, disrupting one or several critical services of a business. Incidents are often reported to a service desk technician or detected by monitoring tools that automatically trigger notifications when anomalies are identified.

Stakeholder Communication

When a major incident is detected, the relevant stakeholders need to be engaged to contain the situation and minimize business losses. There are three key groups that must be informed of the situation:

  • Incident Response Team: The team manages and takes control of the critical situation. It is important that teams have the right tools in place to detect issues and alert engineers of severe outages.
  • Senior Management: The command manager must send timely situational reports (SITREPS) with timelines to senior management. Incident alert management platforms can be used to streamline this process through pre-configured recipient groups and email templates.
  • Users: Being transparent and apprising users of a major incident helps alleviate stress and anxiety. It also reflects good on the company, solidifying trust and fostering better relationships with customers. Mass messaging platforms can be deployed to broadcast timely updates to users via many message channels.

Stage 2: Orchestration

Assemble the major incident team to remediate the major incident. The team must consist of engineers, incident commanders and other key stakeholders, such as external consultants. All parties aim to minimize damage and resolve the issue.

Centralized Communication

In critical situations, emails and SMS are ineffective message channels. Messages are often missed and unaddressed, and the channels are unable to elevate high-priority messages. Incident alert management applications allow for real-time, secure messages for team collaboration.

Stage 3: Resolution

The resolution stage occurs when the outage has subsided, and systems have been restored to full functionality.

Once resolution is achieved, the team should document the entire incident management process for future reference. If necessary, carefully implement process changes and ensure that other dependencies are not affected.

check

audit trail

Stage 4: Post-Incident Analysis

Post-incident analysis measures the performance of resources and systems in place. Incident managers conduct post-incident reviews to gain insight into the situation and event resolution process. This helps organizations become well prepared for future incidents. Message audit trails can be analyzed to get information about the team’s incident response.

Why Adopt a Major Incident Management Tool

Status Quo vs OnPage

While you can catalog an incident using your IT service management (ITSM) ticketing tools, there is very little you can do to manage the incident by simply using tickets. Ticketing tools only allow for unprioritized SMS and email incident alerts. This limitation inhibits team collaboration and slows down the incident remediation process.

Integrating your existing tools with an incident alert management system overcomes these limitations. Tickets are converted into intelligent, actionable alerts that can be audibly delivered to response teams during critical incidents. Teams can then collaborate directly on tickets and seamlessly include additional resources in the conversation thread that then syncs back into the ticketing tool in real time.

An infographic illustrating how OnPage's comprehensive incident alert management system integrates with several monitoring, ticketing and remote monitoring solutions to promptly alert the right teams/service owners.

OnPage Capabilities

The incident alert management tool must be more than just an alerting service. Here are some requirements of any incident alert management solution:

  • High-priority notifications that bypass the silent switch on mobile
  • Secure messaging to aid team communication
  • Ability to integrate with ticketing tools
  • Persistent, distinguishable mobile alerts
  • Digital on-call schedulers
  • Alert escalation policies
  • Fail-over options if an alert escalation fails
  • High and low-priority alerting
  • Ability to track incoming and outgoing alerts and messages
  • Reporting to summarize and gain insights into historical data

Yoast Focus Keyword

Learn More

Tour OnPage’s Incident Alert Management Solution!🚀

Continuous Industry Success

OnPage is a G2 Leader for incident alert management, consistently receiving recognition for high performance and user satisfaction. Read more reviews!

Start Your Journey to Critical Alerting in Just Minutes

OnPage