MSP

Incident Alert Management for MSPs

Please schedule a more convenient time for your IT breakdown

Incidents that could hurt business never happen at a convenient time. So it makes sense for MSPs in charge of these businesses’ IT infrastructure to move alerting to their smartphones. MSPs may forget to eat breakfast or even sleep but chances are that they are glued to their smartphones. When this is the case, the best medium to deliver an alert becomes the smartphone.

Take a look at the following anatomy of an incident to see a moment in the life of an MSP and their smartphone.

First, you have the incident itself. Someone’s pet monkey has broken into the server room and pulled out all the wires! Steps to follow?

  • Send a memo to employees, canceling bring-your-pet-to-work-day. They will probably read this message on their smartphone.
  • Inform your on-call team of the monkey business. Through your company’s automated alerting process that is integrated with your monitoring tools and sensors, an important notification will create an alert that is audible on the MSP’s smartphone.

Second, you mobilize your team. If you previously relied on an underpaid intern making calls to the on-call team, then you are doing it wrong. You need to assign team members into an escalation group with automated alerts.

Third, you need to have a plan B in place if the first person your frantic intern calls is you. You are at an obnoxiously loud concert, so you want to make sure there is a backup on-call engineer because you can’t hear the alerts. The reliable ones on your team (clearly, not you) get an alert because they are in the escalation group. The order in which MSPs are alerted can be adjusted along with the time between escalations. Make sure that if an incident is not acknowledged or resolved within a pre-determined amount of time, it will be escalated to the next person on call.
In the event a message is sent to an escalation group and does not reach anyone in the escalation group, make sure you have failover options.

Fourth, the alerts have been sent out and now your on-call team has several options available to them. They can send and receive messages that include images and voice attachments to enrich the alert message. These features can be used to describe the incident further. All of this is completely secure of course and works over cellular or wireless (Wi-Fi) coverage.

Fifth, take action. Now that the escalation has moved to the rest of your team members, they can collaborate to fix the incident by sending high and low-priority messages. High-priority messages could be about how to solve the incident. Low-priority messages could be reserved for discussing how much they hate you. Your team can also acknowledge that they have fixed the issue using pre-defined reply options built into the app and tracked by our audit trail.

Sixth, it’s the day of reckoning. Every single thing you and your team did during the outage has been cataloged using audit trails. Your ignoring of the alert while posting concert pictures to Facebook was not a good idea. With the audit trail, your boss knows every alert that went out and who responded.

Imagine if this scenario was real. Wouldn’t you want to make sure you had technology on your side that was robust enough:

  • To handle on-call scheduling
  • To enable escalation of alerts
  • To enable communication among team members
  • To enable alert tracking through audit trails
  • To generate fail-over reports

You could try to search for this technology on your own or you could try OnPage.
Contact us for more information on how we can fix your monkey business.

Shawn Lazarus

Share
Published by
Shawn Lazarus

Recent Posts

Site Reliability Engineer’s Guide to Black Friday

Site Reliability Engineer’s Guide to Black Friday   It’s gotten to the point where Black Friday…

6 days ago

Cloud Engineer – Roles and Responsibilities

Cloud engineers have become a vital part of many organizations – orchestrating cloud services to…

4 weeks ago

The Vitals Signs: Why Managed IT Services for Healthcare?

Organizations across the globe are seeing rapid growth in the technologies they use every day.…

1 month ago

How Effective are Your Alerting Rules?

How Effective Are Your Alerting Rules? Recently, I came across this Reddit post highlighting the…

1 month ago

Using LLMs for Automated IT Incident Management

What Are Large Language Models?  Large language models are algorithms designed to understand, generate, and…

2 months ago

OnPage Lands Spot on Constellation ShortList™ for Clinical Communication in 2024

Recognition highlights OnPage's commitment to advancing healthcare communication through new integrations and platform upgrades. Waltham,…

3 months ago