ChatOps – Secret to great incident management in DevOps teams of 20 or 2000

ChatOps 1

Chat your way to excellence

DevOps is constantly trying to improve production through automation, collaboration and tools. ChatOps is often the paradigm which brings these tasks together into a single conversation. In ChatOps, “chat applications and tools for real-time communication and task execution [are distributed] among members of development and IT operations teams”. Yet often times the proponents of ChatOps don’t pay sufficient attention to the incident management component part of the operation, preferring instead to look at the bots and chat room tools.

However, as James Fryman noted in his talk on ChatOps: Technology and Philosophy at Geekdom San Francisco “the shared context [of chat rooms] allows everybody to see and collaborate around things that happen. This is super amazing [for] the incident management space.” Specifically, though, when a high priority or critical alerts occurs, notifications need to be used to broadcast the incident beyond the chat room and ensure the conversations don’t get muddled.

ChatOps as a central doctrine of DevOps -why it is important

The notion of chat rooms did not begin with GitHub. Rather, IT has used BBSs (Bulletin Board Systems) and later IRC (Internet Relay Chat) to encourage connected networks through chat. And even today with Slack, Spark or Hipchat, the goal is still the same as it was with these previous forums. As New Relic notes in a blog article “even as the tools facilitating real-time chat have changed, the primary reasons for using them have not.” The goal has remained to enable synchronous and asynchronous communication for distributed groups and people. Chat allows for greater collaboration on development, less delay and better outcomes. Tomer Levy, CEO of Logz.io, notes that the strength of ChatOps lies in the “feedback loop [which] enhances collaboration”

As one engineer I met who works for Wayfair noted he was challenged and fatigued by calls with the company’s developers in China. “At 2am my heavy Jersey accent and their heavy Chinese accent made it difficult to understand one another. Putting our conversation in chat channels really cut down on the frustration.”

Low priority conversations and chats

Now that we are agreed on the importance of ChatOps, I need to rock the boat a bit and make a distinction here between normal chat and priority alerts in chat channels. As discussed above, chat is great for furthering collaboration. However, when important or critical situations happen, a different approach needs to be taken.

When a low priority alert occurs, an engineer needs to be notified. A simple chat request is not sufficient. OnPage allows you to actually create an audible alert on an engineer’s smartphone through a simple Slack command. See OnPage’s video on Slack integration for further explanation on how this works. Chat can then continue in OnPage for low priority alerts or through Slack channels. OnPage is bi-directional so conversations can go from Slack to OnPage and vice-versa. The issue that remains relevant is how to best conduct critical or high priority alerts and elevate them to their own channel.

Why separate high priority conversations are needed in ChatOps

The need for separate channels for critical alerts is highlighted by the fact that traditional Slack or chat channels can quickly get muddled if they also become the place for critical alert discussion. The meandering and humorous tone of chat channels is not conducive to high priority alerts.

In critical alerting situations, the conversation needs to be focused and directed. Furthermore, the need to actually alert through an audible notification can be achieved through a command to OnPage where critical alerts can occur on the engineer’s smart phone. These can be louder or simply different from a low-priority alert. The conversation should then continue in a new high priority chat channel that is separated from the existing conversation.

There are several reasons for separating the high-priority conversation into another channel. These include:

  • Better management of signal to noise ratio. By having critical alerts occur through OnPage, and continuing important conversations on a separate channel, engineers are able to have important conversations rise above the noise.
  • To enable on-call engineers to separate off from the team of 20 or 2000. By having critical alerting channels, not everyone can offer their suggestions. Instead, the conversation is focused and intentional
  • Ensure the message comes with context. Alerts in OnPage are message rich and provide details on the issue at hand
  • Ensure the message gets to the right on-call engineer. Rather than guessing or assuming who is on-call, alerts to OnPage will be directed to the correct engineer
  • Ensure persistent alerts. Alerts will continue for up to 8 hours until recognized
  • Once the issue is resolved, the engineer can return to the regular Slack channel

Impact of elevating critical alerts

The critical alert is enabled by OnPage and the conversation is elevated to a dedicated Slack channel. By enabling a separate, dedicated channel for critical alert situations many advantages will follow:

  • Improve MTTR. By ensuring conversations for critical alerts are on separate channels, the amount of time until issues are detected and resolved decreases.
  • Deploy more frequently. Engineers can resolve issues more quickly
  • Decrease lead time for changes. Decrease the amount of time it takes to go from code commit to code successfully running in production
  • Decrease change failure rate. Decrease the percentage of changes that result in degraded service or require remediation

Conclusion

For ChatOps to truly improve DevOps, critical alerting has to be nuanced. Critical alerts cannot stay in the thread of regular chat. Chat needs to be differentiated based on whether the conversations are for critical issues or low-priority or regular conversations. For both situations, OnPage provides a solution that ensures engineers receive the right message at the right time. Every time.

OnPage is cloud-based incident alerting and management platform that elevates notifications on your smartphone so they continue to alert until read. Incidents can be programmed to arrive to the person on-call and can be escalated if they are not attended to promptly. Schedule a demonstration today!

OnPage