By Umair Hussain on February 13, 2017
It’s 3AM and you’re sound asleep. Suddenly, there is a terrible sound of a train horn coming from your phone. Dazed and disoriented, you reach for your phone knocking over the glass of water you unwittingly placed right beside it so you wouldn’t forget to drink it first thing in the morning. You dry your hands on your pajamas and make a brief mental note to plan that better as you pick up your phone. Just before you reach the app, you realize the alarm stopped. Frustrated, you, the on call Operations Engineer for the week, open up the PagerDuty app and find that it was the _____ check on _____ service, again, and that it resolved itself, again…
As Operations Engineers, I am sure we can all think of one or five different checks on different services that could fill in the blanks. Everyone’s aware of the story of the “Boy Who Cried Wolf”. For the on call Operations Engineer, a similar situation can arise when enough alerts fired off during the rotation are false-positives because each and every alert is a possible company-wide outage that could potentially cause major financial loss. Recurring false-positives, are, in a sense, similar to the boy’s cry of wolf in that where they once commanded a flurry of actions, now cause halfhearted sighs while still carrying the same threat of severe and irreparable damage.
So given these very real problems, how is Hootsuite tackling these alerts? Well, the Operations Team is working together with the Development Teams and our approach can be summarized in the following three points:
- Creating a ticket for each and every alert that the on call Operations Engineer sees during their rotation.
- Developing tools to ease the process of finding and targeting the noisiest alerts.
- Dedicating time in our sprints to focus on rectifying our alerts.