How to Reduce False-Positive Monitoring and Alerting
It’s 3AM and you’re sound asleep. Suddenly, there is a terrible sound of a train horn coming from your phone. Dazed and disoriented, you reach for your phone knocking over the glass of water you unwittingly placed right beside it so you wouldn’t forget to drink it first thing in the morning. You dry your hands on your pajamas and make a brief mental note to plan that better as you pick up your phone. Just before you reach the app, you realize the alarm stopped. Frustrated, you, the on call Operations Engineer for the week, open up the PagerDuty app and find that it was the _____ check on _____ service, again, and that it resolved itself, again…
As Operations Engineers, I am sure we can all think of one or five different checks on different services that could fill in the blanks. Everyone’s aware of the story of the “Boy Who Cried Wolf”. For the on call Operations Engineer, a similar situation can arise when enough alerts fired off during the rotation are false-positives because each and every alert is a possible company-wide outage that could potentially cause major financial loss. Recurring false-positives, are, in a sense, similar to the boy’s cry of wolf in that where they once commanded a flurry of actions, now cause halfhearted sighs while still carrying the same threat of severe and irreparable damage.
So given these very real problems, how is Hootsuite tackling these alerts? Well, the Operations Team is working together with the Development Teams and our approach can be summarized in the following three points:
- Creating a ticket for each and every alert that the on call Operations Engineer sees during their rotation.
- Developing tools to ease the process of finding and targeting the noisiest alerts.
- Dedicating time in our sprints to focus on rectifying our alerts.
Noting ItAs a team, we were noticing a lack of action on many alerts as we knowingly, and unknowingly, brushed them under the rug or passed them on to the next person during the on call rotation. To combat this, we proposed the idea of creating tickets for all alerts that occur during on call. Incorporating this simple task ensured that:
- All false-positives would be noted down for further investigation.
- We’d avoid the issue of trying to recall contextual information for an alert that occurred the night before.
- The work of investigation and resolving the underlying issue of the alert would no longer fall solely on the Operations Engineer that saw that alert; rather, it can be picked up by any of the other team members, or even Developers who have intimate knowledge of the underlying service, during sprint planning and brought to a resolution.
Expanding Your ToolkitBeing the first responder to all alerts and taking care of day to day tasks can get tricky. This was essentially the problem we were thinking of when we decided to set up automation to handle ticket creation for our alerts. A simple workflow below outlines the process.
This script was implemented in AWS Lambda to take advantage of the statelessness of the script, instead of an EC2 Instance. We used the AWS API Gateway to trigger the execution of the script whenever a new alert fired off in PagerDuty. By introducing this, we effectively ensured that tickets for each and every alert would be created while ensuring that the person on call can focus on more important tasks such as putting out fires.
In the face of several problem alerts, we needed a way to prioritize which kind of alert, when resolved, would give the best return on investment. To directly focus on these recurring alerts, we developed an application that read our alert history and pulled out the noisiest alerts. This application was integrated with our PagerDuty account. As input, it required a specific amount of time in the past. It then pulled out all the alerts that occurred during the specified range and ordered them by how frequently they alerted. This information provided a good stepping stone for the team when we began addressing alert tickets.
The Final Piece
This last point is perhaps the most important – because ideas are nothing without actions to implement them. This came in the form of dedicating several sprints on the Operations Team for the sole purpose of addressing these recurring false-positives. During the planning sessions for these specific sprints, the team prioritized the alert tickets that were created and assigned the tickets amongst themselves. These sprints gave the team the much needed time to work with the Development Teams to actually address, what are sometimes trivial, root causes to successfully prevent false-positives from recurring.
As a concrete example, one of the alerts we focused on was an omnibus check against several of our microservices. This check tested the service and all its dependencies, which was a great idea on paper. However, in practice, some of our services had some dependencies that were more reliable than others. Even worse, one service going down would cause huge cascades of alerts from everything else, clouding the issue. We refactored the checks to focus on service-owned dependencies (databases, cache) and to throw a different, more meaningful alert on external dependencies. This simple change dramatically reduced the volume of alerts and let us zero in on the real problems.
To add to this discussion, I’ve included a chart which I’ve obtained from PagerDuty, our alerts provider, showing the number of alerts we’ve seen over time. It is important to note that the date we started our first sprint tackling alert-related issues was on October 19, 2016 and we followed that sprint up with three more. The chart below is a good indication of our progress in this endeavour.
It is worth reiterating that even if tools and systems are created, but time is not dedicated to following up with alert-related tickets, scenes like the one depicted in the beginning are more likely to occur and with each recurring false-positive, the on call Operations Engineer is more likely to shrug it off as a non-issue until it eventually is, at which point, it may be too late. Like the situation where we had a circuit breaker between two critical services, with an alert when it trips. It had been tripping frequently due to a timeout change, but nothing was done about it. Then, one day, after resolving a service outage, we realized that the ignored circuit breaker would have given us early warning on the outage if we hadn’t assumed that the alert was “just that damn timeout again!”
On call, by itself, can be a daunting experience. It can be all the more daunting if practices and policies are not in place to keep alerts in check. I hope you’ve gained some insight in some of the common problems we faced and in the solutions we found effective.