Operations Engineering and Sensory Overload

Sensu reports problems with your main page of your site, your Graphite graphs confirm that page execution time is off the charts, Pagerduty is blowing up your phone, and your Elasticsearch cluster is drowning in error logs…

Outages are a deluge of ALL CAPS emails, PagerDuty alerts, and text messages from our various monitoring tools. That’s good! A relief, too – I want to know when things are going sideways. That said, all these disparate systems each compete for my attention by shouting at me, and sometimes I find myself wishing for a ‘system’ that collects the noise and spits out just the facts – specifically, useful insight into our application issues.

Researching and implementing this ‘system’ has been the focus of my co-op term in Operations Engineering.

image03

Do All the Things

At a high-level my ideal system:

  • Replaces our existing monitoring tools
  • Provides a unified monitoring solution across all of Hootsuite’s products
  • Provides our software engineers with a detailed view of their applications (especially important as we move to a Service-Oriented Architecture)
My approach has been research → build → test → assess. My first experiment is with New Relic, a SaaS Application Performance Management (APM) tool for monitoring web and mobile applications. For the purpose of my task, I’m focusing only on web.

My proof-of-concept is building a series of dashboards and assessing their usefulness in diagnosing application issues: do they contribute to a faster time-to-diagnose (and therefore time-to-resolve) because they present information about an issue in a more intuitive fashion.

APM Dashboard

30-minute Monitoring Slice of the Hootsuite Dashboard

This APM dashboard is monitoring one of our PHP applications. It takes in information from all the servers running the app, and displays it in easy-to-consume chunks. My favourite ‘at-a-glance’ measure is the “apdex” which shows overall application health. Below these graphs is a list of transactions and their execution time, each clickable for additional information. The error rate graph has already helped to diagnose issues.

Server Monitoring Dashboard

24-hours of Server Monitoring

In addition to a PHP agent, each server has a system monitoring agent on each of the servers which has it’s own dashboard. This is a record of one of the servers running one of our applications over 24 hours. The daily cycle of social media activity is reflected in the network I/O and CPU usage. It’s easy to see the processes running on the server and how many resources they each take. At the moment this dashboard only displays information about one server at a time but we have put in a feature request with New Relic so we will hopefully soon be able to ‘roll-up’ the information as we do with the Application Monitoring.

A Real-world Example

Earlier this year our blog was sending a flurry of outage alerts to our on call Operations Engineer.

Here was an opportunity to assess my proof-of-concept. I deployed New Relic to our blog.hootsuite.com servers to see if we’d get any insight into the problem:

Error Rate Spikes on the Blog

As soon as I enabled monitoring, it showed an error rate spiked of 80% (our standard for error rates is well below 1%), so something was clearly wrong. New Relic showed there were 1700 instances of a single error in just a short span of time: calling a function with too few arguments. This visibility let us diagnose the problem, fix it, and deploy a release within three hours. The dashboard made the problem obvious – there was no dredging through errors logs, or reading stack traces, etc. This event encouraged me to continue my experiment by building and testing New Relic onto more of our application servers.

As an aside, moments like this are so rewarding as a co-op student: working in a real production environment and fixing a real problem.

Conclusion

Having a clear and helpful view of your system during an outage is necessary to make an accurate diagnosis of any problems. Furthermore, applying a proof-of-concept system to a real-world situation has proven an effective way to grade its usefulness. Early results show promise: the information on our New Relic dashboards has so far helped reduce our time-to-resolve more than one issue on Production. My experiment continues…

Resources

Thanks

Thanks to Aaron Budge, Jeff Oliver, Jeremie Bethmont, Jonas Courteau, Noel Pullen, Kimli Welsh, and Jacob King for reading drafts of this post.

About the Authorimage02

Neil Power is a co-op student from Memorial University of Newfoundland, working on the Operations Engineering team at Hootsuite. This is his first time in Vancouver and he’s enjoying the good weather and trying all sorts of new food. Follow him on Twitter @neilpower.