Accurate Counting with Graphite and Statsd
We have been using statsd and Graphite at HootSuite to get visibility into the frequency of events within our technology stack for over a year now. And to great effect.
In real time we are able to see the health of the HootSuite dashboard, supporting services, the myriad APIs our product uses and get to better understand things like our payments systems. Visibility has reduced downtime, made us more responsive to performance problems and generally forced us to confront more technical debt than we otherwise might have. More recently we have taken things a step further and are now using statsd and Graphite to measure and optimize the implementation of specific features. We test theories about whether a given change in the product will actually yield the intended result. It’s the scientific method at work and a tried and true approach to evolving software as a service (SaaS) in general.
This approach has been used to make product deliveries better. We have applied these techniques to things like:
- Optimizing our sign-up with Facebook feature
- Making it easier for users to add social networks to HootSuite, and
- Significantly reducing the number of Twitter API calls from our Twitter archiving service
But for all of its benefits, statsd and Graphite have suffered from internal doubts about data accuracy. When we periodically validate counts of things against some other known-to-be-accurate data source like a MySQL table, counts of things do not always match. Relative amounts of multiple Graphite stats always seem to be OK, but not the absolutes. It’s disconcerting.
Lack of confidence in data is arguably the surest kiss of death for a data driven system. As a development team we have been touting the success of data driven development to other business units but if the data is wrong, so might be the conclusions we’re drawing about the results of our development efforts!
As it turns out, things are not that bad. Statsd and Graphite work just fine. As is so often the case, we have simply, on occasion, been looking at Graphite data through the wrong lens – one that distorts the absolute numbers.
This article describes the problems we have encountered and how we have gained confidence in Graphite by better understanding what’s going on under the hood.
The Problem in Pictures
First, here is a view the problem as we encounter it in pictures. This is a view of a the total occurrences of a real user driven Graphite stat over the a recent three hour span:
This graph is telling us that the event has happened about 270 times in the last three hours, and at a fairly constant rate. About 3 events occur every 4 minutes. Useful.
Here is a view of the same stat with the same counting mechanism applied but with the relative time range expanded out to the last 22 hours:
Uh oh! Now Graphite is telling us there have been 240 occurrences of the event in the last 22 hours, or less than once every six minutes! What gives?
How Graphite and Statsd Work
To understand why the same approach to counting over two different time spans is inconsistent, here is what you need to know first about how these systems gather, store and return data.
Reporting Intervals in Statsd
Statsd sends data to Graphite over UDP. When code that fires statsd events is hit in an instrumented code base, statsd gets the event from the code, but actually only sends data to Graphite every 10 seconds by default. This is an obvious optimization intended to cut down on network traffic. So for example, if an event happens 250 times per second, statsd will send one event to Graphite with a value of 2500 every 10 seconds. And thats how it will be stored in Graphite too (to start). A datapoint of 2500 will be stored for every 10 second interval.
Storage Intervals in Graphite
Just as statsd batches data sending for network optimization, Graphite batches data storage for disk use optimization. The older data gets in Graphite, the more optimized storage of that data becomes. The specifics are configurable, but in general, Graphite stores very recent data in very short intervals and summarizes those short intervals into larger intervals over time on the basis that the older data is, the less important interval granularity becomes.
This table shows an example of how storage intervals in Graphite might be configured.
|Data Age (d)||Interval Size|
|d < 6 hours||10 seconds|
|6 hours <= d < 2 months||1 minute|
|2 months < d||10 minutes|
This approach makes a lot of sense. Detailed visibility of old data is rarely required and this optimization means Graphite requires about 1% of the storage capacity that would be required if everything was stored for all time at 10 second intervals.
Interval Aggregation in Graphite
But here’s the rub. Graphite by default aggregates older data in a non-intuitive way, which makes it subject to being counted inaccurately if one is not careful. This too is configurable but by default, Graphite stores the AVERAGE rate of data within an aggregated interval, and NOT the sum.
A Simple Example to Illustrate
Consider an event that fires once per minute and how it gets stored over 7 hours in a Graphite system using the data ageing configuration from the table above:
The first six hours of data will conceptually look like this:
|Datapoint||Time Stamp||Stat Value|
The seventh hour will look like this – note the bigger intervals, one minute instead of 10 seconds:
|Datapoint||Time Stamp||Stat Value|
At first, that 0.1666 number looks weird until we recall the AVERAGE aggregation technique that Graphite uses by default. 0.1666 is the average number of times the event happened in any given 10 second interval across any minute of the seventh hour.
This has implications.
Graphite comes bundled with a myriad of useful functions that can be run on data sets to glean more insight. Counting the total number of times an event happened over a given time range is one of the most common and useful tasks that one can perform with Graphite.
The integral function is particularly handy for counting, but it’s no silver bullet.
From the Graphite documentation, here is how it works:
“integral() will show the sum over time, sort of like a continuous addition function. Useful for finding totals or trends in metrics that are collected per minute.”
This is far less than the whole story of the integral function but regardless, the intuitive web app developer will quickly find this function and reasonably start using it for immediate benefit.
The procedure goes something like this:
- Drop a statsd call into code to send data to a Graphite counter stat based on some user driven event
- Deploy the code to production
- Set up a Graphite dashboard with the integral function against the stat
- Sit back and watch how often users trigger the event
- Iterate on the code to increase / decrease the rate at which the event occurs, as desired
This is a nice tight development loop for gaining insight into the frequency of anything in a web app. Even nicer is when you add two or more stats, say for an A/B Test. The team can then watch multiple stats compete over time and quickly pick a winning implemenation. It’s a tried and true HootSuite model for rapid continuous improvement of an app.
But the integral function starts to fall apart after 6 hours, as in the problem pictures above show, because of how Graphite aggregates data.
How Integral Breaks
The integral function sums things. So when the specified time range includes data aggregated with averages, you will not get sums as results, but rather sums of averages. In short, things won’t look right at all.
Another Example to illustrate:
Consider a statsd event that fires once per minute to a Graphite store that keeps the last 1 hour of data at 10 second intervals and keeps 1 minute averages (not sums) for anything older than 1 hour. The data will look something like this in the database:
Beyond an hour line, data is stored as averages
Now if we run the integral function on the stat for the most recent hour (or less), we’ll get the right answer:
360 datapoints, 1/6 of which have a value of 1, yields 60 events. Right answer.
But, if we run the integral function on the stat for the most recent hour plus 1 minute, things change drastically. Without having done a deep forensic code analysis, here is how I speculate it works – and this speculation holds up against several empirical observations:
The interval after the one hour mark is 1 minute and the aggregation function is configured to average time series, not sum them. So, the integral function will convert the dataset to one minute interval averages before doing any summation. The result is 61 data points, each with values of 0.1666 over a span of one minute each. Adding these all up, we get answer of just over 10.
So in short, the integral function on its own is only useful on a Graphite instance with default configuration if we don’t query time ranges beyond the point in history where Graphite starts aggregrating data using averages.
Counting Past Graphite Aggregation Boundaries
But counting is still so useful, and Graphite has the data in some form, so how do we get counts across time ranges that span Graphite aggregation boundaries? At HootSuite we use the hitcount() function. Hitcount lets you specify a stat and a bucket size and counts data correctly across Graphite aggregation boundaries. With these parameters, hitcount will give accurate totals per interval.
For example, in the Graphite web app, to see the count per hour of a stat over whatever time range is specified in the UI, use this:
Here’s an example of running hitcount with hourly breakdown through a full day time range.
This view of things is quite different in form and meaning than what integral gives us, but is often useful in it’s own right.
To get something closer to integral, use the hitcount function on a time range that is less than the specified bucket size. For example, to get a count of how many things have happened in the last 10 hours, set the time range in the Graphite web app to the last 10 hours and run this function:
Since the time range is greater than the bucket size, we get one number:
The number is right, but the display is kind of rough. Truth be told, hitcount is kind of hard to use in the Graphite web UI because:
- Lower bounds get set based on the minimum value in the returned data set. For hitcount, a constant lower bound of 0 would be a more logical default
- A histogram type view would be easier to interpret than a line graph connected by points for the hitcount function
These aspects of data viewing are configurable in the web UI but configuration is a pain.
Maybe the most powerful use of hitcount is to use it to make the integral function work the way developers intuit that it should!
Going back to the problem graph from the start of this article, we can fix it simply by applying the integral over a long time range to hitcount function over a small time bucket size, like this:
integral( hitcount(stat, "1minutes") )
Applied to a similar 22 hour dataset on the same stat in the problem set yields the right answer:
That’s more like it.
Overcoming the Web UI Limitations
As alluded to, the Graphite web UI is less than ideal, especially for counting. Lots of configuration is required to get something half decent. Doing better than the Graphite web UI is actually quite straightforward because Graphite provides a very workable API with one render endpoint that can pass JSON data back for sets of stats, with any functions applied, across whatever time range. At HootSuite we’ve used this API to build rapidly configurable interfaces for key stats using simple jQuery and the D3.js graphing library.
Here’s a screenshot of a nice D3 based view of a Graphite stat called via API, broken down by week and color coded by sub-stat:
It gives a nice view of how an event grew over time.
statsd and Graphite are powerful and we use them to great effect at HootSuite. Shortcomings that we’ve encountered have always been a function of our own inexperience or presumptions conspiring with Graphite tendency for user unfriendliness. Issues so far have never been data layer fails in Graphite.
Digging deeper into understanding how Graphite works has enabled us to get way more out of a system designed for power at the expense of usability.