Accurate Counting with Graphite and Statsd

We have been using statsd and Graphite at HootSuite to get visibility into the frequency of events within our technology stack for over a year now. And to great effect.

In real time we are able to see the health of the HootSuite dashboard, supporting services, the myriad APIs our product uses and get to better understand things like our payments systems. Visibility has reduced downtime, made us more responsive to performance problems and generally forced us to confront more technical debt than we otherwise might have. More recently we have taken things a step further and are now using statsd and Graphite to measure and optimize the implementation of specific features. We test theories about whether a given change in the product will actually yield the intended result. It’s the scientific method at work and a tried and true approach to evolving software as a service (SaaS) in general.

This approach has been used to make product deliveries better. We have applied these techniques to things like:

  • Optimizing our sign-up with Facebook feature
  • Making it easier for users to add social networks to HootSuite, and
  • Significantly reducing the number of Twitter API calls from our Twitter archiving service

But for all of its benefits, statsd and Graphite have suffered from internal doubts about data accuracy. When we periodically validate counts of things against some other known-to-be-accurate data source like a MySQL table, counts of things do not always match. Relative amounts of multiple Graphite stats always seem to be OK, but not the absolutes. It’s disconcerting.

Lack of confidence in data is arguably the surest kiss of death for a data driven system. As a development team we have been touting the success of data driven development to other business units but if the data is wrong, so might be the conclusions we’re drawing about the results of our development efforts!

As it turns out, things are not that bad. Statsd and Graphite work just fine. As is so often the case, we have simply, on occasion, been looking at Graphite data through the wrong lens – one that distorts the absolute numbers.

This article describes the problems we have encountered and how we have gained confidence in Graphite by better understanding what’s going on under the hood.

The Problem in Pictures

First, here is a view the problem as we encounter it in pictures. This is a view of a the total occurrences of a real user driven Graphite stat over the a recent three hour span:

GraphiteProblem-good

This graph is telling us that the event has happened about 270 times in the last three hours, and at a fairly constant rate. About 3 events occur every 4 minutes. Useful.

Here is a view of the same stat with the same counting mechanism applied but with the relative time range expanded out to the last 22 hours:

GraphiteProblem-bad

Uh oh! Now Graphite is telling us there have been 240 occurrences of the event in the last 22 hours, or less than once every six minutes! What gives?

How Graphite and Statsd Work

To understand why the same approach to counting over two different time spans is inconsistent, here is what you need to know first about how these systems gather, store and return data.

Reporting Intervals in Statsd

Statsd sends data to Graphite over UDP. When code that fires statsd events is hit in an instrumented code base, statsd gets the event from the code, but actually only sends data to Graphite every 10 seconds by default. This is an obvious optimization intended to cut down on network traffic. So for example, if an event happens 250 times per second, statsd will send one event to Graphite with a value of 2500 every 10 seconds. And thats how it will be stored in Graphite too (to start). A datapoint of 2500 will be stored for every 10 second interval.

Storage Intervals in Graphite

Just as statsd batches data sending for network optimization, Graphite batches data storage for disk use optimization. The older data gets in Graphite, the more optimized storage of that data becomes. The specifics are configurable, but in general, Graphite stores very recent data in very short intervals and summarizes those short intervals into larger intervals over time on the basis that the older data is, the less important interval granularity becomes.

This table shows an example of how storage intervals in Graphite might be configured.

Data Age (d) Interval Size
d < 6 hours 10 seconds
6 hours <= d < 2 months 1 minute
2 months < d 10 minutes

This approach makes a lot of sense. Detailed visibility of old data is rarely required and this optimization means Graphite requires about 1% of the storage capacity that would be required if everything was stored for all time at 10 second intervals.

Interval Aggregation in Graphite

But here’s the rub. Graphite by default aggregates older data in a non-intuitive way, which makes it subject to being counted inaccurately if one is not careful. This too is configurable but by default, Graphite stores the AVERAGE rate of data within an aggregated interval, and NOT the sum.

A Simple Example to Illustrate

Consider an event that fires once per minute and how it gets stored over 7 hours in a Graphite system using the data ageing configuration from the table above:

The first six hours of data will conceptually look like this:

Datapoint Time Stamp Stat Value
1 00:00:00 0
2 00:00:10 0
3 00:00:20 0
4 00:00:30 0
5 00:00:40 0
6 00:00:50 1
2155 05:59:00 0
2156 05:59:10 0
2157 05:59:20 0
2158 05:59:30 0
2159 05:59:40 0
2160 05:59:50 1

The seventh hour will look like this – note the bigger intervals, one minute instead of 10 seconds:

Datapoint Time Stamp Stat Value
2161 06:00:00 0.1666
2162 06:01:00 0.1666
2163 06:02:00 0.1666
2218 06:57:00 0.1666
2219 06:58:00 0.1666
2220 06:58:00 0.1666

At first, that 0.1666 number looks weird until we recall the AVERAGE aggregation technique that Graphite uses by default. 0.1666 is the average number of times the event happened in any given 10 second interval across any minute of the seventh hour.

This has implications.

Counting Data

Graphite comes bundled with a myriad of useful functions that can be run on data sets to glean more insight. Counting the total number of times an event happened over a given time range is one of the most common and useful tasks that one can perform with Graphite.

The integral function is particularly handy for counting, but it’s no silver bullet.

From the Graphite documentation, here is how it works:

integral() will show the sum over time, sort of like a continuous addition function. Useful for finding totals or trends in metrics that are collected per minute.”

This is far less than the whole story of the integral function but regardless, the intuitive web app developer will quickly find this function and reasonably start using it for immediate benefit.

The procedure goes something like this:

  • Drop a statsd call into code to send data to a Graphite counter stat based on some user driven event
  • Deploy the code to production
  • Set up a Graphite dashboard with the integral function against the stat
  • Sit back and watch how often users trigger the event
  • Iterate on the code to increase / decrease the rate at which the event occurs, as desired

This is a nice tight development loop for gaining insight into the frequency of anything in a web app. Even nicer is when you add two or more stats, say for an A/B Test. The team can then watch multiple stats compete over time and quickly pick a winning implemenation. It’s a tried and true HootSuite model for rapid continuous improvement of an app.

But the integral function starts to fall apart after 6 hours, as in the problem pictures above show, because of how Graphite aggregates data.

How Integral Breaks

The integral function sums things. So when the specified time range includes data aggregated with averages, you will not get sums as results, but rather sums of averages. In short, things won’t look right at all.

Another Example to illustrate:

Consider a statsd event that fires once per minute to a Graphite store that keeps the last 1 hour of data at 10 second intervals and keeps 1 minute averages (not sums) for anything older than 1 hour. The data will look something like this in the database:

Datapoint Timestamp Stat Value
1 00:00:00 0
2 00:00:10 0
3 00:00:20 0
4 00:00:30 0
5 00:00:40 0
6 00:00:50 1
355 00:59:00 0
356 00:59:10 0
357 00:59:20 0
358 00:59:30 0
359 00:59:40 0
360 00:59:50 1

Beyond an hour line, data is stored as averages

Datapoint Timestamp Stat Value
361 01:00:00 0.166
362 01:01:00 0.166
363 01:02:00 0.166

Now if we run the integral function on the stat for the most recent hour (or less), we’ll get the right answer:

360 datapoints, 1/6 of which have a value of 1, yields 60 events. Right answer.

But, if we run the integral function on the stat for the most recent hour plus 1 minute, things change drastically. Without having done a deep forensic code analysis, here is how I speculate it works – and this speculation holds up against several empirical observations:

The interval after the one hour mark is 1 minute and the aggregation function is configured to average time series, not sum them. So, the integral function will convert the dataset to one minute interval averages before doing any summation. The result is 61 data points, each with values of 0.1666 over a span of one minute each. Adding these all up, we get answer of just over 10.

Wrong answer!

So in short, the integral function on its own is only useful on a Graphite instance with default configuration if we don’t query time ranges beyond the point in history where Graphite starts aggregrating data using averages.

Counting Past Graphite Aggregation Boundaries

But counting is still so useful, and Graphite has the data in some form, so how do we get counts across time ranges that span Graphite aggregation boundaries? At HootSuite we use the hitcount() function. Hitcount lets you specify a stat and a bucket size and counts data correctly across Graphite aggregation boundaries. With these parameters, hitcount will give accurate totals per interval.

For example, in the Graphite web app, to see the count per hour of a stat over whatever time range is specified in the UI, use this:

hitcount(stat,"1hours")

Here’s an example of running hitcount with hourly breakdown through a full day time range.

Graphite-hitcount

This view of things is quite different in form and meaning than what integral gives us, but is often useful in it’s own right.

To get something closer to integral, use the hitcount function on a time range that is less than the specified bucket size. For example, to get a count of how many things have happened in the last 10 hours, set the  time range in the Graphite web app to the last 10 hours and run this function:

hitcount(stat,"1days")

Since the time range is greater than the bucket size, we get one number:

Graphite-hitcount-1point

The number is right, but the display is kind of rough. Truth be told, hitcount is kind of hard to use in the Graphite web UI because:

  • Lower bounds get set based on the minimum value in the returned data set. For hitcount, a constant lower bound of 0 would be a more logical default
  • A histogram type view would be easier to interpret than a line graph connected by points for the hitcount function

These aspects of data viewing are configurable in the web UI but configuration is a pain.

Maybe the most powerful use of hitcount is to use it to make the integral function work the way developers intuit that it should!

Going back to the problem graph from the start of this article, we can fix it simply by applying the integral over a long time range to hitcount function over a small time bucket size, like this:

integral( hitcount(stat, "1minutes") )

Applied to a similar 22 hour dataset on the same stat in the problem set yields the right answer:

Graphite-integral-hitcount

That’s more like it.

Overcoming the Web UI Limitations

As alluded to, the Graphite web UI is less than ideal, especially for counting. Lots of configuration is required to get something half decent. Doing better than the Graphite web UI is actually quite straightforward because Graphite provides a very workable API with one render endpoint that can pass JSON data back for sets of stats, with any functions applied, across whatever time range. At HootSuite we’ve used this API to build rapidly configurable interfaces for key stats using simple jQuery and the D3.js graphing library.

Here’s a screenshot of a nice D3 based view of a Graphite stat called via API, broken down by week and color coded by sub-stat:

Graphite-REST-D3

It gives a nice view of how an event grew over time.

Conclusion

statsd and Graphite are powerful and we use them to great effect at HootSuite. Shortcomings that we’ve encountered have always been a function of our own inexperience or presumptions conspiring with Graphite tendency for user unfriendliness. Issues so far have never been data layer fails in Graphite.

Digging deeper into understanding how Graphite works has enabled us to get way more out of a system designed for power at the expense of usability.