Distributed Configuration Management and Dark Launching Using Consul

Something that has been in our sights for a while is Distributed Configuration. This is a puzzle piece we’re currently missing, and one that will help guide and support many of the tasks that we’re going to face in the next couple years. In this post we’ll recount our experience with implementing Consul to fill that need.

consul-long

The Problem

We’re improving our build and deploy pipeline, but don’t yet have a way to handle dynamic configuration of services. This part of the system will see some heavy work in the next year, and Distributed Config will influence a lot of the design of that system.

The usage of Consul has also been driven by a need to improve an existing piece of our system – the Dark Launch mechanism. It’s one of the key ways Hootsuite is able to be nimble and keep our deployment rate up, without sacrificing quality. Dark Launching, or “feature flagging”, allows us to have control over very granular pieces of the codebase through an interface we created. We can modify the execution of our code at runtime by setting conditions on the execution of a certain block, such as boolean true or false, random percentage, specific members, and more. We’ve had this system in place for years now, and we can confidently say that implementing it was one of the best decisions Hootsuite Engineering ever made. The high degree of confidence and control over their code is invaluable to our engineers.

However, being that important and useful comes with a cost. As it has grown in usage, the Dark Launching system has developed a number of hotspots and can be a potential point of failure. It relies on a MySQL table of data which is cached using memcached and PHP APC, which can mean that the keys required to serve the dark launch data can be very heavily read. We also need a unified way of managing dark launch codes across many services and codebases, and in various languages.

Reasons We Picked Consul

  • We’ve wanted to use Serf for a while, and Consul takes Serf and adds other great stuff
  • We’re fans of other Hashicorp products such as Vagrant
  • It supports 0-TTL DNS based service discovery plus events, kv store, etc.
  • 
It will allow us to solve current problems and is flexible enough to be able to handle future situations
  • 
Multi-data center aware – very important as we scale out a globally-distributed service oriented platform
  • Easy to set up – the same agent process runs on all servers, just with slightly modified configuration
  • Easy to use REST API
  • 
Baked-in security through encryption and ACLs
  • 
Based on well known and proven protocols SWIM and RAFT 

For more information about some alternatives to Consul that we investigated, have a look at Consul’s own list.

Consul overview (All nodes run the same agent, but “servers” participate in RAFT and become seed nodes for clients which communicate by gossiping)
Consul overview (All nodes run the same agent, but “servers” participate in RAFT and become seed nodes for clients which communicate by gossiping)

Risks to Consider

  • Beta (bleeding edge, untested, lack of community resources) – If necessary we can contribute back to consul or fork a stable version (it’s written in Go)
  • Quick pace of it’s development could be an ops / dev problem if frequent updates are required
  • It can shim in between DNS lookups. If this fails, it could fail hard
  • Amount of data between nodes for gossip and raft is an unknown
  • Key-value store is per-datacenter
  • Requires a separate process to replicate the data

Our Implementation

Our first implementation of Consul revolves around the key-value data store. Given our problems with the Dark Launching system, we thought Consul would be a great choice to replace it because of the combination of the KV store and events. This made it possible to rely on a push-based system, rather than a pull-based system like we had.

When we want to modify a dark launch value we use our admin panel which is running on webservers with the consul agent running. When we save the value, the PHP code of the admin panel makes a request to the REST API on the local agent, changing the JSON value of a key using a URL like this:

Consul URL

The key prefix is like a directory structure or namespace within the KV store, and Consul allows you to set watches not only on individual keys, but also on key prefixes. These watches mean that when anything inside a certain prefix is modified, a predefined script is executed and passed the data of all the keys inside that prefix.

When we make that REST call, the local agent sends the data to the RAFT leader, which then sends out an event to all the nodes that something has changed in the KV store, and watches are called on any server where the consul agent has been configured to respond to that type of events. Watch handlers can be any executable file, when they are triggered they are passed relevant data through STDIN. In our case, we watch the “darklaunch/dashboard/core” prefix and when an event comes in we take all the data in that prefix and write it in to a PHP-parseable file on the filesystem, then invalidate that file in the server’s opcode cache.

Example watch:

And corresponding handler:

Given the results we’ve seen so far it typically takes less than a second for over 100 servers to converge upon the data, and despite the fact that Consul technically doesn’t guarantee delivery of events, we haven’t seen any instances yet of data being behind.

Typical convergence of data on a subset of 59 servers after a dark launch code is modified. Each bar represents 0.1 second.
Typical convergence of data on a subset of 59 servers after a dark launch code is modified. Each bar represents 0.1 second.

Moving from a  pull-based system where dark launch values are read from a database or cache to a push-based system where a file containing up-to-date dark launch data always exists on the filesystem has been great, as there are less external dependencies in the code that relies on that dark launch data. Even if the Consul agent were to go down, the file would still remain on disk. We have seen a small improvement in page execution time and reduction in cpu usage on webservers since this change because of the simplified way of loading this data. We were also able to roll this change out in a manageable way by implementing the Consul-based Dark Launch system and then running it in tandem with the existing system (all reads and writes happen to both sides and upon each read both values were compared and logged to determine if any data was missing or incorrect.)

Our typical way of carefully rolling out features is to dark launch them but that wasn’t really feasible here for obvious reasons.

Lessons Learned

  • We had to modify our dev vagrant config to open ports for the consul nodes to communicate and make sure each node has a unique name
  • Implement ACLs early, as doing it later will mean a bunch of reconfiguration and will invalidate a lot of the data in the KV store
  • The performance of the underlying Serf gossiping is not tuneable (for example, the Convergence Simulator), though it seems that they picked some sane values for a typical setup.
  • The version we started with (0.4.0) had a bug where watches were not being properly loaded, though it was quickly fixed in 0.4.1
  • We found it to be useful to always use the “rejoinafterleave” config option for agents. Without it, the bootstrap server will not rejoin the cluster after being destroyed, it will instead just start a new cluster with itself as leader
    • From the docs: “By default, Consul treats leave as a permanent intent, and does not attempt to join the cluster again when starting. This flag allows the previous state to be used to rejoin the cluster.”
  • It’s important to understand Consul’s Disaster Recovery process. Servers can be brought up or down, destroyed and rebuilt at will but in the case where all servers are down or have been destroyed, the next time you want to bring them up you’ll need to switch over to the disaster recovery steps. We’ve found this to be reasonable, as the likelihood of all (typically 3, 5, or 7 distributed through various zones) of your raft members going down at once is pretty small.
  • Events (at least keyprefix type events) will be delivered to nodes that were down at the time the event was triggered. When the node comes back up it will trigger any applicable watches. ( @armon from Hashicorp tells us that this is because keyprefix events are handled using a different mechanism which benefits from extra reliability.)
Now that we’re proving out Consul in production, we’re looking at implementing Scala dark launching, working on a similar principle to the PHP watch handler we created. Instead of writing to a local file, we’ll use a Scala-specific approach and use Akka Remoting to communicate with an agent running inside existing processes, telling it to update its local dark launch storage when an event comes in. We’re also brainstorming all the other fancy things we’d like to do with it, and taking it into account as a key piece of our build and deploy pipeline.

 

Things We’d Like To Do

  • Brokerless Services: It should be possible to have a pool of workers, and instead of having a load balancer or broker that routes traffic for that pool, all clients could have an up-to-date list of all workers in the pool and connect to them directly.
  • Load balancers that can automatically update their own configs when a new web server comes up
  • Consul clients that can specify a specific service that they provide (MySQL for example), and then other clients can discover the providers of that service via DNS or REST API
  • Use it for it’s health checking features rather than having a separate dedicated health checking agent.
Trying out bleeding edge software can be a risky proposition, but in the case of Consul, we’ve found it to be a solid system that works basically as described and was easy to get up and running. We managed to go from initial investigations to production within a month. The value was immediately obvious after looking into the key-value store combined with the events system and it’s DNS features and each of these has worked how we expected. Overall it has been fun to work with and has worked well and based on the initial work we have done with the Dark Launching system we’re feeling confident in Consul’s operation and are looking forward to expanding the scope of it’s use.

Bill

About the Author: Bill is a Specialist Engineer working mostly on the Platform team, which is building the foundation of our move toward a service oriented architecture built on Scala. He also enjoys long walks on the beach, code optimization, and distributed computing papers. Follow Bill on Twitter @bmonkman.