Distributed Configuration Management and Dark Launching Using Consul
Something that has been in our sights for a while is Distributed Configuration. This is a puzzle piece we’re currently missing, and one that will help guide and support many of the tasks that we’re going to face in the next couple years. In this post we’ll recount our experience with implementing Consul to fill that need.
The ProblemWe’re improving our build and deploy pipeline, but don’t yet have a way to handle dynamic configuration of services. This part of the system will see some heavy work in the next year, and Distributed Config will influence a lot of the design of that system.
The usage of Consul has also been driven by a need to improve an existing piece of our system – the Dark Launch mechanism. It’s one of the key ways Hootsuite is able to be nimble and keep our deployment rate up, without sacrificing quality. Dark Launching, or “feature flagging”, allows us to have control over very granular pieces of the codebase through an interface we created. We can modify the execution of our code at runtime by setting conditions on the execution of a certain block, such as boolean true or false, random percentage, specific members, and more. We’ve had this system in place for years now, and we can confidently say that implementing it was one of the best decisions Hootsuite Engineering ever made. The high degree of confidence and control over their code is invaluable to our engineers.
However, being that important and useful comes with a cost. As it has grown in usage, the Dark Launching system has developed a number of hotspots and can be a potential point of failure. It relies on a MySQL table of data which is cached using memcached and PHP APC, which can mean that the keys required to serve the dark launch data can be very heavily read. We also need a unified way of managing dark launch codes across many services and codebases, and in various languages.
Reasons We Picked Consul
- We’ve wanted to use Serf for a while, and Consul takes Serf and adds other great stuff
- We’re fans of other Hashicorp products such as Vagrant
- It supports 0-TTL DNS based service discovery plus events, kv store, etc.
- It will allow us to solve current problems and is flexible enough to be able to handle future situations
- Multi-data center aware – very important as we scale out a globally-distributed service oriented platform
- Easy to set up – the same agent process runs on all servers, just with slightly modified configuration
- Easy to use REST API
- Baked-in security through encryption and ACLs
- Based on well known and proven protocols SWIM and RAFT
Risks to Consider
- Beta (bleeding edge, untested, lack of community resources) – If necessary we can contribute back to consul or fork a stable version (it’s written in Go)
- Quick pace of it’s development could be an ops / dev problem if frequent updates are required
- It can shim in between DNS lookups. If this fails, it could fail hard
- Amount of data between nodes for gossip and raft is an unknown
- Key-value store is per-datacenter
- Requires a separate process to replicate the data
Our ImplementationOur first implementation of Consul revolves around the key-value data store. Given our problems with the Dark Launching system, we thought Consul would be a great choice to replace it because of the combination of the KV store and events. This made it possible to rely on a push-based system, rather than a pull-based system like we had.
When we want to modify a dark launch value we use our admin panel which is running on webservers with the consul agent running. When we save the value, the PHP code of the admin panel makes a request to the REST API on the local agent, changing the JSON value of a key using a URL like this:
The key prefix is like a directory structure or namespace within the KV store, and Consul allows you to set watches not only on individual keys, but also on key prefixes. These watches mean that when anything inside a certain prefix is modified, a predefined script is executed and passed the data of all the keys inside that prefix.
When we make that REST call, the local agent sends the data to the RAFT leader, which then sends out an event to all the nodes that something has changed in the KV store, and watches are called on any server where the consul agent has been configured to respond to that type of events. Watch handlers can be any executable file, when they are triggered they are passed relevant data through STDIN. In our case, we watch the “darklaunch/dashboard/core” prefix and when an event comes in we take all the data in that prefix and write it in to a PHP-parseable file on the filesystem, then invalidate that file in the server’s opcode cache.
And corresponding handler:
Given the results we’ve seen so far it typically takes less than a second for over 100 servers to converge upon the data, and despite the fact that Consul technically doesn’t guarantee delivery of events, we haven’t seen any instances yet of data being behind.
Moving from a pull-based system where dark launch values are read from a database or cache to a push-based system where a file containing up-to-date dark launch data always exists on the filesystem has been great, as there are less external dependencies in the code that relies on that dark launch data. Even if the Consul agent were to go down, the file would still remain on disk. We have seen a small improvement in page execution time and reduction in cpu usage on webservers since this change because of the simplified way of loading this data. We were also able to roll this change out in a manageable way by implementing the Consul-based Dark Launch system and then running it in tandem with the existing system (all reads and writes happen to both sides and upon each read both values were compared and logged to determine if any data was missing or incorrect.)
Our typical way of carefully rolling out features is to dark launch them but that wasn’t really feasible here for obvious reasons.
- We had to modify our dev vagrant config to open ports for the consul nodes to communicate and make sure each node has a unique name
- Implement ACLs early, as doing it later will mean a bunch of reconfiguration and will invalidate a lot of the data in the KV store
- The performance of the underlying Serf gossiping is not tuneable (for example, the Convergence Simulator), though it seems that they picked some sane values for a typical setup.
- The version we started with (0.4.0) had a bug where watches were not being properly loaded, though it was quickly fixed in 0.4.1
- We found it to be useful to always use the “rejoinafterleave” config option for agents. Without it, the bootstrap server will not rejoin the cluster after being destroyed, it will instead just start a new cluster with itself as leader
- From the docs: “By default, Consul treats leave as a permanent intent, and does not attempt to join the cluster again when starting. This flag allows the previous state to be used to rejoin the cluster.”
- It’s important to understand Consul’s Disaster Recovery process. Servers can be brought up or down, destroyed and rebuilt at will but in the case where all servers are down or have been destroyed, the next time you want to bring them up you’ll need to switch over to the disaster recovery steps. We’ve found this to be reasonable, as the likelihood of all (typically 3, 5, or 7 distributed through various zones) of your raft members going down at once is pretty small.
- Events (at least keyprefix type events) will be delivered to nodes that were down at the time the event was triggered. When the node comes back up it will trigger any applicable watches. ( @armon from Hashicorp tells us that this is because keyprefix events are handled using a different mechanism which benefits from extra reliability.)
Things We’d Like To Do
- Brokerless Services: It should be possible to have a pool of workers, and instead of having a load balancer or broker that routes traffic for that pool, all clients could have an up-to-date list of all workers in the pool and connect to them directly.
- Load balancers that can automatically update their own configs when a new web server comes up
- Consul clients that can specify a specific service that they provide (MySQL for example), and then other clients can discover the providers of that service via DNS or REST API
- Use it for it’s health checking features rather than having a separate dedicated health checking agent.
About the Author: Bill is a Specialist Engineer working mostly on the Platform team, which is building the foundation of our move toward a service oriented architecture built on Scala. He also enjoys long walks on the beach, code optimization, and distributed computing papers. Follow Bill on Twitter @bmonkman.