Rebuilding the Build
Continuous Integration (CI) and Delivery (CD) are a core processes within our engineering team. CD is hard to get right, and we pride ourselves on following best practices. In the cultural sense, we imbue the central values and DevOps culture in general: continuous improvement, tight feedback loops, automation, education, blamelessness, and autonomy.
In fact, it can be argued that CI is an absolute baseline requirement in modern software development. It is the hub for all spokes that make up engineering (devs, ops, QA, management). The build and release pipelines are therefore a mission critical production distributed system, and should be treated as such. But what happens when the pipeline (or parts of it) goes down? Important fixes can’t be pushed out, developers get blocked, visibility is lost, and everyone notices. Worse, when the problem is resolved, the commits have piled up and an important tenet is broken: small incremental changes vs. large deployments.
When starting out as a monolithic app developed with a small team, you can easily get away with just configuration management and a simple build server (eg. Ansible & Jenkins). But as time moves on, the product, team and company grow, and that monolith is half way transitioned to a handful micro-services.
Last ½ year we took a step back to look at things with a bit of perspective. We can’t just hire someone to monkey around with Jenkins and expect our growth problems to go away. A deeper re-examination and contemplation of the current issues (and future requirements) needs to take place. With a solid design philosophy, implementation more easily falls into place and decisions are made within the proper context. So, in the spirit of openness, we’re opening up our dev process in the hopes that we get insights from unexpected places and that we unexpectedly help others by providing lessons learned.
Under PressureOur situation is a sign of a healthy business and hyper growth. We move fast, our systems grow organically, and code naturally degrades over time.
We have a legacy PHP monolith and a few dozen Scala services. The legacy app has integration tests that masquerade as unit tests, developer feedback loop slows, hundreds of coupled snowflake jobs make the build server brittle and hard to maintain, and Jenkins struggles to keep up with demand. With the continued effort of chipping services off the monolith and the doubling of our engineering team over the next year, the problem is only going to get (exponentially) worse.
This coupling leads to snowflake pipelines and processes diverging. We’ve recently seen errors due to infrastructure being deployed inconsistently or with too much manual intervention. Case in point, in some instances we spin up a base Amazon AMIs and provision Ansible as the final production instance. In others, we provision, then bake an AMI for speedy spin up (or auto scaling groups). Inevitable problems ensue from the inconsistency.
As with any code base, you need to refactor. Most of those pipelines are doing the same thing in slightly different ways, and can easily be aligned into a smaller series of templated workflows. By unifying the processes and our stack, we greatly increase our agility.
The fantastic book “Continuous Delivery” provides a lot of sage and hard-won technical and cultural guidance. In addition to those ideas, we identify with the following guiding principles. None of these are particularly new or novel, but it helps to codify them and foster cultural buy-in.
Infrastructure Everything as Code
Provisioning and the setup of infrastructure will be handled as first-class software projects. Using Ansible/Terraform to set up a load balancer or a RabbitMQ instance for example, should go from commit, to code review, build, test (serverspec/rspec), and deployment. The goal is to provide us the following:
- Every change has been committed to version control and tested just like source code
- Reproducibility: confidence in ability to rebuild lost infrastructure
- Re-deploy staging (or any environment) from scratch
- Easier to spin up in a new availability zone/region
- Continuously integrate the Continuous integration server
- Auditability for security and compliance
Deployments should be immutable once built. All static configuration (routes, DI, certain constants, defaults etc) is baked into the image, and all other configuration is kept in the environment. Preferably env vars (http://12factor.net/config) via our Consul service.
- Safe (failure is more likely during build/test phase)
- Increased reliability (predictable)
- Decreases config drift & snowflakes
- Atomic: deployment can be done as blue/green deploy
- Easy rollback
Deployments are to be rolled out as machine images. There are various options available such as AMI, VM images, or Docker. Docker fits our use case best. It’s the next step in artifact evolution from tar to debian packages to containers. It gives us many other highly desirable advantages:
- Clear interface (lingua franca) between Devs & Ops
- Polyglot friendly: languages, libraries, dependencies, tools
- Code and runtime part of release
- Less dependencies on release. Would you compile source on target as a release strategy? In a way that’s what’s traditionally happening with git pull & setup scripts
- Portability: easier to move cloud providers, deal with international offices, or possibly go bare metal
- Greater parity between all environments: dev/stage/prod
Build & Deploy as a ServiceB&D is a service that is provided internally to all teams. The objective is to not be overly prescriptive, but instead encourage independence in teams with a “self help” approach.
An effective service provides a clear and well documented API. There are many build tools that need to be supported: sbt, Maven, Grunt, phpunit, etc. The choice of tooling should be left to the teams to fit their needs, and the build pipeline should not limit their choice (within reason). Each project will have a spec outlining the build and release steps in a declarative manner. Similar to the yaml specs of DotCi, circle-ci, and travis etc. Our Build System
Jenkins is our main workhorse. It’s a loved and hated tool with great flexibility. What it does have going for it is a vast ecosystem of integrations and plugins to shape it to almost any need. As we evaluate other possible tools, we need to assess whether a new system is worth the high overhead of migration (such as Integration with our custom dashboards and information radiators for e.g.).
Of particular interest is eBay’s recent work with running a federated Jenkins setup on Mesos. As the team grows, a single Jenkins master becomes unfeasible. Running on Mesos has several key advantages: it keeps the cost of down by sharing computing resources, and makes scaling up capacity straightforward. Interesting to note was their pre-Mesos utilization at < 5%, due to mostly idle isolated slaves.
Tester & Builder ContainersBuilding a project should be made available in any location with minimal setup or fuss. To help encapsulate build dependencies, we introduce the idea of builder and tester containers. All tooling required to build or test an application will be kept in a discrete container. This allows developers to build an app locally with the identical binaries the CI server will use, as well as let us version, update, and migrate our tooling with greater ease. Another benefit is that our build system itself is being continually integrated just as a first class software project.
Deployment TypesMost of our deployments mutate the state of our running servers – they upload an artifact (tar, deb), unpack it, run a setup script, and restart the service. This changes as we move to image based deployments. Regardless, the various possible deployment types must be considered:
Continuous Deployment: From commit to automatic production deployment. We don’t have the desire, the requisite test coverage, or find it appropriate to do this for customer facing products. When developers merge to master, they are “on call” until their change is live and all post deploy production metrics look ok. Currently this is only a fit for offline/ETL workloads, or non critical systems such as local tooling.
Rolling: Necessary for certain apps. For example, apps with expensive startups, or cache warming routines.
Blue / Green: A new pool gets deployed, then a load balancer (networking route, config, etc.) points to the new pool. Once all is ok the old pool gets torn down. The switchover is instantaneous and provides fast & easy rollback. Our Hootsuite Analytics team uses Blue / Green deployments.
Canary: Spin up an isolated deployment, then dial in a percentage of traffic to get production performance and feedback. Lower risk, and appropriate for low-level core changes like protocol updates, library changes, or new runtimes (jvm).
The Road AheadAll this leads to to where we ultimately want to go, with a unified process and tooling to increase our agility. It allows us to move fast, and gives us the flexibility to leverage future platforms. While we are ready to commit to Docker as a format, we feel the PaaS space is moving too fast and needs to settle a little. Some really exciting stuff is being worked on, but there are open problems that need a lot of work on (intra-networking of containers being one of them). Some offerings are still in beta (CoreOS, Kubernetes, Flynn, Deis, Longshoreman, etc.), and choice is dizzying. Others are production proven (Mesos), but don’t have the tooling integrations we require and are a big investment to commit to. Nonetheless, the future is exciting.
We’d love your help
Does rebuilding our build resonate with you? Right on. Apply for our Build and Deploy Engineer role.
About the Author
Mark is a Senior Software Engineer on the Platform Team at Hootsuite. He is passionate about Scala micro-services, container technologies such as Docker & Mesos, and DevOps in general. Follow Mark on Twitter @markeijsermans