Why Migrate? – The Non-Technical Parts

Earlier this year, our Product Operations and Delivery team decided to migrate services from our Mesos cluster to Kubernetes. George Gao wrote a post detailing the technical reasons  for the move. On the non-technical side of things, the tooling around Kubernetes was more developer friendly than what Mesos offered, which bode well for our dev teams. Additionally, only the core operations team that originally implemented Mesos understood it. When problems arose, they were the only ones capable of troubleshooting.

Following an evaluation of alternatives, the team made a bet on Kubernetes and started migrating services on Mesos with the goal of moving all fifteen to Kubernetes. The team gave themselves three months to complete the migration, but thanks to our service mesh, the project only took two!

This was because our service mesh decoupled microservice networking from the application code. As a result, the migration process was limited to simple routing changes. To fully appreciate this, we need understand how our service mesh works.

What is a Service Mesh?

Imagine you’re writing a new service. Let’s say that your service has a bunch of microservices it needs to talk to. Do you hardcode the URLs to these dependencies into your service? What if there are multiple instances of each service so requests are load balanced? How will your service continuously discover these new instances if they could go down and be brought back up anytime with different URLs?

Adding logic for these considerations would bloat your application code and you’d have to do the same work for every service in your architecture. This work is only compounded as the number of services and languages grows.

One solution is to move this responsibility from the clients to the networking layer. By doing so, we have a ‘thin client’ and ‘fat middleware’ model. This is a service mesh. Practically speaking, this means setting up lightweight proxies between the origin and destination services to take care of service discovery and routing. Service discovery is the mechanism that allows services to dynamically track other services and route to them.

Once the service mesh has been set up, adding a new service to it makes the service automatically routable from existing services. You can then focus on writing application logic and trust that the network will route as you expect it to. At Hootsuite, this helped lower the barrier to writing new microservices. Mark Eijsermans gave a talk that goes into more detail.

Hootsuite’s Service Mesh

Our in-house service mesh is called Skyline. It uses Consul for service discovery and NGINX for routing.

Mesos to Mesos

On each node in the Mesos cluster, we run a Consul agent and an NGINX server. The NGINX config is kept up-to-date by fsconsul and consul-template.

Each container that runs on a Mesos node makes application requests to a Skyline  URL: http://localhost:5040/service/foo/endpoint. This request first goes to the local NGINX proxy at port 5040. The local NGINX then proxies that request to the destination NGINX proxy at port 5041, which routes the request to the correct application on the node. So, the Mesos service only needs to know the Skyline URL of its downstream Mesos service.

Mesos to Kubernetes

If the local NGINX proxy can’t figure out where to send the request, it just gets proxied to the Kubernetes cluster. All Kubernetes worker nodes are listed in Consul, so any calls to a service that isn’t found in the Mesos cluster will route to Kubernetes via a catch-all.

When a request comes in from outside the Kubernetes cluster, it will reach any Kubernetes worker node at random. On our Kubernetes nodes, we run a skyline-bridge Pod in a Daemon Set on all worker nodes. These Pods just run NGINX and listen on their container port 5041, which is mapped to the host port 5041. When a request comes into a Kubernetes node, the skyline-bridge Pod transforms the request URL into a kubedns name: http://foo.default.svc.cluster.local:8080/endpoint. After that, kubedns takes care of routing the request to the correct destination Service.

For example, say a Mesos service wants to reach the Giphy service sitting in Kubernetes. The origin service calls http://localhost:5040/service/giphy/media. The request gets proxied from the local NGINX to a dedicated set of gateway servers called the ‘service discovery bridge’ (SDB) and into a Kubernetes worker node.

The skyline-bridge Pod on that node receives the request called {NODE IP}:5041/service/giphy/media. It transforms the request into http://giphy.default.svc.local:8080/media. That request is then passed to kubedns to be routed to the Giphy Service.

Kubernetes to Mesos

If a request from within Kubernetes is destined for a Mesos service, the requesting service calls the kubedns name of a Service that represents the Mesos service. This Service targets skyline-bridge Pods.

For example, the Organization service lives in Mesos, but there is a Service named organization.default.svc.cluster.local to represent it in Kubernetes. The skyline-bridge Pods will then transform the kubedns name of the destination into a Skyline URL before proxying it to the Mesos cluster.

The server named is part of the ‘service discovery bridge’ (SDB) that acts as a gateway into both the Kubernetes and Mesos cluster.

Life of a Request through the Service Mesh

Let’s look at a more detailed example. Say that we have a request from a service outside the cluster to our Giphy service running in Kubernetes. This request will first be made to the local NGINX proxy at http://localhost:5040/service/giphy/media on the origin server. This request then gets proxied to the SDB.

Once the request hits the SDB, it will proxy the request to its known backends based on the request URL. If the destination service sits in our Mesos cluster, it would appear in the SDB’s NGINX config like this:

From this, we know that our request will be routed to one of these Mesos slaves. If the Mesos service goes down, its location block in the SDB will be removed by consul-template. In that case, the SDB will fallback to the Kubernetes upstream, as shown below.

On a side note, since the Mesos service location blocks come before the Kubernetes one, the SDB will always prioritize routing to Mesos over Kubernetes as long as a Mesos version of the service is healthy. Once the request reaches the Kubernetes cluster, the skyline-bridge Pods will take care of routing it to the correct Service.

How did this help with the migration?

Easy cutover

Consider a service’s dependency that was originally in Mesos and sitting on Skyline. If it got scaled down, the SDB would automatically route to the catch-all and into the Kubernetes cluster. Assuming that there was already a Kubernetes version running, the traffic cutover would happen seamlessly.

This is the power of a service mesh. Instead of having to SSH into each upstream service and make manual routing changes, we can just tell Mesos to scale down the service and our service mesh will correct the routing for us.

Easy fallback

If something went wrong with the Kubernetes version of the service while it was serving requests, falling back would be as simple as scaling the Mesos service back up. Since the SDB favours the Mesos cluster, it would automatically route traffic back into the Mesos cluster. This also sidesteps manual configuration in emergencies.

Minimal code changes

With this, the only code changes our migration team had to make were to the dependency URLs. These are usually defined in a service’s deploy configurations (.yml files).

If a service was in Mesos, it would reach a downstream service at http://localhost:5040/service/foo/endpoint. Once it gets deployed to Kubernetes, the downstream service URL would need to be http://foo.default.svc.cluster.local:8080/endpoint. Following our Giphy example, the code changes we made to its deploy configuration looked something like this:

Thus, pull requests were relatively small and simple to review, which made the migration smoother for all parties involved.

Towards the future

Migrating from Mesos to Kubernetes reduced a huge amount of technical debt across all teams. On the operations side, it didn’t make sense for us to maintain two separate container schedulers and deploy environments. It also saved us $67k a year. On the development side, service owners have a more robust and developer-friendly tool in kubectl to debug and maintain deployments, enabling them to troubleshoot problems instead of relying on the core operations team.

For our next step, we are looking at second generation service meshes. Compared to Skyline, these service meshes promise out-of-the-box features like smart retries, circuit breaking, better metrics and logging, authentication and authorization, etc. Currently, Istio and Envoy are our strongest contenders. As our microservices grow, this will give us even more operational control over our service mesh to empower fast and safe deployment of services at Hootsuite.

Big thanks to Luke Kysow and Nafisa Shazia for helping me with this post.

About the Author

Jordan Siaw is a co-op on the Product Operations and Delivery (POD) team. He is a Computing Science major at Simon Fraser University (SFU). When he’s not busy with code, he enjoys reading, podcasts, playing guitar and programming jokes. Find him on Github, Twitter, Instagram or LinkedIn.