Category:

Background:

On the datalab team at Hootsuite, we help the rest of the company make data driven decisions. A big part of this is the collection of data. We collect the data from external vendors that we work with such as Salesforce and Marketo. We also collect data on the Hootsuite product itself. We load all of this data into a single data warehouse, where our data scientists, business analysts, and product teams can easily use.

The engineering side of the datalab has 3 main tasks:

  • Collecting data
  • Enriching the data
  • Storing data in different formats
In order to do all this datalab develops specialized apps that do one thing very well. These apps get input from one or many internal and external data sources. These sources are a mix of internal APIs, databases and data streams as well as external APIs. This first step is known as the extract step. The apps then process the data. This could range from enriching one data source with data from other data sources, or doing some calculations and enriching the data with the results. This second step is known as the transform step. The app finally loads the data into another data source. This last step is knows as the load step. In data engineering parlance these apps are called ETLs: Extract, Transform, Load.

Problem:

We are dealing with large volumes of data and all the data operations are subject to a high standard of data quality. The result of one data operation is often an input to multiple other data operations. Therefore, one problem can easily cascade into decreasing the data quality of other parts of the system. There are a large number of stakeholders consuming this data every moment. These stakeholders use the insights for making critical decisions. Therefore, it is imperative that the data is always correct and complete.

These are some of the technical difficulties that the datalab had to solve to create a reliable data pipeline:

  • How can we quickly spot any anomalies in the system?
  • How can we easily troubleshoot and fix the problems?
  • Given that the output of some jobs are the input to others, how do we make sure that the jobs run in the correct order every time? Additionally, if one of the components in the pipeline fails, how do we prevent it from affecting other parts of the system?
  • How should we manage our infrastructure? Some apps run continuously and some run periodically. How should we schedule and deploy these apps on our servers? We want to make sure that all the apps have enough computing and memory resources when running. But we also want to make sure that our servers are not sitting there idly (cloud bills can be expensive!).

Technologies:

1- Docker:

Datalab packages all of its apps as Docker containers. In practice, you can think of a Docker container as a Linux machine that runs on your development machine or on your server. Your app will run inside of this container, which is completely isolated from the host machine it is running on.

Docker has a much lower overhead than a virtual machine, as it does not require running an entire kernel for each container. In addition,while you can put limits on the resources for each Docker container, the container only uses those resources if it needs to. The resources dedicated to a VM, on the other hand, are not usable by the system anymore.

This lightweight environment isolation enabled by Docker provides many advantages:

  • Environment setup: For an app to run properly it requires that the server has all of the right versions of the dependencies installed. Furthermore, the file system has to be set up as the app expects it too. Finally, environment variables have to be set correctly. All of this configuration can lead to an app working perfectly fine on my computer, but not on yours, or on the server. Docker can help with these issues. A developer can write a dockerfile which is very similar to a bash script. The dockerfile has instructions on what dependencies need to be installed, and sets up the file system and environment variables. When a developer wants to ship a product, s/he will ship a Docker image instead of a jar file or some other build artifact. This will guarantee that the program will behave exactly the same irrespective of what machine it is running on. As an added bonus, Docker has developed tooling so you can run Docker containers on Windows and Macs . So you can develop on a Mac or a Windows machine, but rest assured that your program will work on a server running Linux.
  • Easy Deployment: By running the dockerfile the developer can create a Docker image. You can think of a Docker image as the harddrive of a Linux computer that is completely set up to run our app. The developer can then push this Docker image to an image repository. The servers that actually run the app, will pull the image from the image repository and start running it. Depending on the size of the image, the pull can take a bit of time. However, one nice thing about Docker images is their layered architecture. This means that once the server pulls the image for the first time, the subsequent pulls, will only pull the layers that have changed. If the developer makes a change to the main code, it will only affect the outermost layer of the image. So the server will only need to pull a very small amount of data to have the most up-to-date version. If an app needs an update, the server will pull the recent version and starts it in parallel to the running old version. When the new version is all up and running, the old container is destroyed. This allows for zero downtime of the server.
  • Running Multiple apps on the same server: Docker allows us to run multiple containers on the same server, helping us maximize the value that we get out of our resources. This is possible thanks to the low overhead of these containers. Moreover, If each app needs a different version of Python, or a different version of SQLAlchemy, it does not matter, as each container has its own independent environment.
So now that we have come to resource allocation, one might wonder, how do we make sure that each container has enough resources and how do we manage all the servers required for running our apps.

2- ECS:

Companies that adopt the Docker technology and microservice architecture along with it, will end up with many specialized small containers. This calls for technologies to manage and orchestrate these containers. There have been an array of technologies that make managing and deploying of containers easier. Some of the more well known names are Kubernetes, Amazon ECS, and Docker Swarm. Here at datalab we have picked Amazon ECS as our primary container orchestration solution.

Amazon ECS provides a cluster of servers (nodes) to the user and takes care of distributing the containers over the cluster. If servers are running out of memory or computing resources, the cluster will automatically create a new server.

We have some apps that are constantly running on this cluster. But there are also apps that only run periodically (once a day or once a week). In order to save on costs, we destroy the containers that have completed their job and will only create them again when we want to run them. So you can imagine that the number of containers and the type of containers running on the cluster is very dynamic. ECS automatically decides which server to schedule new containers on. It will add a new server if more resources are required. Finally, It will phase out a server if there is not enough work for it.

In short ECS takes care of distributing containers over the cluster, and helps us pack the most containers onto the fewest number of servers possible. But how do we actually schedule a container to run?

3- Airflow:

Airflow is a tool developed by Airbnb that we used to help us with a few tasks. Firstly we needed a way to schedule an app to run at a certain time of the day or week. Secondly, and more importantly, we needed a way to make sure that a job only runs when all the jobs that it is dependent on have completed successfully. Many of our apps’ inputs are the outputs of our other apps. So if App A’s input is the output of app B, we have to make sure that app A only runs if app B has successfully run. Airflow allows the developer to create DAGs (Directed Acyclic Graph), where each node is a job. Airflow will only schedule a job if all of its parent nodes have run successfully. It can be configured to rerun a job in case it fails. If it still cannot run the job, it will send alerts to the team, so that the problem can be fixed as soon as possible and the pipeline can continue its operation where it left off. Airflow has a nice UI that shows all the dependencies and allows the developers to rerun the failed jobs.

So airflow will schedule our periodic jobs and it will notify us if things go wrong. But how can we screen the apps that run constantly?

4 and 5- Sumo Logic and Sensu:

As a developer the most useful thing that I use to debug my apps is the logs. However, accessing the logs on servers is usually hard. It is even harder when using a distributed system such as Amazon ECS. The dynamic nature of ECS means that an app could be running on a different server on each given day. In addition, if a container is destroyed for any reason, all its logs will be lost too.

To solve the complexity of capturing and storing logs on a distributed system, we have made the system even more distributed! Sumo Logic is a service that accepts logs from apps over the network and will store them on the cloud. The logs can easily be searched using the name of the app. They can be further narrowed down with additional filters. So if a developer needs access to the logs for a specific app, s/he can get them with only a few clicks.

This still means that in order to quickly identify a broken app, someone has to be constantly looking at the logs. That sounds super boring, so we have automated the process by using Sumo Logic’s API and another technology called Sensu. Sensu is an open source project that allows companies to check the health of their apps, servers, and more. Sensu regularly runs the defined checks and alerts the team if something is wrong. At a high level, Sensu has two components: Sensu server and Sensu client. The Sensu server regularly asks for updates from the clients. The clients run some tests and return a pass/fail message back to the server. The server can then notify the team if the check has failed.

One of the use cases of Sensu in datalab is monitoring the health of our apps. One thing that I found particularly interesting is that the system is designed in a way that no extra code needs to be added to the apps. This is done through monitoring the logs of an app. This is how it all works: Our Sensu client is always running on the ECS cluster. Let’s say that Sensu server wants to check on the status of app X. It will send a request to the sensu client asking for updates on the status of app X and a regex expression of what a success message looks like. Sensu client will then send a request to Sumo Logic API asking for all the recent logs for app X. Next, Sensu client will search through the logs and see if they include the success expression. The client will send a success message back to the server if it can find the success message, and will send a failed message otherwise. The server will send an email to the team when a check fails or if the client is unresponsive, and the engineer on call can take measures to resolve the issue quickly.

That was a long post, so I’ll stop here. Hopefully you have gotten an idea of how Hootsuite manages its big data. To summarize, we use Docker containers for the ease of development, deployment, and the modularity it provides, ECS to orchestrate these containers, Airflow to manage scheduling and enforcing the order in which apps should run, and Sensu and Sumo Logic for monitoring and troubleshooting our apps.

About the Author

Ali is a co-op software developer on the data lab. He studies computer science at UBC. Connect with him on LinkedIn.