Processing Big Data with a Micro-Service-Inspired Data Pipeline
You aren’t truly ready for a career in Big Data until you have everyone in the room cringing from the endless jargon you are throwing at them. Everyone in tech is always trying to out-impress one another with their impressive grasp of technical jargon.
However, tech jargon does exist for a reason: it summarizes complex concepts into a simple narrative, and allow developers to abstract implementation details into design patterns which can be “mix-and-matched” to solve any technical task. With that in mind, let’s take a look at the technical tasks the Data Lab team was facing this year, and how we addressed them with an absurd quantity of geek speak.
The Data Lab team at Hootsuite is designed to help the business make data-driven decisions. From an engineering standpoint, this means designing a data pipeline to manage and aggregate all our data from various sources (Product, Salesforce, Localytics, etc.) and make them available in Redshift for analysis by our Analysts. Analyses typically take the form of either a specific query used to answer a specific ad-hoc request, or a more permanent Dashboard designed to monitor key metrics.
However, as Hootsuite grew, the Datalab team became a bottleneck for data requests from stakeholders across the business. This led us to search for a way that would allow various decision makers to dig into our data on their own, without needing SQL knowledge.
Enter Interana. Interana is a real-time time-indexed interactive data analytics tool which would allow for all of our employees to visualize and explore data themselves. Awesome, right?! Unfortunately, there was one little problem: we didn’t have the infrastructure for real-time data processing. Our pipeline only had support for a series of nightly ETLs, which were run by a cron job.
Creating something from scratch is incredibly exciting. Finally, an opportunity to implement a solution using all of the jargon you’d like, without any of the technical debt! We laid out our goals, and chose the solution that best fit our needs.
While analyzing the problem, I realized that the qualities we wanted our pipeline to have were the same qualities computer scientists have been striving to achieve for decades: abstraction, modularity, and robustness. What changed were the problems software engineers were facing, and the technologies which have been developed to provide modularity, robustness, and increased abstractness. It makes sense. We wouldn’t be able to create a real-time data pipeline by running our ETLs every second — we needed a different solution, which addressed these issues:
Micro-services are small applications that perform a single, specific service. They are often used in applications where each request can be delegated to a separate and complete application. What makes them fantastic to work with is that they abstract away the implementation details, and present only an interface comprising of their data inputs and outputs. This means that as long as interface remains the same, any modifications made in a service are guaranteed to be compatible with the system. In fact, one could safely replace one micro-service with another!
With all of Hootsuite migrating towards breaking apart our monolith into a set of micro-services, the Data Lab team also wanted a slice of the fun. Wanting to move away from our monolith-like ETL codebase, we saw an opportunity to implement our real-time data pipeline using the best practices established by our Product brethren. A data pipeline has of course some inherently different requirements than a SaaS product does — so we needed to make a few changes to what a typical micro-service product looks like. Our micro-services:
- Behave more like workstations at an assembly line than independent services — that is, after processing its data it does not “respond” to its caller
- Have a dependency structure of an acyclic graph — we don’t want data circulating our pipeline forever!
Above is an overview of our real-time data pipeline. We have a diverse set of sources for our data — some of them produce data in real time, while others do not. We built a micro-service to support batch-updated data. Each data source then gets put onto a data queue where our cleaner micro-services clean the data. This cleaned data then gets put into a common data format, and passed on to a “unified, cleaned” message queue, for our enricher to consume off of. This micro-service enriches our data by cross-referencing various fields with our data sets (and other micro-services!), and then uploads it into our data store. It sends a message into another message queue asking to have that data uploaded to our analytical data warehouse. Voila! A complete data pipeline.
We were able to create a complete data pipeline which meets the three qualities we sought out at the beginning: abstraction, modularity, and robustness:
- It is abstract. Each service hides its implementation details, and reveals only what it consumes and what it outputs.
- It is modular. Each micro-service can be reused and re-arranged without needing to refactor the entire system it resides in.
- It is robust. New data sources can be easily added (just clone and update a cleaner/producer micro-service), and if one service fails, the rest of the pipeline can still operate correctly.
- It is distributed. Each micro-service is run on a separate box, and may be consuming data from entirely different places.
- It is scalable. We can always create more instances of each application to consume and process data in parallel to each other. Adding new data sources is easy.