How to Build a Service Oriented Data Pipeline
Has this happened to you?: An Account Manager comes to you in a panic. They’ve just lost their most important client, but the Account Manager had no idea they weren’t happy, as they had no insight into how the client was actually using the product. To make sure the Account Manager doesn’t get blindsided like this again, you want to build a tool that allows them to check in on the health of their accounts. The backend will be built as a data pipeline – a series of jobs to collect, clean, enrich and aggregate data. But what should each job do? How will you run them in a robust way, with proper dependencies between jobs, re-runs on failure, and alerts raised when necessary? How will you handle different jobs requiring dramatically different architecture and tools, while keeping the code simple and the system cohesive?
In this presentation, given at Applicative 2016, we’ll look at the challenges we’ve faced scaling Data Analytics at Hootsuite, and how we’ve moved from a monolithic project that was becoming unmaintainable to a series of very small, loosely coupled jobs connected by a communication layer, stealing ideas from Service Oriented Architecture.
We’ll actually build a simple data pipeline for our imaginary Account Manager, constructing it as a series of Scala apps, deployed to AWS Lambda, stitched together using Airbnb’s open source Airflow tool. If you want to build something similar, feel free to steal the code as a starter project.
- Data cleaning and processing job
- Statistic calculation job
- Airflow, a tool for building data pipelines