How to Build a Service Oriented Data Pipeline

Has this happened to you?: An Account Manager comes to you in a panic. They’ve just lost their most important client, but the Account Manager had no idea they weren’t happy, as they had no insight into how the client was actually using the product. To make sure the Account Manager doesn’t get blindsided like this again, you want to build a tool that allows them to check in on the health of their accounts.  The backend will be built as a data pipeline – a series of jobs to collect, clean, enrich and aggregate data. But what should each job do? How will you run them in a robust way, with proper dependencies between jobs, re-runs on failure, and alerts raised when necessary? How will you handle different jobs requiring dramatically different architecture and tools, while keeping the code simple and the system cohesive?

In this presentation, given at Applicative 2016, we’ll look at the challenges we’ve faced scaling Data Analytics at Hootsuite, and how we’ve moved from a monolithic project that was becoming unmaintainable to a series of very small, loosely coupled jobs connected by a communication layer, stealing ideas from Service Oriented Architecture.
We’ll actually build a simple data pipeline for our imaginary Account Manager, constructing it as a series of Scala apps, deployed to AWS Lambda, stitched together using Airbnb’s open source Airflow tool. If you want to build something similar, feel free to steal the code as a starter project.
Let's Build a Service Oriented Data Pipeline! from Yasha Podeswa
About the Author
yasha podeswaYasha is a Software Developer on the Data Lab team at Hootsuite, where he builds software to help the Product and Business teams make data driven, customer centric decisions. Originally an Oceanographer, he got into Software Development through scientific computing, and is an avid fan of oceans, hockey and delicious beers. Follow him on Twitter @ypodeswa.