Dashboard Code Growth: Monolith vs. SOA
About three years ago, the Hootsuite Engineering team began a transition from a monolithic application to a services oriented architecture. While the journey is ongoing, we have come a long way from learning how to build services at all to efficiently delivering new customer value on top of a constellation of services.
This post shares what our journey has looked like, what we’ve learned along the way, and how key metrics have been impacted.
To gauge our progress we have tracked a number of relevant metrics:
- Production lines of code in monolith vs SOA
- Services in production
- Team size
BackgroundIn early 2013, we were forced to confront two stark realities:
- Slowing Velocity: Incremental improvements that we tried to make to the Hootsuite dashboard were getting more complex and more expensive to build. Though we were not explicitly measuring velocity at the time, it was clear that as the team was growing our productivity on a per person basis was shrinking.
- Dropping Reliability: Our customer base had grown to a size greater than what our applications were designed to handle. Consequently, things broke more often and our reliability was well below acceptable levels.
- Allocate large amounts of effort to paying down technical debt, or
- Declare technical bankruptcy and undertake a significant re-write
At the time, our entire core product (not including mobile apps) was comprised of approximately 120,000 lines of PHP code, exclusive of third party libraries and tests. This is more code than any one person can be expected to keep straight, yet our teams were organized to assume that everyone was a generalist capable of contributing anywhere in the code base.
Team members with years of experience building Hootsuite had to spend more time than ever figuring out a myriad of inter-dependencies to make any required product change. We were hiring fast in 2013, but every new development team member needed more onboarding time to get up to speed and start being productive.
While most software companies never get to the application scale we were at in 2013, many do. Our response to these challenges was to take a page out of the playbook of some of those companies who have successfully scaled past 1 million monthly active users. As others have done before us, we decided to move from a monolithic web application to a services-oriented architecture (SOA).
Product Code vs. Team SizeThe graph below shows how the number of lines of code in our monolithic dashboard PHP codebase have grown over the last 3 years:
This shows that dashboard PHP code growth has been essentially linear, with an average of 3,000 net new lines of code added per month. The rate of code growth has tapered somewhat in the past six months but still, the monolith continues to grow.
What’s interesting is that our team size, and therefore our theoretical capacity to build new code, has grown at a faster rate than the code. Since 2013, the size of the product development team at Hootsuite has approximately quadrupled, from 30 people to 150 people:
What the monolith code growth graph does not show is that instead of having all new developers adding new dashboard code, we have instead been paying down technical debt and building the services that have enabled our product to scale and be more maintainable. We have also removed a lot of PHP code that been rendered redundant with services.
Service Code GrowthThis graph shows how all service code bases have grown over time, broken out by individual service:
As the graph shows, services now account for almost half of our total production code base.
Also observable in this graph are three distinct phases of service development:
- Phase 1 (Jan 2013 – June 2014): Learning How to Build Services
- Phase 2 (July 2014 – October 2015): Adding Key Services to Pay Down Technical Debt
- Phase 3 (November 2015 – present): Increasing Product Velocity with New Service Development
Phase 1: Learning How to Build ServicesOur ability to build production grade (Scala) services did not happen overnight.
The first true production service that we built was one that abstracts Hootsuite user management functionality away from our dashboard product. To support this service, we also built a custom ZeroMQ based communication protocol to handle communications between the dashboard, other products, and our services layer. In fact, custom protocol building started before we started building our first service.
In retrospect, we may have been better off leveraging existing protocols like HTTP, but the decision to build our own protocol was well-thought-through. In essence, we knew that a ZeroMQ-based protocol could scale horizontally with service call volume. We were not sure that an HTTP-based protocol with single threaded PHP as the main micro-service consumer could scale to our requirements.
The graph below shows how our first core service grew in terms of lines of code over time. This service first went into production in May 2014:
This service took a long time to build because building also involved solving the following problems:
- Learning Scala: To deploy mission critical high volume services, we had to learn Scala and its ecosystem and supporting toolchain deeply
- Deploying Services: We had to learn how to stand-up and deploy service infrastructure
- Data Migration: A large migration of existing data was required – without any downtime – before cutover to this service could occur
- Data Normalization: Since the work done by this service is so mission critical, we had run both old monolith logic and new Scala service logic concurrently for several months to verify the equivalence of the Scala logic within a very large problem domain.
- Scalability: Out of the gate, this service was capable of handling orders of magnitude more requests per second than the monolith equivalent
- Reliability: Unburdened from having to process requests now handled by the service, our web servers immediately ran more reliably
- Cost: Scala being so much more efficient than PHP, having this service in production yielded immediate cost savings
- Extensibility: Since this service was decoupled from our dashboard product, having such a service opened the door to having products other than our dashboard access and manage data without having to go through the dashboard.
- The benefits of SOA are real
- We had to get much better and faster at building services
Phase 2 – Adding Key Services to Pay Down Technical DebtAs the graph below shows, from May 2014 to February 2015, we shipped (or substantially started work on) nine new mission critical services:
All through construction of the original service, the team worked out loud and shared knowledge with other teams. We consider this high visibility style of working key to learning at Hootsuite. In the second half of 2014, other teams began to apply this hard-won know-how to building out other mission critical services. Having other teams involved allowed us to replace several systems in desperate need of overhauling.
In this phase we built services that support key high volume systems like:
- The ow.ly link shortening service
- Publishing customer content to social networks
- Sessions, unique identifiers and data encryption
The Event BusFrom April to July of 2015, we also worked on another core piece of SOA enabling technology, something we affectionately refer to as The Event Bus. The Event Bus enables asynchronous, pub/sub style communications between independent services.
With the Event Bus in production, we were able to more easily treat our services as a graph of interconnected things instead of a cluster of independent services that could talk to a central product but not easily to one another.
Newer product functionality like Analytics, video & message Publishing, and Billing all increasingly rely on the Event Bus to handle communications between services.
Whereas in Phase 1 of our journey to SOA, we required 18 months to build one service, in Phase 2 we built 7 important services in 14 months. Compared to their former PHP equivalents, these services are low maintenance.
While we were getting more efficient at building & shipping new services, we were still not shipping much new service oriented customer value (other than perhaps significantly improved product reliability!). In Phase 3 of our transition to SOA, this changed. Over two years into our journey, we were poised to turn the corner from technical debt paydown to value delivery.
Phase 3: Increasing Product Velocity with New Service DevelopmentIn the the last 9 months or so – the third part of our journey to SOA – service development has accelerated dramatically. In that time – one quarter of the three years under analysis – we’ve written over half of our services code and shipped, extended, or are close to shipping another 16 services.
The graph below that shows code growth of services that were either started or substantially improved since Q3 2015:
Services built during this productive phase do things for our customers like:
- Manage media uploads
- Drive workflow
- Deal with billing and entitlements
- Send notifications
- Track social media interactions between our customers and their customers
SOA in ProductionThe graph below shows how many services we’ve had in development over time:
As with the services code growth graph, this graph shows that in each subsequent phase of our SOA development journey, the number of services in active development has also grown.
Most importantly, our customers have benefitted in the form of more reliable products that evolve more quickly. This is captured in how our availability has improved over the last three years:
What We’ve LearnedThe short answer is “a lot.” Less tersely, here are some of the key takeaways from our multi-phased, ongoing journey from monolith to SOA.
Scala is PowerfulThe power of Scala & functional programming is real. Anecdotally, building Scala services takes longer, but once deployed, they just run. So despite large upfront development cost, there is payback in the form of reduced maintenance cost and therefore total cost of ownership.
Scala is Hard to LearnMost software engineers are not trained in functional programming, and therefore Scala has a steep learning curve. To deal with this at Hootsuite, we’ve run numerous Scala training programs in house. This has been much more effective than our initial approach of trying to hire people who already know Scala.
Favour Existing ProtocolsBuilding and maintaining a custom service communication protocol is a lot of work. While we had good reason to build our own, we have since started transitioning from reliance on our protocol to reliance on HTTP as the transport layer. The gains in efficiency of building and deploying new services have been immediate.
Work Out LoudA key to our ability to move quickly as a team and adopt entirely new languages or architectures is what we call “Working Out Loud.” When venturing forth on a new path, we may dedicate a small focused team but part of that team’s mandate is always to share with the entire team, what is succeeding and what is failing. Tangibly, this happens through, regular demos, IM, posts to Facebook@Work and regular cross-team check-ins. We find that if we consciously work out loud from the outset of a project, information and knowledge propagates much more effectively than if we wait until something concrete is shipped.
SOA Operating Costs Can SpiralThe intuition that the infrastructure costs of operating a service oriented architecture are higher is accurate. This can be mitigated through things like reliance on orchestration and containerized services. The takeaway is to build costing into your SOA transition plan.
Conway’s Law is RealHaving gone through many Engineering team re-organizations as the size of the Hootsuite organization has grown, we’ve continued to be amazed at the applicability Conway’s Law. This states:
.. organizations which design systems … are constrained to produce > > designs which are copies of the communication structures of these > > organizationsThis also held true in our transition to SOA. Key to our transition was re-structuring the organization from one that was built to deliver generic product features to o ne that was designed to build services AND features. Essentially, this meant transitioning from a large group of generalists who self-formed into short running project teams into a set of stable teams each responsible for an area of functionality or technology, including the services within them.