Learn about the technology, culture, and processes we use to build Hootsuite.

Everyone — our software developers, co-op students, high school students — 'works out loud' about the tools we use, the experiments we ran, and lessons we learned. We hope our stories and lessons help you.

Recent Posts:

The world of social media moves fast as new platforms arise and gain popularity. Since Hootsuite is here to champion the power of human connection, we want to be able to integrate new social channels quickly. This was one of the driving factors behind creating a universal polling system.


The first design decisions were to determine what characteristics of a polling system can be shared between all social channels. This led to splitting our system into a shared scheduler, responsible for registrations, polling frequencies, scheduling, retrying, dealing with rate limits, etc., and a worker for making the actual API calls to the social channels and publishing the result to our Kafka event bus. These scheduler and worker microservices run inside a Kubernetes cluster which helps ensure each has enough resources to handle bursts in traffic and makes spinning up more replicas a breeze.

Overview of of the polling system with emphasis on how a job flows through it


In our system, a job refers to a specific social channel API endpoint and the kind of data we want to publish to our event bus from it. Each job constantly cycles from a scheduler to a worker and back. It has contextual data that is important for the API query, but not used by the scheduler. For example, if a job’s purpose is to retrieve a user’s new posts, it must keep track of the id of the last post it saw in order to make the appropriate API query. This job context can vary greatly between jobs depending on exactly what it’s meant to publish to the event bus, the social channel it’s connecting to, and the endpoint it’s hitting. For maximum flexibility, we used a hash map stored as a property of each job.

AWS Simple Queue Service (SQS)

Since we need this system to be scalable enough to handle thousands, perhaps millions, of API calls per second, we needed an effective and reliable bridge that could connect our scheduler and worker microservices together. The team decided to use Amazon’s robust Simple Queue Service to move jobs between services, giving each job its own queue. When a job is completed by a worker, it is returned back to the scheduler through the completed queue. One of SQS’s great features is that jobs aren’t immediately deleted once they all consumed from the queue, but only once a confirmation messages has been provided (ie. after the job has been fully processed). This way SQS will handle retry logic and help ensure no jobs are lost. It also provides a dead-letter queue for jobs that have failed after multiple retries, preventing them from clogging up the main queues.


Once a job is consumed from it’s SQS queue by a worker, it will use the job’s context to build an appropriate API query and make calls to the specific API endpoint. Once the calls are made, the results are filtered as appropriate (ie. we may only be interested in posts after a certain date) and published to the event bus for consumption by Hootsuite products. The job’s context is updated with information important for subsequent runs and the job is pushed into the completed queue along with rate limit information. Since some APIs only allow you to call them a set number of times within a given time window, it is important to respect these limits and to use this information when rescheduling a job. Only after this step is a confirmation sent to the original SQS queue where the job was pulled from, signalling that it has been completed and can be removed from the queue.

What If Something Goes Wrong?

In the event that something goes wrong with calling the API, error codes are included with the job and it is pushed into the completed queue without publishing anything. Based on the error code, it will be rescheduled differently by the scheduler. If the job fails inside the worker for other reasons, perhaps a bug, it will never be confirmed as completed to the original SQS queue. After a timeout period, it will reappear in that queue again, ready to be retried. After a few instances of this, it will be sent back to the scheduler through a SQS dead-letter queue.

Rescheduling Jobs

Our scheduler reads from the SQS completed and dead-letter queues, rescheduling jobs into time slots based on the job type, whether the it returned with an error code, the type of error code, rate limit information, and the Hootsuite service level associated the account. The service level is based on the type of Hootsuite plan an account is associated with and different service levels may poll at different frequencies. For example, since enterprise customers require more timely data they will poll more frequently. Rate limit information is especially important if there are multiple products, such as legacy products, that are accessing the same API and sharing the same limit. We have a cushion value for each job and if the remaining available API calls are less than this value, we will schedule it in the next rate limit window.

Each time slot that a job is scheduled into represents a minute in the day and is implemented as a set within our Redis instance. There are 1440 time slots, each containing the jobs that will be run in that minute. Every minute, an entire time slot is drained and all of the jobs are sent to their appropriate SQS queues. Once a job is pushed back into its appropriate SQS queue, the entire cycles begins anew, nurturing Hootsuite customers with life sustaining data.


In true microservice fashion, we kept our scheduler and worker services stateless, storing all of our data in AWS ElastiCache Redis instances. An in-memory solution was chosen since our throughput was estimated to be starting at tens of thousands of read/writes per second for updating and scheduling jobs. It provided us with data structures such a sets, which we used to represent time slots. Redis was also used to cache necessary information such as auth tokens and service levels to avoid straining other Hootsuite services.

Adding a New Channel

The wonderful thing about this polling system is that we can use the same scheduler for new social channels after adding some code to point to the correct SQS queues and Redis instances. A new worker needs to be created for making the API calls, but large parts of its functionality, like publishing to the event bus and reading from SQS queues, can be templated from previous workers. Each channel would also require its own Redis instances. One to store registrations, jobs, time slots, and service level information and the other for auth tokens that the worker requires. Finally, SQS queues for each new job associated with the channel would finish the integration.

Lessons Learned

Social media accounts can be very unpredictable and some accounts will have very little activity and all of the sudden unleash a torrent of posts. In certain cases there were more posts than could be captured within our polling frequency and API query limits. We had to log accounts that showed such activity and manually bump up their polling frequency. A future consideration is a dynamic way to change polling frequencies when there is a lot of activity. Publishing a mass of posts to the Kafka event bus also proved challenging because exceptions would be thrown related to queuing posts faster than they could be sent. The team ended up internally throttling posts going to the event bus to solve this issue.


Thanks to excellent leadership and teamwork the universal polling system is successfully polling for its first social channel in production. Its microservice design and fast in-memory database means it can scale to handle millions of jobs per second. It will make adding new social channels to Hootsuite much quicker and help keep the company at the forefront of social media management.

About the Author

Ostap Manastyrski is Software Developer Co-op working with the Strategic Integrations team on the universal polling system. When not coding, he does Brazilian jiu jitsu, plays board games, and blends his food. Connect with him on LinkedIn.

As summer interns on the Measure backend team in Bucharest, we had to implement functionalities for both Hootsuite Insights and Analytics. We worked on three main projects, involvins controllers and services.

What are controllers? Our data processing system is laid out as a pipeline. The entities within it are called controllers, and they communicate through queues. Data goes through the pipeline, where we might access extra information from our services or the Internet.

As for services, they define a functionality or a set of functionalities that different clients can reuse for different purposes, together with the policies that control its usage.

Health Check Service for Insights

Health checks are a way of interrogating a component’s status inside a distributed system, in order to notify the load-balancing mechanisms when the component should be removed or when more instances should be installed. To understand how health checks work in our project, we need to have some basic knowledge about a service.

Services are implemented using gRPC, for its simple service definition and automatic generation of idiomatic client and server stubs for a variety of languages and platforms. gRPC uses Protocol Buffers (a powerful binary serialization toolset and language) to define a service’s protocol and then it generates stubs which can be used in your server/client implementation.

The server side of a service is running inside containers within Kubernetes Pods. To be more precise, a dockerized server is deployed in Kubernetes, which handles the number of containers alongside its pods to automatically adjust to the “traffic” requirements.

So far, we know what a service is, how to implement it and where it will run. So, what about health checks? Kubernetes does not provide native health checks for gRPC services, so we decided to develop them ourselves.

Health checks are also implemented as a service. The main difference from a regular service is that you “attach” this service to the existing ones. How? You simply ‘register’ it along them.

grpcServer := grpc.NewServer(grpc.UnaryInterceptor(interceptors.CreateMonitoringInterceptor()))
pb.RegisterOutboundAnalyticsServer(grpcServer, newServer(es_driver))
health_pb.RegisterHealthCheckServer(grpcServer, health_check.NewServer(make([]func() bool, 0)))

How does it actually work?

Here comes a feature of Kubernetes. When deploying a service, Kubernetes allows tests to be ran after the creation of a container (readiness probes) and while it is running (liveness probes).  There are 3 types of handlers for probes: TCP (tcpSocket), HTTP Request (httpGet) and command (exec). In order to use our custom health checks, we need the latter. We use a client to connect to our server, then use its Check function to verify the status of the service. The specified method runs a list of checks and it shows us whether the pod is running correctly or not.

How do I make my own checks for my service?

Well, a check is simply a boolean function. Each developer can customize their own checks for a service as a list of boolean functions.

We have already integrated health checks for all Insights services in production. For easier deployment on Kubernetes, we have also defined some Jinja Templates that use the health-checking system as readiness and liveness probes. Later on, the system has been also ported into Analytics by our colleagues.


Anytime something doesn’t work in production, alerts are generated to notify the on-call developers. To achieve this, various checks are ran periodically to verify the status of different infrastructure components. These checks use parameters that are statically defined inside the code or YAML files, which define “normal behavior” for a specific component.

One problem of this system is that every time someone wanted to modify the values of the parameters, a deployment needed to be done, which took a lot of time. We decided it would be a big improvement if we could dynamically configure these parameters.

A technology already used by Hootsuite for configuration management is Consul. Amongst other features, Consul provides a distributed system for hierarchical key-value storage. For ease of access, Consul exposes a WebUI which can be used to easily modify and adapt these values, without the need to go through the development process of modifying the code and redeploying it.

We configured the alert-generating system to use Consul for controllers and services alerts on both Insights and Analytics.

Services Alerts

A service’s implementation plots to graphite each of its RPC calls / errors. This provides very precise information about a service’s status. By monitoring the graphite metrics, we can define automatic alerts on our services to signal unexpected behaviour. In order to do so, we have followed the next steps:
  • generate metadata about services and their RPCs using the Protocol Buffer definitions
  • generate default alerts for each service using the previously obtained metadata and insert these values into Consul KV
  • modify the existing regenerating scripts (which generate .cfg files for Nagios) to use the values from Consul KV

Controllers Alerts

Controllers communicate through Kestrel queues. When a controller doesn’t work properly, messages may pile up in its queue. By monitoring the queues, we can both
  • generate alerts in order to signal the unexpected behavior;
  • scale that controller’s instances up in order to get the jobs processed.
Every type of controller has a set of parameters it needs to stay within, defining the number of instances or the size and speed of the input queue. These values used to be defined statically into some yaml configuration files. We decided to load these values into Consul, in order to be able to fetch their updated values during the scaling process and at alert-generating time.

What we did was:

  • add a script that loads into Consul the default values from the configuration files; the script should be ran every time a new type of controller is added;
  • modify the auto-scaling and alert-generating mechanisms to use the values stored inside Consul; if Consul fails, we use the static values as fallback.
To simplify the developer’s life even further, we decided to use a web app, called Dynamic Config (previously developed during a Hackathon by some of our colleagues from Bucharest Office) that features a Flask server that accepts HTTP requests to modify or fetch the controllers’ Consul configurations and a web UI where developers can easily change or read these values, just by clicking a few buttons.

Analytics: Graphite Plotting of Mongo Calls from Go

The problem

Since our team has recently started to develop new services in Go, some of the functionalities already present in our existing Python code base are not available. One of them was the Graphite plotting of calls to Mongo databases.

Graphite is a timeseries database. It can plot different data over extended periods of time and it’s commonly used for monitoring the usage of different services.

The design choice

What we wanted to obtain was a method for adding calls to the Graphite API every time Mongo is used.

We had multiple ideas about how to achieve this, but we decided from the start that:

  1. The solution should be easy to use (just plug and play)
  2. The developer should not have to worry about how it’s implemented and they should still be able to use all of the methods and fields available in the mgo (Mongo Go) package
  3. Code readability should not be affected at all
Since in our Python code, this functionality was implemented using decorators, the first idea was to try and add them here, too. They can be implemented with reflection, as in the example found (here)[https://gist.github.com/saelo/4190b75724adc06b1c5a]. The major downside of this was that it affected the readability and that the decorator itself would be hard to understand for any maintainer that wanted to modify it later.

Another idea that came to mind was to define wrappers for the basic objects used from the mgo package. At first, we wanted to be able to write unit tests, so we had to mock the calls to Mongo by using an interface.

With anonymous fields, we were not required to implement all the methods on the wrapper class. If one has a MongoDatabase object and calls a method that is defined for the MongoDBWrapperInterface implementation, that method would be called in fact. Unfortunately, this design had a major flaw: if we wanted to access an attribute of mgo.Database (like Name) from the wrapper, it would not be visible by default at compile time, since interfaces do not have fields. You would have to use a work-around like db.(MongoDBWrapper).Name or require a getName() method on the interface, which would make the code really ugly.

After discussing with our mentors, we decided to switch to integration tests and avoid mocking for the moment. The final design looks like this:

type GraphiteMonitoredDatabase struct {
type GraphiteMonitoredCollection struct {
func (db *GraphiteMonitoredDatabase) C(name string) *GraphiteMonitoredCollection {
return &GraphiteMonitoredCollection{db.Database.C(name)}

Some other functions are redefined too. Adding the calls to Graphite was pretty basic:

func (c *GraphiteMonitoredCollection) Find(query interface{}) *mgo.Query {
startTime := time.Now()
result := c.Collection.Find(query)
elapsedTime := time.Since(startTime).Seconds()
addMongoCallMetrics(c, "find", elapsedTime)
return result


Using the wrappers instead of the actual mgo objects inside the code went smooth as well. Here’s an example from the tests:

func (suite *GraphiteMonitoredCollectionSuite) TestInsertAndFind() {
t := suite.T()
for i := 0; i < 5; i++ {
suite.collection.Insert(&testType{TestInt: i})
var findResult []testType

The final result

In the end, we were pleased with the design we built because it implied minimal code changes and it suited all of the requirements we set from the start.

Now all our Go components are also plotting their Mongo calls:


Working on these projects has been really fun for both of us. Apart from learning a new programming language (Go), we’ve had the occasion to design our own solutions, learn best practices from our mentors and also see our projects adding value when deployed into production.

Of course, After 5 parties, a great team building trip, the everyday foosball and billiards championships as well as weekly basketball matches might have also contributed to our cool experience :).

About the Authors

Monica and Alex are fourth year students at University POLITEHNICA of Bucharest, Faculty of Automatic Control and Computer Science. They found their interest in technology during their highschool years and decided to follow their passion as a future career. In her spare time, Monica likes playing ping pong and reading. Alex enjoys occasional scuba-divings and has also been a professional dancer for 10 years. After a summer working at Hootsuite’s projects, they are prepared to successfully meet all the challenges of 4th year at faculty.


As many companies tend towards a service oriented architecture, developers will often wonder whether more and more parts of their service could be moved into the cloud. Databases, file storage, and even servers are slowly transitioning to the cloud, with servers being run in virtual containers as opposed to being hosted on a dedicated machine. Recently, FaaS (function as a service) were introduced to allow developers to upload their “application logic” to the cloud without requiring the server, essentially abstracting the servers away. Despite not having to worry about the servers, developers find that they now have to deal with the cloud. Complexity with uploading, deploying and versioning now become cloud related. This, along with several current limitations of the FaaS model, has often positioned serverless technology as being best suited towards complementing a dedicated server.


Recently, the Serverless Framework was introduced allowing us to abstract even the cloud part of the development process away. Now, with just about everything server or cloud related hidden away, developers have the ability to directly write code related to the actual application. The serverless model also offers several other advantages over traditional servers such as costs and ease of scaling. So would it be possible to completely replace the server with a serverless model? Taking into account the limitations of the FaaS model, we set out to build a fully functional, cloud based, serverless app. Read More …

Why Migrate? – The Non-Technical Parts

Earlier this year, our Product Operations and Delivery team decided to migrate services from our Mesos cluster to Kubernetes. George Gao wrote a post detailing the technical reasons  for the move. On the non-technical side of things, the tooling around Kubernetes was more developer friendly than what Mesos offered, which bode well for our dev teams. Additionally, only the core operations team that originally implemented Mesos understood it. When problems arose, they were the only ones capable of troubleshooting.

Following an evaluation of alternatives, the team made a bet on Kubernetes and started migrating services on Mesos with the goal of moving all fifteen to Kubernetes. The team gave themselves three months to complete the migration, but thanks to our service mesh, the project only took two!

This was because our service mesh decoupled microservice networking from the application code. As a result, the migration process was limited to simple routing changes. To fully appreciate this, we need understand how our service mesh works.

What is a Service Mesh?

Imagine you’re writing a new service. Let’s say that your service has a bunch of microservices it needs to talk to. Do you hardcode the URLs to these dependencies into your service? What if there are multiple instances of each service so requests are load balanced? How will your service continuously discover these new instances if they could go down and be brought back up anytime with different URLs?

Adding logic for these considerations would bloat your application code and you’d have to do the same work for every service in your architecture. This work is only compounded as the number of services and languages grows.

One solution is to move this responsibility from the clients to the networking layer. By doing so, we have a ‘thin client’ and ‘fat middleware’ model. This is a service mesh. Practically speaking, this means setting up lightweight proxies between the origin and destination services to take care of service discovery and routing. Service discovery is the mechanism that allows services to dynamically track other services and route to them.

Once the service mesh has been set up, adding a new service to it makes the service automatically routable from existing services. You can then focus on writing application logic and trust that the network will route as you expect it to. At Hootsuite, this helped lower the barrier to writing new microservices. Mark Eijsermans gave a talk that goes into more detail.

Hootsuite’s Service Mesh

Our in-house service mesh is called Skyline. It uses Consul for service discovery and NGINX for routing.

Mesos to Mesos

On each node in the Mesos cluster, we run a Consul agent and an NGINX server. The NGINX config is kept up-to-date by fsconsul and consul-template.

Each container that runs on a Mesos node makes application requests to a Skyline  URL: http://localhost:5040/service/foo/endpoint. This request first goes to the local NGINX proxy at port 5040. The local NGINX then proxies that request to the destination NGINX proxy at port 5041, which routes the request to the correct application on the node. So, the Mesos service only needs to know the Skyline URL of its downstream Mesos service.

Mesos to Kubernetes

If the local NGINX proxy can’t figure out where to send the request, it just gets proxied to the Kubernetes cluster. All Kubernetes worker nodes are listed in Consul, so any calls to a service that isn’t found in the Mesos cluster will route to Kubernetes via a catch-all.

When a request comes in from outside the Kubernetes cluster, it will reach any Kubernetes worker node at random. On our Kubernetes nodes, we run a skyline-bridge Pod in a Daemon Set on all worker nodes. These Pods just run NGINX and listen on their container port 5041, which is mapped to the host port 5041. When a request comes into a Kubernetes node, the skyline-bridge Pod transforms the request URL into a kubedns name: http://foo.default.svc.cluster.local:8080/endpoint. After that, kubedns takes care of routing the request to the correct destination Service.

For example, say a Mesos service wants to reach the Giphy service sitting in Kubernetes. The origin service calls http://localhost:5040/service/giphy/media. The request gets proxied from the local NGINX to a dedicated set of gateway servers called the ‘service discovery bridge’ (SDB) and into a Kubernetes worker node.

The skyline-bridge Pod on that node receives the request called {NODE IP}:5041/service/giphy/media. It transforms the request into http://giphy.default.svc.local:8080/media. That request is then passed to kubedns to be routed to the Giphy Service.

Kubernetes to Mesos

If a request from within Kubernetes is destined for a Mesos service, the requesting service calls the kubedns name of a Service that represents the Mesos service. This Service targets skyline-bridge Pods.

For example, the Organization service lives in Mesos, but there is a Service named organization.default.svc.cluster.local to represent it in Kubernetes. The skyline-bridge Pods will then transform the kubedns name of the destination into a Skyline URL before proxying it to the Mesos cluster.

The server named service-discovery-bridge.hootops.com is part of the ‘service discovery bridge’ (SDB) that acts as a gateway into both the Kubernetes and Mesos cluster.

Life of a Request through the Service Mesh

Let’s look at a more detailed example. Say that we have a request from a service outside the cluster to our Giphy service running in Kubernetes. This request will first be made to the local NGINX proxy at http://localhost:5040/service/giphy/media on the origin server. This request then gets proxied to the SDB.

Once the request hits the SDB, it will proxy the request to its known backends based on the request URL. If the destination service sits in our Mesos cluster, it would appear in the SDB’s NGINX config like this:

From this, we know that our request will be routed to one of these Mesos slaves. If the Mesos service goes down, its location block in the SDB will be removed by consul-template. In that case, the SDB will fallback to the Kubernetes upstream, as shown below.

On a side note, since the Mesos service location blocks come before the Kubernetes one, the SDB will always prioritize routing to Mesos over Kubernetes as long as a Mesos version of the service is healthy. Once the request reaches the Kubernetes cluster, the skyline-bridge Pods will take care of routing it to the correct Service.

How did this help with the migration?

Easy cutover

Consider a service’s dependency that was originally in Mesos and sitting on Skyline. If it got scaled down, the SDB would automatically route to the catch-all and into the Kubernetes cluster. Assuming that there was already a Kubernetes version running, the traffic cutover would happen seamlessly.

This is the power of a service mesh. Instead of having to SSH into each upstream service and make manual routing changes, we can just tell Mesos to scale down the service and our service mesh will correct the routing for us.

Easy fallback

If something went wrong with the Kubernetes version of the service while it was serving requests, falling back would be as simple as scaling the Mesos service back up. Since the SDB favours the Mesos cluster, it would automatically route traffic back into the Mesos cluster. This also sidesteps manual configuration in emergencies.

Minimal code changes

With this, the only code changes our migration team had to make were to the dependency URLs. These are usually defined in a service’s deploy configurations (.yml files).

If a service was in Mesos, it would reach a downstream service at http://localhost:5040/service/foo/endpoint. Once it gets deployed to Kubernetes, the downstream service URL would need to be http://foo.default.svc.cluster.local:8080/endpoint. Following our Giphy example, the code changes we made to its deploy configuration looked something like this:

Thus, pull requests were relatively small and simple to review, which made the migration smoother for all parties involved.

Towards the future

Migrating from Mesos to Kubernetes reduced a huge amount of technical debt across all teams. On the operations side, it didn’t make sense for us to maintain two separate container schedulers and deploy environments. It also saved us $67k a year. On the development side, service owners have a more robust and developer-friendly tool in kubectl to debug and maintain deployments, enabling them to troubleshoot problems instead of relying on the core operations team.

For our next step, we are looking at second generation service meshes. Compared to Skyline, these service meshes promise out-of-the-box features like smart retries, circuit breaking, better metrics and logging, authentication and authorization, etc. Currently, Istio and Envoy are our strongest contenders. As our microservices grow, this will give us even more operational control over our service mesh to empower fast and safe deployment of services at Hootsuite.

Big thanks to Luke Kysow and Nafisa Shazia for helping me with this post.

About the Author

Jordan Siaw is a co-op on the Product Operations and Delivery (POD) team. He is a Computing Science major at Simon Fraser University (SFU). When he’s not busy with code, he enjoys reading, podcasts, playing guitar and programming jokes. Find him on Github, Twitter, Instagram or LinkedIn.

The Hootsuite App Directory is a collection of extensions and applications which Hootsuite users can add to their Hootsuite dashboard to create a customized experience. Since its launch in 2011, it has been used by millions of Hootsuite customers. In the past 6 years, it has accumulated gigabytes of data, from information about the apps that people can install to information about which apps are installed for each user. As more apps are released into the app directory and more customers install and use apps, it becomes all the more necessary to have a database which is easy to maintain and scale. The requirements for our App Directory database are:

  • The ability to handle relational data.
  • The ability to easily ensure data integrity.
  • The ability to handle many simultaneous read requests.

When the Hootsuite App Directory was introduced in 2011, we did not know how it would develop over the next six years. MongoDB was chosen to hold the app directory data, at the time MongoDB was Hootsuite’s primary database. MongoDB provided us with much needed flexibility during the early stages of the App Directory. We were able to store data in Mongo without focussing strictly on the structure of the data, allowing us to quickly handle rapidly changing requirements as we experimented with various apps and integrations.


We have previously written about our move from a PHP monolith to a microservice architecture. As a part of the larger-scale migration project, we are moving the App Directory logic from the monolith into a Scala microservice. Along with the logic, we are moving the data from the MongoDB Database connected to the PHP monolith, into a new database connected to our microservice. This has given us an opportunity to revisit our database choice and the data’s schema.

Relational Data

Our App Directory has matured since its launch and we have a more stable data model than we had when the project began. This stability allows the model to be represented well by a schema. The model consists of various relationships which can be used to reduce the complexity of the business layer.

We currently have these collections among others in our MongoDB:

  • An App collection which contains data related to Apps.
  • An InstalledApp collection which contains data related to installed Apps.
  • An InstalledAppComponent collection which contains information related to external developer.
By taking the advantage of features built into Mongo like document embedding and document references, we could possibly have used MongoDB much more efficiently. These techniques could be used with one-to-many relationships (for more details refer to this link). An example of this relationship in our database in our database is between App and InstalledApp:

We have denormalized InstalledApp and InstalledAppComponent, and referenced the InstalledApp to App collection.

The complexity of the queries increases dramatically when it comes to more involved relationships such as multi-layer hierarchy or many-to-many relationships. You can see this in the example above. When the document starts growing, performing update or search operations on the data becomes more difficult. Though it is easier to update documents using document referencing, it would require us to use multiple queries to retrieve the related data. In the end this leaves us with both ugly documents and highly convoluted queries.

In newer versions of MongoDB, they have introduced “lookup”, a feature analogous to SQL left outer joins. There are two main reasons why we are reluctant to use lookup. Firstly, lookup is a relatively new feature and would require us to a upgrade our Mongo version. Secondly, it only performs left outer joins, so performing inner joins and full joins would still result in messy, hard to maintain code.

The complex Mongo query above is expressed relatively simply in SQL. Here is the same query, retrieving the total number of app installs for each app developed by a certain app provider:

The above query shows how easily relations can be handled in MySQL. These are the benefits of using a robust query language like SQL. It allows for operations such as joining, filtering and grouping. MySQL is a relational database and our data model consists of complex relations, therefore we feel that MySQL is more suitable database for our use case in the Hootsuite App Directory.

Data Integrity

Data integrity is defined as the overall completeness, accuracy, and consistency of the data. This is highly valuable to us, as it increases the maintainability, reusability, stability, and performance of the service.

MongoDB follows the BASE approach, which sacrifices consistency in favor of making the database more partition tolerant. As a result, performing operations on more than one document at once may lead to corrupted data in the database. MongoDB provides two phase commits which allows transactions to be performed similar to transactions in SQL. If there is an error during the transaction a rollback is performed. One important difference from SQL is that a user can still access the intermediate state while the operation is in progress. The Mongo documentation warns:

It is possible for applications to return intermediate data at intermediate points during the two-phase commit or rollback. [1]

This is not the case with SQL as it adheres to ACID, having the properties of Atomicity, Consistency, Isolation, and Durability. This ensures that the data always remains in a valid state, both consistent and accurate.

Being schema-less and without the presence of referential integrity, MongoDB shifts the burden of maintaining data consistency onto the developers. A side-effect of the lack of strict schema and referential integrity is that bugs in code can result in inconsistencies in the database. These inconsistencies may not surface as errors until something breaks. It is certainly possible to prevent data inconsistencies in database by thoroughly designing and testing the software that reads from and writes to it. However, as we are moving from PHP to Scala, we would not only have to rewrite all the models, but we would also have to write extra code to ensure consistency. We reasoned that this would slow down the migration process as well as adding to the difficulty of maintaining the code. With the relations in our data, we would like to have referential integrity so that we don’t create orphaned data. Implementing referential integrity in MongoDB would require the following steps:

Inserting an InstalledApp

  • Insert InstallApp
  • Search Mongo for correct App using the appId.
  • Insert the installedApp in the installedApp array in App collection
Deleting an App
  • Fetch the the right App
  • Get all the InstalledApp ids
  • Remove all of InstalledApps
  • Delete the App from App collection
There are many other scenarios that would have to be covered for our use cases, and there are only 3 collections! To make things worse, if anything fails, we would end up having faulty data.

MySQL requires us to define a schema, declare data types, nullable fields, etc. We can declare foreign keys (referential integrity) while creating the schema itself. The schema does not reside in the business layer, it is part of the database itself. If the data does not agree with the defined schema, it will not be added to the database. This lessens the burden on the developer to implement logic ensuring the consistency and correctness of the data.

Read Requests

Our service receives many simultaneous read requests, and also relies heavily on relational data. We need a database that performs very well under such conditions. General speaking, Mongo outperforms MySQL when the service is exposed to high volume of write requests. This is because things like referential integrity and ACID compliance have a cost. Being horizontally scalable, MongoDB can deal with an even higher volume of requests by taking advantage of the benefits of distributed systems.

When it comes to read requests, especially when dealing with relations, MySQL often outperforms MongoDB. Moreover, with proper indexing, the performance of operations, such as joins, can be improved drastically in MySQL. One reason why MongoDB is slower in these cases is that some of the logic handling the relations resides in the business layer, which is not the case with MySQL. Because we experience a high volume of read requests, allowing for slower write requests in favor of faster read requests is a reasonable trade off.

Although MySQL is generally vertically scalable, there are still ways to make it horizontally scalable. Features like replication, and products like dbShards can be used if needed. With our requirements, replication is a good option as we can balance high read requests between various slave MySQL databases.

Migration Progress

Anyone who has done a data migration knows that it is not an easy task. We are dealing with customer data and we want to ensure that the integrity of that data is maintained throughout the entire process. Our strategy is to write to both our MongoDB and MySQL databases, and then compare if the data matches. For historical data, we use a migration script which exports the data from mongo and then imports it into SQL using the new schema. Any mismatches are fixed by the team, either by adding more validation checks in the business layer or by updating the migration script.

This migration project has given us some good insights into our legacy code. This has enabled us to write much more efficient and more maintainable code for our microservice. It is a win-win situation for us, we are storing clean data in the database and we have higher quality code.

Remarks and conclusion

In the end, both MongoDB and MySQL have their strengths and weaknesses. However, the differences between MongoDB and MySQL are lessening with new features that continue to be released. Newer versions of MongoDB can use join-like operations and MySQL now has the ability to store JSON data. Along with that, there are many integrations available which can be used to improve the performance of MySQL, or to handle transactions effectively in MongoDB. It ultimately depends upon the data, and what do you want to do with that data to determine which database is the right choice.

For our App Directory service we have a well defined relational data model. We want to ensure that the principles of data integrity are offered by the database itself, and that it can also handle many simultaneous read requests. These requirements led us to choose MySQL for our new App Directory service database.

Shoutout to Neil, Sim, Jody, Steve and Isha for helping me with the blog post.

Preetkaran Rawal is a Co-op Software Developer on the Developer Products team. He currently attends University of Waterloo for Computer Engineering.

Loading ...