By Ostap Manastyrski on January 11, 2018
The world of social media moves fast as new platforms arise and gain popularity. Since Hootsuite is here to champion the power of human connection, we want to be able to integrate new social channels quickly. This was one of the driving factors behind creating a universal polling system.
The first design decisions were to determine what characteristics of a polling system can be shared between all social channels. This led to splitting our system into a shared scheduler, responsible for registrations, polling frequencies, scheduling, retrying, dealing with rate limits, etc., and a worker for making the actual API calls to the social channels and publishing the result to our Kafka event bus. These scheduler and worker microservices run inside a Kubernetes cluster which helps ensure each has enough resources to handle bursts in traffic and makes spinning up more replicas a breeze.
Overview of of the polling system with emphasis on how a job flows through it
In our system, a job refers to a specific social channel API endpoint and the kind of data we want to publish to our event bus from it. Each job constantly cycles from a scheduler to a worker and back. It has contextual data that is important for the API query, but not used by the scheduler. For example, if a job’s purpose is to retrieve a user’s new posts, it must keep track of the id of the last post it saw in order to make the appropriate API query. This job context can vary greatly between jobs depending on exactly what it’s meant to publish to the event bus, the social channel it’s connecting to, and the endpoint it’s hitting. For maximum flexibility, we used a hash map stored as a property of each job.
AWS Simple Queue Service (SQS)
Since we need this system to be scalable enough to handle thousands, perhaps millions, of API calls per second, we needed an effective and reliable bridge that could connect our scheduler and worker microservices together. The team decided to use Amazon’s robust Simple Queue Service to move jobs between services, giving each job its own queue. When a job is completed by a worker, it is returned back to the scheduler through the completed queue. One of SQS’s great features is that jobs aren’t immediately deleted once they all consumed from the queue, but only once a confirmation messages has been provided (ie. after the job has been fully processed). This way SQS will handle retry logic and help ensure no jobs are lost. It also provides a dead-letter queue for jobs that have failed after multiple retries, preventing them from clogging up the main queues.
Once a job is consumed from it’s SQS queue by a worker, it will use the job’s context to build an appropriate API query and make calls to the specific API endpoint. Once the calls are made, the results are filtered as appropriate (ie. we may only be interested in posts after a certain date) and published to the event bus for consumption by Hootsuite products. The job’s context is updated with information important for subsequent runs and the job is pushed into the completed queue along with rate limit information. Since some APIs only allow you to call them a set number of times within a given time window, it is important to respect these limits and to use this information when rescheduling a job. Only after this step is a confirmation sent to the original SQS queue where the job was pulled from, signalling that it has been completed and can be removed from the queue.
What If Something Goes Wrong?
In the event that something goes wrong with calling the API, error codes are included with the job and it is pushed into the completed queue without publishing anything. Based on the error code, it will be rescheduled differently by the scheduler. If the job fails inside the worker for other reasons, perhaps a bug, it will never be confirmed as completed to the original SQS queue. After a timeout period, it will reappear in that queue again, ready to be retried. After a few instances of this, it will be sent back to the scheduler through a SQS dead-letter queue.
Our scheduler reads from the SQS completed and dead-letter queues, rescheduling jobs into time slots based on the job type, whether the it returned with an error code, the type of error code, rate limit information, and the Hootsuite service level associated the account. The service level is based on the type of Hootsuite plan an account is associated with and different service levels may poll at different frequencies. For example, since enterprise customers require more timely data they will poll more frequently. Rate limit information is especially important if there are multiple products, such as legacy products, that are accessing the same API and sharing the same limit. We have a cushion value for each job and if the remaining available API calls are less than this value, we will schedule it in the next rate limit window.
Each time slot that a job is scheduled into represents a minute in the day and is implemented as a set within our Redis instance. There are 1440 time slots, each containing the jobs that will be run in that minute. Every minute, an entire time slot is drained and all of the jobs are sent to their appropriate SQS queues. Once a job is pushed back into its appropriate SQS queue, the entire cycles begins anew, nurturing Hootsuite customers with life sustaining data.
In true microservice fashion, we kept our scheduler and worker services stateless, storing all of our data in AWS ElastiCache Redis instances. An in-memory solution was chosen since our throughput was estimated to be starting at tens of thousands of read/writes per second for updating and scheduling jobs. It provided us with data structures such a sets, which we used to represent time slots. Redis was also used to cache necessary information such as auth tokens and service levels to avoid straining other Hootsuite services.
Adding a New Channel
The wonderful thing about this polling system is that we can use the same scheduler for new social channels after adding some code to point to the correct SQS queues and Redis instances. A new worker needs to be created for making the API calls, but large parts of its functionality, like publishing to the event bus and reading from SQS queues, can be templated from previous workers. Each channel would also require its own Redis instances. One to store registrations, jobs, time slots, and service level information and the other for auth tokens that the worker requires. Finally, SQS queues for each new job associated with the channel would finish the integration.
Social media accounts can be very unpredictable and some accounts will have very little activity and all of the sudden unleash a torrent of posts. In certain cases there were more posts than could be captured within our polling frequency and API query limits. We had to log accounts that showed such activity and manually bump up their polling frequency. A future consideration is a dynamic way to change polling frequencies when there is a lot of activity. Publishing a mass of posts to the Kafka event bus also proved challenging because exceptions would be thrown related to queuing posts faster than they could be sent. The team ended up internally throttling posts going to the event bus to solve this issue.
Thanks to excellent leadership and teamwork the universal polling system is successfully polling for its first social channel in production. Its microservice design and fast in-memory database means it can scale to handle millions of jobs per second. It will make adding new social channels to Hootsuite much quicker and help keep the company at the forefront of social media management.
About the Author
Ostap Manastyrski is Software Developer Co-op working with the Strategic Integrations team on the universal polling system. When not coding, he does Brazilian jiu jitsu, plays board games, and blends his food. Connect with him on LinkedIn.