Learn about the technology, culture, and processes we use to build Hootsuite.

Everyone — our software developers, co-op students, high school students — 'works out loud' about the tools we use, the experiments we ran, and lessons we learned. We hope our stories and lessons help you.

Recent Posts:

A big movement at Hootsuite is to have everything on Kubernetes and this includes our CI/CD pipelines. So given Jenkins, Kubernetes, and Vault our goal was to create a CI/CD system that was secure, portable, and scalable.

With this blog post I’ll be covering some of the steps we took to achieve this.

Figuring out how Jenkins and Kubernetes work together

So going into this, we knew two things: First we wanted Jenkins to be our CI/CD tool and second we wanted to take advantage of Kubernetes to schedule our jobs. To link the two together we decided to use the jenkins-kubernetes-plugin. This plugin allows the Jenkins Masters to make Kubernetes API calls to schedule builds inside their own Kubernetes Pods.

This provides the following benefits:

  • Isolation: Builds being in their own Pod means it can’t affect other builds. Each Pod as per the Kubernetes documentation is a logical-host.
  • Ephemeral: Pods do a fantastic job of cleaning up after themselves. Pods by its nature are ephemeral. So unless we explicitly want to keep changes the Pod makes in its lifetime, everything will be erased. No more conflicting workspaces!
  • Build Dependencies: Related to isolation, but with Pods each job can define exactly what their build needs. Say a build pipeline has several stages: one for building the JavaScript frontend, and another for building the Go backend. Each of those stages can have its own container, simply by pulling the necessary image for each stage.
Below is a pipeline taken from the plugin repo which demonstrate the three benefits. We see that there is a unique Pod being made, this Pod will then have any of its state wiped when the build completes, and most importantly it uses a container with a specific image for each build step. How lovely!

Containerized Jenkins Master and Agents

Now that we have an idea of how Jenkins and Kubernetes will work together, we had to move our Jenkins Master and Agent into modern times. This meant defining two rock solid containers.

For the Jenkins Master we started off by creating an image that would contain all the plugins that it would require. The base of this plugin image was this one. The reason we separated the plugins out was to allow us to update plugins independently from the Jenkins Master. Our plugin base image ended up looking something like this:

From there we could then build our Jenkins Master image on top of it. One of the nice things about the official Jenkins image is that it offers a lot of flexibility. We took advantage of this by copying in configuration overrides, initial groovy startup scripts, and other files to ensure our Jenkins Master would start up configured and ready to go.

Files that we copy over to our Jenkins Master

As for our Jenkins Agent, we used this image as the base. The benefit here being that the Kubernetes plugin that we chose was developed and tested with this image in mind. As such the communication between our Master and Agents is well supported.

Once we’ve decided how the images would look, we set up a Jenkins job that would regularly build and push our Master and Agent images to our registry. This way we would always be up to date and avoid having outdated versions of Jenkins and its plugins.

Using Vault to handle our CI/CD secrets

With our images set up, the next step was figuring out if we could #BuildABetterWay to manage our secrets in Jenkins. For this we turned to Vault. *

This choice was made for the following reasons:

  1. Provides a single source of truth; previously our secrets would get sprawled across our Jenkins Masters and became difficult to manage
  2. We are already using Vault extensively at Hootsuite, so we have lots of support and knowledge
  3. Vault supports a Kubernetes Authentication method (More on this below)
The first two reasons are self explanatory, but the third was where things got interesting and also the main focus of my contributions.

Previously we were using AppRole Authentication and while it worked well, it meant we had a secret_id and role_id that we had to manage. Ideally what we wanted was a way for our Pods to tell Vault that it belonged to a certain Kubernetes cluster and should be granted certain access. This is where Kubernetes Authentication comes in.

I’ve outlined the steps for our Kubernetes Cluster to authenticate with Vault:

  1. Before anything happens, we setup the Vault and Kubernetes relationship by giving Vault some information about our cluster:
    • The cluster’s CA cert
    • The host of our Kubernetes cluster
    • A Vault policy
    • A Vault role that is mapped to our Kubernetes namespace/serviceaccount
    With that completed, Vault now knows which Kubernetes cluster to respond to and which ServiceAccount in the cluster is allowed to authenticate against the Vault Role
  2. When we define the Jenkins Master Pod, we add a field that attaches a ServiceAccount to that Pod. This ServiceAccount is referenced when the Pod starts up and is used to retrieve the account’s JWT.
  3. Once the JWT is retrieved it is sent over to Vault which then forwards it to Kubernetes.
  4. Vault will then receive a response from Kubernetes that says the JWT came from the correct namespace and is actually the Service Account it claims to be.
  5. Once Vault gets confirmation it knows that the Pod has the right ServiceAccount which means it is mapped to a Vault role, and so Vault gives back a VAULT_TOKEN that the Pod can then use.

The Five steps converted as a picture

What’s great is that there is no secret that has to be managed and the Pod only needs to use the Kubernetes API. So from the Pod’s perspective in the startup script of a container it would do something like:

And with the Vault token inside VAULT_LOGIN_RESULT it can use that for subsequent calls to Vault.

At this point you might be wondering how we can go from Vault Secrets to Jenkins Credentials. This is where the initial groovy startup scripts come in. On start up our entrypoint script reads in Jenkins related secrets  from Vault and writes these values as JSON objects into a temporary file. This temporary file gets read by the startup script and converts those values into Jenkins Credentials.

So at a Vault path containing Jenkins Credentials, we would keep something like

Which on Jenkins Master gets converted to

A Hashicorp Vault Secret converted to a Jenkins Credential

For information on how to programmatically add credentials check here.

With all that done we now have a way to securely retrieve CI/CD secrets from Vault and a way to convert them to Jenkins Credentials if needed.

* For those wondering why we didn’t use something like the Jenkins Vault Plugin, it was lacking in a few areas:

  1. It did not support Kubernetes Authentication
  2. The secrets could not be used to make Jenkins Credentials which other plugins can use
  3. It would mean adding initial setup scripts to our Jenkinsfiles


The steps in this post hopefully described how you can improve your Jenkins CI/CD pipeline with Docker, Kubernetes, and Vault. Revisiting our goals let’s make sure we’ve made things: secure, portable, and scalable. We added security by letting Vault handle our CI/CD secrets. With Docker we ended up with a self contained Jenkins Master and Agent containers. Then, with Kubernetes orchestrating these containers we are able to handle dynamic workloads.


About the Author

David Jung is a co-op Student on the Production, Operations, and Delivery team. He is currently studying Computer Engineering at The University of British Columbia. Connect with him on LinkedIn.







Love technology? Are you a high school student in grade 11 or 12? Do you live in British Columbia?

This is the 4th summer of Hootsuite’s Tech High School Program and we’re back offering an opportunity to sharpen your technical skills in our company and community of creators, innovators, and builders dedicated to championing the power of human connection.

You’ll pair with a mentor and work side-by-side with a passionate team to build something bigger than yourself. You’ll experience what it’s like to build products that millions of people use every day.

There are opportunities at Hootsuite in all aspects of technology including software (both mobile and web), operations, and IT.

This is a paid position with a competitive salary.

Application Deadline: Apr 30th, 2018

Application instructions are at the end of this post.

Hootsuite 2018 Summer High School Program
The only source of knowledge is experience – Albert Einstein

Software Development

Work with our software developers to build and ship software in a continuous delivery environment, where we ship code to production many times a day. Come and collaborate on solving problems with other developers and play an active role in shaping our product as you learn. Some of our current technologies include Javascript (React.js and Flux), Go, HTML5/CSS3, Scala, Python, Play framework, iOS, Android, Akka, MongoDB, MySQL, and more.

“Working on Hootsuite’s Publishing team has made this the best summer of my life. I’ve learned a great deal about web development as well as developing software as part of a team, which is something that you can only get with experience. The skills I gained during this summer will be incredibly valuable down the road and I’d like to thank everybody at Hootsuite who gave me a hand in developing them.” – Rhys Rustad-Elliott.

Operations Development

Our operations engineers build and maintain the systems infrastructure, servers, and networks that power our mission-critical social media applications. Put another way, your code may not end up on the homepage, but your team loves you because you make everything “just work”. Our current technical stack and tool chain includes Consul, Terraform, Nginx, MySQL, MongoDB, Sensu, Graphite, PagerDuty, Redis, Memcached and many others. We build tools with Python, Bash, and Perl.

“Working for the Ops Team at Hootsuite I get to work on methods to maintain Hootsuite’s infrastructure. It’s the biggest responsibility I have had over the years working for various tech companies. It has taught me how to think bigger and foresee problems. I think what makes Hootsuite awesome are the people here. They are amazing, talented, and really friendly. It’s a fun environment to work in and everyone has trust in each other to do the right thing.” – Mishra.


Our “NerdNest” team supports our 1000+ staff who need a stable and reliable environment to get their work done: hardware troubleshooting, setting up laptops, network admin. This is best done with a positive attitude, by someone who is graceful under pressure and has a natural curiosity for solving problems with physical technology. Our technology includes Apple, Dell, VoIP, AD, Okta, VMWare, Meraki, multiple SaaS solutions, and Google Apps to highlight a few.

“I used to think that being in IT is about being a hardware engineer, but once I started my co-op, I realized that there is a great deal of diversity within the IT department. There are a variety of opportunities and positions: helpdesk, support, purchaser, system administrators, and technicians.” – Shameen Gill.

Curious about Hootsuite?

Application Instructions

Apply for this position here, on our careers page.

Selected candidates will hear from Hootsuite directly.

The world of social media moves fast as new platforms arise and gain popularity. Since Hootsuite is here to champion the power of human connection, we want to be able to integrate new social channels quickly. This was one of the driving factors behind creating a universal polling system.


The first design decisions were to determine what characteristics of a polling system can be shared between all social channels. This led to splitting our system into a shared scheduler, responsible for registrations, polling frequencies, scheduling, retrying, dealing with rate limits, etc., and a worker for making the actual API calls to the social channels and publishing the result to our Kafka event bus. These scheduler and worker microservices run inside a Kubernetes cluster which helps ensure each has enough resources to handle bursts in traffic and makes spinning up more replicas a breeze.

Overview of of the polling system with emphasis on how a job flows through it


In our system, a job refers to a specific social channel API endpoint and the kind of data we want to publish to our event bus from it. Each job constantly cycles from a scheduler to a worker and back. It has contextual data that is important for the API query, but not used by the scheduler. For example, if a job’s purpose is to retrieve a user’s new posts, it must keep track of the id of the last post it saw in order to make the appropriate API query. This job context can vary greatly between jobs depending on exactly what it’s meant to publish to the event bus, the social channel it’s connecting to, and the endpoint it’s hitting. For maximum flexibility, we used a hash map stored as a property of each job.

AWS Simple Queue Service (SQS)

Since we need this system to be scalable enough to handle thousands, perhaps millions, of API calls per second, we needed an effective and reliable bridge that could connect our scheduler and worker microservices together. The team decided to use Amazon’s robust Simple Queue Service to move jobs between services, giving each job its own queue. When a job is completed by a worker, it is returned back to the scheduler through the completed queue. One of SQS’s great features is that jobs aren’t immediately deleted once they all consumed from the queue, but only once a confirmation messages has been provided (ie. after the job has been fully processed). This way SQS will handle retry logic and help ensure no jobs are lost. It also provides a dead-letter queue for jobs that have failed after multiple retries, preventing them from clogging up the main queues.


Once a job is consumed from it’s SQS queue by a worker, it will use the job’s context to build an appropriate API query and make calls to the specific API endpoint. Once the calls are made, the results are filtered as appropriate (ie. we may only be interested in posts after a certain date) and published to the event bus for consumption by Hootsuite products. The job’s context is updated with information important for subsequent runs and the job is pushed into the completed queue along with rate limit information. Since some APIs only allow you to call them a set number of times within a given time window, it is important to respect these limits and to use this information when rescheduling a job. Only after this step is a confirmation sent to the original SQS queue where the job was pulled from, signalling that it has been completed and can be removed from the queue.

What If Something Goes Wrong?

In the event that something goes wrong with calling the API, error codes are included with the job and it is pushed into the completed queue without publishing anything. Based on the error code, it will be rescheduled differently by the scheduler. If the job fails inside the worker for other reasons, perhaps a bug, it will never be confirmed as completed to the original SQS queue. After a timeout period, it will reappear in that queue again, ready to be retried. After a few instances of this, it will be sent back to the scheduler through a SQS dead-letter queue.

Rescheduling Jobs

Our scheduler reads from the SQS completed and dead-letter queues, rescheduling jobs into time slots based on the job type, whether the it returned with an error code, the type of error code, rate limit information, and the Hootsuite service level associated the account. The service level is based on the type of Hootsuite plan an account is associated with and different service levels may poll at different frequencies. For example, since enterprise customers require more timely data they will poll more frequently. Rate limit information is especially important if there are multiple products, such as legacy products, that are accessing the same API and sharing the same limit. We have a cushion value for each job and if the remaining available API calls are less than this value, we will schedule it in the next rate limit window.

Each time slot that a job is scheduled into represents a minute in the day and is implemented as a set within our Redis instance. There are 1440 time slots, each containing the jobs that will be run in that minute. Every minute, an entire time slot is drained and all of the jobs are sent to their appropriate SQS queues. Once a job is pushed back into its appropriate SQS queue, the entire cycles begins anew, nurturing Hootsuite customers with life sustaining data.


In true microservice fashion, we kept our scheduler and worker services stateless, storing all of our data in AWS ElastiCache Redis instances. An in-memory solution was chosen since our throughput was estimated to be starting at tens of thousands of read/writes per second for updating and scheduling jobs. It provided us with data structures such a sets, which we used to represent time slots. Redis was also used to cache necessary information such as auth tokens and service levels to avoid straining other Hootsuite services.

Adding a New Channel

The wonderful thing about this polling system is that we can use the same scheduler for new social channels after adding some code to point to the correct SQS queues and Redis instances. A new worker needs to be created for making the API calls, but large parts of its functionality, like publishing to the event bus and reading from SQS queues, can be templated from previous workers. Each channel would also require its own Redis instances. One to store registrations, jobs, time slots, and service level information and the other for auth tokens that the worker requires. Finally, SQS queues for each new job associated with the channel would finish the integration.

Lessons Learned

Social media accounts can be very unpredictable and some accounts will have very little activity and all of the sudden unleash a torrent of posts. In certain cases there were more posts than could be captured within our polling frequency and API query limits. We had to log accounts that showed such activity and manually bump up their polling frequency. A future consideration is a dynamic way to change polling frequencies when there is a lot of activity. Publishing a mass of posts to the Kafka event bus also proved challenging because exceptions would be thrown related to queuing posts faster than they could be sent. The team ended up internally throttling posts going to the event bus to solve this issue.


Thanks to excellent leadership and teamwork the universal polling system is successfully polling for its first social channel in production. Its microservice design and fast in-memory database means it can scale to handle millions of jobs per second. It will make adding new social channels to Hootsuite much quicker and help keep the company at the forefront of social media management.

About the Author

Ostap Manastyrski is Software Developer Co-op working with the Strategic Integrations team on the universal polling system. When not coding, he does Brazilian jiu jitsu, plays board games, and blends his food. Connect with him on LinkedIn.

As summer interns on the Measure backend team in Bucharest, we had to implement functionalities for both Hootsuite Insights and Analytics. We worked on three main projects, involvins controllers and services.

What are controllers? Our data processing system is laid out as a pipeline. The entities within it are called controllers, and they communicate through queues. Data goes through the pipeline, where we might access extra information from our services or the Internet.

As for services, they define a functionality or a set of functionalities that different clients can reuse for different purposes, together with the policies that control its usage.

Health Check Service for Insights

Health checks are a way of interrogating a component’s status inside a distributed system, in order to notify the load-balancing mechanisms when the component should be removed or when more instances should be installed. To understand how health checks work in our project, we need to have some basic knowledge about a service.

Services are implemented using gRPC, for its simple service definition and automatic generation of idiomatic client and server stubs for a variety of languages and platforms. gRPC uses Protocol Buffers (a powerful binary serialization toolset and language) to define a service’s protocol and then it generates stubs which can be used in your server/client implementation.

The server side of a service is running inside containers within Kubernetes Pods. To be more precise, a dockerized server is deployed in Kubernetes, which handles the number of containers alongside its pods to automatically adjust to the “traffic” requirements.

So far, we know what a service is, how to implement it and where it will run. So, what about health checks? Kubernetes does not provide native health checks for gRPC services, so we decided to develop them ourselves.

Health checks are also implemented as a service. The main difference from a regular service is that you “attach” this service to the existing ones. How? You simply ‘register’ it along them.

grpcServer := grpc.NewServer(grpc.UnaryInterceptor(interceptors.CreateMonitoringInterceptor()))
pb.RegisterOutboundAnalyticsServer(grpcServer, newServer(es_driver))
health_pb.RegisterHealthCheckServer(grpcServer, health_check.NewServer(make([]func() bool, 0)))

How does it actually work?

Here comes a feature of Kubernetes. When deploying a service, Kubernetes allows tests to be ran after the creation of a container (readiness probes) and while it is running (liveness probes).  There are 3 types of handlers for probes: TCP (tcpSocket), HTTP Request (httpGet) and command (exec). In order to use our custom health checks, we need the latter. We use a client to connect to our server, then use its Check function to verify the status of the service. The specified method runs a list of checks and it shows us whether the pod is running correctly or not.

How do I make my own checks for my service?

Well, a check is simply a boolean function. Each developer can customize their own checks for a service as a list of boolean functions.

We have already integrated health checks for all Insights services in production. For easier deployment on Kubernetes, we have also defined some Jinja Templates that use the health-checking system as readiness and liveness probes. Later on, the system has been also ported into Analytics by our colleagues.


Anytime something doesn’t work in production, alerts are generated to notify the on-call developers. To achieve this, various checks are ran periodically to verify the status of different infrastructure components. These checks use parameters that are statically defined inside the code or YAML files, which define “normal behavior” for a specific component.

One problem of this system is that every time someone wanted to modify the values of the parameters, a deployment needed to be done, which took a lot of time. We decided it would be a big improvement if we could dynamically configure these parameters.

A technology already used by Hootsuite for configuration management is Consul. Amongst other features, Consul provides a distributed system for hierarchical key-value storage. For ease of access, Consul exposes a WebUI which can be used to easily modify and adapt these values, without the need to go through the development process of modifying the code and redeploying it.

We configured the alert-generating system to use Consul for controllers and services alerts on both Insights and Analytics.

Services Alerts

A service’s implementation plots to graphite each of its RPC calls / errors. This provides very precise information about a service’s status. By monitoring the graphite metrics, we can define automatic alerts on our services to signal unexpected behaviour. In order to do so, we have followed the next steps:
  • generate metadata about services and their RPCs using the Protocol Buffer definitions
  • generate default alerts for each service using the previously obtained metadata and insert these values into Consul KV
  • modify the existing regenerating scripts (which generate .cfg files for Nagios) to use the values from Consul KV

Controllers Alerts

Controllers communicate through Kestrel queues. When a controller doesn’t work properly, messages may pile up in its queue. By monitoring the queues, we can both
  • generate alerts in order to signal the unexpected behavior;
  • scale that controller’s instances up in order to get the jobs processed.
Every type of controller has a set of parameters it needs to stay within, defining the number of instances or the size and speed of the input queue. These values used to be defined statically into some yaml configuration files. We decided to load these values into Consul, in order to be able to fetch their updated values during the scaling process and at alert-generating time.

What we did was:

  • add a script that loads into Consul the default values from the configuration files; the script should be ran every time a new type of controller is added;
  • modify the auto-scaling and alert-generating mechanisms to use the values stored inside Consul; if Consul fails, we use the static values as fallback.
To simplify the developer’s life even further, we decided to use a web app, called Dynamic Config (previously developed during a Hackathon by some of our colleagues from Bucharest Office) that features a Flask server that accepts HTTP requests to modify or fetch the controllers’ Consul configurations and a web UI where developers can easily change or read these values, just by clicking a few buttons.

Analytics: Graphite Plotting of Mongo Calls from Go

The problem

Since our team has recently started to develop new services in Go, some of the functionalities already present in our existing Python code base are not available. One of them was the Graphite plotting of calls to Mongo databases.

Graphite is a timeseries database. It can plot different data over extended periods of time and it’s commonly used for monitoring the usage of different services.

The design choice

What we wanted to obtain was a method for adding calls to the Graphite API every time Mongo is used.

We had multiple ideas about how to achieve this, but we decided from the start that:

  1. The solution should be easy to use (just plug and play)
  2. The developer should not have to worry about how it’s implemented and they should still be able to use all of the methods and fields available in the mgo (Mongo Go) package
  3. Code readability should not be affected at all
Since in our Python code, this functionality was implemented using decorators, the first idea was to try and add them here, too. They can be implemented with reflection, as in the example found (here)[https://gist.github.com/saelo/4190b75724adc06b1c5a]. The major downside of this was that it affected the readability and that the decorator itself would be hard to understand for any maintainer that wanted to modify it later.

Another idea that came to mind was to define wrappers for the basic objects used from the mgo package. At first, we wanted to be able to write unit tests, so we had to mock the calls to Mongo by using an interface.

With anonymous fields, we were not required to implement all the methods on the wrapper class. If one has a MongoDatabase object and calls a method that is defined for the MongoDBWrapperInterface implementation, that method would be called in fact. Unfortunately, this design had a major flaw: if we wanted to access an attribute of mgo.Database (like Name) from the wrapper, it would not be visible by default at compile time, since interfaces do not have fields. You would have to use a work-around like db.(MongoDBWrapper).Name or require a getName() method on the interface, which would make the code really ugly.

After discussing with our mentors, we decided to switch to integration tests and avoid mocking for the moment. The final design looks like this:

type GraphiteMonitoredDatabase struct {
type GraphiteMonitoredCollection struct {
func (db *GraphiteMonitoredDatabase) C(name string) *GraphiteMonitoredCollection {
return &GraphiteMonitoredCollection{db.Database.C(name)}

Some other functions are redefined too. Adding the calls to Graphite was pretty basic:

func (c *GraphiteMonitoredCollection) Find(query interface{}) *mgo.Query {
startTime := time.Now()
result := c.Collection.Find(query)
elapsedTime := time.Since(startTime).Seconds()
addMongoCallMetrics(c, "find", elapsedTime)
return result


Using the wrappers instead of the actual mgo objects inside the code went smooth as well. Here’s an example from the tests:

func (suite *GraphiteMonitoredCollectionSuite) TestInsertAndFind() {
t := suite.T()
for i := 0; i < 5; i++ {
suite.collection.Insert(&testType{TestInt: i})
var findResult []testType

The final result

In the end, we were pleased with the design we built because it implied minimal code changes and it suited all of the requirements we set from the start.

Now all our Go components are also plotting their Mongo calls:


Working on these projects has been really fun for both of us. Apart from learning a new programming language (Go), we’ve had the occasion to design our own solutions, learn best practices from our mentors and also see our projects adding value when deployed into production.

Of course, After 5 parties, a great team building trip, the everyday foosball and billiards championships as well as weekly basketball matches might have also contributed to our cool experience :).

About the Authors

Monica and Alex are fourth year students at University POLITEHNICA of Bucharest, Faculty of Automatic Control and Computer Science. They found their interest in technology during their highschool years and decided to follow their passion as a future career. In her spare time, Monica likes playing ping pong and reading. Alex enjoys occasional scuba-divings and has also been a professional dancer for 10 years. After a summer working at Hootsuite’s projects, they are prepared to successfully meet all the challenges of 4th year at faculty.


As many companies tend towards a service oriented architecture, developers will often wonder whether more and more parts of their service could be moved into the cloud. Databases, file storage, and even servers are slowly transitioning to the cloud, with servers being run in virtual containers as opposed to being hosted on a dedicated machine. Recently, FaaS (function as a service) were introduced to allow developers to upload their “application logic” to the cloud without requiring the server, essentially abstracting the servers away. Despite not having to worry about the servers, developers find that they now have to deal with the cloud. Complexity with uploading, deploying and versioning now become cloud related. This, along with several current limitations of the FaaS model, has often positioned serverless technology as being best suited towards complementing a dedicated server.


Recently, the Serverless Framework was introduced allowing us to abstract even the cloud part of the development process away. Now, with just about everything server or cloud related hidden away, developers have the ability to directly write code related to the actual application. The serverless model also offers several other advantages over traditional servers such as costs and ease of scaling. So would it be possible to completely replace the server with a serverless model? Taking into account the limitations of the FaaS model, we set out to build a fully functional, cloud based, serverless app. Read More …

Loading ...