Learn about the technology, culture, and processes we use to build Hootsuite.

Everyone — our software developers, co-op students, high school students — 'works out loud' about the tools we use, the experiments we ran, and lessons we learned. We hope our stories and lessons help you.

Recent Posts:

Positioning in CSS can be a nightmare, especially when you want to make all your pages responsive and look pretty on phones, tablets, laptops, and bigger screens. If we wanted to create layouts for an entire page, one option is to use a grid like Bootstrap. But what if we wanted to create layouts for something smaller, like items in a component?

As part of the WebDev team, I worked on several components that were used by editors in our CMS to create content for the Hootsuite website. This meant I was given a design of a component that was to be used, and it was my job to create it, considering responsiveness, translations, and content variability at all times. I was fairly comfortable with using the basics of CSS, and the transition to using SCSS at Hootsuite didn’t turn out to be a huge challenge either. What I struggled with the most was positioning – even tasks that seemed to be easy didn’t end up being as easy as they appeared.

Pre-Flexbox: Absolute Positioning

Let’s take a simple situation as an example: vertically centering something. Besides the horrendous idea of hard-coding the pixel values, a neat trick that would do the job would be to use absolute positioning:

See the Pen Vertical Centering – Absolute Positioning by Jieun Lee (@jieun-lee) on CodePen.

This isn’t a terrible idea, since it accomplishes the task with a few lines of code and is also fairly responsive, as the object would be centered at different screen sizes as well. However, it’s not the best solution, as it makes code overly complicated when we get to more complex situations, such as when using nested components and handling varying designs for different screen sizes. I kept thinking that there had to be a better way to approach these problems, and it turned out the solution I was looking for was flexbox.

The Transition to Using Flexbox

During a code review, one of my teammates suggested using flexbox to refactor my component. At first, it seemed like a lot of work because I would have to rewrite a lot of CSS after I spent a lot of time getting all the images and content to align nicely using absolute positioning. However, after a quick reading of flexbox from some articles I found online, it didn’t seem so bad. In fact, it seemed to make the tasks that should have been easy actually easy.

After some thought, I decided to start fresh with a blank CSS file – I literally scrapped all the CSS I had and re-wrote everything using flexbox. It was amazing to see how quickly I was able to re-create the component, and my file went from 150 lines to 100 lines. Refactoring was definitely worth it – of course, it did take a bit more time to redo the component, but it took much less time to implement and my code was shorter, cleaner, and easier to read and manage.

So let’s go back to that vertical centering problem. This is the same scenario created using flexbox:

See the Pen Vertical Centering – Flexbox by Jieun Lee (@jieun-lee) on CodePen.

All we had to do was apply display: flex; then set the vertical margin to auto. This was a much easier and more flexible solution.

Flexbox Layouts with an Example

There are many resources out there that explains the fundamentals of flexbox and its basic properties, so instead of being repetitive I will show you a sample layout built with flexbox and break it down into parts of how it was created. Let’s say we’re trying to make something like this:

See the Pen Flexbox Example – Final by Jieun Lee (@jieun-lee) on CodePen.

It’s a basic usage example, but it contains several usages of flexbox:

  • Overall Layout: the sidebar and main section are in a wrapper div, and the wrapper and header are in a page div
  • Header: the box that contains the title is flexible and fills the space that isn’t filled up by the circular buttons on the right
  • Sidebar: the menu items are spaced apart evenly from one another (this might look ugly on mobile, normally we wouldn’t keep the menu as a sidebar on smaller screens)
  • Main: contains three streams, evenly spaced, with each stream containing multiple text boxes with varying lengths (two streams on smaller screen sizes; this would normally be styled nicer on real components)

NOTE: If you’re not familiar with SCSS, part of the syntax may be confusing, so here’s a short guide to the SCSS syntax that you’ll encounter in my code:

  • Using variables, named using the dollar sign
  • (ex) $menu-color: #333;

  • Using nesting with ampersands (the ‘&’ replaces the parent in this case)
  • (ex) .header { &__main { //css }} would compile out to .header__main { //css }

Making the Menu Items Space Evenly on the Sidebar

Making items space evenly is easy with flexbox; simply use flexbox by adding display: flex; to the flex container and set margin: auto; to the flex items. This is the same technique we used to center a single item. Try changing the height to verify that it is flexible and evenly spaced at various height settings.

See the Pen Flexbox Example – Sidebar by Jieun Lee (@jieun-lee) on CodePen.

Splitting the Main Section into Streams

This is actually pretty simple, since all three streams (two, if you’re on mobile) have the same width, height, and flex properties. We will be able to see the desired behavior as soon as we add display: flex; and flex-direction: row;. However, the underlying property that makes this all work is set on the flex items (the streams).

In the code below, I have explicitly included flex: 0 1 auto; which is the default value in most modern browsers. The values represent flex-grow, flex-shrink, and flex-basis, respectively. None of the streams will grow larger than the specified width (flex-grow: 0;), but all of them have the equal ability to shrink (flex-shrink: 1;). They also have a flex-basis: auto; which is used to set the default size of the stream. This method is better than specifying widths in percentages, as it will be easier to make changes if we ever have to add or remove streams but still want to fill the entire space.

See the Pen Flexbox Example – Main Section by Jieun Lee (@jieun-lee) on CodePen.

Filling Remaining Space in the Top Header

The header is composed of three circular buttons with fixed sizes on the right and a titlebar on the left that fills the space that isn’t taken up by the buttons. When the width is adjusted, the titlebar shrinks or grows accordingly, and the title text inside centers itself automatically (just another use of flexbox with auto margin).

The only actual work we have to do is set header__menu to have flex-shrink: 0; instead of the default value of 1. This means these menu items cannot shrink, so the titlebar with an initial width of 100% shrinks instead, as it still has flex-shrink: 1;.

See the Pen Flexbox Example – Header by Jieun Lee (@jieun-lee) on CodePen.

Combining the Sections

Basically, the “page” div holds the header and the wrapper, and the “wrapper” holds the sidebar and the main section. Of course, I used display: flex; on both the page and the wrapper. This example uses minimal content (no menus, only one stream, etc.) to illustrate the layout.

As mentioned previously, the page div holds the header and the wrapper (with a flex-direction: column;). We want the header to be a fixed height, which we specify. We also add flex-shrink: 0; so that the heading section will never get squished. This means the wrapper div (below the header) has a flex-shrink: 1; so it will adjust its height as needed.

The wrapper works in a similar way; it has a flex-direction: row; and a sidebar with a fixed size. For the sidebar we set width: 20%; and flex-shrink: 0; and have the main section have a width: 100% so it can shrink the width as needed.

See the Pen Flexbox Example – Layout by Jieun Lee (@jieun-lee) on CodePen.

Put these four steps together to get the final product that was illustrated at the top of this section. If all of this seems easy to you, try and create this exact scenario without using flexbox – in addition to looking correct, it has to be scalable to content changes and different screen sizes. You’ll quickly realize how much flexbox simplifies the whole process.

One final note about flexbox is that there are some older browsers that have limited or no support for flexbox. For example, Internet Explorer 10 has a different default value for flex, and some properties such as flex-shrink are not supported. However, most newer browsers do have full support for flexbox, but we may have to add prefixes (ex. display: -webkit-flex;) to make sure it is compatible. You may be able to find an automated way to do this, or use something like SCSS mixins to add flex properties all at once.

There’s a lot of neat things you can do with flexbox, also by using the properties I didn’t have in my examples. Either way, I encourage you to try it out – you’ll be amazed by how easily you can create interesting layouts with just a few short lines of code.

About the Author

Jieun Lee is a Software Developer Co-op on the WebDev Team. Jieun studies Computer Science and Mathematics at UBC, and likes to play piano in her spare time. Connect with her on LinkedIn.

What is Kubernetes?

Kubernetes is an open-source clustered container orchestration platform that works across clusters of machines.

In Kubernetes, containers belonging to each application are grouped into work units called pods which are scheduled on specific worker nodes in the cluster on a resource-availability basis.

This is analogous to how an operating system’s CPU scheduler decides which processes receive CPU cycles at any given moment. Other OS process scheduling concepts like affinity and priority also have Kubernetes equivalents as well. If a pod needs more resources, it can be scaled vertically by changing the resource limits in its manifest.

Situations do arise when it is necessary to have multiple pods running the same container for reasons such as load balancing or high availability; an action known as horizontal scaling. Pods are horizontally scaled and managed using deployments. Pods running the same application are grouped together into services which provide a single point of access for the pods they represent; a service’s IP remains the same even when its backend pods are scaled or rescheduled. Services employ a selector which identifies backend pods based on their labels, arbitrary tags that can be used to group pods together.

When a service is deployed in Kubernetes it is easily accessible from other services in the cluster via kube-dns. But what if you want to host a service meant to be accessed from outside the cluster?

The Options

There are three ways to expose an external-facing service in Kubernetes which differ in their ServiceType specification; as a NodePort, LoadBalancer, or ClusterIP with Ingress.

NodePort LoadBalancer ClusterIP with Ingress
Each service is accessible on its own port in the form <NodeIP>:<Port>. Each service provisions its own LoadBalancer from the cloud provider. All services share a LoadBalancer and requests are proxied to each service.
The method most commonly used in production environments is ClusterIP with Ingress. This is due to high scalability, accessibility, and low infrastructure usage; reasons which this post will elaborate upon! When services are exposed in this way, there is only one load balancer and wildcard DNS record required and clients do not have to keep track of a service’s port number. All services use a common point-of-entry which simplifies development, operations, and security efforts. Routing rules and provisioning are done within the manifests of Kubernetes native objects, allowing for easy configurability.

ClusterIP With Ingress

In this example, a request to xyz.bar.com is directed to xyz-container and a request to foo.bar.com is directed to foo-container. This type of proxying is known as name-based virtual hosting. The ingress-controller pod runs a container which handles the proxying – most commonly this is implemented using nginx, an open-source reverse proxy and load balancer (among other cool features!) although other implementations such as HAproxy and Traefik exist. Proxy rules are defined in the nginx configuration file /etc/nginx/nginx.conf and kept up-to-date by a controller binary that watches the Kubernetes API for changes to the list of external-facing services, known as ingresses; hence the name ingress controller. When an ingress is added, removed, or changed, the controller binary rewrites nginx.conf according to a template and signals nginx to reload the configuration. Nginx proxies requests according to its configuration by looking at the host header to determine the correct backend.

Each exposed service requires the following Kubernetes objects:

Deployment Ingress Service
Specifies the number of pods and the containers that run on those pods. Defines the hostname rules and backends used by the controller for the service. Lists the destination pods and ports according to the selector in the Endpoints API.

Here’s what the Kubernetes manifest describing these objects for the xyz application might look like in YAML:

The ingress controller requires its own deployment and service just like any other application in Kubernetes. The controller’s service provisions the load balancer from the cloud provider with optional provider-specific configuration. Additionally, the controller can use a ConfigMap, a Kubernetes object for storing configuration key-value pairs, for global customization and consume per-service annotations, which are key-value metadata attached to Kubernetes objects, specified in each ingress. Examples of these configuration options include the use of PROXY protocol, SSL termination, use of multiple ingress controllers in parallel, and restriction of the load balancer’s source IP range.

Let’s return our attention to the two other methods of exposing external-facing services. Firstly, let’s revisit ServiceType LoadBalancer:

LoadBalancer

From the client’s perspective, the behaviour is the same. A request to xyz.bar.com still reaches xyz-container and foo.bar.com still hits the foo-container. However, note the duplicated infrastructure: An extra load balancer and CNAME record is needed. If there were a thousand unique services, there would have to be an equivalent number of load balancers and DNS records, all needing to be maintained; what happens if a change needs to be made to every load balancer or record, or perhaps only a specific subset of them? Updating DNS records can also prove a hassle with this configuration, as changes will not be instant due to caching as opposed to a nginx configuration reload, which is a near-zero downtime operation. This is a network that simply does not scale and wastes resources if each service does not receive enough requests to justify having its own load balancer.

For completeness, here is the equivalent ServiceType NodePort network diagram:

NodePort

Here, client request behaviour is radically different. To access the xyz-container, a request to node.bar.com:30000 must be made, and for the foo-container, the request is made to node.bar.com:32767. This port-based virtual hosting is extremely unfriendly to the client: port numbers must be tracked and kept up-to-date in case they are changed, and this is additional overhead on the node; while there are thousands of possible ports, the finite number of them is still a limitation to take into account. These port numbers are by default auto-assigned by Kubernetes; it is possible to specify the service’s external port, but the responsibility for avoiding collisions falls on the developer. While it is possible to solve some of these issues with SRV records or service discovery, those solutions add a layer of complexity which negates the advantage of the simplicity of this method.

The Choice

In comparison, ClusterIP with Ingress provides a powerful, transparent, and intelligent way to allow Kubernetes services to be accessed from outside the cluster. It allows services to assert control over how they expose themselves by moving per-service routing rules from cloud provider infrastructure like DNS records and load balancers to Kubernetes object manifests inside the cluster. In most cases, it is the preferred solution to expose an external-facing service with advantages in scalability, accessibility, and infrastructure over NodePort or LoadBalancer. Hopefully you’ve come away from this post with a general overview of the three methods of exposing external-facing services in Kubernetes!

About The Author

Winfield Chen is a high school co-op on the Operations team, recently graduated from Centennial Secondary School. He is attending Simon Fraser University’s School of Computing Science in September. 

Over the past two months, I’ve had the chance to work on Hootsuite’s Amplify team to implement a system for detecting buying intent in tweets, in order to help Amplify users track prospective customers. While traditionally Amplify has relied on user-defined keyword matching to filter tweets, that user would still have to sift through these potential buying signals in order to find leads. By integrating a system that intelligently scores tweets and ranks contacts, we make Amplify more effective at automating the social selling process. This post details the decisions and challenges I’ve come across in implementing these changes.

Initial Considerations

One of the defining factors of any scoring system is how it chooses to interpret its data. In our case, this was deciding on how to evaluate the stages of the buyer’s lifecycle. Kulkarni, Lodha, & Yeh (2012) described three main discrete steps in online buying behavior, where a customer:
  1. Declares intention to buy (express intent, or EI)
  2. Searches and evaluates options (purchase intent, or PI)
  3. Makes post-purchase statements (post-evaluation, or PE)
Following this system, our focus was to accurately classify and score tweets that fell in those categories.

We decided to evaluate tweets which fell into those categories differently. For example, a post such as “Should I get a Mac or a PC?” expresses a much higher intent to buy than a post where somebody expresses their thoughts on a product they just bought. There was also the problem of ambiguity – for example, in the case that somebody states “I am going to buy X type of product”, it would be difficult to know for sure whether they were simply expressing an intent to search for X type of product, or if they were literally steps away from purchasing that product. For these reasons, we decided on ‘base scores’ for posts such that PI > EI > PE.

Choosing a Service

Service Sentiment Intensity Emotion Intent
IBM Watson ✔ Natural Language Understanding Service ✔ Implied within Tone Analyzer ✔ Tone Analyzer or

Natural Language Understanding Service

✔ Natural Language Classifier
Microsoft ✔ Text Analytics API (in Preview as of June 2017) ✔ Language Understanding Intelligent Service (in Preview as of June 2017)
Converseon ✔ ConveyAI ✔ ConveyAI ✔ ConveyAI ? Waiting for their response

Another initial consideration was which machine learning service to use. While looking at available services, we searched primarily for services which provided not only text classification services, but also insight into sentiment and intensity. We decided on using IBM’s Watson Developer Cloud for their large ecosystem of services.

Implementation

As Hootsuite continues to transition from a monolithic architecture to microservices, our scoring system, the Intelligent Contact Ranking Engine (ICRE), was implemented as a Flask web service within Kubernetes. This provides a layer of abstraction between our existing Ruby back-end and the handling of asynchronous requests and scoring of posts done by the ICRE.

While the ICRE acts as an adapter between our back-end and IBM’s Natural Language Classifier (NLC), it handles some functionalities, as well as technical hurdles that we encounter along the way. Here are a few:

Batch Requests

One challenge we came across was a lack of support for processing batch requests. Potential buying signals (tweets) are collected in batches by a job in Amplify’s back-end. This would be fine, except it takes a total of, on average, 0.8 seconds for a request to be sent to Watson’s NLC, and for a response to be returned. Given the nearly 2.6 million keyword-matched tweets stored in our database, it’s clear that sending these tweets one by one would be a major bottleneck. The ICRE optimizes this process by using a thread pool. Though parallelism in Python is limited by the Global Interpreter Lock, most of the processing is done on IBM’s side, so any inefficiency is minimized.

Re-training & Scoring

Another minor challenge came with training our ML model. An interesting aspect of Watson’s NLC is that once a model (termed ‘classifier’) is trained, it cannot be retrained. This means that if we ever wanted to retrain our model, we would have to initialize the training process on IBM’s side, wait for that classifier to finish training, and then switch the classifier_id in our code to use that new classifier. ICRE reduces this complexity for developers in two ways:
  1. Allowing devs to call a training event with a simple command when running the ICRE
  2. Automatically detecting the latest available classifier each time it’s called
This way, a developer can simply call a training event, have a cup of coffee, check back later, and delete the old classifier once they confirm that the newest model has completed training. Even if the newest model completed training 30 minutes before the developer checked back, ICRE would have already started using it behind the scenes.

All of these implementation details allow the ICRE to reliably calculate signal scores given the information returned by the Natural Language Classifier.

Training & Methodology

So how did we train the model? As we didn’t have any previously available annotated dataset, I had to create the dataset myself. In essence, the idea was to collect data for as many feasible cases as possible. I collected training data by manually classifying information from:
  1. Keyword-matched tweets from our production databases
  2. Tweets grabbed through keyword searches
  3. Product reviews online
Kulkarni et al., (2012) found that 3.4% of all tweets they collected were “related to consumer buying behavior”. Therefore, besides classifying our data between the stages of the buyer’s lifecycle, we had to be prudent to make sure that we had a large sample of data to classify as “irrelevant”.

I found that there were a few patterns to tweets which fell in the same category – here are just a few examples of soft rules/guidelines I outlined while classifying training data:

Express Intent

E.g.,I’m looking to buy a new 6-10 seater dining table. Any recommendations?”

Express Intent is simply the declaration of the desire to purchase. This often includes:

  • Keywords such as “want”, “wanna”, “desire”, “wish”, etc.
  • Expression of anticipation
Purchase Intent

E.g., “I’ve decided my new PC is going to be #Ryzen based, unless someone can convince me to buy #Intel?”

Purchase Intent includes both the search and evaluation of options. This often includes:

  • Asking for details on how to obtain something
  • Asking about the ‘goodness’ of a product/service
  • Asking for opinions of one purchase option versus another
Post-Evaluation

E.g., “Solid purchase, no regrets.”

Post-Evaluation can be thought of as a review or statement after having purchased a service/product.

On Consistency

We decided that we would create a more effectively trained model by manually classifying data against a set of soft rules rather than crowdsourcing for labelled data. I found that in cases where context or images were required to fully understand a tweet, data classification could be ambiguous, especially when spread across a wide variety of people with different understandings of what “purchase intent” may mean. Consistency is key to training a good model. With an initial training set of over 1700 classified strings, we found that we were already getting good results.

Integration

While the original purpose of integrating intent analysis into Amplify was to score posts so that we could intelligently rank contacts on the front-end, we came across new possible use cases while implementing support in Amplify’s back-end. This led to some important decisions about how we should handle contact scoring.

Syncing Post Scoring and Contact Scoring

In an ideal situation, we would update contact scores whenever post scores are updated. However, this introduces unnecessarily high server load. Recalculating contact scores every time post scores are calculated (a job run on a fairly small interval) would mean running many postgreSQL queries involving both a join and a summation at that same interval. This is computationally inefficient.

Incrementing contact scores each time a post score is updated might have been ideal, but old post removal is automatically handled by our database – therefore making it difficult to track when we would need to decrement contact scores.

The most efficient way, then, of going about contact scoring would probably be to do a user-specific calculation every time they look at their contacts list, with a time limit between re-calculations. It would ensure that we don’t calculate contact scores for users who don’t need them… but we didn’t do that. Why?

We found that contact scores were valuable for features other than sorting the contacts list. Some features actively required contact scores in the background. For that reason, we needed to find a solution which would efficiently calculate contact scores for all users.

The Middle Ground

Ultimately, we decided to integrate contact score calculation into an existing daily job that already processes all contacts in our database. This allowed us to offload much of the calculation work to the database while adding only a few more calculations to an already existing and tested job. Now, incremental updates are done upon post scoring, allowing for immediate contact ranking throughout the day. This can be done because the daily job recalculates contact scores from scratch, therefore ignoring any scored posts which may have been removed due to age.

Integration into Amplify’s Front-end

After integrating support for contact scoring in the backend, updating our app to support contact ranking simply consisted of:
  1. Updating the retrieval of contact profiles to include their contact score
  2. Sorting the contacts list by score and name, rather than by name alone.
And we’ve successfully implemented contact ranking in Amplify!

Conclusion

The integration of contact ranking in Hootsuite Amplify demonstrates just one use case of machine learning for businesses. While purchase intent scoring was originally implemented solely for this use case, it has already proved useful in other features (e.g., alerting sellers about definite buyers with push notifications). The value of machine learning, in this case, isn’t a flashy new feature, but rather a subtle change which provides greater value to Amplify’s users. In this way, leveraging cutting edge technology in even subtle features serves as an indication of the potential for machine learning to drive intelligence in so many industries in the present, and in the future.

 

About the Author

Daniel Zhang is a High School Co-op on the Amplify team at Hootsuite. He will be studying Computer Science at the University of Waterloo in September, and in his spare time he likes to paint and work on projects. Connect with him on Linkedin.

Overview:

When I arrived at Hootsuite as a Summer High School Technical Intern I was tasked with creating sample apps for the Hootsuite App Directory. These apps should be easy for developers to host, quickly modify to suit their purposes, and use a minimal amount of external libraries in order to make the source code easy to understand for as many people as possible.

The GitHub repository for the sample apps I’ve mentioned in this blog post can be seen here.

Goals

  • Reduce the amount of time it takes for a developer to get a basic app up and running
  • Find flaws in the documentation and Hootsuite SDK
  • Code can be easier to process for developers than documentation
  • Allow developers to test out the Hootsuite SDK
  • Having a reference implementation
  • Give developers an understanding of what Hootsuite apps are
  • Address common pitfalls and “gotchas”

To JQuery or not to JQuery

I knew that I didn’t want to use any frameworks like React or Angular because we wanted the Sample App code to be readable by as many JavaScript developers as possible. I would still have to use JavaScript to manipulate the DOM though, and I had to decide whether JQuery was the right choice here. Because jQuery is an external library and not a part of the JavaScript language I decided that it wasn’t the right choice and that I would use the groundbreaking VanillaJS instead. This meant that I had to make a few simple helper functions for some common operations like finding a single element by class name but this was worth it in my opinion because it didn’t add unnecessary bloat to this barebones sample app.

Here’s an example of a few of the functions I created instead of using JQuery:

Choosing a backend

The backend for this Sample App is very simple because all it does is serve static content and accept POST requests so that window.postMessage could communicate with the Hootsuite dashboard. I initially started off with Python/Flask but switched to Node/Express so that all of the code could be in JavaScript and so that the Sample App would be easier to host on Heroku. Similar to the motivations of not using JQuery in the frontend, the backend doesn’t use any databases, or anything that would over-complicate the sample app.

Easy hosting

One of the challenges with the postMessage API is that your endpoint must be able to accept POST requests. This means that the sample app is more difficult to host than just setting up static files. This is where Heroku is really useful because the free tier is more than enough to host this kind of an app and it deals with HTTPS for you as well, which is required for integrations with Hootsuite. Heroku is very well integrated with NodeJS/Express and works seamlessly. Hosting with Heroku helps achieve one of the primary goals of writing sample apps, which is to reduce the time it takes for a developer to get an app up and running.

Production Readiness

One challenge with creating something so public facing is that the code needs to be clean and easily readable, with comments explaining what is going on throughout the code. The best process for achieving this is code review. Before any change goes out to the public it is reviewed by multiple people at Hootsuite, making sure that all the code is easily understood and follows best practices. To save reviewers time, it is a good idea to run a linter, such as JSHint for Javascript before you submit the code for review.

Another challenge with Open Source Software is that you must have a way to make sure that secrets, API keys, and other sensitive information aren’t checked into your version control. The way that I dealt with this is with explanations on how to input user-specific credentials into a configuration file and/or information into the sample app in the README, and also logging errors when users haven’t input an API key or secret.

Conclusion

There were many benefits to writing a sample app for the Hootsuite SDK. One of first benefits was simply having fresh eyes on the SDK. There were many small issues and inconsistencies in both the documentation and the SDK that I was able to point out because I had a fresh perspective. Once these issues are resolved, the Hootsuite SDK will be easier to use for all developers. Another key benefit of writing sample apps is to provide developers with a base to work from. Sample apps provide a lot of the boilerplate code so that developers can get to writing the valuable parts of their apps much faster.

About The Author

Vadym Martsynovskyy is a high school summer student on the Developer Products team. He will be studying at Computer Science at the University of Toronto starting in September 2017. In his spare time he enjoys programming, watching Game of Thrones fan conspiracy theories, playing video games, and skiing. 

Hootsuite is a big advocate of continuous integration (CI) and continuous delivery (CD). These tools allow developers to ship value to the customer quickly, reliably, and regularly. Every service, from our main dashboard – to dozens of customer-facing microservices – to internal tools, are built and shipped using an automated pipeline. This encourages a culture of agility, taking calculated risks.

My team mission was to improve developers productivity and satisfaction and to enable them to deliver stable and reliable software to our customers as quickly as possible. Having CI and CD being such a big part of developers’ day-to-day workflow, we decided we need a tool to help us measure and identify any pain-points and delays developers have to deal with. Developers should not spend their time debugging tests that would auto-resolve on a re-run, or be blocked for an extended period of time while waiting for their code to deploy.

Jenkins Metrics Phaser

Jenkins Metrics Phaser is our internal tool for the collection of all metrics related to our build and deploy pipelines. It’s ultimate goal is to enhance developer productivity and satisfaction. It aims to do so by tracking pipelines performance, flagging pain-points, and collecting feedback from developers. It also allows us to gain visibility into all our pipelines across the variety of microservices and tools we have.

Implementation

With our usage of Jenkins through the years we have a very diverse Jenkins setup. Our legacy dashboard Jenkins is self-hosted on our local servers. We have cloud-hosted Jenkins instances for a couple of our teams. Finally, all of our newer instances are hosted in Mesos clusters with each instance belonging to its designated team. We needed Phaser to aggregate all of the data from these instances into one platform.

Getting information from Jenkins

We use Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS) to reliably deliver data from Jenkins. All of our Jenkins instances are pre configured with a custom made Jenkins plugin for sending data to our SNS topic after the completion of each job. The SNS topic instantly pushes these messages to the SQS queue to allow for real-time processing.

This setup was proven to be reliable even when things break. One time a third-party dependency used by Phaser introduced a breaking change downstream and caused our service to be down for a few hours. As we tackled the problem and worked on a fix, we weren’t worried about data loss as we knew the queue would just grow larger but the messages would remain there. And indeed, once we brought Phaser back up it immediately picked up from where it left off and fully recovered within minutes, without any data loss.

Saving the information in our database

Each batch of SNS messages gets distributed between goroutines for asynchronous processing. Every message gets stripped down to its essential information and gets passed along to our main controllers. The controllers are under two main categories, with each corresponding to its designated database tables:

  1. Build and Deploy: responsible for storing information about all pipeline jobs. Each table row represents a single Jenkins job, with information such as result, time, duration, etc. One key detail is each job also contains a relational link to it’s triggering (parent) job. This allows us to use the data to visualize an entire pipeline workflow.
  2. Test: responsible for describing the performance of unit and integration tests of all pipelines. After identifying that a job is in fact a test job, Phaser posts to the Jenkins API to get test results in JUnit XML format and parses them down to individual test units. We convert all test results, regardless of the language they were written in, to JUnit XML format.

Notifying users via Slack and classifying pipeline failures

Upon receiving a failed job on a master branch build, Phaser sends the failed job information to our Slack bot through a completely non-blocking Go channel. The bot uses the git commit information to get the developer’s email, which gets correlated with his/her slack user. Phaser then sends an interactive slack message with a summary of the error, link to the job, and the option to classify the error. We use this data to identify common errors and specific pain-points which we can tackle in order to make an overall better experience for our developers.

Visualizing the data

We developed an internal API and UI platform for the use of both our team and developers which showcases metrics for all pipelines. Each individual pipeline has its own sub-platform with a variety of metrics such as: build times, longest tests, most failing tests, merge freezes, jobs stability, etc.

Conclusion

Our focus with Phaser was to build a unified platform for monitoring all of our pipelines with the ultimate goal of finding ways to improve developers’ experience. Both Go and AWS allowed us to develop a concurrent, fault-tolerant system. Slack allowed for easy and convenient delivery of important information directly to developers, and the ability for the developers to interact back.

About the Author

Elad Michaeli is Software Developer Co-op at Hootsuite. Elad studies Computer Science at the University of British Columbia. Connect with him on LinkedIn

Loading ...