Posts from August 2017

What is Kubernetes?

Kubernetes is an open-source clustered container orchestration platform that works across clusters of machines.

In Kubernetes, containers belonging to each application are grouped into work units called pods which are scheduled on specific worker nodes in the cluster on a resource-availability basis.

This is analogous to how an operating system’s CPU scheduler decides which processes receive CPU cycles at any given moment. Other OS process scheduling concepts like affinity and priority also have Kubernetes equivalents as well. If a pod needs more resources, it can be scaled vertically by changing the resource limits in its manifest.

Situations do arise when it is necessary to have multiple pods running the same container for reasons such as load balancing or high availability; an action known as horizontal scaling. Pods are horizontally scaled and managed using deployments. Pods running the same application are grouped together into services which provide a single point of access for the pods they represent; a service’s IP remains the same even when its backend pods are scaled or rescheduled. Services employ a selector which identifies backend pods based on their labels, arbitrary tags that can be used to group pods together.

When a service is deployed in Kubernetes it is easily accessible from other services in the cluster via kube-dns. But what if you want to host a service meant to be accessed from outside the cluster?

The Options

There are three ways to expose an external-facing service in Kubernetes which differ in their ServiceType specification; as a NodePort, LoadBalancer, or ClusterIP with Ingress.

NodePort LoadBalancer ClusterIP with Ingress
Each service is accessible on its own port in the form <NodeIP>:<Port>. Each service provisions its own LoadBalancer from the cloud provider. All services share a LoadBalancer and requests are proxied to each service.
The method most commonly used in production environments is ClusterIP with Ingress. This is due to high scalability, accessibility, and low infrastructure usage; reasons which this post will elaborate upon! When services are exposed in this way, there is only one load balancer and wildcard DNS record required and clients do not have to keep track of a service’s port number. All services use a common point-of-entry which simplifies development, operations, and security efforts. Routing rules and provisioning are done within the manifests of Kubernetes native objects, allowing for easy configurability.

ClusterIP With Ingress

In this example, a request to xyz.bar.com is directed to xyz-container and a request to foo.bar.com is directed to foo-container. This type of proxying is known as name-based virtual hosting. The ingress-controller pod runs a container which handles the proxying – most commonly this is implemented using nginx, an open-source reverse proxy and load balancer (among other cool features!) although other implementations such as HAproxy and Traefik exist. Proxy rules are defined in the nginx configuration file /etc/nginx/nginx.conf and kept up-to-date by a controller binary that watches the Kubernetes API for changes to the list of external-facing services, known as ingresses; hence the name ingress controller. When an ingress is added, removed, or changed, the controller binary rewrites nginx.conf according to a template and signals nginx to reload the configuration. Nginx proxies requests according to its configuration by looking at the host header to determine the correct backend.

Each exposed service requires the following Kubernetes objects:

Deployment Ingress Service
Specifies the number of pods and the containers that run on those pods. Defines the hostname rules and backends used by the controller for the service. Lists the destination pods and ports according to the selector in the Endpoints API.

Here’s what the Kubernetes manifest describing these objects for the xyz application might look like in YAML:

The ingress controller requires its own deployment and service just like any other application in Kubernetes. The controller’s service provisions the load balancer from the cloud provider with optional provider-specific configuration. Additionally, the controller can use a ConfigMap, a Kubernetes object for storing configuration key-value pairs, for global customization and consume per-service annotations, which are key-value metadata attached to Kubernetes objects, specified in each ingress. Examples of these configuration options include the use of PROXY protocol, SSL termination, use of multiple ingress controllers in parallel, and restriction of the load balancer’s source IP range.

Let’s return our attention to the two other methods of exposing external-facing services. Firstly, let’s revisit ServiceType LoadBalancer:

LoadBalancer

From the client’s perspective, the behaviour is the same. A request to xyz.bar.com still reaches xyz-container and foo.bar.com still hits the foo-container. However, note the duplicated infrastructure: An extra load balancer and CNAME record is needed. If there were a thousand unique services, there would have to be an equivalent number of load balancers and DNS records, all needing to be maintained; what happens if a change needs to be made to every load balancer or record, or perhaps only a specific subset of them? Updating DNS records can also prove a hassle with this configuration, as changes will not be instant due to caching as opposed to a nginx configuration reload, which is a near-zero downtime operation. This is a network that simply does not scale and wastes resources if each service does not receive enough requests to justify having its own load balancer.

For completeness, here is the equivalent ServiceType NodePort network diagram:

NodePort

Here, client request behaviour is radically different. To access the xyz-container, a request to node.bar.com:30000 must be made, and for the foo-container, the request is made to node.bar.com:32767. This port-based virtual hosting is extremely unfriendly to the client: port numbers must be tracked and kept up-to-date in case they are changed, and this is additional overhead on the node; while there are thousands of possible ports, the finite number of them is still a limitation to take into account. These port numbers are by default auto-assigned by Kubernetes; it is possible to specify the service’s external port, but the responsibility for avoiding collisions falls on the developer. While it is possible to solve some of these issues with SRV records or service discovery, those solutions add a layer of complexity which negates the advantage of the simplicity of this method.

The Choice

In comparison, ClusterIP with Ingress provides a powerful, transparent, and intelligent way to allow Kubernetes services to be accessed from outside the cluster. It allows services to assert control over how they expose themselves by moving per-service routing rules from cloud provider infrastructure like DNS records and load balancers to Kubernetes object manifests inside the cluster. In most cases, it is the preferred solution to expose an external-facing service with advantages in scalability, accessibility, and infrastructure over NodePort or LoadBalancer. Hopefully you’ve come away from this post with a general overview of the three methods of exposing external-facing services in Kubernetes!

About The Author

Winfield Chen is a high school co-op on the Operations team, recently graduated from Centennial Secondary School. He is attending Simon Fraser University’s School of Computing Science in September. 

Over the past two months, I’ve had the chance to work on Hootsuite’s Amplify team to implement a system for detecting buying intent in tweets, in order to help Amplify users track prospective customers. While traditionally Amplify has relied on user-defined keyword matching to filter tweets, that user would still have to sift through these potential buying signals in order to find leads. By integrating a system that intelligently scores tweets and ranks contacts, we make Amplify more effective at automating the social selling process. This post details the decisions and challenges I’ve come across in implementing these changes.

Initial Considerations

One of the defining factors of any scoring system is how it chooses to interpret its data. In our case, this was deciding on how to evaluate the stages of the buyer’s lifecycle. Kulkarni, Lodha, & Yeh (2012) described three main discrete steps in online buying behavior, where a customer:
  1. Declares intention to buy (express intent, or EI)
  2. Searches and evaluates options (purchase intent, or PI)
  3. Makes post-purchase statements (post-evaluation, or PE)
Following this system, our focus was to accurately classify and score tweets that fell in those categories.

We decided to evaluate tweets which fell into those categories differently. For example, a post such as “Should I get a Mac or a PC?” expresses a much higher intent to buy than a post where somebody expresses their thoughts on a product they just bought. There was also the problem of ambiguity – for example, in the case that somebody states “I am going to buy X type of product”, it would be difficult to know for sure whether they were simply expressing an intent to search for X type of product, or if they were literally steps away from purchasing that product. For these reasons, we decided on ‘base scores’ for posts such that PI > EI > PE.

Choosing a Service

Service Sentiment Intensity Emotion Intent
IBM Watson ✔ Natural Language Understanding Service ✔ Implied within Tone Analyzer ✔ Tone Analyzer or

Natural Language Understanding Service

✔ Natural Language Classifier
Microsoft ✔ Text Analytics API (in Preview as of June 2017) ✔ Language Understanding Intelligent Service (in Preview as of June 2017)
Converseon ✔ ConveyAI ✔ ConveyAI ✔ ConveyAI ? Waiting for their response

Another initial consideration was which machine learning service to use. While looking at available services, we searched primarily for services which provided not only text classification services, but also insight into sentiment and intensity. We decided on using IBM’s Watson Developer Cloud for their large ecosystem of services.

Implementation

As Hootsuite continues to transition from a monolithic architecture to microservices, our scoring system, the Intelligent Contact Ranking Engine (ICRE), was implemented as a Flask web service within Kubernetes. This provides a layer of abstraction between our existing Ruby back-end and the handling of asynchronous requests and scoring of posts done by the ICRE.

While the ICRE acts as an adapter between our back-end and IBM’s Natural Language Classifier (NLC), it handles some functionalities, as well as technical hurdles that we encounter along the way. Here are a few:

Batch Requests

One challenge we came across was a lack of support for processing batch requests. Potential buying signals (tweets) are collected in batches by a job in Amplify’s back-end. This would be fine, except it takes a total of, on average, 0.8 seconds for a request to be sent to Watson’s NLC, and for a response to be returned. Given the nearly 2.6 million keyword-matched tweets stored in our database, it’s clear that sending these tweets one by one would be a major bottleneck. The ICRE optimizes this process by using a thread pool. Though parallelism in Python is limited by the Global Interpreter Lock, most of the processing is done on IBM’s side, so any inefficiency is minimized.

Re-training & Scoring

Another minor challenge came with training our ML model. An interesting aspect of Watson’s NLC is that once a model (termed ‘classifier’) is trained, it cannot be retrained. This means that if we ever wanted to retrain our model, we would have to initialize the training process on IBM’s side, wait for that classifier to finish training, and then switch the classifier_id in our code to use that new classifier. ICRE reduces this complexity for developers in two ways:
  1. Allowing devs to call a training event with a simple command when running the ICRE
  2. Automatically detecting the latest available classifier each time it’s called
This way, a developer can simply call a training event, have a cup of coffee, check back later, and delete the old classifier once they confirm that the newest model has completed training. Even if the newest model completed training 30 minutes before the developer checked back, ICRE would have already started using it behind the scenes.

All of these implementation details allow the ICRE to reliably calculate signal scores given the information returned by the Natural Language Classifier.

Training & Methodology

So how did we train the model? As we didn’t have any previously available annotated dataset, I had to create the dataset myself. In essence, the idea was to collect data for as many feasible cases as possible. I collected training data by manually classifying information from:
  1. Keyword-matched tweets from our production databases
  2. Tweets grabbed through keyword searches
  3. Product reviews online
Kulkarni et al., (2012) found that 3.4% of all tweets they collected were “related to consumer buying behavior”. Therefore, besides classifying our data between the stages of the buyer’s lifecycle, we had to be prudent to make sure that we had a large sample of data to classify as “irrelevant”.

I found that there were a few patterns to tweets which fell in the same category – here are just a few examples of soft rules/guidelines I outlined while classifying training data:

Express Intent

E.g.,I’m looking to buy a new 6-10 seater dining table. Any recommendations?”

Express Intent is simply the declaration of the desire to purchase. This often includes:

  • Keywords such as “want”, “wanna”, “desire”, “wish”, etc.
  • Expression of anticipation
Purchase Intent

E.g., “I’ve decided my new PC is going to be #Ryzen based, unless someone can convince me to buy #Intel?”

Purchase Intent includes both the search and evaluation of options. This often includes:

  • Asking for details on how to obtain something
  • Asking about the ‘goodness’ of a product/service
  • Asking for opinions of one purchase option versus another
Post-Evaluation

E.g., “Solid purchase, no regrets.”

Post-Evaluation can be thought of as a review or statement after having purchased a service/product.

On Consistency

We decided that we would create a more effectively trained model by manually classifying data against a set of soft rules rather than crowdsourcing for labelled data. I found that in cases where context or images were required to fully understand a tweet, data classification could be ambiguous, especially when spread across a wide variety of people with different understandings of what “purchase intent” may mean. Consistency is key to training a good model. With an initial training set of over 1700 classified strings, we found that we were already getting good results.

Integration

While the original purpose of integrating intent analysis into Amplify was to score posts so that we could intelligently rank contacts on the front-end, we came across new possible use cases while implementing support in Amplify’s back-end. This led to some important decisions about how we should handle contact scoring.

Syncing Post Scoring and Contact Scoring

In an ideal situation, we would update contact scores whenever post scores are updated. However, this introduces unnecessarily high server load. Recalculating contact scores every time post scores are calculated (a job run on a fairly small interval) would mean running many postgreSQL queries involving both a join and a summation at that same interval. This is computationally inefficient.

Incrementing contact scores each time a post score is updated might have been ideal, but old post removal is automatically handled by our database – therefore making it difficult to track when we would need to decrement contact scores.

The most efficient way, then, of going about contact scoring would probably be to do a user-specific calculation every time they look at their contacts list, with a time limit between re-calculations. It would ensure that we don’t calculate contact scores for users who don’t need them… but we didn’t do that. Why?

We found that contact scores were valuable for features other than sorting the contacts list. Some features actively required contact scores in the background. For that reason, we needed to find a solution which would efficiently calculate contact scores for all users.

The Middle Ground

Ultimately, we decided to integrate contact score calculation into an existing daily job that already processes all contacts in our database. This allowed us to offload much of the calculation work to the database while adding only a few more calculations to an already existing and tested job. Now, incremental updates are done upon post scoring, allowing for immediate contact ranking throughout the day. This can be done because the daily job recalculates contact scores from scratch, therefore ignoring any scored posts which may have been removed due to age.

Integration into Amplify’s Front-end

After integrating support for contact scoring in the backend, updating our app to support contact ranking simply consisted of:
  1. Updating the retrieval of contact profiles to include their contact score
  2. Sorting the contacts list by score and name, rather than by name alone.
And we’ve successfully implemented contact ranking in Amplify!

Conclusion

The integration of contact ranking in Hootsuite Amplify demonstrates just one use case of machine learning for businesses. While purchase intent scoring was originally implemented solely for this use case, it has already proved useful in other features (e.g., alerting sellers about definite buyers with push notifications). The value of machine learning, in this case, isn’t a flashy new feature, but rather a subtle change which provides greater value to Amplify’s users. In this way, leveraging cutting edge technology in even subtle features serves as an indication of the potential for machine learning to drive intelligence in so many industries in the present, and in the future.

 

About the Author

Daniel Zhang is a High School Co-op on the Amplify team at Hootsuite. He will be studying Computer Science at the University of Waterloo in September, and in his spare time he likes to paint and work on projects. Connect with him on Linkedin.

Overview:

When I arrived at Hootsuite as a Summer High School Technical Intern I was tasked with creating sample apps for the Hootsuite App Directory. These apps should be easy for developers to host, quickly modify to suit their purposes, and use a minimal amount of external libraries in order to make the source code easy to understand for as many people as possible.

The GitHub repository for the sample apps I’ve mentioned in this blog post can be seen here.

Goals

  • Reduce the amount of time it takes for a developer to get a basic app up and running
  • Find flaws in the documentation and Hootsuite SDK
  • Code can be easier to process for developers than documentation
  • Allow developers to test out the Hootsuite SDK
  • Having a reference implementation
  • Give developers an understanding of what Hootsuite apps are
  • Address common pitfalls and “gotchas”

To JQuery or not to JQuery

I knew that I didn’t want to use any frameworks like React or Angular because we wanted the Sample App code to be readable by as many JavaScript developers as possible. I would still have to use JavaScript to manipulate the DOM though, and I had to decide whether JQuery was the right choice here. Because jQuery is an external library and not a part of the JavaScript language I decided that it wasn’t the right choice and that I would use the groundbreaking VanillaJS instead. This meant that I had to make a few simple helper functions for some common operations like finding a single element by class name but this was worth it in my opinion because it didn’t add unnecessary bloat to this barebones sample app.

Here’s an example of a few of the functions I created instead of using JQuery:

Choosing a backend

The backend for this Sample App is very simple because all it does is serve static content and accept POST requests so that window.postMessage could communicate with the Hootsuite dashboard. I initially started off with Python/Flask but switched to Node/Express so that all of the code could be in JavaScript and so that the Sample App would be easier to host on Heroku. Similar to the motivations of not using JQuery in the frontend, the backend doesn’t use any databases, or anything that would over-complicate the sample app.

Easy hosting

One of the challenges with the postMessage API is that your endpoint must be able to accept POST requests. This means that the sample app is more difficult to host than just setting up static files. This is where Heroku is really useful because the free tier is more than enough to host this kind of an app and it deals with HTTPS for you as well, which is required for integrations with Hootsuite. Heroku is very well integrated with NodeJS/Express and works seamlessly. Hosting with Heroku helps achieve one of the primary goals of writing sample apps, which is to reduce the time it takes for a developer to get an app up and running.

Production Readiness

One challenge with creating something so public facing is that the code needs to be clean and easily readable, with comments explaining what is going on throughout the code. The best process for achieving this is code review. Before any change goes out to the public it is reviewed by multiple people at Hootsuite, making sure that all the code is easily understood and follows best practices. To save reviewers time, it is a good idea to run a linter, such as JSHint for Javascript before you submit the code for review.

Another challenge with Open Source Software is that you must have a way to make sure that secrets, API keys, and other sensitive information aren’t checked into your version control. The way that I dealt with this is with explanations on how to input user-specific credentials into a configuration file and/or information into the sample app in the README, and also logging errors when users haven’t input an API key or secret.

Conclusion

There were many benefits to writing a sample app for the Hootsuite SDK. One of first benefits was simply having fresh eyes on the SDK. There were many small issues and inconsistencies in both the documentation and the SDK that I was able to point out because I had a fresh perspective. Once these issues are resolved, the Hootsuite SDK will be easier to use for all developers. Another key benefit of writing sample apps is to provide developers with a base to work from. Sample apps provide a lot of the boilerplate code so that developers can get to writing the valuable parts of their apps much faster.

About The Author

Vadym Martsynovskyy is a high school summer student on the Developer Products team. He will be studying at Computer Science at the University of Toronto starting in September 2017. In his spare time he enjoys programming, watching Game of Thrones fan conspiracy theories, playing video games, and skiing. 

Hootsuite is a big advocate of continuous integration (CI) and continuous delivery (CD). These tools allow developers to ship value to the customer quickly, reliably, and regularly. Every service, from our main dashboard – to dozens of customer-facing microservices – to internal tools, are built and shipped using an automated pipeline. This encourages a culture of agility, taking calculated risks.

My team mission was to improve developers productivity and satisfaction and to enable them to deliver stable and reliable software to our customers as quickly as possible. Having CI and CD being such a big part of developers’ day-to-day workflow, we decided we need a tool to help us measure and identify any pain-points and delays developers have to deal with. Developers should not spend their time debugging tests that would auto-resolve on a re-run, or be blocked for an extended period of time while waiting for their code to deploy.

Jenkins Metrics Phaser

Jenkins Metrics Phaser is our internal tool for the collection of all metrics related to our build and deploy pipelines. It’s ultimate goal is to enhance developer productivity and satisfaction. It aims to do so by tracking pipelines performance, flagging pain-points, and collecting feedback from developers. It also allows us to gain visibility into all our pipelines across the variety of microservices and tools we have.

Implementation

With our usage of Jenkins through the years we have a very diverse Jenkins setup. Our legacy dashboard Jenkins is self-hosted on our local servers. We have cloud-hosted Jenkins instances for a couple of our teams. Finally, all of our newer instances are hosted in Mesos clusters with each instance belonging to its designated team. We needed Phaser to aggregate all of the data from these instances into one platform.

Getting information from Jenkins

We use Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS) to reliably deliver data from Jenkins. All of our Jenkins instances are pre configured with a custom made Jenkins plugin for sending data to our SNS topic after the completion of each job. The SNS topic instantly pushes these messages to the SQS queue to allow for real-time processing.

This setup was proven to be reliable even when things break. One time a third-party dependency used by Phaser introduced a breaking change downstream and caused our service to be down for a few hours. As we tackled the problem and worked on a fix, we weren’t worried about data loss as we knew the queue would just grow larger but the messages would remain there. And indeed, once we brought Phaser back up it immediately picked up from where it left off and fully recovered within minutes, without any data loss.

Saving the information in our database

Each batch of SNS messages gets distributed between goroutines for asynchronous processing. Every message gets stripped down to its essential information and gets passed along to our main controllers. The controllers are under two main categories, with each corresponding to its designated database tables:

  1. Build and Deploy: responsible for storing information about all pipeline jobs. Each table row represents a single Jenkins job, with information such as result, time, duration, etc. One key detail is each job also contains a relational link to it’s triggering (parent) job. This allows us to use the data to visualize an entire pipeline workflow.
  2. Test: responsible for describing the performance of unit and integration tests of all pipelines. After identifying that a job is in fact a test job, Phaser posts to the Jenkins API to get test results in JUnit XML format and parses them down to individual test units. We convert all test results, regardless of the language they were written in, to JUnit XML format.

Notifying users via Slack and classifying pipeline failures

Upon receiving a failed job on a master branch build, Phaser sends the failed job information to our Slack bot through a completely non-blocking Go channel. The bot uses the git commit information to get the developer’s email, which gets correlated with his/her slack user. Phaser then sends an interactive slack message with a summary of the error, link to the job, and the option to classify the error. We use this data to identify common errors and specific pain-points which we can tackle in order to make an overall better experience for our developers.

Visualizing the data

We developed an internal API and UI platform for the use of both our team and developers which showcases metrics for all pipelines. Each individual pipeline has its own sub-platform with a variety of metrics such as: build times, longest tests, most failing tests, merge freezes, jobs stability, etc.

Conclusion

Our focus with Phaser was to build a unified platform for monitoring all of our pipelines with the ultimate goal of finding ways to improve developers’ experience. Both Go and AWS allowed us to develop a concurrent, fault-tolerant system. Slack allowed for easy and convenient delivery of important information directly to developers, and the ability for the developers to interact back.

About the Author

Elad Michaeli is Software Developer Co-op at Hootsuite. Elad studies Computer Science at the University of British Columbia. Connect with him on LinkedIn

Introduction:

Golang is a free and open source programming language created by Google in 2007. It is a popular language choice for building microservices and we use it extensively here at Hootsuite along with Scala. What makes Golang such a powerful language is its efficiency and scalability. Go compiles to binary and doesn’t depend on virtual machines (i.e JVM for Java). It’s built as an alternative to C and C++ so it’s fast and highly efficient but has garbage collection and memory safety features. My personal favourite feature of Golang are the routines and channels and how easy it is to use them. A goroutine is a function running independently in the same address space as other goroutines. They are a bit like threads but much much cheaper. Communications between goroutines is done using go channels. With these two features, you can build powerful concurrent programs that scale.

Goroutines and Channels:

So how easy is it to write concurrent programs in Golang? You can spawn a new goroutine simply using the go keyword:
package main

import ( "fmt" "time" )

func myFunc(done chan string) { for i := 0; i < 10; i++ { time.Sleep(time.Millisecond * 500) fmt.Println(i, " myFunc") } fmt.Println("finished loop in myFunc") done <- "goroutine finished" // send the message into the channel }

func main() { done := make(chan string) // make the "done" channel go myFunc(done) // run myFunc on a goroutine for i := 0; i < 5; i++ { time.Sleep(time.Millisecond * 500) fmt.Println(i, " main") }

msg := <- done // receive from the channel fmt.Println(msg) }

In this simple program, myFunc() runs concurrently with main(), and the “done” channel is used for communication between the two routines and synchronization. Run this program and see what it prints(https://play.golang.org/)! You should expect to see myFunc and Main numbers interleaved as they are running concurrently.

The above example uses the channel as a blocking receive to wait for the myFunc() goroutine to finish. If that channel wasn’t there, the program would exit right after the main() for loop finishes, and we would never finish the rest of the loop execution in myFunc() (try commenting out the last two lines of the code and running it). The presence of the receive channel (<- done) in main() prevents the program from existing until a notification is received. When the for loop in myFunc() finishes, it will pass the message “goroutine finished” into the “done” channel, that message is then received in the main() and program is allowed to exit.

That was a pretty simple example of a go routine and channel usage, I won’t go too much into syntax but for those that are interested, https://gobyexample.com/goroutines provides great examples of the concepts.

Now that we have covered some of the basics, lets see how we use goroutines in a real microservice here at Hootsuite.

Goroutines in Interaction History Service:

Interaction history is a feature that allows a Hootsuite user to view all their past interactions with another user for a social network. When we show the interaction history for Twitter, we would display all the mentions, likes, quotes,  and retweets that happened between the two users. So every time any of these events occur, our service has to store it. Our service consumes roughly 1 million of these events a day, and we handle this heavy lifting with goroutines and channels.

At Hootsuite, we have an “Event Bus” that is built using Apache Kafka which is responsible for distributing real-time events. Using the Sarama go client for Kafka,  we are able to subscribe to topics on the Event Bus which we are interested in such as Twitter mentions, likes, quotes etc, and get these events real time which we receive through channels.

Here is the simplified/modified version of the function we have for consuming events:

func consumeMessages(consumer *saramaCluster.Consumer) {
   for w := 0; w < 400; w++ {
      go func() {
         // infinite for loop
         for {
            select {
            case msg := <-consumer.Messages(): // consumer.Messages() returns a read channel where new messages will be passed through
                saveMessageToDB(msg)
case err := <-consumer.Errors(): // consumer.Errors() returns a read channel where error messages will be passed through logError(err) } }() } }
Lets take a look at what’s happening here. We constructed 400 goroutines, each running an infinite for loop that is waiting for message to come back from the channels. Go’s “select” lets us switch between messages received over a channel. If we received a message from the Message channel, we save that message to our database, but if we received a message from the Errors channel, we log the error.

What if we had only spawned 1 goroutine instead of 400 to achieve this task? The processing time for these messages isn’t very long so if there is no other possible delays, we might not even notice a difference. However, in a real service, we often experience network delays, and in our case, delays reading and writing to the database. When these delays are happening, another goroutine can jump in and start processing another message. These context switches happen so fast, it almost seems like things are happening in parallel. With 400 goroutines each ready to process messages, we are able to handle the heavy traffic that comes over from the event bus.

The Power of Goroutines

So how were we able to construct 400 goroutines without suffering any consequence? It is actually pretty common for a single go program to create thousands of goroutines. The key here is that goroutines are not threads, they are much lighter weight. In Go, when you block the current execution, you only block the current goroutine, not the thread. You can have thousands of goroutines running concurrently on a single thread and they will be able to context switch efficiently. Goroutines are lighter weight than OS threads because they use little memory and resources. Their initial stack size is also small but has the ability to grow as needed. And if you ever want to run go routines in parallel,  Go provides the ability to add more logical processors via the GOMAXPROCS environment variable or runtime function.

 

About the Author

Colby Song is Software Developer Co-op at Hootsuite on the Engagement Team. Colby studies Computer Science at UBC. Connect with him on LinkedIn

Jon Slow sees that his favorite social media platform is missing a feature that would make it great: the ability to directly add pictures from his Google Drive account. He asks the company support if they’re going to add this feature, and gets a negative response, so Jon decides to do it himself. With his new programming boot camp certificate in hand, he gets to work, feverishly coding day and night. Without any official support from the platform, he has to get creative, hacking a solution together, spoofing HTTP calls, spending hours on Stack Overflow, until finally, it works. His third-party add-on, with no support from the platform creators, successfully connects to his Google Drive (sometimes). Suddenly, the platform creators push a minor, unrelated bug fix, and all of Jon’s work comes crashing down. His integration crashes immediately and permanently. Months of work, gone. If only there had been an API he could use.

What is an API?

An API (Application Programming Interface) is a well-documented set of functions that allow one program to interact with another. Essentially, an API is a contract between the API provider and the client program: the client knows that it can call a predefined function with certain types of parameters, and the provider promises to return a specifically formatted response for the client to use.

At Hootsuite, we allow third-party developers to create apps that integrate with our social media platform in a variety of ways. For some, we have created inbound APIs that third-party apps can call to take advantage of Hootsuite features. For others, we define an outbound API that an app must implement, and then Hootsuite will invoked predefined endpoints in that API. Recently, we created functionality to support a new type of app, media apps, which are used to find and attach images or gifs from an external content source (like Giphy or Google Drive) to a social media post created in Hootsuite. As part of that process, we defined an outbound API for media apps to implement, and we decided to share our decision-making process for the API design here. Once it’s in wide usage, it’s very difficult to significantly change an API, so we tried to be very deliberate with our design process in an attempt to get it right the first time.

Goals for the API design

  • It has to work. The most important part! The functions in the API have to return data in a format that lets us at Hootsuite do what we need with it. For this specific project, that meant the app had to return a list of media, with each item containing information such as file type, height, width, and size, in a format readable by the Hootsuite dashboard.
  • It has to be easy for us to work with. We don’t want every app to return data in a different format, as that increases the amount of code we have to write to support it as well as the possibility of unexpected bugs. We expect every app to return data in almost exactly the same format, to minimize the work needed on our end to support different integrations.
  • It has to be easy for devs to work with! None of this process works if we can’t convince developers to create apps for our platform! Because we are requiring developers to conform to our API design, we want to make sure that it isn’t a pain to do so. Ideally, our API specifications are easy to understand and implement, so that the development process for 3rd-party devs is as smooth and enjoyable as possible.
  • It has to be RESTful. REST is a web API design standard in common usage across a ton of web applications. By making our API RESTful, it will be similar to other APIs that developers may have come across previously, and will help them understand how it’s meant to be used. RESTful design includes a number of best practices that make an API easier to use and maintain.

The API

With those goals in mind, after a few meetings, we produced this spec for our API (still in alpha). This is the API that we expect all apps with media library functionality to implement. The main component is the “media” endpoint, which Hootsuite will call in order to retrieve the media that the app provides. The documentation defines both the type of requests that Hootsuite will make, as well as the format of the responses it expects. It defines which fields in both the request and the response are optional, which are required, and which are sometimes optional and under what conditions. There are also some optional OAuth endpoints, for apps that require the user to login before they will provide media. More detail about the OAuth setup can be found here if you’re interested.

Problems

While we’re pleased with our final API, it’s still not perfect, and its flaws come from places where our goals conflicted with each other. One example is the “cursor”. When an app has more than one page’s worth of media that fits a specific query, it returns a cursor in addition to the rest of the payload. The cursor is a unique key that Hootsuite can send to the app to get the next page of results.

During our internal planning discussions, we thought about providing the app developer a guarantee that our software would return all the search parameters from the first page in addition to the cursor when asking for the second page. We eventually decided against it, as we thought it would add too much complexity in designing and implementing our internal services that interface with media apps.

As part of this project, we developed our own app that implemented the API as a proof of concept. During the development process, we realized that only sending a cursor, without the related search params, made it a lot harder for the app to generate said cursor. We had to encode all the search parameters in the cursor, send it to Hootsuite, and then decode the cursor that Hootsuite sends back when it asks for the next page of results. Not impossible, but a great deal more complicated than the alternative when search params are sent: “page2”.

This problem is typical of API design. Two of our major goals (“make it easy to use for external devs”, “make it easy to use for internal devs”) conflicted, and we had to make a decision about what was most important to the team. Maybe it was the wrong decision! We won’t really know until we show the API to more external developers and get some feedback. At that point, maybe we could change it.

Changing an API after release

Changing an API isn’t as easy as it might seem at first. As I said earlier, an API is a contract between two pieces of software, promising each other that they will both act a certain way. Changing the rules of that contract can cause a lot of problems. If the API implementation changes, every program that uses the API will probably have to change as well, and given that the API provider and consumer are usually developed by different teams or companies, it might be a long time before everything is back in sync. In the meantime, software will break and customers will be unhappy.

In order to avoid that, there are a few steps to take when updating your API. The most important is to make non-breaking changes. These are changes that will allow all apps currently using your API to continue using it in exactly the same way without malfunctioning. Generally, this restricts your changes to adding optional features: either new methods that API users can call, or optional parameters to existing methods that will add functionality.

Breaking changes, then, are basically anything else: removing functions, changing the required parameters that they accept, or the type or form of the data they return. If you really need to make a breaking change, then you’ll need to make a new version of the API, release it, try to convince developers to do the work to switch to the new version, and continue to support the old version for as long as possible (some might say forever) while developers are switching. The easiest way to do this is to version your endpoints (e.g. currently our media endpoint is called “v1/media”, and if we made breaking changes, we would create a new “v2/media” endpoint while leaving the old one alone). Even so, it’s a major headache, and the main reason why it’s important to be thoughtful with your initial design.

Finale

Jon Slow has been ruined. His integration is broken beyond repair. Every time he tries to use it, his computer gives off a little puff of smoke in despair. His programming career is at death’s door. Suddenly, an update arrives: the platform creators have released an API for media integrations! Reinvigorated, Jon sits at his computer. An hour later, his Google Drive has successfully connected to the platform. He can quickly and easily attach images to his posts. It’s everything he dreamed of and more. Jon Slow’s life is complete.

About the Author

Jacob Brunette is a Software Developer Co-op with the Platform team at Hootsuite. Jacob studies computer science at the University of British Columbia. Connect with him on LinkedIn.

Background:

On the datalab team at Hootsuite, we help the rest of the company make data driven decisions. A big part of this is the collection of data. We collect the data from external vendors that we work with such as Salesforce and Marketo. We also collect data on the Hootsuite product itself. We load all of this data into a single data warehouse, where our data scientists, business analysts, and product teams can easily use.

The engineering side of the datalab has 3 main tasks:

  • Collecting data
  • Enriching the data
  • Storing data in different formats
In order to do all this datalab develops specialized apps that do one thing very well. These apps get input from one or many internal and external data sources. These sources are a mix of internal APIs, databases and data streams as well as external APIs. This first step is known as the extract step. The apps then process the data. This could range from enriching one data source with data from other data sources, or doing some calculations and enriching the data with the results. This second step is known as the transform step. The app finally loads the data into another data source. This last step is knows as the load step. In data engineering parlance these apps are called ETLs: Extract, Transform, Load.

Problem:

We are dealing with large volumes of data and all the data operations are subject to a high standard of data quality. The result of one data operation is often an input to multiple other data operations. Therefore, one problem can easily cascade into decreasing the data quality of other parts of the system. There are a large number of stakeholders consuming this data every moment. These stakeholders use the insights for making critical decisions. Therefore, it is imperative that the data is always correct and complete.

These are some of the technical difficulties that the datalab had to solve to create a reliable data pipeline:

  • How can we quickly spot any anomalies in the system?
  • How can we easily troubleshoot and fix the problems?
  • Given that the output of some jobs are the input to others, how do we make sure that the jobs run in the correct order every time? Additionally, if one of the components in the pipeline fails, how do we prevent it from affecting other parts of the system?
  • How should we manage our infrastructure? Some apps run continuously and some run periodically. How should we schedule and deploy these apps on our servers? We want to make sure that all the apps have enough computing and memory resources when running. But we also want to make sure that our servers are not sitting there idly (cloud bills can be expensive!).

Technologies:

1- Docker:

Datalab packages all of its apps as Docker containers. In practice, you can think of a Docker container as a Linux machine that runs on your development machine or on your server. Your app will run inside of this container, which is completely isolated from the host machine it is running on.

Docker has a much lower overhead than a virtual machine, as it does not require running an entire kernel for each container. In addition,while you can put limits on the resources for each Docker container, the container only uses those resources if it needs to. The resources dedicated to a VM, on the other hand, are not usable by the system anymore.

This lightweight environment isolation enabled by Docker provides many advantages:

  • Environment setup: For an app to run properly it requires that the server has all of the right versions of the dependencies installed. Furthermore, the file system has to be set up as the app expects it too. Finally, environment variables have to be set correctly. All of this configuration can lead to an app working perfectly fine on my computer, but not on yours, or on the server. Docker can help with these issues. A developer can write a dockerfile which is very similar to a bash script. The dockerfile has instructions on what dependencies need to be installed, and sets up the file system and environment variables. When a developer wants to ship a product, s/he will ship a Docker image instead of a jar file or some other build artifact. This will guarantee that the program will behave exactly the same irrespective of what machine it is running on. As an added bonus, Docker has developed tooling so you can run Docker containers on Windows and Macs . So you can develop on a Mac or a Windows machine, but rest assured that your program will work on a server running Linux.
  • Easy Deployment: By running the dockerfile the developer can create a Docker image. You can think of a Docker image as the harddrive of a Linux computer that is completely set up to run our app. The developer can then push this Docker image to an image repository. The servers that actually run the app, will pull the image from the image repository and start running it. Depending on the size of the image, the pull can take a bit of time. However, one nice thing about Docker images is their layered architecture. This means that once the server pulls the image for the first time, the subsequent pulls, will only pull the layers that have changed. If the developer makes a change to the main code, it will only affect the outermost layer of the image. So the server will only need to pull a very small amount of data to have the most up-to-date version. If an app needs an update, the server will pull the recent version and starts it in parallel to the running old version. When the new version is all up and running, the old container is destroyed. This allows for zero downtime of the server.
  • Running Multiple apps on the same server: Docker allows us to run multiple containers on the same server, helping us maximize the value that we get out of our resources. This is possible thanks to the low overhead of these containers. Moreover, If each app needs a different version of Python, or a different version of SQLAlchemy, it does not matter, as each container has its own independent environment.
So now that we have come to resource allocation, one might wonder, how do we make sure that each container has enough resources and how do we manage all the servers required for running our apps.

2- ECS:

Companies that adopt the Docker technology and microservice architecture along with it, will end up with many specialized small containers. This calls for technologies to manage and orchestrate these containers. There have been an array of technologies that make managing and deploying of containers easier. Some of the more well known names are Kubernetes, Amazon ECS, and Docker Swarm. Here at datalab we have picked Amazon ECS as our primary container orchestration solution.

Amazon ECS provides a cluster of servers (nodes) to the user and takes care of distributing the containers over the cluster. If servers are running out of memory or computing resources, the cluster will automatically create a new server.

We have some apps that are constantly running on this cluster. But there are also apps that only run periodically (once a day or once a week). In order to save on costs, we destroy the containers that have completed their job and will only create them again when we want to run them. So you can imagine that the number of containers and the type of containers running on the cluster is very dynamic. ECS automatically decides which server to schedule new containers on. It will add a new server if more resources are required. Finally, It will phase out a server if there is not enough work for it.

In short ECS takes care of distributing containers over the cluster, and helps us pack the most containers onto the fewest number of servers possible. But how do we actually schedule a container to run?

3- Airflow:

Airflow is a tool developed by Airbnb that we used to help us with a few tasks. Firstly we needed a way to schedule an app to run at a certain time of the day or week. Secondly, and more importantly, we needed a way to make sure that a job only runs when all the jobs that it is dependent on have completed successfully. Many of our apps’ inputs are the outputs of our other apps. So if App A’s input is the output of app B, we have to make sure that app A only runs if app B has successfully run. Airflow allows the developer to create DAGs (Directed Acyclic Graph), where each node is a job. Airflow will only schedule a job if all of its parent nodes have run successfully. It can be configured to rerun a job in case it fails. If it still cannot run the job, it will send alerts to the team, so that the problem can be fixed as soon as possible and the pipeline can continue its operation where it left off. Airflow has a nice UI that shows all the dependencies and allows the developers to rerun the failed jobs.

So airflow will schedule our periodic jobs and it will notify us if things go wrong. But how can we screen the apps that run constantly?

4 and 5- Sumo Logic and Sensu:

As a developer the most useful thing that I use to debug my apps is the logs. However, accessing the logs on servers is usually hard. It is even harder when using a distributed system such as Amazon ECS. The dynamic nature of ECS means that an app could be running on a different server on each given day. In addition, if a container is destroyed for any reason, all its logs will be lost too.

To solve the complexity of capturing and storing logs on a distributed system, we have made the system even more distributed! Sumo Logic is a service that accepts logs from apps over the network and will store them on the cloud. The logs can easily be searched using the name of the app. They can be further narrowed down with additional filters. So if a developer needs access to the logs for a specific app, s/he can get them with only a few clicks.

This still means that in order to quickly identify a broken app, someone has to be constantly looking at the logs. That sounds super boring, so we have automated the process by using Sumo Logic’s API and another technology called Sensu. Sensu is an open source project that allows companies to check the health of their apps, servers, and more. Sensu regularly runs the defined checks and alerts the team if something is wrong. At a high level, Sensu has two components: Sensu server and Sensu client. The Sensu server regularly asks for updates from the clients. The clients run some tests and return a pass/fail message back to the server. The server can then notify the team if the check has failed.

One of the use cases of Sensu in datalab is monitoring the health of our apps. One thing that I found particularly interesting is that the system is designed in a way that no extra code needs to be added to the apps. This is done through monitoring the logs of an app. This is how it all works: Our Sensu client is always running on the ECS cluster. Let’s say that Sensu server wants to check on the status of app X. It will send a request to the sensu client asking for updates on the status of app X and a regex expression of what a success message looks like. Sensu client will then send a request to Sumo Logic API asking for all the recent logs for app X. Next, Sensu client will search through the logs and see if they include the success expression. The client will send a success message back to the server if it can find the success message, and will send a failed message otherwise. The server will send an email to the team when a check fails or if the client is unresponsive, and the engineer on call can take measures to resolve the issue quickly.

That was a long post, so I’ll stop here. Hopefully you have gotten an idea of how Hootsuite manages its big data. To summarize, we use Docker containers for the ease of development, deployment, and the modularity it provides, ECS to orchestrate these containers, Airflow to manage scheduling and enforcing the order in which apps should run, and Sensu and Sumo Logic for monitoring and troubleshooting our apps.

About the Author

Ali is a co-op software developer on the data lab. He studies computer science at UBC. Connect with him on LinkedIn.

Overview:

On the Plan and Create team we were using CommonJS as our module system until very recently when we ran into a circular dependency issue with our React components. For one of our new features we had a component that needed to reference a modal and that modal in turn needed to reference a new instance of the original component. This circular relationship caused a problem and webpack could not resolve the dependencies correctly. Our solution was to upgrade our modules to use ES6 import/exports. This enabled us to reuse the react components and avoid circular dependencies while moving us closer to ES standards. We upgraded as much as we could without affecting other teams.

What is a module?

A module is a reusable block of code that with data and implementation details for a specific functionality that is exposed as a public API to be loaded and used by other modules. The concept of a module stems from the modular programming paradigm which says that software should be composed of separate components that are responsible for specific functions that are linked together to form a complete program.

Why are modules useful?

Modules allow programmers to:

  • Abstract code: hide implementation details from the user, so they only have knowledge on what the object does, not how it’s done
  • Encapsulate code: hiding attributes in programming so they can only be accessed via methods of their current class
  • Reuse code: avoid repetitiveness in code by abstracting out methods and classes
  • Manage dependencies

ES5 Module System – CommonJS

ES5 was not designed with modules, so developers introduced patterns to simulate modular design. CommonJS modules were designed with server-side development in mind, so the API is synchronous. Modules are loaded at the moment they are needed in the order they are required inside of the file. Each file is a unique module with two objects, require and module.exports, used to define dependencies and modules.

Exports

Exports or module.exports is used to export module contents as public elements and a module identifier (location path of the module).

Require

Require is used by modules to import the exports of other modules. Every time you use require(‘example-module’) you get the same instance of that module ensuring the modules are a singleton and state is synchronized throughout the application.

ES6 Module System

ES6 introduces a standard module system based on CommonJS. In ES6 the module system operates differently to the mechanism above. CommonJS assumes that you will either use an entire module, or not use it at all whereas ES6 modules assumes that a module exports one or more entities and another module will use any number of those entities exported.

The two core concepts of the ES6 module system are exporting and importing. Each file represents a single module which can export any number of entities as well as import any other entities.

Exporting

Variable and functions that are declared in a module are scoped to that module, so only entities that are exported from the module are public to other modules and the rest remain private to the module. This can be leveraged for abstraction and to explicitly make elements publicly available.

Importing

The import directive is used to bring in modules to the current file. A module can import any number of other modules and refer to none, some, or all of the objects. Any object that is referred to must be specified in the import statement.

Other Benefits

  • In ES6 you get strict mode for free, so you do not explicitly have to say ‘use strict’ in every file. Strict mode is a set of rules for JavaScript semantics that help eliminate mistakes that prevent compilers from performing optimizations.
  • Because import and export are static, static analyzers can build a tree of dependencies.
  • Modules can be synchronously and asynchronously loaded
Limitations

At this time not all browsers implement ES6 module loading. The workaround is to use transpilers such as Babel to convert code to an ES5 module format.

Difficulties/Changes

Updating to use ES6 modules in our code base led to some problems in our test suite, causing majority of the tests to fail. We soon realized that this is because in our current test suite we were using the JS Rewire library to mock modules in tests which does not support the new ES6 module syntax causing everything to explode.

Rewire is important because it provides an easy way to perform dependency injection by adding getter and setter methods to modules us to modify the behavior of imported modules in order to test the component that imports those modules in isolation.

Luckily there is a an alternative to the JS Rewire library, babel-plugin-rewire, that works with Babel’s ES6 module by adding methods to modules allowing us to mock data and test components in isolation. To make this change you must include the babel-plugin-rewire in your package.json file in the dependencies section amongst other changes.

Example using the JS Rewire Library: *Note the CommonJS require syntax

Example using babel-plugin-rewire: * Note the ES6 import syntax – you can import named exports which only import the methods, not the entire library.

About the Author

Sonalee is a Co-op Front-End Developer on the Plan and Create team at Hootsuite. She is a Bachelor of Science in Computer Science and a minor in Commerce at the University of British Columbia. In her spare time, Sonalee enjoys hiking, snowboarding and foosball.

Follow her on Instagram or connect with her on LinkedIn!

In the summer of 2017, I had the pleasure of joining Hootsuite’s Product Design team as a UX designer for all things mobile. I first came to Hootsuite looking to get a taste of the tech industry and am so thankful to have spent 4 months in such an inclusive, collaborative, and welcoming environment. Even though I was just a co-op student, there were endless projects and opportunities that I was able to have full ownership over. This creative freedom and level of responsibility was a rewarding learning experience, mainly thanks to the amount of mentorship I received along the way. By working alongside UX designers, user researchers, developers, and product managers, I got to see my designs being developed, learned the process and what it takes to ship a product.

 

One of the most valuable lessons I learned at Hootsuite was the importance of evaluating your design before breaking it down into phases for release.

 
Often as design students, we’re taught to create meaningful, delightful end-products. We develop a vision for our designs and create mockups of this ‘ideal state’. But what we fail to consider is how to follow through with our ideas. How will your designs be implemented? How will they be built? How will it be adopted by users? How will it scale over time?

All this requires careful consideration between multiple stakeholders, including you, the designer. A fellow UX designer at Hootsuite, Guillaume D’Arabian, describes it with this pyramid-shaped model that illustrates the relationship between business, design, and technology. By finding the right balance, you’re able to meet the needs of your customers. The real challenge however, is maintaining this balance over time and adapting to unexpected circumstances–from customer feedback, unaccounted use cases, and edge cases.

 

So what do we mean by phases?

It’s great to run developers through mockups and the vision of your design, but when it comes to putting a product out into the real world, it takes time and multiple iterations. In order to achieve your vision, you need to break your design down into phases to account for how it will be released.

Phases represent the scope of a particular sprint or release. They’re helpful for everyone because they define the priority of what needs to be built, show how each element or feature will be implemented, and create a shared understanding of how your product will evolve. In order to determine the priority of what needs to be built, it takes a level of strategic, technical, and customer understanding. This is where the pyramid comes into play.

A lot factors emerge when defining the scope for each phase: from the business strategy of your product, the complexity of each feature, the bandwidth and resources available within your team, the known behaviors among customers, and adapting to customer feedback.

Here are 3 key considerations to keep in mind when designing in phases:

Create An Easy Transition

  • If you’re proposing a redesign or a new feature, how will you transition core users to your design?
  • How will you introduce new users to your design?
  • What will you introduce first and how?
  • Are you replacing a behaviour or creating one?
  • Have a general understanding of how your design will be built. What are the more complex pieces? What are the easiest pieces to develop?
  • Are there elements within the existing product that can be reused in your designs?
  • What pieces are crucial in solving the problem you’re designing for?
Ideally, this should help you visualize the steps you need to be take in order to jump from point A (the existing product) to point B (your new vision).

Consider How it’ll Scale

Design is never finished. It’s important to consider how your designs might adapt or scale over time, while keeping your vision in mind. In general, this is a good exercise to test the longevity of your designs. And even though you might not be able to predict it all, this might just help you define potential phases down the road and identify future areas of opportunity.

  • How will your users’ behaviours change as they transition from being a new user to an everyday user?
  • How will your design scale to a high volume of users?
  • How will your design scale as it adopts new features?

Creating your MVP

By now, you should have a general idea of how to create your MVP (minimum viable product) for the first phase/release. Coined by Frank Robinson and popularized by Steve Blank and Eric Ries, the term MVP refers to “the smallest thing that you can build that delivers customer value.” Your MVP should be able to address the core problem and provide basic functionality, but most importantly, create a lovable experience. With every release, your customer feedback will either validate or refine your ideas, helping you plan for the next phase of release.


As UX designers, we’re trained to be considerate of our audience and our users—but this also applies to our own teams. The most successful teams are built out of strong relationships. In order to build a successful product, it’s important that we stay mindful of everyone’s needs.

 

About the Author

Amanda is a Co-op UX Designer on the Product Design team at Hootsuite. Working closely on Hootsuite Mobile, she’s worked across multiple teams from Publisher, Plan & Create, the iOS and Android core team, as well as the first team for Hatch–an incubator program at Hootsuite. During her off time, you can find her visiting local events, thrift stores, art exhibitions, or taking street photography.

Follow her on Instagram, Twitter, or connect with her on LinkedIn!

 

 

While working on Hootsuite’s Facebook Real-Time service for the past few months, I have had an extremely mind-opening experience dealing with back end development and the underlying architecture which makes all of the internal services work together. I will highlight the architecture of a Hootsuite service with reference to Facebook Real-Time.

Overview

The Facebook Real-Time service consists of two microservices; Facebook Streaming and Facebook Processing. A microservice is an approach of service oriented architectures which divides a service into smaller, more flexible, and specialized pieces of deployable software. This design feature was decided to ensure each service does one thing very well. It allows scaling of modules of a service according to the resource usage it needs, and allows greater control over it. One key reason is that Facebook Processing does a lot of JSON parsing which consumes more CPU. The way we utilize microservices, is by implementing each microservice as a specialized piece of software. In the case of Facebook Real-Time, Facebook Streaming specializes in collecting event payload data from Facebook, while Facebook Processing parses this data into specific event types.

Facebook Streaming

Facebook Streaming collects large payloads of information from Facebook when certain real-time events occur; a streaming service. Data streaming, in this case, is the act of dynamic data being generated by Facebook on a continual basis and being sent to the Facebook Streaming Microservice. These events can include actions such as a post on a Facebook page, or the simple press of the ‘like’ button on a post. Facebook Processing parses these large payloads of data into specific event types.

Webhooks are used in the collection of events from a Facebook page. They register callbacks to Facebook to receive real-time events. When data is updated on Facebook, their webhook will send an HTTP POST request to a callback url belonging to the Facebook Streaming Microservice and this will send the event payloads.

 

Facebook Processing

The Facebook Processing Microservice parses the event payload from Facebook Streaming into more specific events. So a published post will be one kind of event, and a ‘like’ will be another type of event. Here, the number of events which are assigned event types, can be controlled. This is important as event payloads are large, and requires a lot of CPU to parse. Limiting the set of event payloads being parsed at once, reduces CPU usage. So instead of parsing the event payloads at the rate they are received from Facebook Streaming, it can consume a set of these event payloads at a time, while the rest are in queue until the set of consumed events are already parsed.

We have also built a registration endpoint into the Facebook Processing service. Instead of manually adding facebook pages to the database of pages registered for streaming, the endpoint can be called by a service and will register the specified Facebook page.

 

Event Bus

A payload consists of batches of events from different Facebook pages; we call this a ‘Raw Event’. These raw events are published from Facebook Streaming to the Event Bus. The Event Bus is a message bus which allows different micro-services Apache Kafka technology and consists of a set of Kafka clusters with topics. A topic corresponds to a specific event type and consumers can subscribe to these topics. The event data corresponding to these topics will be collected by the consumer. A service can consume events from the event bus or produce events to the event bus, or both! Each service is configured to know which topics to consume or produce.

Event messages are formatted using protocol buffers. Protocol buffers (Protobuf)  are a mechanism for serializing structured data. So the structure of a type of event only needs to be defined once, and it can be read or written easily from a variety of data streams and languages. We decided to use protocol buffers because it has an efficient binary format and can be compiled into implementations in all our supported languages. These include Scala, PHP, and Python. Event payloads can also be easily made backwards compatible, which avoids a log of ‘ugly code’ compared to using JSON. These are just some key examples, but there exist many other benefits to using Protobuf. With the growing number of services at Hootsuite, there was the problem where the occurrence of an event would need to be communicated to other services, asynchronously. The value which the Event Bus provides is that a service where the event is happening does not need to know about all the other services which the event might affect.

Giving an overview of how the Facebook Real-Time Service interacts with the Event Bus, raw events are first published to the Event Bus from the Facebook Streaming Microservice and consumed by Facebook Processing, where the events are assigned specific event topics and sent back to the Event Bus. This allows for specific events to be consumed by other services. Additionally, the events stay on the event bus for a period of time until they expire. In our case, because we wanted to publish raw events and then parse them at a different rate than the events are being sent to us, this allowed us to use the Event Bus as a temporary storage tool and enable Facebook Real-Time to consist of two, separate microservices. It also allows us to offset the producer versus consumer load used for processing events in the Facebook Processing service.

 

Service Discovery

Skyline routing is used to send HTTP requests to the Facebook Real-Time service. Skyline is a service discovery mechanism which ‘glues’ services together; it allows services to communicate with each other and indicate if they are available and in a healthy state. This makes our services more reliable and we are able to build features faster. Skyline routing allows a request to be sent to a service without knowing the specific server which the service is hosted on. The request is sent, and redirected to the appropriate server corresponding to the service. It routes requests from clients to instances of a service, so it can reroute the request to another server if the service worker fails. This also includes when a service is being restarted, so if there are several instances of the service running, then requests will go to another instance of the service while the other instance is restarting. This functionality also allows for new instances of services to be set up if it is being overloaded and there are too many events in queue, improving response time.

In addition, the client can access all the services routed by skyline via a single URL based namespace (localhost:5040), by specifying the service name and associated path. The request will be routed to the corresponding address where the service is hosted.

In conclusion, a microservice publishes events to the event bus, which contains a large pool of events. These events can be consumed by another microservice, which will effectively use the event data. And the services can communicate with each other via HTTP using Skyline routing, a service discovery mechanism.

 

About the Author

Xintong is a Co-op Software Developer on Hootsuite Platform’s Promise team. She works closely with the Facebook Streaming Microservies which is built on Hootsuite’s deployment pipeline and uses the event bus to communicate event information to the Inbound Automation Service. Xintong also enjoys the arts and likes to invest her creativity into fine arts and music.