Posts from April 2017

After any outage or severe incident at Hootsuite, we take a closer look at the event to uncover valuable lessons and data. We do this by working backwards, asking “Why?” to find the most likely root causes for the problem. Then we put in place incremental improvements to bulletproof ourselves against a reoccurrence.

Before I explain our mindset, practice, and step-by-steps, let me tell you a story.

Oh Shit

I’m sure you’ve felt this way before…

On April 6th, 2017, my spidey-sense started to tingle, my hands began to sweat, and I had that sinking feeling in the pit of my stomach: all the physical and emotional signs of having done something wrong.

Turns out, I brought down part of our site for 39 minutes. More specifically, our users were unable to post to LinkedIn or to add new LinkedIn accounts. I had inadvertently deleted the Hootsuite app from the LinkedIn Developer Portal because I thought I was deleting my developer access to the app. Fortunately, for our customers, this happened in the evening around 6:30PM PDT when our site is not under heavy load.

What happened next says everything about our culture.

Now what?

Everyone came together.

  • One of our Product Managers, helped me verify the degraded functionality.
  • Our VP Engineering calmly directed me to our Strategic Partnership team who have a relationship with LI.
  • Our SVP Strategy and Corporate Development tried to get in touch with a contact at LinkedIn while I submitted a support ticket to LinkedIn Developer Support.
  • One of our advocates from Customer Support let us know about the first customer reports of a problem in our Slack #outage channel.
  • Developers, Operations and others from across the company immediately responded in #outage and offered to help

The Fix

Turns out emailing LinkedIn Developer Support was the key to a fast resolution. Danthanh Tran from LI was able to “undelete” our Production, Staging, and Development apps in a matter of minutes.

Whose fault was it?

That’s a trick question. As my cold sweats told me, I felt responsible, but no one is to blame. This is key to the culture of a team that improves systems quickly.

The next day, a member of our Ops team facilitated a blameless post mortem called a “5 Whys” that focused on the timeline, data about the event, what went wrong, and how we can fix our system so the same issue doesn’t happen again.

Note, the words fix our system, not how we can fix me. Like most humans, I’m well intentioned but fallible. Many of us could have made that error. In fact, having personal social network accounts able to administer the one key connection to our social networks has been a time-bomb waiting to happen. This event is a catalyst for long-overdue change.

“Human error is a symptom—never the cause—of trouble deeper within the system” – Dave Zwieback in an article on First Round Review.

So, the question is how can we bulletproof our system so the next person doesn’t have the same inadvertent nightmare? A more humorous way of saying this is from Dan Milstein, in How to Run a 5Why’s (with Humans not Robots): “Let’s plan for a future where we are all as stupid as we are today” 🙂 Focusing on the system instead of the person is a way to remove the stigma of speaking up when we feel guilt for an accident. This key because accidents contain valuable information and should be “seen as a source of data, not something embarrassing to shy away from” via John Allspaw.

Next, we post a summary of our 5 Whys, on our internal tool called Morgue (also from Etsy) and take actions to abate the probability or a repeat event.

A Note on the title of this outage.

Two people pointed out the incongruity of saying we’re blameless but using a name in the title “That time Noel Deleted LinkedIn”. Let me explain. This is the only time we put a name to a 5 Whys. It was done as a satirical tribute to me, expressly for my contributions to our culture of blameless post-mortems.

Our Mindset and Culture on 5 Whys

To surface all the valuable data our culture must remain “open, with a willingness to share information about … problems without the fear of being nailed for them.” – Sidney Dekker in The Field Guide to Understanding ‘Human Error’.

Trust by default

  • Believe that people are doing the best they can with the information they have.

Provide psychological safety

  • How many of you would have admitted to causing a problem that brought down production? Do you feel you can take the risk to tell people without feeling insecure or embarrassed? Censoring ourselves means we limit what we can do.
  • As my colleague Geordie put it: “only people who feel psychologically safe in a team and can be themselves, are capable of doing their best work.”

Helpfulness is the secret weapon

  • Helpfulness feels great and it is an indicator of highly productive teams. Margaret Heffernan in her TED talk speaks about an MIT study of teams, where teams were given very hard problems to solve. The really successful teams were not the teams with the highest aggregate IQs. They were, in fact, the teams that were the most helpful and the most diverse: had the most empathy towards one another, the one’s without a dominant voice, and the ones with the most women.

Continuous Improvement supported by data

  • One of our values is Build a better way. To do this we have to make time for reflection then follow it up with adaptation. Supporting data is key. Tracking metrics and severity in a single place means we can use it to add context, to make decisions, and to measure our improvement.

How we got here

We arrived at these practices because people at other companies ‘worked out loud’ and showed us. Then, we iterated towards our practices based on our context.

The following list is from invaluable thinkers who influenced our mindset, practices, tools

Step-by-step

  1. Something happens… an outage or incident

  2. Resolve it.

  3. Read the articles above. If you read just one, read this one.

  4. Set a time for a post-mortem. The sooner the better. Grab a whiteboard.

  5. A note about where to hold it. A whiteboard is in public place, not in a meeting room. We also leave the details up until the next 5 Whys. This is a visible reminder of openness and it invites participation from passers-by who provide input based on their past experiences.

  6. Invite the people or persons who… (taken from Bethany Macri‘s talk on post-mortems):

    • introduced the problem
    • identified the problem
    • responded to the problem
    • debugged the problem
    • anyone interested

  7. Conduct a 5 Why’s (blameless post-mortem). Focus on the system not the person.

  8. Like Dan Milstein says, start with humour to get participants to open up. Relative to other disasters… How bad is this? Did the site go down? Accidentally post from our main twitter handle? Lose customer data? Send an email to all our users with the greeting “Hello FName!”

  9. Collect metrics

    • Timeline
    • MTTR: Mean Time to Resolution
    • MTTD: Mean Time to Discovery

  10. Drill down
    1. Ask “Why?” at least five times.
    2. Each statement should be both verifiable (how do you know?) and not compound (single cause). An example from Five Whys – How To Do It Better:
    • Example problem: SUV Model Z exhaust system rattle
    • Why? Change of position of bracket results in vibration <– COMPOUND
    • Why? Exhaust pipe vibration <– SINGLE
  11. You must be able to work up the chain (last ‘why’ to the first ‘why’) to verify the causation.

  12. List the improvements you can make to bulletproof the system

  13. Ask people to commit to completing an improvement.

  14. How do we arrive at action items and commitments? Organically. As a heuristic, point to each “why”, and to each step in the timeline, then ask how we can improve the process, technology, and training for each item; and who will take it on.

  15. Publish it in Morgue or in the most visible communication channel.

  16. Follow-up on that channel as you make the fixes and share your learnings.

    Too soon? 🙂

    Thank you

    To everyone who helped with this outage. To Jonas, Beier, Geordie, Tyler, James, and Trenton for their contributions to this process and their help putting this article together.

    About the Author

    Noel Pullen 200x200 Noel focuses on culture, employee engagement, technical community involvement, and training for Hootsuite’s technical groups. He loves to exchange ideas and would like to hear how you do these things at your organization. Get in touch via LinkedIn.

Henry Ford once said, “Coming together is a beginning, staying together is progress, and working together is success.” It is based upon this philosophy of embracing collaboration from start-to-finish that my team, as well as numerous others at Hootsuite, have adapted an additional role in our Agile methodology.

To provide some background for those unfamiliar with Agile, an Epic is a large unit of work which, when completed, will provide a lot of value to the customer. Epics are then broken down into stories and tickets/tasks which developers will commit to and complete.

Every developer is encouraged to work on whichever task is highest in priority allowing work to be fluid and ensure each developer is well-rounded. However, each sprint, there can be numerous Epics being worked on, as well as numerous more being planned in the backlog, which often makes it difficult for product owners to maintain an accurate idea of the current progress of each based on small, fragmented updates from each developer at scrum. Further, the process of conceiving a new feature often gets muddled as it is passed around between design and growth and management before finally arriving at the engineers. The solution to all these problems and more? The Epic Champion. Read More …

I was honored to work on the Test Tooling Team at Hootsuite since January, and one of the main projects that I actively participated in is the Arbiter Tests Management System, a project that provides a user-friendly way for QAs and Developers to manage the tests to be run on different build pipelines in real time. By allowing the users to directly toggle tests from the UI, Arbiter has significantly reduced the resolution time to tests failures in the organization.

As the program aims to optimize developer’s experience, we focused not only on the practical convenience of the program, but also on the design of the user interface that could create a pleasant visual effect for the developers. With this in mind, we have designed 5 different themes ranging all over the spectrum: dark, light, oreo, tiramisu, and azure.

While React has been the de-facto standard for front-end development at Hootsuite, we have also chosen React as our front-end language to effectively build the visual effects we intended to achieve.

CSS to LESS

When Arbiter was first built, we stored our styles into old-fashioned CSS Stylesheets, where themes are loaded as a stylesheet link element in HTML, and switching stylesheet involves simply changing the link of the style element with a Javascript function.

In February, I completed the process of refactoring CSS Stylesheets into LESS Stylesheets to match the organization standards at Hootsuite. This transition results in a much more efficient integration with React. The stylesheets are now loaded directly into the components they style with less-loader so that we don’t have to work with stylesheets in raw CSS.

BUT, alas, every rose comes with its thorn! Now that stylesheet loading is handled by less-loader, we can no longer explicitly manipulate the stylesheet links to switch themes 🙁

After two days of desperate research, I experienced a lot of frustration with the lack of solutions on this topic online. Through this blog post, I would like to demonstrate this simple process that I came up with so that people in my shoe could save their research time. It is also a testament that sometimes it is not necessary to rely on more advanced packages to solve issues, simple Javascript with some thoughts to it just does the job equally well!

Component Structure

To make things simple, I have separated all the theme-loading related functions into its own React component named ReactThemeLoader. The overall structure of this component involves initially loading all the available themes, disabling all the themes, and re-enabling the current theme.

The mechanism of this system relies the fact that stylesheets loaded onto DOM can be toggled on and off by setting the boolean property “disabled”.This property gives us the flexibility to load multiple themes onto the page at once by enabling the current theme and disabling others, without having them interfering one another.

In componentWillMount, all the themes are loaded and disabled initially. The index of the theme stylesheets are tracked here and recorded as key-value pairs into a dictionary called themes in state.

After the components are mounted, in componentDidMount, the program reads the theme name stored in the cookie and and sets the theme by calling setTheme(). In case the cookie is not present, the theme will be set to default theme (in this case dark), which is handled by the setTheme() method.

In the setTheme method, the stylesheet indexes of the old theme and new theme are simply retrieved from the dictionary themes and toggled by setting the “disabled” property of the stylesheets as shown above.

The UML Diagram for the theme-loading process is shown below:

NPM Package

To make this React Component more reusable, I have published an NPM package: react-theme-loader (http://ow.ly/U9Lt30asuTX) as a ready-to-use open-source React package for you. Below is where the tutorial actually starts

Install the NPM Package

Under your project directory, run:
npm install react-theme-loader –save
 

This command will install this node package to your project and add it to the dependencies in package.json

Import the React Component

In your top-level React components where you want the theme styles to take effect, import the package and render it as a component

When rendering the ThemeLoader, there are certain properties we need to attach to the element; the above is just an example, please read the instructions below on filling in the props and do NOT copy the above exactly.

Required
“ref”: pass in “themeLoader” exactly so that you have access to the functions provided by the component.

“supportedThemes”: pass in an array of strings that represent the names of the theme files (without “.less”).

Example:

“themeDirectory”: pass in the relative path of the directory of the theme files in “supportedThemes”

Optional
“fonts”: pass in the relative path of the fonts file. This file should contain ALL fonts used in the themes, and do NOT import any fonts in the theme files since they mess up the indexes of loaded stylesheets. If “fonts” is not specified, no fonts file will be loaded.

“defaultTheme”: pass in the name(string) of the default theme, which will be loaded initially when no theme cookie is stored. The first theme in “supportedThemes” will be the default if not specified.

“themeCookie”: pass in the name(string) of the cookie where you wish to store the current theme name, so that the browser will remember the selected theme. “CURRENT_THEME” will be used if not specified.

Using react-theme-loader

Switching Themes
In the React component where ThemeLoader is rendered, simply use

to switch themes, where theme is the name(string) of the theme you wish to switch to.

When a theme is loaded, the ThemeLoader will render the following to HTML:

Cookie

Currently the theme information in our system is completely relying on un-encrypted browser cookie, named “CURRENT_THEME” by default. The package gives you an option to pass in the cookie name you wish to have.

Although theme information is not sensitive, it still brings inconvenience and leaves holes for security issues. We are currently looking into other ways of improving this features.

React-Theme-Loader 2.0 !!!

Currently, the react-theme-loader package is still in its rudimentary stage and a lot of customized features can be added to the component. The ones listed below are in progress, please feel free to leave any ideas below in the comments

 

  • Customized way to store current theme information
As mentioned above, the current way of storing theme information in cookie isn’t ideal. In a lot of applications, it is more ideal to store those to a user account. Instead of doing the cookie handling in the component itself, the component will be able to read in a customized function by the user to handle the job.

 

  • More robust import handling
Currently, you would have to store all the imported fonts into a separate fonts file for the component to read the correct indexes of other stylesheets, since importing fonts in a stylesheet causes index problems. In the future, more investigation will be made to ensure that adding “import” statements in .less files will not cause the program to break.

Conclusion

Completing the transition from CSS Stylesheets to LESS Stylesheets is one of the first refactoring process I have worked on during my time at Hootsuite. While I have been through a lot of frustrations and desperation throughout this process, I was forced into discovering more hidden features in the world of programming (leading me to make my first npm pacakge!) Looking back at the process, I would consider it is a tremendously helpful experience in the long run with major improvements made to the system.

About the Author

Sonny Yu was a co-op student on the Test Tooling team for the spring term of 2017. He studies Computer Science and Business Administration at University of Waterloo. Connect with him on LinkedIn!

How did this idea get started?

When you are running more than 1500 servers in AWS and there was no consistent standard for creating servers, it is really hard for the Operations Developers to get an insightful view about the system inventory on each machine. How should I know when an instance needs to be patched? What if there is a package with unpatched vulnerabilities installed on several servers? As a result, the Ops team wanted a solution to monitor and gather inventory information on our servers.

Why use AWS Simple System Manager (SSM)

Why not use Puppet or other tools? Although some configuration management tools like Puppet and Chef already gather inventory information on their clients, they just don’t fit with Hootsuite’s Ansible based ecosystem. Setting up an additional Configuration Management tool and only using it for a small use case like this is just overkill. So what could be a good option that is efficient and requires little configuration?

How about run bash scripts as a cron job to collect required information (packages, CVEs, OS version, etc) in the system? First of all, the bash script for gathering CVE takes at least 20 minutes to run and 90% of the packages on most instances are the same, so it is not ideal to have all instances gathering duplicated information. Secondly, what if the script needs to be changed in the future? Is there a better way than re-deploying to every machine? Eventually, I came across Simple System Manager, a service that AWS recently launched to help users automate management tasks with no additional charges.

How to achieve all of these?

Prerequisites of SSM: An agent needs to be installed on every instance and an IAM role needs to be attached to the instance, so it is granted to access the console in AWS.

AWS SSM has a cool feature called “Send Command” which allows users to run bash scripts on target machines without establishing an SSH connection to them and the same command can be sent to as many machines as the user wants.  Documents define the actions that SSM will perform on the instance and they can be associated with EC2 instances as scheduled tasks. The bash script for gathering packages and system info will be embedded into SSM documents as parameters and then associated with all instances in the SSM console. The diagram below is a visual representation of the idea.

What do we need to collect?

Besides the basic system info such as installed packages, OS version/name, and uptime, the CVE (Common Vulnerability and Exposures) for all packages also need to be collected from each instance, as the CVEs are crucial for the Security team to determine potential vulnerabilities of Hootsuite servers.

Implementation

Workflow Diagram:

Uploading Data to Dynamodb:

After collecting all required information, the bash script will generate a JSON file that contains all the data and upload the file to Dynamodb using AWS CLI. The JSON object is strictly formatted to match the requirements of Dynamodb which look like:

1
2
3
4
5
6
{"Key":{"Data_Type":"Value"},
"Attribute1":{"Data_Type":"Value"}
"Attribute2":{"Data_Type":"Value"}
"Attribute3":{"Data_Type":"Value"}
"Attribute4":{"Data_Type":"Value"}
"Attribute5":{"Data_Type":"Value"}}

The object will also contain a timestamp attribute “TTL (time to live)” for auto-expiration of terminated instance in the DB. This attribute is important as the bash script will run every 5 days to update the information. If the “TTL” attribute is not updated on the 6th day, it likely means that the instance is terminated or stopped, so the database will remove the item to save space

Sample JSON Object:

1
2
3
4
5
6
7
{"instance_id":{"S":"instance_id"},
"runstatus": {"S": "True"},
"ttl": {"N": "1493081195"},
"os": {"M":{"name": {"S":"Ubuntu"}, "version": {"S":"14.04.5"}}},
"uptimebydays": {"S":" 84 "},
"pkg": {"M":{
"accountsservice":{"M":{ "pkgversion":{"S": "0.6.35-0ubuntu7.3"}, "status": {"S":"latest"}}}}}

Create Association:

AWS Config Rules are used to monitor configuration changes in SSM. When an instance is created, it will be automatically added to the SSM console in AWS, and the creation event will be captured by AWS Config Rules to trigger a Lambda function called “ssm_association”. The event is passed into the Lambda in JSON format, and the Lambda can easily retrieve the instance id and event type to determine if the association needs to be created. Then Lambda functions use Boto3 (Python AWS SDK) to create the association.

Instance Creation Event:

1
2
3
4
5
6
7
8
9
10
11
{'configRuleId': 'config-rule-g3xyel', 'version': '1.0', 'configRuleName':
'create_ssm_assoiciation', 'configRuleArn': 'arn:aws:config:us-east-1:1111111111:config-rule/config-rule-g3xyel', 'invokingEvent': '{"configurationItemSummary":
{"changeType":"CREATE","configurationItemVersion":"1.2","configurationItemCaptureTime":"2017-
1111111111","configurationStateId":12345687654,"awsAccountId":"1111111111","configurationItemS
tatus":"OK","resourceType":"AWS::SSM::ManagedInstanceInventory","resourceId":"i-
06ad1615134baaa2a","resourceName":null,"ARN":"arn:aws:ssm:us-east-1:1111111111:managed-
instance-inventory/i-xxxxxxxx","awsRegion":"us-east-1","availabilityZone":null,"configurationStateMd5Hash":"6b1a5634c1f60482767fc239e4422ea4","res
ourceCreationTime":null},"s3DeliverySummary":null,"notificationCreationTime":"2017-04-
10T20:05:02.891Z","recordVersion":"1.0"}','eventLeftScope': False, 'ruleParameters':
'{"type":"ssm_testing"}', 'executionRoleArn': 'arn:aws:iam::1111111111:role/AWSConfig',
u'accountId': '11111111'}

Lambda Function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#The following code is the simplified Lambda function
#it takes the instance id from the creation event, checks if the instance is running, and creates the association when the event type is "CREATE"

invokingEvent = event['invokingEvent']
instanceid = configurationItem['resourceId']
configurationItemDiff = invokingEvent['configurationItemDiff']
changeType = configurationItemDiff['changeType']

#make sure the instance is running,
client = boto3.client('ec2')
response = client.describe_instance_status(
       InstanceIds=[instanceid]
     )
state = response['InstanceStatuses'][0]['InstanceState']['Name']
print(response)
if state =="running" and changeType == "CREATE":

      print("Executing create_association")
      response = client.create_association(
             Name='upload_pkg_info',
             DocumentVersion='$LATEST',
             Targets=[
                {
                   'Key': 'InstanceIds',
                   'Values': [
                      instanceid
                     ]
                },
             ],
            ScheduleExpression='cron(0 0 0/12 1/1 * ? *)'
      )

Gathering CVE:

The security team not only wants to collect unpatched CVEs for all installed packages but also those that already patched. In fact, gathering CVEs has become the biggest bottleneck of the process as I could not find any available databases or API where I can query all CVEs using package name and version number. The only known method is to use

1
apt-get changelog PKG_NAME

It might take a few seconds to download each changelog which results in extremely long running time.

To solve this problem, another Lambda function is introduced to create a list of instances with the packages installed on them. Then the Lambda function will call SSM to invoke “Send Command” to run the bash script on each instance, so that this task will require minimum time and resources.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#Example of sending bash script to one of the targe instance
#target_instance is a list of target instances generated by the Lambda function.
# its key's value, list of packages, will be changed into a bash array and combined with the bash script.
# The bash script cannot be parameterized in this case because the list of instances and package might change every time we run the code.
str = " ".join(target_instance[key])
#basharray = ‘pkgs=( python3, apport, sensu,..... )’
basharray = "pkgs=( " + str + " )"
get_cve = '''
          for i in "${pkgs[@]}"
          do
                CVE=`sudo apt-get changelog $i | grep -o "CVE-.*" | cut -c1-13 | sort -u | paste -s -d, -`
             if [ -z "$CVE" ]; then
                 echo "No CVE"
             else
                aws dynamodb update-item --table-name pkg_CVE \
                --key '{"pkg_name":{"S":"'$i'"}}' \
                --update-expression "SET cve_list=:y" \
                --expression-attribute-values '{":y":{"S":"'$CVE'"}}' \
                --return-values ALL_NEW \
                --region us-east-1
              fi
          Done'''
tmp = [basharray] + get_cve
response = client.send_command(
         InstanceIds=[
               key
         ],
         DocumentName='AWS-RunShellScript',
         TimeoutSeconds=1800,
         Comment='get CVE from ' + key,
         Parameters={'commands': tmp}
   )

Limitation/Improvement:

  • Currently there is no API designed for this project which means people will have to pull data directly from the Database.
  • The API layer can give more flexibility in terms of designing the data structure. The API defines the data structure for the user, instead of strictly formatting it in the DB.
  • SSM provides little feedback when creating association between instances and documents.
  • Link Lambda functions with SNS topics to gather error message and help troubleshoot the system
Conclusion:

This project is a PoC for gathering system inventory using SSM and it can be optimized in many aspects. This is also a good test case of what SSM is capable of doing and we can clearly see the advantages such as parametrizing the bash script. I feel SSM is a tool that has some good potential and it can be leveraged to a higher level than just using it as a patching tool.

About the Author

Andy Han is an Operations Developer Co-op at Hootsuite. Andy studies Management Engineering at the University of Waterloo. Contact him on Linkedin.

 

Guy Drut. From hurdler49

Guy and the Gold Medal

The French Olympic hurdler, Guy Drut, found himself in an unenviable position in the early summer of 1976. He was France’s only hope for a track-and-field medal, and the burden of carrying the nation’s pride on his shoulders was getting to him. Drut later told me that he had spoken on several occasions prior to the games with our long-time client Jean-Claude Killy and that he really felt he owed a part of his gold medal to Killy. He explained it as follows: “Jean-Claude told me that I was the only one who knew how to get my body and mind to their ultimate peak for the Olympic Games. He then told me that after I had done this that I should keep saying to myself, ‘I have done everything I can to get ready for this race and if I win, everything will be great, but if I don’t win my friends will still be my friends, my enemies will still be my enemies, and the world will still be the same.’ I repeated this sentence to myself before the qualifying heats and during the break between the semi-finals and finals. I kept saying the sentence over and over, and it blocked out everything else. I was still repeating it to myself when I went up to get my gold medal.

From the Fear of Failure passage in What They Don’t Teach You at Harvard Business School: Notes from a Street-smart Executive by Mark H. McCormack. Underlining is mine.

This isn’t the fear you’re looking for

Few of us have been in the starting blocks at the Olympics but for many of us a similar level of anxiety can be brought on at the thought of presenting a technical demo in front of our fellow engineers – even our friends and colleagues.

By repeating those words, Drut downplayed the consequences of failure and detached his anxiety from his situation. Every time I go up on stage, I say those same words, for the exact same reason.

Working out loud is a good thing

Every Wednesday morning for the last four years our entire technical staff gets together for Demos and an All Hands. Engineers sign up to give 5-minute demonstrations of new product functionality or internal tooling and then take questions from the audience. After the Demos we move on to announcements and awards. Over the last four years we’ve done upwards of 440. I’ve watched almost all of them.

For everyone who attends this session, it celebrates people and accomplishments; it drives alignment around mission, strategy and priorities; and finally, it provides a forum to ask and answer questions. (Thanks for that excellent article Gokul).

Each presenter needs to make the most of this opportunity because talking about your work is as important as the work itself. The challenge is to get so good at presenting a technical demo that others feel compelled to celebrate your work, change their outlook, and share your story. That means making it succinct, informative, and relevant.

The leap from paper to the stage is huge – the way our ideas sound in our head is not at all how they sound out loud. Here are five ways to elevate a mediocre technical demo to a great one. Read More …

Do you build microservices in Golang? If so, today is your lucky day as we have just open sourced our Go Health Checks Framework which implements our standard Health Checks API.

What is it?

The Go Health Checks Framework is a declarative, extendable health checking framework written in Go that provides a simple way to register dependencies as status endpoints and integrate them into an existing microservice that uses either the standard net/http package or the Gin Framework.

Monitor From the Inside

The Health Checks API helps you monitor your service health from the inside by exposing a set of standardized endpoints at “/status/…” that can be monitored using any monitoring framework.

Monitor your microservice from the inside.
Monitor your microservice from the inside.

We have found that the best way to monitor the health of a microservice is from the inside. This is because it is the single source of real truth for its health. If you’re not monitoring from the inside, then you are inferring the health of the microservice and this comes with its own problems. Not convinced? Watch this talk by Kelsey Hightower titled Stop reverse engineering applications and start monitoring from the inside.

Getting Started

Using the Go Health Checks Framework in your Golang microservice is easy:

  1. Define a StatusEndpoint for each dependency in your microservice.
  2. Configure framework options and register the Health Checks framework to respond to all /status/… requests passing a slice of all your StatusEndpoints.

That’s it! As long as you have defined your StatusEndpoints correctly, the framework will take care of the rest.

Not Just Monitoring

The Health Checks API enables more than just microservice monitoring and gives you the power to explore, debug, and document your ever changing architecture. Below is a demo video of a tool we use that displays a dashboard for each microservice in a distributed application and lets developers/ops navigate the microservice graph in real time. This dashboard not only shows information about the services in your graph, but also displays the current status of each microservice in the graph and its dependencies. The open sourcing of this tool is coming soon!

Want to learn more?

Watch my full talk from DevOpsDaysYVR where I go over the Health Checks API and demo a tool we use to explore microservice graphs in real time.

Links

  • Health Checks API – A cross language standard for checking health in a distributed application
  • Go Health Checks Framework – A Golang implementation of the Health Checks API used for microservice exploration, documentation and monitoring.
About the Author
Adam Arsenault is a senior specialist in full stack and mobile development. He leads the mobile platform Hootsuite. Get in touch via Twitter @adam_arsenault

I have worked at several different software companies as a co-op student, however this was the first time I was able to participate in a company-wide hackathon. I have always wanted to participate but I always seem to get to a company just after it has completed or I leave just before they start.

Last October, Hootsuite ran #SuiteHacks, a company wide hackathon with the theme of Customers and Tools. I joined a team mostly made up of my fellow Ops teammates and a person from our Security team. The idea we decided to hack on was automating instance startup task using AWS Lambdas. The serverless, ephemeral nature of Lambdas match perfectly with automating relatively infrequent and quick tasks. It is also an excuse for us to play around a little bit more with one of Amazon’s newer services.

The Design

The idea behind this hack was a system where an instance can announce a change in state which the system would respond by running tasks for the instance based on that state change. When an instance goes down we can verify and automate cleanup and remediation where needed. When an instance is created, we can verify and automate different on startup tasks. Doing so we would automating tedious tasks that are important to infrastructure health. The diagram below gives a general idea of what the system looks like:

The system starts when the instances send information to a centralized Enrichment SNS topic. In this model, servers would be able trigger this data transfer based on different events in the server’s life cycle. In this project the event used to demo this functionality was server startup. On startup, the server gathers the data that is assigned when the instance was provisioned and push it to the Enrichment topic. The Enricher receives said data and proceeds to gather more data about the instances from AWS based on identifying data sent from the instance. This is done by using AWS’ Describe Instance API. The data would then be formatted and standardize, ensuring a common structure for the plugins to expect. The Enricher job would also be the point in the process where data sent from the instance is verified. Specifically the state that the instance is reporting to be in would need to be verified. This was done by cross referencing the data sent by the instance and the data that AWS provided about the instance. If there was a discrepancy the data would not move forward in the process. Some for of alerting would also be needed to surface these errors. Once verified and structured data is then pushed to a SNS topic for that trigger plugins.

The Plugin jobs are triggered by data being published to the Plugin SNS topic. Due to SNS being a publish/subscription system, each plugin would receive a copy of the data published to the topic. The plugin processes the enriched data to perform specific functions. Specifically plugins that would run based on states the instance reports. For example, startup tasks will only proceed if the data sent has the proper state specified in the data.

The POC

During the hackathon we were able to implement a simple version of this model to create a system that allow servers to trigger multiple tools on startup. In the proof of concept we focused on some common tasks such as DNS, Reverse DNS and Tag Verification. We also created a simple plugin to calculate the monthly costs of these resources and placing that information in Slack.

The Learnings

The hackathon started with the team deciding how to send data to the system. One option was to utilize AWS Config as the trigger for the Enricher Lambda. AWS Config was a service that provides AWS configuration management and history. A lot of the tasks that we want to automate are tied to configuration changes for different instances so it made sense to investigate the option. At first glance AWS Config integration with AWS Lambdas would make  it a good choice, however the delay between the change happening and when the change is register in Config was too long. In the worse case it was several minutes before the change was register and in turn trigger automation. For some tasks, we wanted to trigger the process as soon as possible, specifically DNS and Reverse DNS.  Due to this we decided to trigger events through a simple script located located on the instance. The script gathers information about the instance and then sends it to the Enricher SNS and following automation system.

The Future

Since it was for a hackathon we kept it simple but this system can easily be expanded to a multitude of applications. An expansion to the DNS plugin idea was one that would ensure that any DNS or Reverse DNS were removed on instance termination. Other applications that we considered during the hackathon was automatically registering and deregistering instances from monitoring systems or health checks on startup. Another advantage of this system was it being language and workflow agnostic. AWS Lambdas allow you to create jobs in multiple languages and all of them have the ability to ingest data from SNS topics. Furthermore the Lambdas themselves do not need to follow a specific workflow to achieve its task. They jobs can be design to best meet the needs of the task. The architecture is plug and play, allowing for a lot of flexibility. Since this system is designed using AWS Lambdas it scales to meet the needs of your growing system. With this flexibility, there are a number of tasks around scaling up and down infrastructure that could be automated in this simple and scalable manner.

About the Author

Stuart was a co-op at Hootsuite in the Fall of 2016. He’s currently studying Management Engineering at The University of Waterloo, and graduates in 2017.

One memorable moment I had while on the APIs and Integrations team at Hootsuite was trying to debug a mysterious HTTP 500 response coming from one of our endpoints. Normally, one would check the logs of the API service, or perhaps navigate to its error handling code to see under which circumstances a 500 should be returned. The problem was that errors were generated at several different levels, from several different services and even worse, were modified repeatedly as they propagated to the end user.

As you can imagine, debugging this was a nightmare and involved some creative usages of prints scattered throughout the code base. I did eventually figure out the problem (converting years previous to 1970 didn’t play well with UNIX timestamps), but boy was it a lot more difficult than it should have been. It got me thinking and it led to some conversations with the team about how the decisions we made led us to where we are today. Some of those inspired this blog post to share the rationale behind our design choices and perhaps share our experiences for anyone in the same position as we were.

The proper API-building mindset.

The first run of the API was built for a single customer and contained some basic user management functionality across five endpoints. It met the customer’s needs and worked well enough. However, due to inexperience and time pressure, the endpoints contained several inconsistencies in data types, error models, and field names. Looking back, it’s clear now that some endpoints also lacked RESTful patterns. For example, we often didn’t embed resource IDs into URIs, preferring to pass them as parameters instead.

An example of unRESTful parameter passing.

At the time of writing, our API is comprised of 30 endpoints and we have both expanded our original user management API and added support for messages, social profiles, and media as well. Along the way, we’ve also made significant improvements in our API’s consistency, RESTfulness and developer experience. Here are four lessons we’ve learned as a result.

Lesson 1: Focus on your errors and focus on them early

While not usually a major talking point during the preliminary design, focusing early on hammering out an error schema can prevent a lot of headaches down the line. We fell victim to the trap of not establishing clear guidelines of how we’d like our errors to look. Without an overarching plan our errors grew organically as we supported integration with our various microservices. Read More …

           In a modern age where computer is now considered one of man’s best friend, there lives a boy who works as a software developer. He gets up in the morning full of excitement and hurries to the office. Upon arrival, the boy logs into his lovely computer to start bundling his code – combining the javascript files into a single file. He checks that bundling has started and casually walks away to grab a cup of coffee. A few moments later, when he comes back with his aromatic coffee, he notices that bundling has still not finished!

         At that moment, rather than a few sips of coffee running down his throat, he experiences something else rolling down from his eyes. “Sacrilege”, the boy whispers with subtle but clear hints of bitterness in his voice. ‘Slower than a coffee machine? This execution time is a bold mockery in the holy realm of programmers – who are obligated to optimize.’

        Concluding that something must be done, the boy hesitates no more and quickly turns to the ‘savior’ of all coders – Google. ‘Oh thy Google, have mercy on such undignified bundling time.’ Soon after, Google blesses the boy with a Babel Precompilation feature. Too glaringly bright to even glimpse at, the boy humbly accepts the feature and never again has to wait for the code to bundle with a fresh coffee present in his mug. A few days later, still treasuring the moment of epiphany, he seeks out to teach others what he has learned in the hope that no one will ever need to suffer like he did. Okay, so the boy lived happily ever after… Now onto the technical stuff that you were expecting to read.

Overview

First, I would like to introduce some terms that are crucial to understanding this blog post: Compilation and Pre-compilation, what they mean and how they are used with Javascript. Javascript, as an interpreted language, doesn’t require a compiler since the browser essentially compiles the necessary parts of the code on the fly. When mentioning compilation in the context of Javascript, it is natural and perfectly safe to refer to it by a different name called transpilation. Transpilation is a process of translating code into a different version of code. In most cases it involves converting a newer version of a language, like ES6 into a more compatible one like ES5, which a greater number of browsers are able to understand. This is where Babel Compilation steps in to compile the original source code so that all the browsers are happy.

The beauty of Babel compilation is well noted; one might wonder why we would need anything else. The thing is that if we stop at compilation only, the boy from our story will forever weep as he waits for his npm script to finish. What is missing? PREcompilation. At a big company such as Hootsuite, we have hundreds of thousands of lines of Javascript code organized into a number of sub-projects. Every day, we merge in files for new features, bug fixes and many more with sprinkles of mystery in every sub-projects. The problem is evident and inevitable – compilation of the entire projects must happen every time to setup a working environment! This is where the Babel Precompilation feature takes charge to change our lives; a programmer’s life is too short to be waiting for entire projects to compile. Read More …