Posts from April 2017

Henry Ford once said, “Coming together is a beginning, staying together is progress, and working together is success.” It is based upon this philosophy of embracing collaboration from start-to-finish that my team, as well as numerous others at Hootsuite, have adapted an additional role in our Agile methodology.

To provide some background for those unfamiliar with Agile, an Epic is a large unit of work which, when completed, will provide a lot of value to the customer. Epics are then broken down into stories and tickets/tasks which developers will commit to and complete.

Every developer is encouraged to work on whichever task is highest in priority allowing work to be fluid and ensure each developer is well-rounded. However, each sprint, there can be numerous Epics being worked on, as well as numerous more being planned in the backlog, which often makes it difficult for product owners to maintain an accurate idea of the current progress of each based on small, fragmented updates from each developer at scrum. Further, the process of conceiving a new feature often gets muddled as it is passed around between design and growth and management before finally arriving at the engineers. The solution to all these problems and more? The Epic Champion. Read More …

How did this idea get started?

When you are running more than 1500 servers in AWS and there was no consistent standard for creating servers, it is really hard for the Operations Developers to get an insightful view about the system inventory on each machine. How should I know when an instance needs to be patched? What if there is a package with unpatched vulnerabilities installed on several servers? As a result, the Ops team wanted a solution to monitor and gather inventory information on our servers.    

Why use AWS Simple System Manager (SSM)

Why not use Puppet or other tools? Although some configuration management tools like Puppet and Chef already gather inventory information on their clients, they just don’t fit with Hootsuite’s Ansible based ecosystem. Setting up an additional Configuration Management tool and only using it for a small use case like this is just overkill. So what could be a good option that is efficient and requires little configuration?

How about run bash scripts as a cron job to collect required information (packages, CVEs, OS version, etc) in the system? First of all, the bash script for gathering CVE takes at least 20 minutes to run and 90% of the packages on most instances are the same, so it is not ideal to have all instances gathering duplicated information. Secondly, what if the script needs to be changed in the future? Is there a better way than re-deploying to every machine? Eventually, I came across Simple System Manager, a service that AWS recently launched to help users automate management tasks with no additional charges.    

How to achieve all of these?

Prerequisites of SSM: An agent needs to be installed on every instance and an IAM role needs to be attached to the instance, so it is granted to access the console in AWS.

AWS SSM has a cool feature called “Send Command” which allows users to run bash scripts on target machines without establishing an SSH connection to them and the same command can be sent to as many machines as the user wants.  Documents define the actions that SSM will perform on the instance and they can be associated with EC2 instances as scheduled tasks. The bash script for gathering packages and system info will be embedded into SSM documents as parameters and then associated with all instances in the SSM console. The diagram below is a visual representation of the idea.

What do we need to collect?

Besides the basic system info such as installed packages, OS version/name, and uptime, the CVE (Common Vulnerability and Exposures) for all packages also need to be collected from each instance, as the CVEs are crucial for the Security team to determine potential vulnerabilities of Hootsuite servers.

Implementation

Workflow Diagram:

Uploading Data to Dynamodb:

After collecting all required information, the bash script will generate a JSON file that contains all the data and upload the file to Dynamodb using AWS CLI. The JSON object is strictly formatted to match the requirements of Dynamodb which look like:

1
2
3
4
5
6
{"Key":{"Data_Type":"Value"},
"Attribute1":{"Data_Type":"Value"}
"Attribute2":{"Data_Type":"Value"}
"Attribute3":{"Data_Type":"Value"}
"Attribute4":{"Data_Type":"Value"}
"Attribute5":{"Data_Type":"Value"}}

The object will also contain a timestamp attribute “TTL (time to live)” for auto-expiration of terminated instance in the DB. This attribute is important as the bash script will run every 5 days to update the information. If the “TTL” attribute is not updated on the 6th day, it likely means that the instance is terminated or stopped, so the database will remove the item to save space    

Sample JSON Object:

1
2
3
4
5
6
7
{"instance_id":{"S":"instance_id"},
"runstatus": {"S": "True"},
"ttl": {"N": "1493081195"},
"os": {"M":{"name": {"S":"Ubuntu"}, "version": {"S":"14.04.5"}}},
"uptimebydays": {"S":" 84 "},
"pkg": {"M":{
"accountsservice":{"M":{ "pkgversion":{"S": "0.6.35-0ubuntu7.3"}, "status": {"S":"latest"}}}}}

 

Create Association:

AWS Config Rules are used to monitor configuration changes in SSM. When an instance is created, it will be automatically added to the SSM console in AWS, and the creation event will be captured by AWS Config Rules to trigger a Lambda function called “ssm_association”. The event is passed into the Lambda in JSON format, and the Lambda can easily retrieve the instance id and event type to determine if the association needs to be created. Then Lambda functions use Boto3 (Python AWS SDK) to create the association.    

Instance Creation Event:

1
2
3
4
5
6
7
8
9
10
11
{'configRuleId': 'config-rule-g3xyel', 'version': '1.0', 'configRuleName':
'create_ssm_assoiciation', 'configRuleArn': 'arn:aws:config:us-east-1:1111111111:config-rule/config-rule-g3xyel', 'invokingEvent': '{"configurationItemSummary":
{"changeType":"CREATE","configurationItemVersion":"1.2","configurationItemCaptureTime":"2017-
1111111111","configurationStateId":12345687654,"awsAccountId":"1111111111","configurationItemS
tatus":"OK","resourceType":"AWS::SSM::ManagedInstanceInventory","resourceId":"i-
06ad1615134baaa2a","resourceName":null,"ARN":"arn:aws:ssm:us-east-1:1111111111:managed-
instance-inventory/i-xxxxxxxx","awsRegion":"us-east-1","availabilityZone":null,"configurationStateMd5Hash":"6b1a5634c1f60482767fc239e4422ea4","res
ourceCreationTime":null},"s3DeliverySummary":null,"notificationCreationTime":"2017-04-
10T20:05:02.891Z","recordVersion":"1.0"}','eventLeftScope': False, 'ruleParameters':
'{"type":"ssm_testing"}', 'executionRoleArn': 'arn:aws:iam::1111111111:role/AWSConfig',
u'accountId': '11111111'}

    Lambda Function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#The following code is the simplified Lambda function
#it takes the instance id from the creation event, checks if the instance is running, and creates the association when the event type is "CREATE"

invokingEvent = event['invokingEvent']
instanceid = configurationItem['resourceId']
configurationItemDiff = invokingEvent['configurationItemDiff']
changeType = configurationItemDiff['changeType']

#make sure the instance is running,
client = boto3.client('ec2')
response = client.describe_instance_status(
     InstanceIds=[instanceid]
)
state = response['InstanceStatuses'][0]['InstanceState']['Name']
print(response)
if state =="running" and changeType == "CREATE":

    print("Executing create_association")
    response = client.create_association(
        Name='upload_pkg_info',
        DocumentVersion='$LATEST',
        Targets=[
              {
                'Key': 'InstanceIds',
                'Values': [
                     instanceid
                  ]
                 },
             ],
        ScheduleExpression='cron(0 0 0/12 1/1 * ? *)'
       )

   

Gathering CVE:

The security team not only wants to collect unpatched CVEs for all installed packages but also those that already patched. In fact, gathering CVEs has become the biggest bottleneck of the process as I could not find any available databases or API where I can query all CVEs using package name and version number. The only known method is to use

1
apt-get changelog PKG_NAME

It might take a few seconds to download each changelog which results in extremely long running time.

To solve this problem, another Lambda function is introduced to create a list of instances with the packages installed on them. Then the Lambda function will call SSM to invoke “Send Command” to run the bash script on each instance, so that this task will require minimum time and resources.  

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#Example of sending bash script to one of the targe instance
#target_instance is a list of target instances generated by the Lambda function.
# its key's value, list of packages, will be changed into a bash array and combined with the bash script.
# The bash script cannot be parameterized in this case because the list of instances and package might change every time we run the code.
str = " ".join(target_instance[key])
#basharray = ‘pkgs=( python3, apport, sensu,..... )’
basharray = "pkgs=( " + str + " )"
get_cve = '''
   for i in "${pkgs[@]}"
   do
      CVE=`sudo apt-get changelog $i | grep -o "CVE-.*" | cut -c1-13 | sort -u | paste -s -d, -`
      if [ -z "$CVE" ]; then
         echo "No CVE"
      else
        aws dynamodb update-item --table-name pkg_CVE \
           --key '{"pkg_name":{"S":"'$i'"}}' \
           --update-expression "SET cve_list=:y" \
           --expression-attribute-values '{":y":{"S":"'$CVE'"}}' \
           --return-values ALL_NEW \
           --region us-east-1
      fi
  Done'''
tmp = [basharray] + get_cve
response = client.send_command(
    InstanceIds=[
        key
       ],
    DocumentName='AWS-RunShellScript',
    TimeoutSeconds=1800,
    Comment='get CVE from ' + key,
    Parameters={'commands': tmp}
 )

 

Limitation/Improvement:

  • Currently there is no API designed for this project which means people will have to pull data directly from the Database.
  • The API layer can give more flexibility in terms of designing the data structure. The API defines the data structure for the user, instead of strictly formatting it in the DB.
  • SSM provides little feedback when creating association between instances and documents.
  • Link Lambda functions with SNS topics to gather error message and help troubleshoot the system
  Conclusion:

This project is a PoC for gathering system inventory using SSM and it can be optimized in many aspects. This is also a good test case of what SSM is capable of doing and we can clearly see the advantages such as parametrizing the bash script. I feel SSM is a tool that has some good potential and it can be leveraged to a higher level than just using it as a patching tool.

Guy Drut. From hurdler49

Guy and the Gold Medal

The French Olympic hurdler, Guy Drut, found himself in an unenviable position in the early summer of 1976. He was France’s only hope for a track-and-field medal, and the burden of carrying the nation’s pride on his shoulders was getting to him. Drut later told me that he had spoken on several occasions prior to the games with our long-time client Jean-Claude Killy and that he really felt he owed a part of his gold medal to Killy. He explained it as follows: “Jean-Claude told me that I was the only one who knew how to get my body and mind to their ultimate peak for the Olympic Games. He then told me that after I had done this that I should keep saying to myself, ‘I have done everything I can to get ready for this race and if I win, everything will be great, but if I don’t win my friends will still be my friends, my enemies will still be my enemies, and the world will still be the same.’ I repeated this sentence to myself before the qualifying heats and during the break between the semi-finals and finals. I kept saying the sentence over and over, and it blocked out everything else. I was still repeating it to myself when I went up to get my gold medal.

From the Fear of Failure passage in What They Don’t Teach You at Harvard Business School: Notes from a Street-smart Executive by Mark H. McCormack. Underlining is mine.

This isn’t the fear you’re looking for

Few of us have been in the starting blocks at the Olympics but for many of us a similar level of anxiety can be brought on at the thought of presenting a technical demo in front of our fellow engineers – even our friends and colleagues.

By repeating those words, Drut downplayed the consequences of failure and detached his anxiety from his situation. Every time I go up on stage, I say those same words, for the exact same reason.

Working out loud is a good thing

Every Wednesday morning for the last four years our entire technical staff gets together for Demos and an All Hands. Engineers sign up to give 5-minute demonstrations of new product functionality or internal tooling and then take questions from the audience. After the Demos we move on to announcements and awards. Over the last four years we’ve done upwards of 440. I’ve watched almost all of them.

For everyone who attends this session, it celebrates people and accomplishments; it drives alignment around mission, strategy and priorities; and finally, it provides a forum to ask and answer questions. (Thanks for that excellent article Gokul).

Each presenter needs to make the most of this opportunity because talking about your work is as important as the work itself. The challenge is to get so good at presenting a technical demo that others feel compelled to celebrate your work, change their outlook, and share your story. That means making it succinct, informative, and relevant.

The leap from paper to the stage is huge – the way our ideas sound in our head is not at all how they sound out loud. Here are five ways to elevate a mediocre technical demo to a great one. Read More …

Do you build microservices in Golang? If so, today is your lucky day as we have just open sourced our Go Health Checks Framework which implements our standard Health Checks API.

What is it?

The Go Health Checks Framework is a declarative, extendable health checking framework written in Go that provides a simple way to register dependencies as status endpoints and integrate them into an existing microservice that uses either the standard net/http package or the Gin Framework.

Monitor From the Inside

The Health Checks API helps you monitor your service health from the inside by exposing a set of standardized endpoints at “/status/…” that can be monitored using any monitoring framework.

Monitor your microservice from the inside.
Monitor your microservice from the inside.

We have found that the best way to monitor the health of a microservice is from the inside. This is because it is the single source of real truth for its health. If you’re not monitoring from the inside, then you are inferring the health of the microservice and this comes with its own problems. Not convinced? Watch this talk by Kelsey Hightower titled Stop reverse engineering applications and start monitoring from the inside.

Getting Started

Using the Go Health Checks Framework in your Golang microservice is easy:

  1. Define a StatusEndpoint for each dependency in your microservice.
  2. Configure framework options and register the Health Checks framework to respond to all /status/… requests passing a slice of all your StatusEndpoints.

That’s it! As long as you have defined your StatusEndpoints correctly, the framework will take care of the rest.

Not Just Monitoring

The Health Checks API enables more than just microservice monitoring and gives you the power to explore, debug, and document your ever changing architecture. Below is a demo video of a tool we use that displays a dashboard for each microservice in a distributed application and lets developers/ops navigate the microservice graph in real time. This dashboard not only shows information about the services in your graph, but also displays the current status of each microservice in the graph and its dependencies. The open sourcing of this tool is coming soon!

Want to learn more?

Watch my full talk from DevOpsDaysYVR where I go over the Health Checks API and demo a tool we use to explore microservice graphs in real time.

Links

  • Health Checks API – A cross language standard for checking health in a distributed application
  • Go Health Checks Framework – A Golang implementation of the Health Checks API used for microservice exploration, documentation and monitoring.
About the Author
Adam Arsenault is a senior specialist in full stack and mobile development. He leads the mobile platform Hootsuite. Get in touch via Twitter @adam_arsenault

I have worked at several different software companies as a co-op student, however this was the first time I was able to participate in a company-wide hackathon. I have always wanted to participate but I always seem to get to a company just after it has completed or I leave just before they start.

Last October, Hootsuite ran #SuiteHacks, a company wide hackathon with the theme of Customers and Tools. I joined a team mostly made up of my fellow Ops teammates and a person from our Security team. The idea we decided to hack on was automating instance startup task using AWS Lambdas. The serverless, ephemeral nature of Lambdas match perfectly with automating relatively infrequent and quick tasks. It is also an excuse for us to play around a little bit more with one of Amazon’s newer services.

The Design

The idea behind this hack was a system where an instance can announce a change in state which the system would respond by running tasks for the instance based on that state change. When an instance goes down we can verify and automate cleanup and remediation where needed. When an instance is created, we can verify and automate different on startup tasks. Doing so we would automating tedious tasks that are important to infrastructure health. The diagram below gives a general idea of what the system looks like:

The system starts when the instances send information to a centralized Enrichment SNS topic. In this model, servers would be able trigger this data transfer based on different events in the server’s life cycle. In this project the event used to demo this functionality was server startup. On startup, the server gathers the data that is assigned when the instance was provisioned and push it to the Enrichment topic. The Enricher receives said data and proceeds to gather more data about the instances from AWS based on identifying data sent from the instance. This is done by using AWS’ Describe Instance API. The data would then be formatted and standardize, ensuring a common structure for the plugins to expect. The Enricher job would also be the point in the process where data sent from the instance is verified. Specifically the state that the instance is reporting to be in would need to be verified. This was done by cross referencing the data sent by the instance and the data that AWS provided about the instance. If there was a discrepancy the data would not move forward in the process. Some for of alerting would also be needed to surface these errors. Once verified and structured data is then pushed to a SNS topic for that trigger plugins.

The Plugin jobs are triggered by data being published to the Plugin SNS topic. Due to SNS being a publish/subscription system, each plugin would receive a copy of the data published to the topic. The plugin processes the enriched data to perform specific functions. Specifically plugins that would run based on states the instance reports. For example, startup tasks will only proceed if the data sent has the proper state specified in the data.

The POC

During the hackathon we were able to implement a simple version of this model to create a system that allow servers to trigger multiple tools on startup. In the proof of concept we focused on some common tasks such as DNS, Reverse DNS and Tag Verification. We also created a simple plugin to calculate the monthly costs of these resources and placing that information in Slack.

The Learnings

The hackathon started with the team deciding how to send data to the system. One option was to utilize AWS Config as the trigger for the Enricher Lambda. AWS Config was a service that provides AWS configuration management and history. A lot of the tasks that we want to automate are tied to configuration changes for different instances so it made sense to investigate the option. At first glance AWS Config integration with AWS Lambdas would make  it a good choice, however the delay between the change happening and when the change is register in Config was too long. In the worse case it was several minutes before the change was register and in turn trigger automation. For some tasks, we wanted to trigger the process as soon as possible, specifically DNS and Reverse DNS.  Due to this we decided to trigger events through a simple script located located on the instance. The script gathers information about the instance and then sends it to the Enricher SNS and following automation system.

The Future

Since it was for a hackathon we kept it simple but this system can easily be expanded to a multitude of applications. An expansion to the DNS plugin idea was one that would ensure that any DNS or Reverse DNS were removed on instance termination. Other applications that we considered during the hackathon was automatically registering and deregistering instances from monitoring systems or health checks on startup. Another advantage of this system was it being language and workflow agnostic. AWS Lambdas allow you to create jobs in multiple languages and all of them have the ability to ingest data from SNS topics. Furthermore the Lambdas themselves do not need to follow a specific workflow to achieve its task. They jobs can be design to best meet the needs of the task. The architecture is plug and play, allowing for a lot of flexibility. Since this system is designed using AWS Lambdas it scales to meet the needs of your growing system. With this flexibility, there are a number of tasks around scaling up and down infrastructure that could be automated in this simple and scalable manner.

About the Author

Stuart was a co-op at Hootsuite in the Fall of 2016. He’s currently studying Management Engineering at The University of Waterloo, and graduates in 2017.

One memorable moment I had while on the APIs and Integrations team at Hootsuite was trying to debug a mysterious HTTP 500 response coming from one of our endpoints. Normally, one would check the logs of the API service, or perhaps navigate to its error handling code to see under which circumstances a 500 should be returned. The problem was that errors were generated at several different levels, from several different services and even worse, were modified repeatedly as they propagated to the end user.

As you can imagine, debugging this was a nightmare and involved some creative usages of prints scattered throughout the code base. I did eventually figure out the problem (converting years previous to 1970 didn’t play well with UNIX timestamps), but boy was it a lot more difficult than it should have been. It got me thinking and it led to some conversations with the team about how the decisions we made led us to where we are today. Some of those inspired this blog post to share the rationale behind our design choices and perhaps share our experiences for anyone in the same position as we were.

The proper API-building mindset.

The first run of the API was built for a single customer and contained some basic user management functionality across five endpoints. It met the customer’s needs and worked well enough. However, due to inexperience and time pressure, the endpoints contained several inconsistencies in data types, error models, and field names. Looking back, it’s clear now that some endpoints also lacked RESTful patterns. For example, we often didn’t embed resource IDs into URIs, preferring to pass them as parameters instead.

An example of unRESTful parameter passing.

At the time of writing, our API is comprised of 30 endpoints and we have both expanded our original user management API and added support for messages, social profiles, and media as well. Along the way, we’ve also made significant improvements in our API’s consistency, RESTfulness and developer experience. Here are four lessons we’ve learned as a result.

Lesson 1: Focus on your errors and focus on them early

While not usually a major talking point during the preliminary design, focusing early on hammering out an error schema can prevent a lot of headaches down the line. We fell victim to the trap of not establishing clear guidelines of how we’d like our errors to look. Without an overarching plan our errors grew organically as we supported integration with our various microservices. Read More …

           In a modern age where computer is now considered one of man’s best friend, there lives a boy who works as a software developer. He gets up in the morning full of excitement and hurries to the office. Upon arrival, the boy logs into his lovely computer to start bundling his code – combining the javascript files into a single file. He checks that bundling has started and casually walks away to grab a cup of coffee. A few moments later, when he comes back with his aromatic coffee, he notices that bundling has still not finished!

         At that moment, rather than a few sips of coffee running down his throat, he experiences something else rolling down from his eyes. “Sacrilege”, the boy whispers with subtle but clear hints of bitterness in his voice. ‘Slower than a coffee machine? This execution time is a bold mockery in the holy realm of programmers – who are obligated to optimize.’

        Concluding that something must be done, the boy hesitates no more and quickly turns to the ‘savior’ of all coders – Google. ‘Oh thy Google, have mercy on such undignified bundling time.’ Soon after, Google blesses the boy with a Babel Precompilation feature. Too glaringly bright to even glimpse at, the boy humbly accepts the feature and never again has to wait for the code to bundle with a fresh coffee present in his mug. A few days later, still treasuring the moment of epiphany, he seeks out to teach others what he has learned in the hope that no one will ever need to suffer like he did. Okay, so the boy lived happily ever after… Now onto the technical stuff that you were expecting to read.

Overview

First, I would like to introduce some terms that are crucial to understanding this blog post: Compilation and Pre-compilation, what they mean and how they are used with Javascript. Javascript, as an interpreted language, doesn’t require a compiler since the browser essentially compiles the necessary parts of the code on the fly. When mentioning compilation in the context of Javascript, it is natural and perfectly safe to refer to it by a different name called transpilation. Transpilation is a process of translating code into a different version of code. In most cases it involves converting a newer version of a language, like ES6 into a more compatible one like ES5, which a greater number of browsers are able to understand. This is where Babel Compilation steps in to compile the original source code so that all the browsers are happy.

The beauty of Babel compilation is well noted; one might wonder why we would need anything else. The thing is that if we stop at compilation only, the boy from our story will forever weep as he waits for his npm script to finish. What is missing? PREcompilation. At a big company such as Hootsuite, we have hundreds of thousands of lines of Javascript code organized into a number of sub-projects. Every day, we merge in files for new features, bug fixes and many more with sprinkles of mystery in every sub-projects. The problem is evident and inevitable – compilation of the entire projects must happen every time to setup a working environment! This is where the Babel Precompilation feature takes charge to change our lives; a programmer’s life is too short to be waiting for entire projects to compile. Read More …