Category:

How did this idea get started?

When you are running more than 1500 servers in AWS and there was no consistent standard for creating servers, it is really hard for the Operations Developers to get an insightful view about the system inventory on each machine. How should I know when an instance needs to be patched? What if there is a package with unpatched vulnerabilities installed on several servers? As a result, the Ops team wanted a solution to monitor and gather inventory information on our servers.

Why use AWS Simple System Manager (SSM)

Why not use Puppet or other tools? Although some configuration management tools like Puppet and Chef already gather inventory information on their clients, they just don’t fit with Hootsuite’s Ansible based ecosystem. Setting up an additional Configuration Management tool and only using it for a small use case like this is just overkill. So what could be a good option that is efficient and requires little configuration?

How about run bash scripts as a cron job to collect required information (packages, CVEs, OS version, etc) in the system? First of all, the bash script for gathering CVE takes at least 20 minutes to run and 90% of the packages on most instances are the same, so it is not ideal to have all instances gathering duplicated information. Secondly, what if the script needs to be changed in the future? Is there a better way than re-deploying to every machine? Eventually, I came across Simple System Manager, a service that AWS recently launched to help users automate management tasks with no additional charges.

How to achieve all of these?

Prerequisites of SSM: An agent needs to be installed on every instance and an IAM role needs to be attached to the instance, so it is granted to access the console in AWS.

AWS SSM has a cool feature called “Send Command” which allows users to run bash scripts on target machines without establishing an SSH connection to them and the same command can be sent to as many machines as the user wants.  Documents define the actions that SSM will perform on the instance and they can be associated with EC2 instances as scheduled tasks. The bash script for gathering packages and system info will be embedded into SSM documents as parameters and then associated with all instances in the SSM console. The diagram below is a visual representation of the idea.

What do we need to collect?

Besides the basic system info such as installed packages, OS version/name, and uptime, the CVE (Common Vulnerability and Exposures) for all packages also need to be collected from each instance, as the CVEs are crucial for the Security team to determine potential vulnerabilities of Hootsuite servers.

Implementation

Workflow Diagram:

Uploading Data to Dynamodb:

After collecting all required information, the bash script will generate a JSON file that contains all the data and upload the file to Dynamodb using AWS CLI. The JSON object is strictly formatted to match the requirements of Dynamodb which look like:

1
2
3
4
5
6
{"Key":{"Data_Type":"Value"},
"Attribute1":{"Data_Type":"Value"}
"Attribute2":{"Data_Type":"Value"}
"Attribute3":{"Data_Type":"Value"}
"Attribute4":{"Data_Type":"Value"}
"Attribute5":{"Data_Type":"Value"}}

The object will also contain a timestamp attribute “TTL (time to live)” for auto-expiration of terminated instance in the DB. This attribute is important as the bash script will run every 5 days to update the information. If the “TTL” attribute is not updated on the 6th day, it likely means that the instance is terminated or stopped, so the database will remove the item to save space

Sample JSON Object:

1
2
3
4
5
6
7
{"instance_id":{"S":"instance_id"},
"runstatus": {"S": "True"},
"ttl": {"N": "1493081195"},
"os": {"M":{"name": {"S":"Ubuntu"}, "version": {"S":"14.04.5"}}},
"uptimebydays": {"S":" 84 "},
"pkg": {"M":{
"accountsservice":{"M":{ "pkgversion":{"S": "0.6.35-0ubuntu7.3"}, "status": {"S":"latest"}}}}}

Create Association:

AWS Config Rules are used to monitor configuration changes in SSM. When an instance is created, it will be automatically added to the SSM console in AWS, and the creation event will be captured by AWS Config Rules to trigger a Lambda function called “ssm_association”. The event is passed into the Lambda in JSON format, and the Lambda can easily retrieve the instance id and event type to determine if the association needs to be created. Then Lambda functions use Boto3 (Python AWS SDK) to create the association.

Instance Creation Event:

1
2
3
4
5
6
7
8
9
10
11
{'configRuleId': 'config-rule-g3xyel', 'version': '1.0', 'configRuleName':
'create_ssm_assoiciation', 'configRuleArn': 'arn:aws:config:us-east-1:1111111111:config-rule/config-rule-g3xyel', 'invokingEvent': '{"configurationItemSummary":
{"changeType":"CREATE","configurationItemVersion":"1.2","configurationItemCaptureTime":"2017-
1111111111","configurationStateId":12345687654,"awsAccountId":"1111111111","configurationItemS
tatus":"OK","resourceType":"AWS::SSM::ManagedInstanceInventory","resourceId":"i-
06ad1615134baaa2a","resourceName":null,"ARN":"arn:aws:ssm:us-east-1:1111111111:managed-
instance-inventory/i-xxxxxxxx","awsRegion":"us-east-1","availabilityZone":null,"configurationStateMd5Hash":"6b1a5634c1f60482767fc239e4422ea4","res
ourceCreationTime":null},"s3DeliverySummary":null,"notificationCreationTime":"2017-04-
10T20:05:02.891Z","recordVersion":"1.0"}','eventLeftScope': False, 'ruleParameters':
'{"type":"ssm_testing"}', 'executionRoleArn': 'arn:aws:iam::1111111111:role/AWSConfig',
u'accountId': '11111111'}

Lambda Function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#The following code is the simplified Lambda function
#it takes the instance id from the creation event, checks if the instance is running, and creates the association when the event type is "CREATE"

invokingEvent = event['invokingEvent']
instanceid = configurationItem['resourceId']
configurationItemDiff = invokingEvent['configurationItemDiff']
changeType = configurationItemDiff['changeType']

#make sure the instance is running,
client = boto3.client('ec2')
response = client.describe_instance_status(
       InstanceIds=[instanceid]
     )
state = response['InstanceStatuses'][0]['InstanceState']['Name']
print(response)
if state =="running" and changeType == "CREATE":

      print("Executing create_association")
      response = client.create_association(
             Name='upload_pkg_info',
             DocumentVersion='$LATEST',
             Targets=[
                {
                   'Key': 'InstanceIds',
                   'Values': [
                      instanceid
                     ]
                },
             ],
            ScheduleExpression='cron(0 0 0/12 1/1 * ? *)'
      )

Gathering CVE:

The security team not only wants to collect unpatched CVEs for all installed packages but also those that already patched. In fact, gathering CVEs has become the biggest bottleneck of the process as I could not find any available databases or API where I can query all CVEs using package name and version number. The only known method is to use

1
apt-get changelog PKG_NAME

It might take a few seconds to download each changelog which results in extremely long running time.

To solve this problem, another Lambda function is introduced to create a list of instances with the packages installed on them. Then the Lambda function will call SSM to invoke “Send Command” to run the bash script on each instance, so that this task will require minimum time and resources.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#Example of sending bash script to one of the targe instance
#target_instance is a list of target instances generated by the Lambda function.
# its key's value, list of packages, will be changed into a bash array and combined with the bash script.
# The bash script cannot be parameterized in this case because the list of instances and package might change every time we run the code.
str = " ".join(target_instance[key])
#basharray = ‘pkgs=( python3, apport, sensu,..... )’
basharray = "pkgs=( " + str + " )"
get_cve = '''
          for i in "${pkgs[@]}"
          do
                CVE=`sudo apt-get changelog $i | grep -o "CVE-.*" | cut -c1-13 | sort -u | paste -s -d, -`
             if [ -z "$CVE" ]; then
                 echo "No CVE"
             else
                aws dynamodb update-item --table-name pkg_CVE \
                --key '{"pkg_name":{"S":"'$i'"}}' \
                --update-expression "SET cve_list=:y" \
                --expression-attribute-values '{":y":{"S":"'$CVE'"}}' \
                --return-values ALL_NEW \
                --region us-east-1
              fi
          Done'''
tmp = [basharray] + get_cve
response = client.send_command(
         InstanceIds=[
               key
         ],
         DocumentName='AWS-RunShellScript',
         TimeoutSeconds=1800,
         Comment='get CVE from ' + key,
         Parameters={'commands': tmp}
   )

Limitation/Improvement:

  • Currently there is no API designed for this project which means people will have to pull data directly from the Database.
  • The API layer can give more flexibility in terms of designing the data structure. The API defines the data structure for the user, instead of strictly formatting it in the DB.
  • SSM provides little feedback when creating association between instances and documents.
  • Link Lambda functions with SNS topics to gather error message and help troubleshoot the system
Conclusion:

This project is a PoC for gathering system inventory using SSM and it can be optimized in many aspects. This is also a good test case of what SSM is capable of doing and we can clearly see the advantages such as parametrizing the bash script. I feel SSM is a tool that has some good potential and it can be leveraged to a higher level than just using it as a patching tool.

About the Author

Andy Han is an Operations Developer Co-op at Hootsuite. Andy studies Management Engineering at the University of Waterloo. Contact him on Linkedin.