By Noel Pullen on April 28, 2017
After any outage or severe incident at Hootsuite, we take a closer look at the event to uncover valuable lessons and data. We do this by working backwards, asking “Why?” to find the most likely root causes for the problem. Then we put in place incremental improvements to bulletproof ourselves against a reoccurrence.
Before I explain our mindset, practice, and step-by-steps, let me tell you a story.
Oh ShitI’m sure you’ve felt this way before…
On April 6th, 2017, my spidey-sense started to tingle, my hands began to sweat, and I had that sinking feeling in the pit of my stomach: all the physical and emotional signs of having done something wrong.
Turns out, I brought down part of our site for 39 minutes. More specifically, our users were unable to post to LinkedIn or to add new LinkedIn accounts. I had inadvertently deleted the Hootsuite app from the LinkedIn Developer Portal because I thought I was deleting my developer access to the app. Fortunately, for our customers, this happened in the evening around 6:30PM PDT when our site is not under heavy load.
What happened next says everything about our culture.
Everyone came together.
- One of our Product Managers, helped me verify the degraded functionality.
- Our VP Engineering calmly directed me to our Strategic Partnership team who have a relationship with LI.
- Our SVP Strategy and Corporate Development tried to get in touch with a contact at LinkedIn while I submitted a support ticket to LinkedIn Developer Support.
- One of our advocates from Customer Support let us know about the first customer reports of a problem in our Slack #outage channel.
- Developers, Operations and others from across the company immediately responded in #outage and offered to help
The FixTurns out emailing LinkedIn Developer Support was the key to a fast resolution. Danthanh Tran from LI was able to “undelete” our Production, Staging, and Development apps in a matter of minutes.
Whose fault was it?That’s a trick question. As my cold sweats told me, I felt responsible, but no one is to blame. This is key to the culture of a team that improves systems quickly.
The next day, a member of our Ops team facilitated a blameless post mortem called a “5 Whys” that focused on the timeline, data about the event, what went wrong, and how we can fix our system so the same issue doesn’t happen again.
Note, the words fix our system, not how we can fix me. Like most humans, I’m well intentioned but fallible. Many of us could have made that error. In fact, having personal social network accounts able to administer the one key connection to our social networks has been a time-bomb waiting to happen. This event is a catalyst for long-overdue change.
“Human error is a symptom—never the cause—of trouble deeper within the system” – Dave Zwieback in an article on First Round Review.
So, the question is how can we bulletproof our system so the next person doesn’t have the same inadvertent nightmare? A more humorous way of saying this is from Dan Milstein, in How to Run a 5Why’s (with Humans not Robots): “Let’s plan for a future where we are all as stupid as we are today” 🙂 Focusing on the system instead of the person is a way to remove the stigma of speaking up when we feel guilt for an accident. This key because accidents contain valuable information and should be “seen as a source of data, not something embarrassing to shy away from” via John Allspaw.
A Note on the title of this outage.Two people pointed out the incongruity of saying we’re blameless but using a name in the title “That time Noel Deleted LinkedIn”. Let me explain. This is the only time we put a name to a 5 Whys. It was done as a satirical tribute to me, expressly for my contributions to our culture of blameless post-mortems.
Our Mindset and Culture on 5 WhysTo surface all the valuable data our culture must remain “open, with a willingness to share information about … problems without the fear of being nailed for them.” – Sidney Dekker in The Field Guide to Understanding ‘Human Error’.
Trust by default
- Believe that people are doing the best they can with the information they have.
Provide psychological safety
- How many of you would have admitted to causing a problem that brought down production? Do you feel you can take the risk to tell people without feeling insecure or embarrassed? Censoring ourselves means we limit what we can do.
- As my colleague Geordie put it: “only people who feel psychologically safe in a team and can be themselves, are capable of doing their best work.”
Helpfulness is the secret weapon
- Helpfulness feels great and it is an indicator of highly productive teams. Margaret Heffernan in her TED talk speaks about an MIT study of teams, where teams were given very hard problems to solve. The really successful teams were not the teams with the highest aggregate IQs. They were, in fact, the teams that were the most helpful and the most diverse: had the most empathy towards one another, the one’s without a dominant voice, and the ones with the most women.
Continuous Improvement supported by data
- One of our values is Build a better way. To do this we have to make time for reflection then follow it up with adaptation. Supporting data is key. Tracking metrics and severity in a single place means we can use it to add context, to make decisions, and to measure our improvement.
How we got hereWe arrived at these practices because people at other companies ‘worked out loud’ and showed us. Then, we iterated towards our practices based on our context.
The following list is from invaluable thinkers who influenced our mindset, practices, tools
- Dan Milstein, How to Run a 5 Whys (With Humans, Not Robots) video, slides.
- John Allspaw, Etsy, on Blameless Post Mortems and Just Culture. Where Etsy takes it next.
- Ian Malpass, Etsy, Fallible Humans video
- Google on how to foster psychological safety (as well as Five keys to a successful Google team, and Why psychological safety matters and what to do about it).
- “Underneath every simple, obvious story about ‘human error,’ there is a deeper, more complex story about the organization.” – Sidney Dekker, The Field Guide to Understanding ‘Human Error’.
- Five Whys – How To Do It Better
Something happens… an outage or incident
Read the articles above. If you read just one, read this one.
Set a time for a post-mortem. The sooner the better. Grab a whiteboard.
A note about where to hold it. A whiteboard is in public place, not in a meeting room. We also leave the details up until the next 5 Whys. This is a visible reminder of openness and it invites participation from passers-by who provide input based on their past experiences.
- introduced the problem
- identified the problem
- responded to the problem
- debugged the problem
- anyone interested
Conduct a 5 Why’s (blameless post-mortem). Focus on the system not the person.
Like Dan Milstein says, start with humour to get participants to open up. Relative to other disasters… How bad is this? Did the site go down? Accidentally post from our main twitter handle? Lose customer data? Send an email to all our users with the greeting “Hello FName!”
- MTTR: Mean Time to Resolution
- MTTD: Mean Time to Discovery
- Drill down
- Ask “Why?” at least five times.
- Each statement should be both verifiable (how do you know?) and not compound (single cause). An example from Five Whys – How To Do It Better:
- Example problem: SUV Model Z exhaust system rattle
- Why? Change of position of bracket results in vibration <– COMPOUND
- Why? Exhaust pipe vibration <– SINGLE
You must be able to work up the chain (last ‘why’ to the first ‘why’) to verify the causation.
List the improvements you can make to bulletproof the system
Ask people to commit to completing an improvement.
How do we arrive at action items and commitments? Organically. As a heuristic, point to each “why”, and to each step in the timeline, then ask how we can improve the process, technology, and training for each item; and who will take it on.
Publish it in Morgue or in the most visible communication channel.
Follow-up on that channel as you make the fixes and share your learnings.
Too soon? 🙂
Thank youTo everyone who helped with this outage. To Jonas, Beier, Geordie, Tyler, James, and Trenton for their contributions to this process and their help putting this article together.
About the AuthorNoel focuses on culture, employee engagement, technical community involvement, and training for Hootsuite’s technical groups. He loves to exchange ideas and would like to hear how you do these things at your organization. Get in touch via LinkedIn.