Category:

After any outage or severe incident at Hootsuite, we take a closer look at the event to uncover valuable lessons and data. We do this by working backwards, asking “Why?” to find the most likely root causes for the problem. Then we put in place incremental improvements to bulletproof ourselves against a reoccurrence.

Before I explain our mindset, practice, and step-by-steps, let me tell you a story.

Oh Shit

I’m sure you’ve felt this way before…

On April 6th, 2017, my spidey-sense started to tingle, my hands began to sweat, and I had that sinking feeling in the pit of my stomach: all the physical and emotional signs of having done something wrong.

Turns out, I brought down part of our site for 39 minutes. More specifically, our users were unable to post to LinkedIn or to add new LinkedIn accounts. I had inadvertently deleted the Hootsuite app from the LinkedIn Developer Portal because I thought I was deleting my developer access to the app. Fortunately, for our customers, this happened in the evening around 6:30PM PDT when our site is not under heavy load.

What happened next says everything about our culture.

Now what?

Everyone came together.

  • One of our Product Managers, helped me verify the degraded functionality.
  • Our VP Engineering calmly directed me to our Strategic Partnership team who have a relationship with LI.
  • Our SVP Strategy and Corporate Development tried to get in touch with a contact at LinkedIn while I submitted a support ticket to LinkedIn Developer Support.
  • One of our advocates from Customer Support let us know about the first customer reports of a problem in our Slack #outage channel.
  • Developers, Operations and others from across the company immediately responded in #outage and offered to help

The Fix

Turns out emailing LinkedIn Developer Support was the key to a fast resolution. Danthanh Tran from LI was able to “undelete” our Production, Staging, and Development apps in a matter of minutes.

Whose fault was it?

That’s a trick question. As my cold sweats told me, I felt responsible, but no one is to blame. This is key to the culture of a team that improves systems quickly.

The next day, a member of our Ops team facilitated a blameless post mortem called a “5 Whys” that focused on the timeline, data about the event, what went wrong, and how we can fix our system so the same issue doesn’t happen again.

Note, the words fix our system, not how we can fix me. Like most humans, I’m well intentioned but fallible. Many of us could have made that error. In fact, having personal social network accounts able to administer the one key connection to our social networks has been a time-bomb waiting to happen. This event is a catalyst for long-overdue change.

“Human error is a symptom—never the cause—of trouble deeper within the system” – Dave Zwieback in an article on First Round Review.

So, the question is how can we bulletproof our system so the next person doesn’t have the same inadvertent nightmare? A more humorous way of saying this is from Dan Milstein, in How to Run a 5Why’s (with Humans not Robots): “Let’s plan for a future where we are all as stupid as we are today” 🙂 Focusing on the system instead of the person is a way to remove the stigma of speaking up when we feel guilt for an accident. This key because accidents contain valuable information and should be “seen as a source of data, not something embarrassing to shy away from” via John Allspaw.

Next, we post a summary of our 5 Whys, on our internal tool called Morgue (also from Etsy) and take actions to abate the probability or a repeat event.

A Note on the title of this outage.

Two people pointed out the incongruity of saying we’re blameless but using a name in the title “That time Noel Deleted LinkedIn”. Let me explain. This is the only time we put a name to a 5 Whys. It was done as a satirical tribute to me, expressly for my contributions to our culture of blameless post-mortems.

Our Mindset and Culture on 5 Whys

To surface all the valuable data our culture must remain “open, with a willingness to share information about … problems without the fear of being nailed for them.” – Sidney Dekker in The Field Guide to Understanding ‘Human Error’.

Trust by default

  • Believe that people are doing the best they can with the information they have.

Provide psychological safety

  • How many of you would have admitted to causing a problem that brought down production? Do you feel you can take the risk to tell people without feeling insecure or embarrassed? Censoring ourselves means we limit what we can do.
  • As my colleague Geordie put it: “only people who feel psychologically safe in a team and can be themselves, are capable of doing their best work.”

Helpfulness is the secret weapon

  • Helpfulness feels great and it is an indicator of highly productive teams. Margaret Heffernan in her TED talk speaks about an MIT study of teams, where teams were given very hard problems to solve. The really successful teams were not the teams with the highest aggregate IQs. They were, in fact, the teams that were the most helpful and the most diverse: had the most empathy towards one another, the one’s without a dominant voice, and the ones with the most women.

Continuous Improvement supported by data

  • One of our values is Build a better way. To do this we have to make time for reflection then follow it up with adaptation. Supporting data is key. Tracking metrics and severity in a single place means we can use it to add context, to make decisions, and to measure our improvement.

How we got here

We arrived at these practices because people at other companies ‘worked out loud’ and showed us. Then, we iterated towards our practices based on our context.

The following list is from invaluable thinkers who influenced our mindset, practices, tools

Step-by-step

  1. Something happens… an outage or incident

  2. Resolve it.

  3. Read the articles above. If you read just one, read this one.

  4. Set a time for a post-mortem. The sooner the better. Grab a whiteboard.

  5. A note about where to hold it. A whiteboard is in public place, not in a meeting room. We also leave the details up until the next 5 Whys. This is a visible reminder of openness and it invites participation from passers-by who provide input based on their past experiences.

  6. Invite the people or persons who… (taken from Bethany Macri‘s talk on post-mortems):

    • introduced the problem
    • identified the problem
    • responded to the problem
    • debugged the problem
    • anyone interested

  7. Conduct a 5 Why’s (blameless post-mortem). Focus on the system not the person.

  8. Like Dan Milstein says, start with humour to get participants to open up. Relative to other disasters… How bad is this? Did the site go down? Accidentally post from our main twitter handle? Lose customer data? Send an email to all our users with the greeting “Hello FName!”

  9. Collect metrics

    • Timeline
    • MTTR: Mean Time to Resolution
    • MTTD: Mean Time to Discovery

  10. Drill down
    1. Ask “Why?” at least five times.
    2. Each statement should be both verifiable (how do you know?) and not compound (single cause). An example from Five Whys – How To Do It Better:
    • Example problem: SUV Model Z exhaust system rattle
    • Why? Change of position of bracket results in vibration <– COMPOUND
    • Why? Exhaust pipe vibration <– SINGLE
  11. You must be able to work up the chain (last ‘why’ to the first ‘why’) to verify the causation.

  12. List the improvements you can make to bulletproof the system

  13. Ask people to commit to completing an improvement.

  14. How do we arrive at action items and commitments? Organically. As a heuristic, point to each “why”, and to each step in the timeline, then ask how we can improve the process, technology, and training for each item; and who will take it on.

  15. Publish it in Morgue or in the most visible communication channel.

  16. Follow-up on that channel as you make the fixes and share your learnings.

    Too soon? 🙂

    Thank you

    To everyone who helped with this outage. To Jonas, Beier, Geordie, Tyler, James, and Trenton for their contributions to this process and their help putting this article together.

    About the Author

    Noel Pullen 200x200 Noel focuses on culture, employee engagement, technical community involvement, and training for Hootsuite’s technical groups. He loves to exchange ideas and would like to hear how you do these things at your organization. Get in touch via LinkedIn.

Guy Drut. From hurdler49

Guy and the Gold Medal

The French Olympic hurdler, Guy Drut, found himself in an unenviable position in the early summer of 1976. He was France’s only hope for a track-and-field medal, and the burden of carrying the nation’s pride on his shoulders was getting to him. Drut later told me that he had spoken on several occasions prior to the games with our long-time client Jean-Claude Killy and that he really felt he owed a part of his gold medal to Killy. He explained it as follows: “Jean-Claude told me that I was the only one who knew how to get my body and mind to their ultimate peak for the Olympic Games. He then told me that after I had done this that I should keep saying to myself, ‘I have done everything I can to get ready for this race and if I win, everything will be great, but if I don’t win my friends will still be my friends, my enemies will still be my enemies, and the world will still be the same.’ I repeated this sentence to myself before the qualifying heats and during the break between the semi-finals and finals. I kept saying the sentence over and over, and it blocked out everything else. I was still repeating it to myself when I went up to get my gold medal.

From the Fear of Failure passage in What They Don’t Teach You at Harvard Business School: Notes from a Street-smart Executive by Mark H. McCormack. Underlining is mine.

This isn’t the fear you’re looking for

Few of us have been in the starting blocks at the Olympics but for many of us a similar level of anxiety can be brought on at the thought of presenting a technical demo in front of our fellow engineers – even our friends and colleagues.

By repeating those words, Drut downplayed the consequences of failure and detached his anxiety from his situation. Every time I go up on stage, I say those same words, for the exact same reason.

Working out loud is a good thing

Every Wednesday morning for the last four years our entire technical staff gets together for Demos and an All Hands. Engineers sign up to give 5-minute demonstrations of new product functionality or internal tooling and then take questions from the audience. After the Demos we move on to announcements and awards. Over the last four years we’ve done upwards of 440. I’ve watched almost all of them.

For everyone who attends this session, it celebrates people and accomplishments; it drives alignment around mission, strategy and priorities; and finally, it provides a forum to ask and answer questions. (Thanks for that excellent article Gokul).

Each presenter needs to make the most of this opportunity because talking about your work is as important as the work itself. The challenge is to get so good at presenting a technical demo that others feel compelled to celebrate your work, change their outlook, and share your story. That means making it succinct, informative, and relevant.

The leap from paper to the stage is huge – the way our ideas sound in our head is not at all how they sound out loud. Here are five ways to elevate a mediocre technical demo to a great one. Read More …

We’ve all been there: Those awkward moments where you’re sharing space with someone, like waiting for the coffee machine at work. There’s a nod, maybe a smile, but not much more… Two people existing in parallel but never really connecting.

But we’ve also been here: A random encounter with someone where you unexpectedly connect on a more-than-superficial level. Surprisingly, that chance encounter leads to discovering a shared interest and sparks a deep discussion around a problem. All of a sudden, you find yourself with a new friend, and are tackling that gnarly problem from a completely different angle which, in turn, leads to a better solution that you could have imagined.

Assisted Serendipity

In any group, I find it’s the informal relationships that are the most effective in getting stuff done. Why? People naturally want to help other people and they are much more likely to do so when the obstacle of unfamiliarity is out of the way. This is where a facilitated and informal ‘blind’ chat can work it’s magic. We call this #randomcoffee. It’s a catalyst that facilitates relationships and eases the awkwardness of approaching someone new. Why coffee? Coffee dates require less commitment because they are inexpensive and quick in nature, yet can be drawn out in length by good conversation.

The term “assisted serendipity” comes from Ryan Vanderbilt’s article in Fast Company about the future of collaboration. He elegantly uses the proximity of magnets to illustrate the benefits of helping to bring two things together. ‘If two magnets are separated by too much distance, they won’t have any impact on each other,” he writes “But, if something helps move them a bit closer, they will gravitate towards each other and connect. Technology can be used in a similar way. It can connect you to other people, skills, tools, and trigger new ways of thinking and working; it can create an “assisted serendipity.”’

Ten Thousand Coffees built a business around helping people connect, Robert Meggs at Etsy built an internal tool called Mixer for the same reasons; The University of Michigan started “Innovate Brew” last October to foster innovation; they randomly match their faculty for 30-minute coffee meetings once a month.

Each of these organizations is looking for ways to bring people together for a conversation to learn about one another—and from one another—in the hopes that it changes their future behaviour for the benefit of all.

#randomcoffee

So, after a few conversations (over coffee, true story), one Google Form, and assistance with writing and sending… in February 2016, we had 128 people sign up for their first #randomcoffee at Hootsuite.

This is an email sent to participants. I hoped to strike a balance of humour and encouragement 🙂

What happened next?

I started to collect anecdotes because people kept stopping me in the hallway and sharing something positive about their experience. People also posted selfies of their coffees and a blurb about the experience on our internal network. These anecdotes are so uplifting to read because they remind me of the positive effect of #randomcoffee. They have also served as seeds for the program’s evolution. Read More …

Rhys and Noah
Rhys Rustad-Elliott and Noah Tajwar.

Why does Hootsuite offer Technical Summer Jobs to High School Students?

Because these kids are amazing. This blog post here details their accomplishments, our philosophy, and a request for more industry peers to help by starting a paid high school summer program: http://code.hootsuite.com/why-you-need-high-school-students/

How will Rhys and Noah contribute over the next two months?

Noah has joined our Android team, and Rhys has joined our Publishing team. Both will pair with a training guide, go through our onboarding program, participate in all aspects of software engineering as a team, work out loud, participate in Guild meetings, demonstrate their work at our Wednesday All Hands, write a blog post, and that’s just the beginning 🙂

Read on to learn more about Rhys and Noah. Read More …

Can you guess what these two scenarios have in common?

Imagine seeing Google or Facebook working on ways to provide internet connectivity to remote regions, and then launching project Seed—an off-the-grid web server that solves the same problem on a smaller scale.

Or say you built an Android app, and in the process figured out a unique way to improve the user experience when the phone is turned on its side. When you write about your methods, you hear from programmers who thank you for saving them tons of time.

These would be great accomplishments for any senior technologist. The fact is, these are examples of the enthusiasm and accomplishments of high school students.

In this post I’m going to illustrate why a paid, technical summer high school program has, in my mind, no downside for students, employers, or schools.

Eric Hamber Kids come for a visit Read More …

Hootsuite Co-ops

The Story of Ira Needles

When Ira G. Needles arrived in the Waterloo region in 1925 to take a job as assistant sales manager at B.F. Goodrich, he hid the fact that he was university educated. At the time, the business world considered it “snooty” to have a higher education.

His education didn’t hurt him, however, and Needles gradually rose within the ranks of the tire giant, and by 1951 he was appointed president of B.F. Goodrich Canada.

However, in the summer of 1956, Needles’ two separate worlds – industry and academia – would finally come together in a radical speech he made to the Rotary Club of Kitchener-Waterloo. Needles’ speech would ultimately transform the nature of education in Canada.

During the talk, entitled “WANTED: 150,000 Engineers – The Waterloo Plan,” Needles presented a new kind of education that would involve studies in the classroom as well as training in industry.

Courtesy of the University of Waterloo public library file on Ira Needles

Thank Goodness for Ira

Needles and two others would then go on to found the future University of Waterloo (first known as the Waterloo College Associate Faculties) in 1957 and admit the first 75 students, all of whom were also co-ops. Sixty years later, roughly 80,000 students in Canada enroll in co-operative education each year.

A co-op program has, in my mind, no downside for students, employers, or the institutions that support it. Students apply their learning, test-drive life in industry, fund their education, and return to their studies to challenge some of the theoretical principles based on this experience. Employees, like us, get an opportunity to cement our own knowledge by teaching students and at the same time build a pipeline of young, bright talent. Lastly, demand for an academic institution’s programs grow as the demand for their graduating students grows.

How Do You Measure the Success of a Co-op Program? Read More …

Love technology? Are you a high school student in grade 11 or 12, or home schooled? Do you live in BC? 

Hootsuite’s technologists would love to help you develop your technical skills and to provide you with experience around people, process, and technology by working with you over the summer. You’ll pair with a mentor and work at our Vancouver office side-by-side with a passionate, egoless team having fun building something bigger than themselves. Experience what it’s like to make a difference in people’s lives by building the products our customers use to turn messages into meaningful relationships.

There are opportunities at Hootsuite in all aspects of technology including software (both mobile and web), operations, security, and IT.

This is a paid position with a competitive salary.

Application instructions are at the end of this post.

This is the second summer of this program. You can meet the High School students who worked with us last year and read about their stories and accomplishments.

Passive learning creates knowledge. Active practice creates skill.James Clear

Photo courtesy of @alexrousse_ (instagram)

Read More …

Everything changes and nothing stands still – Heraclitus

Someone emailed me recently to ask about “the good, the bad, and the ugly” of Guilds, because almost a year has passed since I first wrote about them. We set up Guilds to tap into our desire to learn and improve how we do things, as well as facilitate horizontal communication and collective action across our stable teams. Most times our Guilds aim affect change on something external, but this post focuses on changes within the Guilds themselves. Here are some insights from a recent retrospective on Guilds that we held in July. Guilds session at July Unconference

1. Do > Talk

Hands-on sessions have higher engagement and a high participant return-on-time-invested. Some examples from our technical Guilds include coding workshops, group code-reviews, and mini-hackathons.

2. Why Did People Show Up?

Are members looking for a support network? Do they want a place to learn? Why did you start it? What do people want to get out of this Guild? Kick off your first session with your perception of problem, and a vision how the guild will help address it.

Read More …

We have a norm for onboarding new engineers on Day 1: Push to Production. This norm is a powerful experience for our new people and a litmus test for our systems.

Simon Whitfield pushes to Production

The Experience and Why it’s Important

Pushing to Production on Day 1 is a small and early exposure to an engineering culture that values learning, trust, collaboration, and a bias towards thoughtful action.

Always Be Shipping. As software engineers, this our role, this is why we’re here, this is how we make an impact and help others – like our customers.

This is a hands-on experience. There is an important difference between passive learning and active practice. Reading a wiki page or diagram about deployment steps and hearing our philosophy of “anyone can deploy to Production at any time, from anywhere in less than five minutes” develops knowledge. The act of deploying helps someone understand how to change our product and how it feels to make that change.

This shows the speed at which we operate. Continuous Delivery – shipping to Production multiple times per day – means we can get something useful into the hands of our customers very quickly. We pushed 2015 times in 2014, an average of 8 times per day.

This says ‘we trust you’. Our engineers are responsible for delivering code the last mile and accountable to how it affects our customers. We try to convey the magnitude and gravity of shipping to Production, but at the same time, remove the fear of deployment. Read More …

“You must feel the Force around you; here, between you, me, the tree, the rock, everywhere, yes.” – Master Yoda

The Danger of Team Debt

At her PyCon 2015 keynote, Kate Heddleston explains how a familiar engineering concept – technical debt – applies to any growing organization. Any technical system can accrue technical debt as a consequence of bad design. Heddleston argues that organizations can also accrue ‘team debt‘ as a consequence of bad design: where each person added to your team eventually decreases overall team productivity. Productivity drops because each new addition lacks an understanding about the team’s processes, cultural norms, how to do their job, corporate values, code standards, architecture, and more.

Steve Blank sums it up as “all the people/culture compromises made to ‘just get it done’ in the early stages of a startup.” It’s like death by a thousand cuts – each person’s inefficiencies compound to a point where their time and effort spent navigating your ‘system’ outweighs their time and effort spent shipping code.

This was exactly our situation in our Engineering group two years ago. Our team had tripled in size from 13 to 39 in the span of two years and was slated to double again to 78 in 2014. So much of our ‘just get it done’ approach lead to misaligned expectations and lack of understanding of our code base, our practices, and our culture. Symptoms of the problem trickled in to me periodically, but the depth of the situation really hit home when someone I had hired resigned and cited some of these issues in their exit interview.

That event radically shifted the way I looked at introducing new engineers.

Lechon Kirb photo via Unsplash.com

Read More …

Loading ...