Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
November 26, 2021 04:58 pm GMT

Incident Post-Mortems at Jobber

No matter how stable your software product is, occasionally things go wrong in production, and Jobber is committed to doing a post-mortem investigation to follow up and learn from each incident.

At a high-level, an incident post-mortem answers these questions:

  • What went wrong?
  • What did we do to fix it?
  • What will we do differently, so it doesn't happen again?
  • What went well during the incident, that we should keep doing?

As weve grown and moved to a remote working environment, weve changed our process to work better for remote teams and super busy schedules. This is a summary of what were doing to make sure that incidents remain rare and our customers can keep getting their work done!

Our process

Our process is broken down into 4 steps: Resolve the incident, investigate it, debrief about it, then share the results

Collect data during the incident. We collect as much data as we can in a slack channel dedicated to incidents, keeping it organized with threads. This includes server graphs, snippets from logs, and screenshots showing what was going on at each point in the incident. It doesnt all end up being useful, but its nice to have everything collected when you start going through the investigation.

Start the investigation right away. We get one of the involved people to take on the role of lead investigator, which really means theyre in charge of making sure the investigation gets done, the post-mortem document gets filled in, and the debrief gets held. Starting it right away makes sure nothing gets lost.

Review the results within a week. While things are still fresh, hold a debrief to review the post-mortem document, discuss the action items, and make any edits needed. This is a 30-60min zoom session with the team involved in the incident as well as reps from other departments (mainly the customer support/escalation team).

Share the results as soon as the debrief is done, so everyone gets a chance to learn from it! We post it to a slack channel that the whole company has access to, for transparency.

New Challenges

With a larger company, people working in all sorts of time zones, and everyone being remote, scheduling and coordinating got a lot more complicated. The process is still mostly the same, but with some tweaks to keep it effective.

Shorter timelines

Weve shortened the timeline expectations - getting the incident doc started faster and the debrief done sooner helps get all the data and lets everyone involved get back to their sprint work sooner.

Assume async

Scheduling the debrief sooner means that its harder to find a spot in everyones calendars. Rather than pushing the meeting further and further out, do more of the work asynchronously. Make sure the document can stand on its own, and use slack to ask people for their contributions.

We also record the debrief (easy with zoom) so that anyone who couldnt attend is also able to watch it later, so nobody has to worry about missing out.

Simple incident doc template

Were using a wiki template for consistency, and over time weve simplified down the template repeatedly so theres less sections to worry about.

Setting it up with a button to auto-create the new page from the template works well.

The template has sections for:

  • Impact and Scope
  • Trigger (what started the incident)
  • Resolution (what ended up fixing it)
  • Timeline of events
  • Root Cause
  • What went well
  • What didnt go well
  • Action items
  • Data & Analysis (all the charts!)

Asking for input from customer-facing teams right away

Our customer success team always has great input and is able to help fill in gaps in the timeline. We reach out to them early so theres time for their input to be added into the post-mortem doc before the debrief. Waiting for the debrief is too late!

Tracking action items in Jira

Why track action item progress in an incident doc when we already have a standard tool for tracking work? As soon as we can, we get all action items from post-mortems in as Jira tickets so they can be assigned to backlogs and dont get lost.

We also have some reports set up to view the list of outstanding post-mortem actions - driven by a post-mortem label on the items.

Have a section for things we should do if we have time

Realistically, not all action items are actually actionable - some are more aspirational or are something we just need everyone to keep in mind. In order to keep the Jira action items clearer, weve included this section as a spot to put the things we think are important but we couldnt turn into assignable/trackable work.

Our approach is that its better to have a smaller set of action items that we actually do than a giant list of things wed like to do given infinite time.

Keep it Blameless

This one isnt actually new, but its well worth repeating! Were interested in what happened and what were going to do to fix it going forward, not in pointing fingers.

"Removing blame from a postmortem gives people the confidence to escalate issues without fear."
the SRE book

About Jobber

We're hiring for remote positions across Canada at all software engineering levels!

Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.

If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!


Original Link: https://dev.to/jobber/incident-post-mortems-at-jobber-43ja

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To