This is a guest post from ITHAKA Scrum Master Tom Bellinson, whose work has focused on scrum and project management for a variety of organizations from tech companies to the University of Michigan Medical School. His full bio is below. The opinions and statements in this article are the sole responsibility of Mr. Bellinson and do not necessarily reflect the opinions or positions of ITHAKA or Cronicle Press.
If you would like to write a guest post for Cronicle Press, please email the editor. We are looking for guest posts in tech industry thought leadership, entrepreneurship and startup culture, and book and media reviews of tech-related content.
Why Blameless Postmortems?
This spring, I gave a talk at an agile conference about blameless culture. During the presentation, I asked, by a show of hands, how many people work in a culture ruled by fear. Sadly, but not unexpectedly, well over half the crowd raised their hands.
The reality is that there are many people who seek leadership roles for the wrong reasons. They seek the ability to control their environment, which comes with the power to punish. In the past, strongmen often rose to the top. It stands to reason that they would relate to people who manage the way they do. Thus, the cycle perpetuates itself.
A new breed of organization is shifting from fear-based leadership to servant leadership. Servant leaders seek to build a supportive environment in which people find trust at all levels. Leaders can find other opportunities to demonstrate the right behaviors and that’s helpful, but the blameless postmortem is a powerful tool to help transform a culture.
The Blameless Postmortem
The blameless postmortem or BPM as it is often called, is an approach to learning that offers a number of important benefits:
- It encourages systems thinking
- It acknowledges that failure can’t be avoided
- It makes real the practice of blamelessness
- It creates an opportunity for leadership to build trust by demonstrating support and appreciation for people willing to share their experiences honestly
- It provides an opportunity for broader knowledge sharing
At ITHAKA, we hold Blameless Postmortem meetings, or BPMs, after incidents. We have a dynamic production environment: there are more than 250 applications in AWS, over 100 changes are made to those applications every week, and millions of students and researchers use our website every day. Failures are inevitable.
We prepare for failures, so our systems are designed for rapid recovery. This includes deployment systems in the hands of our engineers, automation to create Slack channels and incident documentation, as well as communicate with users on our websites.
We prepare for failures, so our systems are designed for rapid recovery.
The key here is that our product engineering teams manage their own systems in production. They change, deploy, and monitor their own software. Also, anyone in the organization can initiate the incident automation that puts our incident response in motion. This is a demonstration of trust and leaders tend to stay out of the way and allow teams to use their judgement.
Our incident automation creates a slack channel, BPM documentation from a template, and pages the Incident Manager and Incident Communications Specialist on call. They are responsible for making certain that all the right resources are engaged as well as communicating with the organization. Developers usually know shortly after a failure appears. Many people keep an eye on our Slack #platform-alerts channel for reporting outages. We also have a fairly well instrumented codebase, so when things go south, often more than one alarm sounds. Finally, being in a collab space makes it easy to talk to a bunch of people just by raising your voice.
An Outline of The Blameless Postmortem Process
As soon as possible after user impact has been mitigated, we get people together for the BPM. The goal is to reconstruct the incident timeline in writing as well as what we were thinking at each step. The challenge here is to separate what is hindsight and what we knew as actions were taken leading up to the incident. Experienced facilitation for these discussions is critical to helping people share openly among their peers. Learning about our system is the ultimate goal of this conversation. The psychological safety provided by good facilitation and the support of our senior leadership is critical to effective learning. The Blameless Postmortem post on Code as Craft was our initial inspiration for this process. Building on this, we created a document template that captures the following information from the incident:
- Date/Time of Action or Event
- Actor’s Name
- Action or Event Description
Our focus of discussion tends to center on the last three items in the list above. Because the real value of the exercise is to learn about our systems (both tech and human), we choose to focus on the decisions we made at the time and try to understand the motivation for them. If the decisions had suboptimal outcomes, it is highly valuable to determine if there were signals available that could have told us about our impending skirmish. If we missed available signals, then we can focus our attention on why we missed them. If they weren’t there, we know to add them.
Learning From BPMs
What may make our BPM practice somewhat unique is that everyone is invited to these sessions. So, anyone with an interest from learning from another team’s experience is able to participate. Our product teams share libraries and practices so we all have incentive to learn.
We always finish our BPMs by reviewing what participants learned and identifying any follow-up actions. Systems thinking takes practice. We choose to practice it as best we can in our daily work, but the BPM allows us to conduct thought experiments that cost only the 45-60 minutes a BPM usually takes.
Systems thinking takes practice.
As the participants recount their experiences and share their thoughts from the event, learners may witness a new scenario. One they might not have considered. Sometimes they have had similar experiences and can contribute to the dialog. If people gain a more thorough understanding of how other systems behave, they begin to see potential problems in their own systems before they arise.
Tom Bellinson has been working in information technology positions for 40 years. His diverse background has allowed him to gain intimate working knowledge in technical, marketing, sales and executive roles. He currently serves as a Scrum Master at ITHAKA, best known for JSTOR, a globally recognized online academic research system. Bellinson holds a degree in Communications with a Minor in Management from Oakland University in Rochester, MI, and has held a variety of technical certifications including APICS, CPIM, and CSCP.