How do I apply the 5 Whys technique to system analysis?

Instead of traditional 5 Whys that asks 'Why did this happen?' five times, systems-focused 5 Whys asks 'Why did the system allow this to happen?' each time. For example: Why did the database timeout? → Connection pool exhausted. Why did the system allow exhaustion? → No alerts fired. Why did the system allow alerts not to fire? → Threshold set too high. Each 'why' reveals a layer of systemic opportunity for improvement.

What is hindsight bias and how does it affect post-mortems?

Hindsight bias makes past events seem more predictable than they actually were. After an incident, we conclude we 'should have known' things that were actually unknowable beforehand, leading to shallow conclusions and vague 'be more careful' action items. To counter it, ask 'Could we realistically have detected this before? If not, why not? How do we change that?' This shifts focus from individual awareness to systemic detection.

How did aviation achieve high incident reporting rates?

Aviation achieved a 95%+ incident reporting rate through programs like NASA's Aviation Safety Reporting System, which gives pilots immunity when they voluntarily report errors or near-misses. This systemic, non-blame approach created an enormous database of issues and fixes, dramatically reducing accidents despite increasing complexity. When people aren't punished for mistakes, they report problems freely and the organization gets safer.

What are common systems thinking mistakes in post-mortems?

Three common mistakes: stopping at the first reasonable cause without finding all contributing factors (complex incidents typically have 3-4), focusing only on technical factors while ignoring organizational and human elements, and making fixes too specific to the one incident instead of asking how to prevent the entire class of problems systematically.

Effective Post-Mortems: Systems Thinking

Q: What is systems thinking in incident analysis?

Systems thinking means examining the conditions that made failure possible rather than hunting for who to blame. In complex systems, failures almost never result from one person or one glitch in isolation—they result from multiple contributing factors aligning. It's like Swiss cheese: you only see through when several holes line up. Systems thinking asks 'How did our system allow this?' instead of 'Who did this?'

This is part of the Post-Mortem series. Read the Executive Brief (7 min), the Field Guide (20 min), or the Definitive Guide (60 min, canonical).

Major Incidents Are Swiss Cheese, Not Single Bullets

When a major outage happens, human nature seeks a simple explanation. We want to find the one person or one decision that "caused" the problem. But in complex systems like our production environments, failures are almost never due to one person or one glitch in isolation: they result from multiple contributing factors aligning perfectly.

It's like Swiss cheese: each slice has holes, but you only see through when several holes line up. Similarly, most incidents require multiple things to go wrong simultaneously. A code bug and a missing alert and a slow rollback procedure and unclear documentation all conspire together.

If we only blame the engineer who pushed the code, we miss the other three factors that made the incident possible - and inevitable.

The Trap of Human Nature

Even veteran incident investigators fall into predictable cognitive traps:

Hindsight Bias Makes Everything "Obvious"

After an incident, it's human nature to ask "who missed the warning signs?" and "how did we not see this coming?" This falls victim to hindsight bias: the tendency for past events to seem more predictable than they actually were.

Once we know the outcome, we conclude we "should have known" things that were actually unknowable beforehand. This makes us judge decisions based on their outcomes rather than the information available when they were made.

Confirmation Bias Seeks Easy Answers

Studies show that even experienced investigators can be led astray by their preconceived theories. They seek evidence to fit a favorite hypothesis and overlook contrary facts.⁹ In practice, this means a post-mortem might pin the cause on an easy scapegoat when reality involves multiple contributing factors.

Dave Zwieback points out that hindsight and blame create a "comfortable story" that satisfies our need for closure but prevents real learning.⁷ We prematurely decide "Susan deployed bad code, that's the root cause," and stop analyzing deeper systemic issues.

The Fundamental Attribution Error

When something goes wrong, we tend to attribute others' actions to their character ("Bob is careless") while attributing our own actions to circumstances ("I was under pressure"). This bias leads us to focus on individual traits rather than the contextual factors that influenced behavior.

The Systems Thinking Alternative

Leading organizations shift their analysis from "who" to "how" by examining the system of conditions that made failure possible:

Ask Better Questions

Transform your incident investigation language:

Instead of: "Why did Bob deploy a bug on Friday?" Ask: "What testing or review process failed such that a bug made it to production? What pressures or assumptions led Bob to think deployment was safe?"
Instead of: "Who missed the alert?" Ask: "How could our alerting system be designed so critical issues are impossible to miss?"
Instead of: "Why didn't Sarah follow the runbook?" Ask: "What made the runbook difficult to follow? How could we make the correct path the easy path?"

This shift reveals systemic fixes rather than individual blame.

Apply Structured Analysis Frameworks

Use proven techniques to systematically explore contributing factors:

The "5 Whys" with a Twist

Traditional 5 Whys asks "Why did this happen?" five times. Systems-focused 5 Whys asks "Why did the system allow this to happen?" each time:

Why did the database timeout? - The connection pool was exhausted
Why did the system allow the connection pool to be exhausted? - No monitoring alerts fired
Why did the system allow alerts to not fire? - The threshold was set too high
Why did the system allow incorrect thresholds? - No process for validating alert configurations
Why did the system allow missing validation processes? - No ownership model for infrastructure reliability

Each "why" reveals a layer of systemic opportunity for improvement.

Fishbone (Ishikawa) Diagrams

Map contributing causes across categories:

Human factors: Was someone new? Under pressure? Missing training?
Process issues: Were procedures unclear? Missing steps? Conflicting guidance?
Technology problems: Tool failures? Missing capabilities? Poor interfaces?
Environmental factors: Time pressure? Resource constraints? External dependencies?

This structured approach ensures you don't miss entire categories of contributing factors.

Examine Human Factors as System Issues

When someone makes a mistake, resist the urge to focus on their individual failings. Instead, examine the conditions that made the mistake possible:

Was documentation misleading or incomplete?
Did alert fatigue cause an alarm to be ignored?
Was the engineer new to this system or under time pressure?
Were procedures tested under realistic conditions?
Did tooling make the wrong action easy and the right action difficult?

As Dave Zwieback puts it: "Human error is a symptom, never the cause, of deeper trouble in the system."⁷ If someone made a mistake, ask what made that mistake possible and how the system could catch or prevent it.

Learning from Aviation's Transformation

The aviation industry provides a powerful example of systems thinking in action. Aviation achieved a 95%+ incident reporting rate and dramatically reduced accidents by adopting a systemic, non-blame approach.²²

Through programs like NASA's Aviation Safety Reporting System, pilots receive immunity when they voluntarily report errors or near-misses. This created an enormous database of systemic issues and fixes. The result: aviation's fatal accident rate kept dropping despite increasing complexity.²²

The key insight: When people aren't punished for mistakes, they report problems freely, and the organization as a whole gets safer. Nearly all incidents get reported, providing a rich dataset for systematic improvement.

Practical Systems Analysis Techniques

Look for Patterns Across Incidents

Don't analyze incidents in isolation. Review past incidents to identify systemic trends:

Are multiple incidents related to the same microservice?
Do several outages stem from similar configuration mistakes?
Is there a pattern of incidents happening at specific times or under specific conditions?

Pattern recognition reveals system-level weaknesses that might not be obvious from single incidents.

Include Multiple Perspectives

Gather input from all relevant areas during post-mortem discussions:

Operations might spot monitoring gaps
QA might note missing test cases
Support might reveal customer-facing symptoms that weren't obvious
Security might identify broader vulnerability patterns

Different perspectives ensure nothing gets missed and reveal the full system context.

Document Alternative Hypotheses

Force yourself to consider multiple possible explanations:

What other factors could have contributed?
What nearly went right that prevented worse outcomes?
What assumptions are we making that might be wrong?

This counteracts the tendency to settle on the first plausible explanation.

Common Systems Thinking Mistakes

Stopping at the First Reasonable Cause

Just because you found a cause doesn't mean you found all the causes. Complex incidents typically have 3-4 significant contributing factors. Keep digging until you understand the full system context.

Focusing Only on Technical Factors

Remember to examine organizational, process, and human factors alongside technical ones. Often the most impactful fixes involve clarifying procedures, improving training, or adjusting team structures.

Making Fixes Too Specific

If your incident was caused by "missing validation in the user registration service," don't just add validation to that one service. Ask: "How many other services have similar validation gaps? How do we prevent this class of problem systematically?"

The Business Value of Systems Thinking

Organizations that adopt systems thinking in incident analysis see multiple benefits:

Prevents Entire Classes of Incidents

Instead of fixing one specific bug, you fix the conditions that allow similar bugs to reach production. This dramatically reduces repeat incidents.

Improves Team Morale

Engineers appreciate analyses that focus on improving systems rather than assigning blame. This leads to better retention and more willing participation in incident reviews.

Builds Antifragile Systems

By understanding how failures propagate through your system, you can design resilience that actually improves under stress. Companies like Netflix have embraced this through chaos engineering.¹⁷

Your Implementation Checklist

Modify your post-mortem template to include a "Contributing Factors" section (plural)
Train facilitators to ask "how" and "what" questions instead of "who" and "why"
Include "What went well" sections to balance the analysis
Require multiple perspectives in every significant incident review
Look for patterns across incidents rather than treating each as isolated

Continue the series:

Previous: Psychological Safety Infrastructure - Building blame-free cultures that surface truth
Next: Action Accountability That Sticks - Closing the execution gap on improvements
Four-Phase Implementation Playbook - Step-by-step timeline from incident to improvement
Convincing Skeptical Leaders - Getting executive support for transformation

Want the definitive framework? Read the Definitive Guide for detailed implementation steps, aviation case studies, and structured analysis templates.

Resources

Definitive Guide (60 min) – canonical reference
- https://www.benjamincharity.com/articles/post-mortem-definitive-guide
Post-Mortem Cheat Sheet – free quick-reference checklist
Post-Mortem Template – free, ready-to-use Notion template
Blameless Post-Mortem Policy – ready-to-implement blameless policy framework

Benjamin
Charity

Effective Post-Mortems: Systems Thinking

Major Incidents Are Swiss Cheese, Not Single Bullets

The Trap of Human Nature

Hindsight Bias Makes Everything "Obvious"

Confirmation Bias Seeks Easy Answers

The Fundamental Attribution Error

The Systems Thinking Alternative

Ask Better Questions

Apply Structured Analysis Frameworks

The "5 Whys" with a Twist

Fishbone (Ishikawa) Diagrams

Examine Human Factors as System Issues

Learning from Aviation's Transformation

Practical Systems Analysis Techniques

Look for Patterns Across Incidents

Include Multiple Perspectives

Document Alternative Hypotheses

Common Systems Thinking Mistakes

Stopping at the First Reasonable Cause

Focusing Only on Technical Factors

Making Fixes Too Specific

The Business Value of Systems Thinking

Prevents Entire Classes of Incidents

Improves Team Morale

Builds Antifragile Systems

Your Implementation Checklist

Continue the series:

Resources

Build, Scale, Succeed

Major Incidents Are Swiss Cheese, Not Single Bullets

The Trap of Human Nature

Hindsight Bias Makes Everything "Obvious"

Confirmation Bias Seeks Easy Answers

The Fundamental Attribution Error

The Systems Thinking Alternative

Ask Better Questions

Apply Structured Analysis Frameworks

The "5 Whys" with a Twist

Fishbone (Ishikawa) Diagrams

Examine Human Factors as System Issues

Learning from Aviation's Transformation

Practical Systems Analysis Techniques

Look for Patterns Across Incidents

Include Multiple Perspectives

Document Alternative Hypotheses

Common Systems Thinking Mistakes

Stopping at the First Reasonable Cause

Focusing Only on Technical Factors

Making Fixes Too Specific

The Business Value of Systems Thinking

Prevents Entire Classes of Incidents

Improves Team Morale

Builds Antifragile Systems

Your Implementation Checklist

Continue the series:

Resources

Build, Scale, Succeed

Newsletter Subscription

Related Articles