This is part of the Post-Mortem series. Read the Executive Brief (7 min), the Field Guide (20 min), or the Definitive Guide (60 min, canonical).
Major Incidents Are Swiss Cheese, Not Single Bullets
When a major outage happens, human nature seeks a simple explanation. We want to find the one person or one decision that "caused" the problem. But in complex systems like our production environments, failures are almost never due to one person or one glitch in isolation: they result from multiple contributing factors aligning perfectly.
It's like Swiss cheese: each slice has holes, but you only see through when several holes line up. Similarly, most incidents require multiple things to go wrong simultaneously. A code bug and a missing alert and a slow rollback procedure and unclear documentation all conspire together.
If we only blame the engineer who pushed the code, we miss the other three factors that made the incident possible - and inevitable.

The Trap of Human Nature
Even veteran incident investigators fall into predictable cognitive traps:
Hindsight Bias Makes Everything "Obvious"
After an incident, it's human nature to ask "who missed the warning signs?" and "how did we not see this coming?" This falls victim to hindsight bias: the tendency for past events to seem more predictable than they actually were.
Once we know the outcome, we conclude we "should have known" things that were actually unknowable beforehand. This makes us judge decisions based on their outcomes rather than the information available when they were made.
Confirmation Bias Seeks Easy Answers
Studies show that even experienced investigators can be led astray by their preconceived theories. They seek evidence to fit a favorite hypothesis and overlook contrary facts.⁹ In practice, this means a post-mortem might pin the cause on an easy scapegoat when reality involves multiple contributing factors.
Dave Zwieback points out that hindsight and blame create a "comfortable story" that satisfies our need for closure but prevents real learning.⁷ We prematurely decide "Susan deployed bad code, that's the root cause," and stop analyzing deeper systemic issues.
The Fundamental Attribution Error
When something goes wrong, we tend to attribute others' actions to their character ("Bob is careless") while attributing our own actions to circumstances ("I was under pressure"). This bias leads us to focus on individual traits rather than the contextual factors that influenced behavior.
The Systems Thinking Alternative
Leading organizations shift their analysis from "who" to "how" by examining the system of conditions that made failure possible:
Ask Better Questions
Transform your incident investigation language:
- Instead of: "Why did Bob deploy a bug on Friday?" Ask: "What testing or review process failed such that a bug made it to production? What pressures or assumptions led Bob to think deployment was safe?"
- Instead of: "Who missed the alert?" Ask: "How could our alerting system be designed so critical issues are impossible to miss?"
- Instead of: "Why didn't Sarah follow the runbook?" Ask: "What made the runbook difficult to follow? How could we make the correct path the easy path?"
This shift reveals systemic fixes rather than individual blame.
Apply Structured Analysis Frameworks
Use proven techniques to systematically explore contributing factors:
The "5 Whys" with a Twist
Traditional 5 Whys asks "Why did this happen?" five times. Systems-focused 5 Whys asks "Why did the system allow this to happen?" each time:
- Why did the database timeout? - The connection pool was exhausted
- Why did the system allow the connection pool to be exhausted? - No monitoring alerts fired
- Why did the system allow alerts to not fire? - The threshold was set too high
- Why did the system allow incorrect thresholds? - No process for validating alert configurations
- Why did the system allow missing validation processes? - No ownership model for infrastructure reliability
Each "why" reveals a layer of systemic opportunity for improvement.
Fishbone (Ishikawa) Diagrams
Map contributing causes across categories:
- Human factors: Was someone new? Under pressure? Missing training?
- Process issues: Were procedures unclear? Missing steps? Conflicting guidance?
- Technology problems: Tool failures? Missing capabilities? Poor interfaces?
- Environmental factors: Time pressure? Resource constraints? External dependencies?
This structured approach ensures you don't miss entire categories of contributing factors.
Examine Human Factors as System Issues
When someone makes a mistake, resist the urge to focus on their individual failings. Instead, examine the conditions that made the mistake possible:
- Was documentation misleading or incomplete?
- Did alert fatigue cause an alarm to be ignored?
- Was the engineer new to this system or under time pressure?
- Were procedures tested under realistic conditions?
- Did tooling make the wrong action easy and the right action difficult?
As Dave Zwieback puts it: "Human error is a symptom, never the cause, of deeper trouble in the system."⁷ If someone made a mistake, ask what made that mistake possible and how the system could catch or prevent it.
Learning from Aviation's Transformation
The aviation industry provides a powerful example of systems thinking in action. Aviation achieved a 95%+ incident reporting rate and dramatically reduced accidents by adopting a systemic, non-blame approach.²²
Through programs like NASA's Aviation Safety Reporting System, pilots receive immunity when they voluntarily report errors or near-misses. This created an enormous database of systemic issues and fixes. The result: aviation's fatal accident rate kept dropping despite increasing complexity.²²
The key insight: When people aren't punished for mistakes, they report problems freely, and the organization as a whole gets safer. Nearly all incidents get reported, providing a rich dataset for systematic improvement.
Practical Systems Analysis Techniques
Look for Patterns Across Incidents
Don't analyze incidents in isolation. Review past incidents to identify systemic trends:
- Are multiple incidents related to the same microservice?
- Do several outages stem from similar configuration mistakes?
- Is there a pattern of incidents happening at specific times or under specific conditions?
Pattern recognition reveals system-level weaknesses that might not be obvious from single incidents.
Include Multiple Perspectives
Gather input from all relevant areas during post-mortem discussions:
- Operations might spot monitoring gaps
- QA might note missing test cases
- Support might reveal customer-facing symptoms that weren't obvious
- Security might identify broader vulnerability patterns
Different perspectives ensure nothing gets missed and reveal the full system context.
Document Alternative Hypotheses
Force yourself to consider multiple possible explanations:
- What other factors could have contributed?
- What nearly went right that prevented worse outcomes?
- What assumptions are we making that might be wrong?
This counteracts the tendency to settle on the first plausible explanation.
Common Systems Thinking Mistakes
Stopping at the First Reasonable Cause
Just because you found a cause doesn't mean you found all the causes. Complex incidents typically have 3-4 significant contributing factors. Keep digging until you understand the full system context.
Focusing Only on Technical Factors
Remember to examine organizational, process, and human factors alongside technical ones. Often the most impactful fixes involve clarifying procedures, improving training, or adjusting team structures.
Making Fixes Too Specific
If your incident was caused by "missing validation in the user registration service," don't just add validation to that one service. Ask: "How many other services have similar validation gaps? How do we prevent this class of problem systematically?"
The Business Value of Systems Thinking
Organizations that adopt systems thinking in incident analysis see multiple benefits:
Prevents Entire Classes of Incidents
Instead of fixing one specific bug, you fix the conditions that allow similar bugs to reach production. This dramatically reduces repeat incidents.
Improves Team Morale
Engineers appreciate analyses that focus on improving systems rather than assigning blame. This leads to better retention and more willing participation in incident reviews.
Builds Antifragile Systems
By understanding how failures propagate through your system, you can design resilience that actually improves under stress. Companies like Netflix have embraced this through chaos engineering.¹⁷
Your Implementation Checklist
- Modify your post-mortem template to include a "Contributing Factors" section (plural)
- Train facilitators to ask "how" and "what" questions instead of "who" and "why"
- Include "What went well" sections to balance the analysis
- Require multiple perspectives in every significant incident review
- Look for patterns across incidents rather than treating each as isolated
Continue the series:
- Previous: Psychological Safety Infrastructure - Building blame-free cultures that surface truth
- Next: Action Accountability That Sticks - Closing the execution gap on improvements
- Four-Phase Implementation Playbook - Step-by-step timeline from incident to improvement
- Convincing Skeptical Leaders - Getting executive support for transformation
Want the definitive framework? Read the Definitive Guide for detailed implementation steps, aviation case studies, and structured analysis templates.
Resources
- Definitive Guide (60 min) – canonical reference
- Post-Mortem Cheat Sheet – free quick-reference checklist
- Post-Mortem Template – free, ready-to-use Notion template
- Blameless Post-Mortem Policy – ready-to-implement blameless policy framework