What percentage of incidents are preventable?

A 2024 study found that 80% of major incidents stem from internal changes like deployments and config updates that weren't tested or controlled properly. Additionally, 69% of incidents lacked proactive alerts, meaning teams only discovered problems after damage was done. This means the vast majority of outages are self-inflicted and caught too late.

How do elite teams differ from average teams in incident management?

Elite teams prevent approximately 95% of repeat incidents by systematically addressing not just immediate causes but the conditions that allowed failures to happen. Average teams get stuck in a blame-fix-repeat cycle, treating each incident as isolated rather than as symptoms of systemic issues. Companies with continuous learning cultures experience far fewer customer-impacting incidents than their peers.

What does downtime really cost a business?

Gartner estimates downtime costs approximately $5,600 per minute on average, which translates to roughly $300,000 per hour. For high-traffic services, costs can be even higher. Beyond direct costs, there's also customer churn, SLA penalties, and the opportunity cost of engineering time spent firefighting instead of innovating.

Why do incidents keep repeating at my organization?

Repeat incidents happen due to three main factors: the blame reflex that stops investigation once a scapegoat is found, hindsight bias that makes failures seem obvious and leads to shallow fixes, and the action item void where follow-up tasks disappear into backlogs. Without clear ownership, tracking, and systemic analysis, the underlying vulnerabilities remain unfixed.

Effective Post-Mortems: Reality Check

This is part of the Post-Mortem series. Read the Executive Brief (7 min), the Field Guide (20 min), or the Definitive Guide (60 min, canonical).

Abandoned campfire pit with smoke rising and a warning sign.

The Recurring Nightmare

A director explained the same database timeout issue for the third time in six months. Each incident write-up blamed a different team member, but the root cause never changed. No systemic fixes followed, so the outages kept happening.

This isn't a rare story; it's common when post-mortems are treated as a formality or finger-pointing exercise. Most organizations do create post-mortem reports after big incidents, but they skip the hard work of systemic change. The result: the same failures repeat.

The Brutal Data

Here's what empirical research tells us about why incidents happen and keep happening:

80% Are Self-Inflicted

A 2024 study of 26 major fintech incidents found that 80% of incidents stemmed from internal changes: deployments, config updates, and other modifications that weren't tested or controlled properly.¹ In other words, most outages aren't caused by external forces or unforeseeable circumstances. They're self-inflicted wounds from our own actions.

69% Lack Early Warning

The same study showed 69% of incidents lacked proactive alerts, meaning teams only discovered the problem after damage was done.¹ These weren't subtle, hard-to-detect issues. They were problems that could have been caught if proper monitoring and alerting systems were in place.

Most Learning Efforts Fail

Despite formal incident processes, recurring IT incidents persist across most organizations. This indicates that teams aren't truly learning or improving systems; they're going through the motions without addressing underlying causes.

The Elite vs. Average Divide

The gap between average and elite teams is enormous when it comes to incident management:

Elite Teams: The Prevention Masters

High-performing organizations virtually eliminate repeat failures. In top "Site Reliability Engineering" cultures, major incidents rarely recur. Companies with continuous learning cultures (blameless post-mortems, proactive fixes) experience far fewer customer-impacting incidents than their peers.²

Elite teams prevent ~95% of repeat incidents. When they have an outage, they systematically address not just the immediate cause but the conditions that allowed it to happen. They ask: "What other ways could this type of failure occur?" and "How do we prevent the entire class of similar incidents?"

Average Teams: The Blame Cycle

Most teams get stuck in a reactive pattern:

Incident happens
Someone gets blamed
Surface-level fix applied
Same type of incident happens again
Different person gets blamed
Cycle repeats

Meanwhile, average teams remain reactive. They treat each incident as an isolated event rather than a symptom of systemic issues. Their post-mortems focus on "who" rather than "how," missing opportunities for meaningful improvement.

The Hidden Organizational Costs

The cost of this repetitive cycle extends beyond immediate downtime:

Financial Impact

Gartner estimates downtime costs ~$5,600 per minute on average¹³
For high-traffic services, costs can reach hundreds of thousands per hour
Preventing even one repeat incident can far outweigh the engineering effort needed

Human Impact

Engineers suffer from firefighting fatigue
On-call burnout increases with repeated incidents
Organizations with poor incident practices have 21% higher attrition²
Talent chooses to work where they won't constantly fight the same fires

Opportunity Cost

Engineering time spent on repeat incidents can't be spent on innovation
Teams in blame cycles focus on covering themselves rather than improving systems
Competitive advantage erodes when engineering capacity is consumed by preventable problems

Why Smart Teams Get Stuck

Even intelligent, well-intentioned teams fall into this trap. Three factors create the repetitive incident cycle:

The Blame Reflex

When something goes wrong, human nature seeks someone to hold responsible. This satisfies our need for closure but prevents deep analysis. We stop investigating once we find a scapegoat, missing the systemic factors that made the failure possible.

Hindsight Bias

After an incident, everything seems obvious. We think "we should have known" things that were actually unknowable beforehand. This false clarity leads to shallow fixes focused on individual awareness rather than system design.

The Action Item Void

Even when good insights emerge, execution often fails. Without clear ownership and tracking, follow-up tasks disappear into backlogs. Teams move on to new work, and the underlying vulnerabilities remain.

The Path Forward

The good news? This cycle isn't inevitable. Organizations that implement systematic approaches to incident learning see dramatic improvements:

50% reduction in repeat incidents within 12 months¹²
30% faster resolution times as teams get better at diagnosis¹²
Significantly higher team satisfaction as firefighting decreases²

The key is moving from reactive blame to proactive system strengthening. This requires cultural changes (psychological safety), analytical changes (systems thinking), and process changes (action accountability).

Continue the series:

Psychological Safety Infrastructure - Building blame-free cultures that surface truth
Systems Thinking Over Person-Hunting - Finding root causes in complex systems
Action Accountability That Sticks - Closing the execution gap on improvements
Four-Phase Implementation Playbook - Step-by-step timeline from incident to improvement
Convincing Skeptical Leaders - Getting executive support for transformation

Want the definitive framework? Read the Definitive Guide for detailed implementation steps, success stories, and metrics for measuring transformation.

Resources

Definitive Guide (60 min) – canonical reference
- https://www.benjamincharity.com/articles/post-mortem-definitive-guide

Benjamin
Charity

Effective Post-Mortems: Reality Check

The Recurring Nightmare

The Brutal Data

80% Are Self-Inflicted

69% Lack Early Warning

Most Learning Efforts Fail

The Elite vs. Average Divide

Elite Teams: The Prevention Masters

Average Teams: The Blame Cycle

The Hidden Organizational Costs

Financial Impact

Human Impact

Opportunity Cost

Why Smart Teams Get Stuck

The Blame Reflex

Hindsight Bias

The Action Item Void

The Path Forward

Continue the series:

Resources

Build, Scale, Succeed

The Recurring Nightmare

The Brutal Data

80% Are Self-Inflicted

69% Lack Early Warning

Most Learning Efforts Fail

The Elite vs. Average Divide

Elite Teams: The Prevention Masters

Average Teams: The Blame Cycle

The Hidden Organizational Costs

Financial Impact

Human Impact

Opportunity Cost

Why Smart Teams Get Stuck

The Blame Reflex

Hindsight Bias

The Action Item Void

The Path Forward

Continue the series:

Resources

Build, Scale, Succeed

Newsletter Subscription

Related Articles