Why Root Cause Analysis Fails (And Blame Assignment Succeeds)

The five whys were invented by Sakichi Toyoda in the early 20th century as a method for finding systemic causes of manufacturing defects. The idea: keep asking "why" until you reach the root cause, not the proximate one. It was a genuinely brilliant insight. Toyota built one of the most reliable manufacturing operations in history partly on this principle.

We took this idea, applied it to software incidents, and almost immediately got it backwards.

In practice, the five whys in a software postmortem work like this: you ask "why" until you reach a human decision, and then you stop. The human who made that decision is your root cause. Investigation complete. Postmortem filed. Blame assigned.

A Five Whys in the Wild — Annotated

Why 1 The database went down. Why? → A query saturated the connection pool.

Why 2 Why did the query saturate the pool? → Someone ran a migration without a transaction timeout.

Why 3 Why did they run it without a timeout? → "Alex ran it manually following the runbook."

STOP We found a human. Investigation over. Alex is the root cause. Whys 4 and 5 (why does the runbook lack a timeout? why is the runbook not code-reviewed? why can any engineer run this manually?) remain unasked. Alex remains slightly nervous in retros for six months.

Why This Happens: The Organizational Mechanics of Blame

The failure of root cause analysis in software engineering isn't a training problem. It's not that people don't know how to do RCA properly. It's that blame assignment is doing a different — and organizationally valuable — job that nobody wants to admit out loud.

Blame Creates Closure

An incident without a responsible party is an open loop. Open loops create anxiety in organizations, especially among stakeholders who can't tell the difference between "we found the systemic cause" and "we covered it up." Naming a person closes the loop. The incident has an author. It can be filed. It is over.

Blame Creates Accountability Theater

Leadership needs to be able to answer "what are we doing to ensure this doesn't happen again?" with something more tangible than "we're improving our deployment pipeline systematically." They need a face. A conversation had. An action taken. Blame provides this. Systemic improvement is invisible. Blame is legible.

The uncomfortable truth: Organizations don't fail at blameless postmortems because they're bad at postmortems. They fail because blameless postmortems don't produce the organizational outputs that incidents generate demand for.

Blame Is Fast

A thorough systemic incident investigation takes time. Finding the actual contributing factors — the gaps in tooling, the latent pressure that caused someone to skip a step, the missing automation that would have prevented the manual action — requires work that extends well past the incident. Blame assignment takes about 15 minutes in the right war room.

The Science Behind Why RCA Is Hard

Modern safety research — the kind that underpins aviation safety and nuclear reliability — has largely moved past the concept of "root cause" entirely. The field calls this the "New View" or systems thinking approach, and it's built on a disturbing insight: complex systems don't have root causes.

They have contributing factors. Plural. Interacting. Distributed across time and organizational layers. The reason a plane crashes is never one thing; it's the combination of the fatigued crew, the ambiguous procedure, the warning light that cried wolf sixteen times before, and the weather that was technically within limits but shouldn't have been.

Software incidents work the same way. The outage wasn't caused by Alex's migration. It was caused by the migration plus the connection pool configuration plus the lack of circuit breakers plus the alerting that fired 8 minutes too late plus the runbook that assumed a smaller database plus the review process that rubber-stamps infra changes on Fridays.

That's a much more expensive set of action items. So we call it Alex's fault and ship the postmortem.

What RCA Produces

A proximate human cause
One or two action items, usually assigned to that human
A filed document nobody re-reads
Slightly increased caution from a specific person
No systemic change

What Good Postmortems Produce

A timeline with contributing factors
System-level improvements
Prioritized action items with owners across teams
Shared mental models of failure modes
Lower incident frequency

The Signs Your RCA Process Has Become Blame Assignment

Diagnostic questions for your next postmortem:

Does the postmortem include someone's name in the body text, not just as an action item owner?
Were the action items due in the same sprint the incident happened?
Is the person who made the decision that "caused" it also the one assigned to fix everything?
Did the investigation stop when it reached a human rather than when it ran out of systemic factors?
Did no one ask why the system allowed the human action to cause the incident?

If you answered yes to three or more: you are doing blame assignment with RCA paperwork. This is not a judgment. It is a description of most engineering organizations.

# Actual postmortem action items (sanitized)
"1. Alex to complete database migration training by EOQ"
"2. Alex to review runbook and update"
"3. Alex to add migration checklist to Confluence"
# Systemic action items: 0
# "Alex" job listings found: 1 (6 weeks later)

What Actually Reduces Incident Recurrence

The organizations with the lowest incident rates — the ones actually reducing MTTR because they're reducing incident frequency — do a few things differently in their incident investigation process:

They treat incidents as information, not failures. The incident is data about the gap between how they thought the system worked and how it actually works. This reframe makes investigation genuinely curious rather than accusatory.

They separate investigation from accountability. Accountability happens outside the postmortem. The postmortem is purely analytical. This is genuinely hard to maintain under organizational pressure, which is why it requires explicit cultural and process support.

They ask what would have had to be true for this not to happen. Not "who should have known better" — but "what would the system have needed to make this impossible or automatically corrected." That question produces infrastructure improvements. The former produces nervous engineers.

The irony is that good RCA actually does produce accountability — just at the system level rather than the individual level. The team is accountable for fixing the conditions that allowed the incident. That accountability is collective, forward-looking, and actually changes things.

Mean Time to Blame is fast. Mean Time to Actually Fix the Underlying Problem is slower. The organizations that accept this tradeoff explicitly are the ones whose incident graphs trend downward over time.

If this resonates

Ciroos Actually Solves This

AI SRE teammates that surface systemic patterns before the incident — so your postmortems can investigate causes, not assign blame. Less MTTB. More learning.

See How Ciroos Works → Calculate Your MTTB Score Laughed, then cried.