Let's start with a definition. MTTR — Mean Time to Repair, Mean Time to Recovery, or Mean Time to Remediation, depending on who you ask and how good their week has been — technically measures the average time from incident detection through remediation and resolution. (In practice, it's whatever makes your metrics look good.) It's one of the four DORA metrics. It appears in reliability dashboards. People put it in slides. Leadership asks about it.
MTTB — Mean Time to Blame — measures something different: the average time from incident detection to the identification of the team, person, or service that caused it. It doesn't appear in any official framework. It has no dashboard. Nobody puts it in slides. And yet, in a substantial number of incident postmortems, it is the first metric to be satisfied.
The gap between these two metrics is the gap between the SRE organization you have and the one you think you have.
A Complete Map of SRE Incident Metrics
To understand why MTTB matters, you have to understand what the full landscape of incident metrics actually looks like — and which ones organizations use vs. which ones they aspire to use.
| Metric | What it measures | Who cares | Reality |
|---|---|---|---|
| MTTR | Detection → Resolution | Everyone (officially) | Heavily gamed; often excludes detection time |
| MTTD | Incident start → Detection | Monitoring teams | Underreported; hard to measure accurately |
| MTTI | Detection → Investigation start | Nobody (they should) | The gap where MTTB lives |
| MTTF | Restore → Next failure | Reliability-focused teams | Rarely tracked; deeply revealing |
| MTTB™ | Detection → Blame assigned | Everyone (unofficially) | Optimized in every org, measured in none |
How MTTR Gets Gamed (A Field Guide)
MTTR reduction is the stated goal of most SRE programs. It's also one of the most reliably distorted metrics in engineering. The distortions are rarely intentional — they're structural. Here's how they happen:
The Clock Start Problem
MTTR is calculated from when the incident was detected. But incident detection is itself a fuzzy concept. If a monitor fires at 2:14am but nobody acknowledges it until 2:22am, when did the incident start? Different tools, different teams, and different on-call practices will answer this differently. Organizations that report great MTTR often have generous clock-start definitions.
The Rollback Cheat
Rollback is the fastest path to MTTR. Revert the deploy, metrics recover, incident closed. MTTR: 12 minutes. Excellent. The underlying issue — why the deploy broke production, why the tests didn't catch it, why the deploy was made at 5pm on a Friday — goes into the postmortem, gets three action items assigned to the team that shipped the change, and is quietly never addressed.
The system optimized for MTTR. The system got faster rollbacks. The system got the same incident again next month.
The MTTB Phenomenon: Why It Dominates
Here's what's actually happening in the first 30 minutes of most P0 incidents, in roughly this order:
- Alert fires. On-call acknowledges.
- War room convened. People join.
- Someone asks "what changed recently?"
- Git history and deploy log examined.
- Someone is identified.
- That person is asked to explain themselves.
- Incident resolution continues, now with an audience and a narrative.
Steps 3–6 happen faster than steps 1–2, 7, and every step after. The identification of a responsible party is often the fastest-moving part of incident response. It's prioritized, implicitly, because it answers the organizational question everyone is actually asking: who did this?
What You Should Be Measuring Instead
The most useful SRE metrics are the ones that create feedback loops toward the behaviors you want. MTTR, as typically measured, creates a feedback loop toward faster rollbacks and more conservative clock-start definitions. That's not nothing, but it's not the loop you want.
Here's what the metrics framework looks like in organizations that are actually improving:
What Most Teams Measure
- MTTR (often gamed)
- Incident count (often filtered)
- SLA compliance (often a lagging indicator)
- Deploy frequency (activity, not outcome)
- MTTB (not measured, fully optimized)
What Improves Reliability
- MTTD — are you catching things early?
- MTTI — how fast does investigation start?
- Incident recurrence rate — same cause twice?
- Action item completion rate from postmortems
- % of incidents caught before user impact
The right side of that table has a common characteristic: these metrics measure the quality of your incident response and prevention work, not just its speed. They create pressure toward building better detection, better runbooks, and better systems — not just faster fingers on the rollback button.
The Uncomfortable Reason MTTB Persists
MTTB persists not because engineers are bad people, but because organizations have legitimate needs that MTTB satisfies: accountability, closure, and the ability to tell stakeholders "we know what happened and we've addressed it." These are real organizational needs, and dismissing them as dysfunctional misses the point.
The question isn't how to eliminate the pressure toward blame — it's how to satisfy that pressure through systemic accountability rather than individual blame. The answer to "who's responsible?" in a high-performing SRE organization is "the team that owns this service, and here are the three things they're changing about the system." That's a different answer than "Jordan deployed at 4pm" — and it's actually more satisfying, if leadership has been educated to want it.
SELECT
AVG(blame_assigned_at - detected_at) AS mttb_minutes,
AVG(resolved_at - detected_at) AS mttr_minutes,
(mttb_minutes / mttr_minutes * 100) AS pct_recovery_spent_on_blame
FROM incidents
WHERE severity = 'P0'
AND quarter = 'Q4-2025';
-- Industry median: ~38% of MTTR spent on blame assignment
-- This is the number nobody tracks and everyone should
The Actual Path to MTTR Reduction
Real MTTR reduction — the kind that compounds over time and actually reduces incident frequency, not just incident duration — comes from a different place than faster rollbacks and better on-call rotations. It comes from understanding the relationship between your incidents.
The incidents you're having today are related to each other. There are patterns in what breaks, when it breaks, and why. Those patterns are in your data. They're in your postmortems. They're in your change history and your monitoring dashboards. Organizations that find and address those patterns see incident rates drop. Organizations that treat each incident as a discrete event, assign blame, and move on see incident rates stay flat or increase with scale.
The metric that matters most isn't how fast you recover. It's how fast you learn. Mean Time to Learning, maybe — though that one won't fit on a dashboard quite as cleanly.
What you measure shapes what you optimize for. If you measure MTTR in isolation, you'll get faster rollbacks. If you measure incident recurrence, you'll get better systems. If you measure MTTB — even informally, even just by being honest about what your postmortem process actually produces — you'll at least know what game you're playing.
Most organizations are playing the blame game and calling it SRE.
If this resonates
Ciroos Actually Solves This
AI SRE teammates that find the patterns in your incidents before they repeat — so you're optimizing Mean Time to Learning, not Mean Time to Blame. This is what MTTR reduction actually looks like.
See How Ciroos Works → Calculate Your MTTB Score Laughed, then cried.