On-Call Burnout: The Feature, Not a Bug

Let's be honest about what on-call is. On-call is a system by which a company rents a portion of an engineer's consciousness — including their sleeping, eating, showering, and weekend consciousness — for a week at a time, in exchange for a stipend that works out to somewhere between $4 and $18 per hour depending on alert volume, and the occasional "we really appreciate your dedication" in a team meeting.

On-call burnout is when that engineer can no longer pretend the math makes sense.

Here's the part that's hard to say in a job performance review: on-call burnout isn't an accident. It's not a side effect of scale or complexity. It's the predictable output of an on-call management system designed around the assumption that human attention is an infinitely renewable resource that can be taxed at will, and that the engineer who eventually quits was just not resilient enough.

02:14:07 [CRIT] High CPU on prod-worker-07 (94%)

02:14:09 [CRIT] High CPU on prod-worker-07 (95%)

02:14:11 [CRIT] High CPU on prod-worker-07 (93%)

02:14:13 [WARN] High CPU on prod-worker-07 resolved

02:14:15 [CRIT] High CPU on prod-worker-07 (96%)

02:14:17 [HUMAN] on-call engineer acknowledges alert #47 tonight

02:14:18 [INFO] prod-worker-07 auto-scaled. No action needed.

02:14:19 [HUMAN] on-call engineer stares at ceiling

How On-Call Becomes Burnout: A Technical Explanation

The human nervous system has a threat response system that was calibrated over millions of years to handle things like tigers. When a tiger appears, adrenaline fires, cortisol spikes, heart rate elevates, and the body prepares for action. Then the tiger is dealt with and the system recovers.

PagerDuty has a similar activation profile to a tiger. The 3am page fires all the same systems. The difference is that after the PagerDuty alert, the engineer silences it, checks a dashboard, and often takes no action because it was a CPU spike that resolved itself. The body activated for a tiger and got a false alarm. Eleven times in one night.

The cortisol doesn't know the alert was noisy. The cortisol just knows it was woken up eleven times. Over a week of on-call, this produces a physiological state that clinical literature calls "burnout" and that engineers describe as "I need to find a job somewhere my phone doesn't scream at me."

40%

of SREs cite on-call as primary reason for job change

~60%

of on-call pages require no human action

2.7×

higher attrition on teams with noisy alert environments

The Three Ways Organizations "Solve" On-Call Burnout

To be charitable: most organizations recognize on-call burnout is real and try to address it. They just tend to address the symptoms rather than the causes.

Solution 1: Better Rotations

More people in the rotation means each person is on-call less often. This helps. It doesn't solve the problem that each on-call shift is still brutal. If the alert volume is 60 noisy pages per night, spreading that across six engineers means each engineer gets it once a month instead of once a week. They're still having 60-page nights. They're just having them less frequently.

Rotation improvements are staffing solutions to tooling problems.

Solution 2: Oncall Training and Runbooks

If engineers had better runbooks, they'd resolve incidents faster and sleep better. This is sometimes true. It's also addressing the wrong problem. The issue isn't that engineers don't know how to handle the alerts. It's that the alerts shouldn't require a human in the first place.

Solution 3: "No Blame Culture"

If we remove the psychological burden of blame from the on-call experience, engineers will be less stressed. Also sometimes true. But the 3am page doesn't fire because of organizational culture. It fires because the alert threshold was set by someone who wanted to err on the side of caution, four years ago, and nobody has revisited it since.

The actual problem: On-call burnout is an automation problem wearing a culture problem's clothes. The signals that require human judgment are drowning in signals that don't. SRE tools exist to separate these. Most teams haven't built that separation.

What the On-Call Experience Actually Looks Like in High-Performing Teams

The organizations with the best on-call health — the ones where engineers don't have on-call horror stories — share a characteristic: they have ruthlessly separated "signals that require human judgment" from "signals that can be handled automatically."

This sounds obvious. It is obvious. It's also extremely hard to maintain, because the default behavior of monitoring systems is to alert on everything, and the default behavior of engineers is to add alerts for the thing that just broke them, and nobody goes back to remove alerts that have been noisy for six months because that's not on the roadmap.

Good on-call management means treating alert hygiene as a first-class engineering concern — one that gets prioritized sprint work, not just good intentions. Teams that do this have lower on-call volume, higher quality pages, faster MTTR, and dramatically lower burnout rates. They also have better incident response, because engineers who slept arrive to incidents more capable than engineers who didn't.

On-Call Health Indicators — Typical vs. High-Performing Teams

Actionable pages (requiring human decision) Typical: 38% | Good: 85%+

Alerts auto-resolved without action Typical: 61% | Good: <15%

Engineers who'd recommend their on-call setup Typical: 22% | Good: 70%+

The Real Cost of On-Call Burnout

The argument for fixing on-call burnout usually gets made in terms of engineer wellbeing, which is correct and also, in many organizations, insufficient motivation for an engineering roadmap conversation.

So here's the economic argument: a senior SRE costs between $180,000 and $280,000 annually in salary plus benefits. The fully loaded cost including recruiting, onboarding, and ramp time for a replacement is roughly 1.5–2× their annual salary. On-call burnout has a documented correlation with attrition. If you're losing one senior SRE per year to burnout-driven exits, you're spending more on replacement than you would on the SRE tools and automation work that would have fixed the alert noise.

The math isn't close. Alert hygiene is cheaper than attrition. It's just harder to put on a roadmap because the cost of burnout shows up in HR and the cost of fixing it shows up in engineering time.

On-call burnout is a feature of the current system, not a bug. It's the predictable output of treating human attention as cheap. The organizations that fix it aren't doing something heroic — they're just doing the math.

# on_call_cost_calculator.sh
SENIOR_SRE_SALARY=240000
REPLACEMENT_MULTIPLIER=1.7
ANNUAL_ATTRITION_FROM_BURNOUT=1.2 # team of 8, one leaves every 10 months

echo "Annual burnout cost: $$(echo "$SENIOR_SRE_SALARY * $REPLACEMENT_MULTIPLIER * $ANNUAL_ATTRITION_FROM_BURNOUT" | bc)"
# Output: $489,600
# Cost of fixing alert noise: "we'll add it to next quarter's roadmap"

If this resonates

Ciroos Actually Solves This

AI SRE teammates that handle the signals that don't need humans — so your engineers stay asleep. Proactive reliability that reduces alert noise before it becomes burnout.

See How Ciroos Works → Calculate Your MTTB Score Laughed, then cried.