Introduction#
On-call is often the most stressful part of engineering. Poorly structured on-call leads to alert fatigue, burnout, and engineers leaving. Sustainable on-call requires alert quality, clear escalation paths, good tooling, and cultural norms that protect engineers from unnecessary interruptions.
Alert Quality: The Foundation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # Prometheus alert: actionable, symptom-based
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: page
team: backend
annotations:
summary: "Error rate {{ $value | humanizePercentage }} exceeds 5%"
description: "Service {{ $labels.service }} has had >5% errors for 5 minutes"
runbook: "https://runbook.example.com/high-error-rate"
dashboard: "https://grafana.example.com/d/service?service={{ $labels.service }}"
# BAD: alert that wakes someone up but has no action
- alert: HighMemoryUsage
expr: container_memory_usage_bytes > 500000000
# No for: clause — fires on momentary spikes
# No runbook
# No context — is this actually a problem?
|
Alert criteria:
- Is it actionable? If no human action is needed, it should not page.
- Is it customer-impacting? Alert on symptoms (latency, errors) not causes (CPU, memory) unless they directly predict impact.
- Does it have a runbook? An alert without a runbook wastes incident response time.
Runbook Structure#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # High Error Rate Runbook
## When to use this
Alert: `HighErrorRate`
Severity: P1 (page immediately)
## Impact
Users are receiving 5xx errors. Conversion rate is affected.
Check the dashboard: https://grafana.example.com/d/service
## Diagnosis (< 5 minutes)
1. Check recent deployments:
```bash
kubectl rollout history deployment/api -n production
|
- Check error logs:
1
| kubectl logs -l app=api -n production --since=10m | grep "level=error"
|
- Check database connectivity:
1
2
| kubectl exec -it $(kubectl get pod -l app=api -o name | head -1) -- \
python -c "import psycopg2; psycopg2.connect('...')"
|
Mitigation#
If caused by bad deployment:
1
2
| kubectl rollout undo deployment/api -n production
kubectl rollout status deployment/api -n production
|
If database is down:
- Check RDS status in AWS console
- Notify DBA on-call: #dba-oncall
If external dependency:
- Check status page: https://status.stripe.com
- Enable circuit breaker flag:
kubectl set env deployment/api PAYMENT_FALLBACK=true
Escalation#
- 10 minutes no resolution: escalate to senior engineer
- 20 minutes: escalate to engineering manager
```
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| # pagerduty_client.py: trigger and resolve incidents programmatically
import httpx
PAGERDUTY_API = "https://api.pagerduty.com"
async def trigger_incident(
routing_key: str,
summary: str,
severity: str,
source: str,
details: dict,
) -> str:
async with httpx.AsyncClient() as client:
resp = await client.post(
"https://events.pagerduty.com/v2/enqueue",
json={
"routing_key": routing_key,
"event_action": "trigger",
"payload": {
"summary": summary,
"severity": severity, # critical, error, warning, info
"source": source,
"custom_details": details,
},
},
)
return resp.json()["dedup_key"]
async def resolve_incident(routing_key: str, dedup_key: str) -> None:
async with httpx.AsyncClient() as client:
await client.post(
"https://events.pagerduty.com/v2/enqueue",
json={
"routing_key": routing_key,
"event_action": "resolve",
"dedup_key": dedup_key,
},
)
|
Incident Response Process#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| 1. Acknowledge (< 5 min)
- Confirm you are investigating
- Post in #incidents: "I'm looking at this"
2. Assess (< 10 min)
- What is broken? What is the user impact?
- Is it getting worse, stable, or recovering?
- Update incident channel: "API error rate at 12%, investigating deployment"
3. Mitigate (< 30 min for P1)
- Fix the symptom, not the root cause
- Rollback if recent deployment is suspect
- Scale up if overloaded
- Update every 15 minutes: "Rollback in progress, ETA 5 min"
4. Resolve
- Confirm metrics returned to normal
- All-clear in incident channel
- Schedule post-mortem within 48 hours
5. Post-mortem
- Blameless, within 5 business days
- Actionable items with owners and deadlines
|
Reducing Alert Noise#
1
2
3
4
5
6
7
8
9
10
11
| # Track alert frequency — noisy alerts need to be fixed or removed
# Prometheus: count alert firings over 30 days
sum by (alertname) (
increase(ALERTS_FOR_STATE[30d])
) > 0
# Alerts that fire > 5 times/week are candidates for adjustment:
# - Raise threshold
# - Add "for:" duration to suppress transient spikes
# - Convert to warning (ticket) instead of page
# - Delete if not actionable
|
On-Call Health Metrics#
1
2
3
4
5
6
7
8
| # Track on-call burden — identify systemic problems
METRICS = {
"pages_per_week": "should be < 5 per engineer",
"mean_time_to_acknowledge": "should be < 5 minutes",
"mean_time_to_resolve": "P1 < 30 min, P2 < 2 hours",
"pages_outside_hours": "% of pages at night/weekend — should be declining",
"alert_action_rate": "% of alerts that required action — should be > 80%",
}
|
Conclusion#
Sustainable on-call starts with high signal-to-noise alert quality. Every page must be actionable and have a runbook. Track on-call burden and treat chronic noise as an engineering problem. Post-mortems prevent recurrence. Engineers who feel supported during incidents and see their feedback acted on are far less likely to burn out.