On-Call Best Practices: Sustainable Incident Response

Introduction#

On-call is often the most stressful part of engineering. Poorly structured on-call leads to alert fatigue, burnout, and engineers leaving. Sustainable on-call requires alert quality, clear escalation paths, good tooling, and cultural norms that protect engineers from unnecessary interruptions.

Alert Quality: The Foundation#

# Prometheus alert: actionable, symptom-based
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m])) > 0.05
  for: 5m
  labels:
    severity: page
    team: backend
  annotations:
    summary: "Error rate {{ $value | humanizePercentage }} exceeds 5%"
    description: "Service {{ $labels.service }} has had >5% errors for 5 minutes"
    runbook: "https://runbook.example.com/high-error-rate"
    dashboard: "https://grafana.example.com/d/service?service={{ $labels.service }}"

# BAD: alert that wakes someone up but has no action
- alert: HighMemoryUsage
  expr: container_memory_usage_bytes > 500000000
  # No for: clause — fires on momentary spikes
  # No runbook
  # No context — is this actually a problem?

Alert criteria:

Is it actionable? If no human action is needed, it should not page.
Is it customer-impacting? Alert on symptoms (latency, errors) not causes (CPU, memory) unless they directly predict impact.
Does it have a runbook? An alert without a runbook wastes incident response time.

Runbook Structure#

# High Error Rate Runbook

## When to use this
Alert: `HighErrorRate`
Severity: P1 (page immediately)

## Impact
Users are receiving 5xx errors. Conversion rate is affected.
Check the dashboard: https://grafana.example.com/d/service

## Diagnosis (< 5 minutes)

1. Check recent deployments:
   ```bash
   kubectl rollout history deployment/api -n production

Check error logs:

kubectl logs -l app=api -n production --since=10m | grep "level=error"

Check database connectivity:

kubectl exec -it $(kubectl get pod -l app=api -o name | head -1) -- \
  python -c "import psycopg2; psycopg2.connect('...')"

Mitigation#

If caused by bad deployment:

kubectl rollout undo deployment/api -n production
kubectl rollout status deployment/api -n production

If database is down:

Check RDS status in AWS console
Notify DBA on-call: #dba-oncall

If external dependency:

Check status page: https://status.stripe.com
Enable circuit breaker flag: kubectl set env deployment/api PAYMENT_FALLBACK=true

Escalation#

10 minutes no resolution: escalate to senior engineer
20 minutes: escalate to engineering manager ```

PagerDuty / On-Call Rotation#

# pagerduty_client.py: trigger and resolve incidents programmatically
import httpx

PAGERDUTY_API = "https://api.pagerduty.com"

async def trigger_incident(
    routing_key: str,
    summary: str,
    severity: str,
    source: str,
    details: dict,
) -> str:
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": routing_key,
                "event_action": "trigger",
                "payload": {
                    "summary": summary,
                    "severity": severity,  # critical, error, warning, info
                    "source": source,
                    "custom_details": details,
                },
            },
        )
        return resp.json()["dedup_key"]

async def resolve_incident(routing_key: str, dedup_key: str) -> None:
    async with httpx.AsyncClient() as client:
        await client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": routing_key,
                "event_action": "resolve",
                "dedup_key": dedup_key,
            },
        )

Incident Response Process#

1. Acknowledge (< 5 min)
   - Confirm you are investigating
   - Post in #incidents: "I'm looking at this"

2. Assess (< 10 min)
   - What is broken? What is the user impact?
   - Is it getting worse, stable, or recovering?
   - Update incident channel: "API error rate at 12%, investigating deployment"

3. Mitigate (< 30 min for P1)
   - Fix the symptom, not the root cause
   - Rollback if recent deployment is suspect
   - Scale up if overloaded
   - Update every 15 minutes: "Rollback in progress, ETA 5 min"

4. Resolve
   - Confirm metrics returned to normal
   - All-clear in incident channel
   - Schedule post-mortem within 48 hours

5. Post-mortem
   - Blameless, within 5 business days
   - Actionable items with owners and deadlines

Reducing Alert Noise#

# Track alert frequency — noisy alerts need to be fixed or removed
# Prometheus: count alert firings over 30 days
sum by (alertname) (
  increase(ALERTS_FOR_STATE[30d])
) > 0

# Alerts that fire > 5 times/week are candidates for adjustment:
# - Raise threshold
# - Add "for:" duration to suppress transient spikes
# - Convert to warning (ticket) instead of page
# - Delete if not actionable

On-Call Health Metrics#

# Track on-call burden — identify systemic problems
METRICS = {
    "pages_per_week": "should be < 5 per engineer",
    "mean_time_to_acknowledge": "should be < 5 minutes",
    "mean_time_to_resolve": "P1 < 30 min, P2 < 2 hours",
    "pages_outside_hours": "% of pages at night/weekend — should be declining",
    "alert_action_rate": "% of alerts that required action — should be > 80%",
}

Conclusion#

Sustainable on-call starts with high signal-to-noise alert quality. Every page must be actionable and have a runbook. Track on-call burden and treat chronic noise as an engineering problem. Post-mortems prevent recurrence. Engineers who feel supported during incidents and see their feedback acted on are far less likely to burn out.