Incident Post-Mortems: Writing Effective Root Cause Analyses

Introduction#

A post-mortem (or RCA — root cause analysis) is a document that captures what happened during an incident, why it happened, and what will prevent recurrence. The value is not in assigning blame — it is in building organizational memory and systematically eliminating failure modes.

Blameless Post-Mortems#

The blameless principle: people involved made the best decisions they could with the information available at the time. Systems fail, not people. Blame prevents honest reporting and drives problems underground.

A well-run post-mortem asks “how did the system make it easy to make this mistake?” rather than “who made this mistake?”

Post-Mortem Template#

# Post-Mortem: [Service] Outage — YYYY-MM-DD

**Status**: Complete
**Severity**: P1 (Complete outage affecting all users)
**Duration**: 47 minutes (14:23 UTC to 15:10 UTC)
**Impact**: 100% of API requests failed. ~50,000 users affected.

## Summary
A misconfigured Kubernetes resource limit caused the API pods to be throttled
to near-zero CPU, making all requests time out. The configuration was introduced
in a deployment 2 hours before the outage and activated during a traffic spike.

## Timeline

All times UTC.

| Time  | Event |
|-------|-------|
| 12:15 | Deployment of api-v2.3.1 with updated resource limits |
| 14:20 | Traffic spike begins (2x normal) |
| 14:23 | Error rate rises above 50%; PagerDuty alert fires |
| 14:31 | On-call engineer begins investigation |
| 14:45 | Root cause identified: CPU limit 100m vs required 2000m |
| 14:52 | Rollback initiated |
| 15:05 | Deployment rolled back, error rate returns to normal |
| 15:10 | Incident declared resolved |

## Root Cause

The API resource limit was set to `cpu: 100m` in the deployment YAML.
The correct value, used in staging, was `cpu: 2000m`. The incorrect value was
introduced when the production values file was manually edited. At low traffic
the service ran normally; CPU throttling became severe only when traffic doubled.

## Contributing Factors

1. **No automated validation** of resource limits against historical usage before deployment.
2. **Staging load** is 10% of production — the misconfiguration was not visible in staging testing.
3. **No CPU throttling alert** — monitoring covered error rate and latency, not CPU throttle ratio.
4. **Manual YAML editing** of production values without peer review or diff against previous values.

## Impact

- 50,000+ users experienced failed API requests
- 47 minutes of complete unavailability
- ~$12,000 estimated revenue impact (based on conversion rate)

## What Went Well

- PagerDuty alert fired within 3 minutes of impact start
- On-call engineer identified the issue without escalation
- Rollback was available and executed quickly
- Runbook covered this scenario accurately

## Action Items

| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add CI check: reject deployments where CPU request < 50% of p95 usage from last 7 days | @platform-team | 2025-07-30 | P0 |
| Add Prometheus alert: CPU throttle ratio > 25% for 5 minutes | @platform-team | 2025-07-23 | P0 |
| Require peer review for production values file changes | @eng-manager | 2025-07-16 | P1 |
| Add production-scale load test to deployment pipeline | @backend-team | 2025-08-15 | P2 |

## Lessons Learned

Automatic validation that prevents deployment of clearly misconfigured resources
would have caught this before it reached production. Configuration changes
should receive the same review scrutiny as code changes.

5 Whys Analysis#

The 5 Whys technique traces the causal chain to the systemic root cause.

1. Why did the API return errors?
   → The API pods were CPU throttled to near zero

2. Why were the pods CPU throttled?
   → The CPU limit was set to 100m, far below the actual 1800m required

3. Why was the CPU limit wrong?
   → The production values file was manually edited incorrectly

4. Why was a manual edit not caught?
   → There was no diff review or automated validation of resource limits

5. Why was there no validation?
   → Resource limits were treated as boilerplate, not code requiring review

Root cause: Resource configuration lacked the same review controls as application code.

Making Action Items Effective#

Bad action items are vague and unassigned:

# BAD: too vague, no owner, no deadline
- Improve deployment testing
- Better monitoring for resource issues
- Review our deployment process

Good action items are specific, assigned, and time-bound:

# GOOD: specific, assigned, has due date and priority
- [P0, @platform-team, due 2025-07-23]
  Add Prometheus alert: container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0.25 for 5m
  → creates a PagerDuty P2 incident

- [P0, @platform-team, due 2025-07-30]
  CI job that queries VPA recommendations for changed deployments and fails if
  new CPU request < 50% of the recommended target

Tracking Action Items#

Post-mortems lose value if action items are never completed. Track them:

# Create GitHub issues for each action item
gh issue create \
  --title "Add CPU throttling alert (post-mortem: api-outage-2025-07-07)" \
  --body "See: https://wiki.example.com/post-mortems/2025-07-07\n\nAdd alert: throttle ratio > 25% for 5min" \
  --label "reliability,post-mortem" \
  --assignee platform-team

# Review at weekly SRE sync:
gh issue list --label "post-mortem" --state open

Conclusion#

A post-mortem’s value is in the action items, not the document. Document the timeline honestly, trace to systemic root causes (not individual mistakes), and create specific, assigned, time-bound action items that eliminate the failure class — not just the specific failure. Track completion. Review open post-mortem actions in your weekly reliability meeting.