Introduction#
A post-mortem (or RCA — root cause analysis) is a document that captures what happened during an incident, why it happened, and what will prevent recurrence. The value is not in assigning blame — it is in building organizational memory and systematically eliminating failure modes.
Blameless Post-Mortems#
The blameless principle: people involved made the best decisions they could with the information available at the time. Systems fail, not people. Blame prevents honest reporting and drives problems underground.
A well-run post-mortem asks “how did the system make it easy to make this mistake?” rather than “who made this mistake?”
Post-Mortem Template#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Post-Mortem: [Service] Outage — YYYY-MM-DD
**Status**: Complete
**Severity**: P1 (Complete outage affecting all users)
**Duration**: 47 minutes (14:23 UTC to 15:10 UTC)
**Impact**: 100% of API requests failed. ~50,000 users affected.
## Summary
A misconfigured Kubernetes resource limit caused the API pods to be throttled
to near-zero CPU, making all requests time out. The configuration was introduced
in a deployment 2 hours before the outage and activated during a traffic spike.
## Timeline
All times UTC.
| Time | Event |
|-------|-------|
| 12:15 | Deployment of api-v2.3.1 with updated resource limits |
| 14:20 | Traffic spike begins (2x normal) |
| 14:23 | Error rate rises above 50%; PagerDuty alert fires |
| 14:31 | On-call engineer begins investigation |
| 14:45 | Root cause identified: CPU limit 100m vs required 2000m |
| 14:52 | Rollback initiated |
| 15:05 | Deployment rolled back, error rate returns to normal |
| 15:10 | Incident declared resolved |
## Root Cause
The API resource limit was set to `cpu: 100m` in the deployment YAML.
The correct value, used in staging, was `cpu: 2000m`. The incorrect value was
introduced when the production values file was manually edited. At low traffic
the service ran normally; CPU throttling became severe only when traffic doubled.
## Contributing Factors
1. **No automated validation** of resource limits against historical usage before deployment.
2. **Staging load** is 10% of production — the misconfiguration was not visible in staging testing.
3. **No CPU throttling alert** — monitoring covered error rate and latency, not CPU throttle ratio.
4. **Manual YAML editing** of production values without peer review or diff against previous values.
## Impact
- 50,000+ users experienced failed API requests
- 47 minutes of complete unavailability
- ~$12,000 estimated revenue impact (based on conversion rate)
## What Went Well
- PagerDuty alert fired within 3 minutes of impact start
- On-call engineer identified the issue without escalation
- Rollback was available and executed quickly
- Runbook covered this scenario accurately
## Action Items
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add CI check: reject deployments where CPU request < 50% of p95 usage from last 7 days | @platform-team | 2025-07-30 | P0 |
| Add Prometheus alert: CPU throttle ratio > 25% for 5 minutes | @platform-team | 2025-07-23 | P0 |
| Require peer review for production values file changes | @eng-manager | 2025-07-16 | P1 |
| Add production-scale load test to deployment pipeline | @backend-team | 2025-08-15 | P2 |
## Lessons Learned
Automatic validation that prevents deployment of clearly misconfigured resources
would have caught this before it reached production. Configuration changes
should receive the same review scrutiny as code changes.
5 Whys Analysis#
The 5 Whys technique traces the causal chain to the systemic root cause.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1. Why did the API return errors?
→ The API pods were CPU throttled to near zero
2. Why were the pods CPU throttled?
→ The CPU limit was set to 100m, far below the actual 1800m required
3. Why was the CPU limit wrong?
→ The production values file was manually edited incorrectly
4. Why was a manual edit not caught?
→ There was no diff review or automated validation of resource limits
5. Why was there no validation?
→ Resource limits were treated as boilerplate, not code requiring review
Root cause: Resource configuration lacked the same review controls as application code.
Making Action Items Effective#
Bad action items are vague and unassigned:
1
2
3
4
# BAD: too vague, no owner, no deadline
- Improve deployment testing
- Better monitoring for resource issues
- Review our deployment process
Good action items are specific, assigned, and time-bound:
1
2
3
4
5
6
7
8
# GOOD: specific, assigned, has due date and priority
- [P0, @platform-team, due 2025-07-23]
Add Prometheus alert: container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total > 0.25 for 5m
→ creates a PagerDuty P2 incident
- [P0, @platform-team, due 2025-07-30]
CI job that queries VPA recommendations for changed deployments and fails if
new CPU request < 50% of the recommended target
Tracking Action Items#
Post-mortems lose value if action items are never completed. Track them:
1
2
3
4
5
6
7
8
9
# Create GitHub issues for each action item
gh issue create \
--title "Add CPU throttling alert (post-mortem: api-outage-2025-07-07)" \
--body "See: https://wiki.example.com/post-mortems/2025-07-07\n\nAdd alert: throttle ratio > 25% for 5min" \
--label "reliability,post-mortem" \
--assignee platform-team
# Review at weekly SRE sync:
gh issue list --label "post-mortem" --state open
Conclusion#
A post-mortem’s value is in the action items, not the document. Document the timeline honestly, trace to systemic root causes (not individual mistakes), and create specific, assigned, time-bound action items that eliminate the failure class — not just the specific failure. Track completion. Review open post-mortem actions in your weekly reliability meeting.