Production Incident Lifecycle
Introduction
Production incidents are inevitable in complex systems. Mature teams treat incident response as a lifecycle with defined phases, clear roles, and measurable outcomes. The goal is to restore service quickly without losing the forensic evidence needed for long-term fixes.
Phases of an Incident
The lifecycle typically follows five phases.
- Detection: An SLO burn-rate or customer report triggers an incident.
- Triage: Validate scope, severity, and potential blast radius.
- Mitigation: Apply the fastest safe change, such as rollback or feature flag.
- Resolution: Restore full functionality and confirm metrics recovery.
- Postmortem: Document root causes and action items.
Incident Roles and Communication
Define roles ahead of time. Incident commander, communications lead, and subject-matter experts should be explicit. Communication should be concise and tied to observable telemetry, not speculation.
JavaScript Example: Incident State Machine
The following state machine helps you drive consistent incident transitions in internal tooling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const IncidentState = Object.freeze({
DETECTED: "detected",
TRIAGE: "triage",
MITIGATING: "mitigating",
RESOLVED: "resolved",
POSTMORTEM: "postmortem"
});
function transition(current, next) {
const allowed = {
detected: ["triage"],
triage: ["mitigating", "resolved"],
mitigating: ["resolved"],
resolved: ["postmortem"],
postmortem: []
};
if (!allowed[current]?.includes(next)) {
throw new Error(`Invalid transition ${current} -> ${next}`);
}
return next;
}
Metrics That Matter
Track mean time to detect (MTTD), mean time to mitigate (MTTM), and mean time to resolve (MTTR). These should be associated with severity levels to reflect the actual impact.
Conclusion
A disciplined incident lifecycle improves reliability by reducing recovery time and capturing actionable learning. The process must be practiced regularly to stay effective during real outages.