Introduction#
Production incidents are inevitable in complex systems. Mature teams treat incident response as a lifecycle with defined phases, clear roles, and measurable outcomes. The goal is to restore service quickly without losing the forensic evidence needed for long-term fixes.
Phases of an Incident#
The lifecycle typically follows five phases.
- Detection: An SLO burn-rate or customer report triggers an incident.
- Triage: Validate scope, severity, and potential blast radius.
- Mitigation: Apply the fastest safe change, such as rollback or feature flag.
- Resolution: Restore full functionality and confirm metrics recovery.
- Postmortem: Document root causes and action items.
Incident Roles and Communication#
Define roles ahead of time. Incident commander, communications lead, and subject-matter experts should be explicit. Communication should be concise and tied to observable telemetry, not speculation.
JavaScript Example: Incident State Machine#
The following state machine helps you drive consistent incident transitions in internal tooling.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const IncidentState = Object.freeze({
DETECTED: "detected",
TRIAGE: "triage",
MITIGATING: "mitigating",
RESOLVED: "resolved",
POSTMORTEM: "postmortem"
});
function transition(current, next) {
const allowed = {
detected: ["triage"],
triage: ["mitigating", "resolved"],
mitigating: ["resolved"],
resolved: ["postmortem"],
postmortem: []
};
if (!allowed[current]?.includes(next)) {
throw new Error(`Invalid transition ${current} -> ${next}`);
}
return next;
}
Metrics That Matter#
Track mean time to detect (MTTD), mean time to mitigate (MTTM), and mean time to resolve (MTTR). These should be associated with severity levels to reflect the actual impact.
Conclusion#
A disciplined incident lifecycle improves reliability by reducing recovery time and capturing actionable learning. The process must be practiced regularly to stay effective during real outages.