Production Incident Lifecycle

Production incidents are inevitable in complex systems. Mature teams treat incident response as a lifecycle with defined phases, clear roles, and measurable outcomes. The goal is to restore service qu

Introduction#

Production incidents are inevitable in complex systems. Mature teams treat incident response as a lifecycle with defined phases, clear roles, and measurable outcomes. The goal is to restore service quickly without losing the forensic evidence needed for long-term fixes.

Phases of an Incident#

The lifecycle typically follows five phases.

  • Detection: An SLO burn-rate or customer report triggers an incident.
  • Triage: Validate scope, severity, and potential blast radius.
  • Mitigation: Apply the fastest safe change, such as rollback or feature flag.
  • Resolution: Restore full functionality and confirm metrics recovery.
  • Postmortem: Document root causes and action items.

Incident Roles and Communication#

Define roles ahead of time. Incident commander, communications lead, and subject-matter experts should be explicit. Communication should be concise and tied to observable telemetry, not speculation.

JavaScript Example: Incident State Machine#

The following state machine helps you drive consistent incident transitions in internal tooling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const IncidentState = Object.freeze({
  DETECTED: "detected",
  TRIAGE: "triage",
  MITIGATING: "mitigating",
  RESOLVED: "resolved",
  POSTMORTEM: "postmortem"
});

function transition(current, next) {
  const allowed = {
    detected: ["triage"],
    triage: ["mitigating", "resolved"],
    mitigating: ["resolved"],
    resolved: ["postmortem"],
    postmortem: []
  };

  if (!allowed[current]?.includes(next)) {
    throw new Error(`Invalid transition ${current} -> ${next}`);
  }

  return next;
}

Metrics That Matter#

Track mean time to detect (MTTD), mean time to mitigate (MTTM), and mean time to resolve (MTTR). These should be associated with severity levels to reflect the actual impact.

Conclusion#

A disciplined incident lifecycle improves reliability by reducing recovery time and capturing actionable learning. The process must be practiced regularly to stay effective during real outages.

Contents