Post

Production Incident Lifecycle

Introduction

Production incidents are inevitable in complex systems. Mature teams treat incident response as a lifecycle with defined phases, clear roles, and measurable outcomes. The goal is to restore service quickly without losing the forensic evidence needed for long-term fixes.

Phases of an Incident

The lifecycle typically follows five phases.

  • Detection: An SLO burn-rate or customer report triggers an incident.
  • Triage: Validate scope, severity, and potential blast radius.
  • Mitigation: Apply the fastest safe change, such as rollback or feature flag.
  • Resolution: Restore full functionality and confirm metrics recovery.
  • Postmortem: Document root causes and action items.

Incident Roles and Communication

Define roles ahead of time. Incident commander, communications lead, and subject-matter experts should be explicit. Communication should be concise and tied to observable telemetry, not speculation.

JavaScript Example: Incident State Machine

The following state machine helps you drive consistent incident transitions in internal tooling.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
const IncidentState = Object.freeze({
  DETECTED: "detected",
  TRIAGE: "triage",
  MITIGATING: "mitigating",
  RESOLVED: "resolved",
  POSTMORTEM: "postmortem"
});

function transition(current, next) {
  const allowed = {
    detected: ["triage"],
    triage: ["mitigating", "resolved"],
    mitigating: ["resolved"],
    resolved: ["postmortem"],
    postmortem: []
  };

  if (!allowed[current]?.includes(next)) {
    throw new Error(`Invalid transition ${current} -> ${next}`);
  }

  return next;
}

Metrics That Matter

Track mean time to detect (MTTD), mean time to mitigate (MTTM), and mean time to resolve (MTTR). These should be associated with severity levels to reflect the actual impact.

Conclusion

A disciplined incident lifecycle improves reliability by reducing recovery time and capturing actionable learning. The process must be practiced regularly to stay effective during real outages.

This post is licensed under CC BY 4.0 by the author.