Post

Alert Fatigue: How to Fix It

Introduction

Alert fatigue happens when on-call engineers are flooded with low-signal alerts. The result is slower incident response and a gradual erosion of trust in the monitoring system. Solving it requires shifting from threshold-based alerts to objective-based alerting and disciplined alert hygiene.

Root Causes of Alert Fatigue

Alert fatigue is usually caused by weak signal design rather than tooling.

  • Thresholds that ignore traffic patterns or seasonality.
  • Alerts that fire for symptoms, not user impact.
  • Duplicate alerts across layers with no deduplication.
  • Lack of actionability or runbooks.

Align Alerts to SLOs

SLO-based alerts reduce noise by focusing on user impact. A burn-rate alert fires when you are consuming error budget too quickly. This is more stable than raw error-rate alerts during traffic spikes.

Python Example: Burn Rate Evaluation

This example computes fast and slow burn rates and returns an alert severity. It assumes an SLO of 99.9% over 30 days.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def burn_rate(error_rate: float, slo: float) -> float:
    error_budget = 1.0 - slo
    return error_rate / error_budget


def alert_severity(error_rate_5m: float, error_rate_1h: float) -> str:
    fast_burn = burn_rate(error_rate_5m, 0.999)
    slow_burn = burn_rate(error_rate_1h, 0.999)

    if fast_burn >= 14 and slow_burn >= 6:
        return "critical"
    if slow_burn >= 2:
        return "warning"
    return "none"

Deduplication and Routing

Alert deduplication should happen before paging. Use a stable incident key such as service + alert_name + region and suppress alerts that are already correlated to a parent incident. Route alerts to the team that owns the error budget, not the infrastructure component.

Continuous Alert Hygiene

Review alerts every sprint. For each alert, document the last time it fired, how it was handled, and whether it led to a code change. Retire alerts that have no action.

Conclusion

Alert fatigue is a process problem, not a tooling problem. Focus on SLO-aligned signals, deduplication, and routine alert hygiene to restore trust in on-call systems.

This post is licensed under CC BY 4.0 by the author.