Introduction#
Alert fatigue happens when on-call engineers are flooded with low-signal alerts. The result is slower incident response and a gradual erosion of trust in the monitoring system. Solving it requires shifting from threshold-based alerts to objective-based alerting and disciplined alert hygiene.
Root Causes of Alert Fatigue#
Alert fatigue is usually caused by weak signal design rather than tooling.
- Thresholds that ignore traffic patterns or seasonality.
- Alerts that fire for symptoms, not user impact.
- Duplicate alerts across layers with no deduplication.
- Lack of actionability or runbooks.
Align Alerts to SLOs#
SLO-based alerts reduce noise by focusing on user impact. A burn-rate alert fires when you are consuming error budget too quickly. This is more stable than raw error-rate alerts during traffic spikes.
Python Example: Burn Rate Evaluation#
This example computes fast and slow burn rates and returns an alert severity. It assumes an SLO of 99.9% over 30 days.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def burn_rate(error_rate: float, slo: float) -> float:
error_budget = 1.0 - slo
return error_rate / error_budget
def alert_severity(error_rate_5m: float, error_rate_1h: float) -> str:
fast_burn = burn_rate(error_rate_5m, 0.999)
slow_burn = burn_rate(error_rate_1h, 0.999)
if fast_burn >= 14 and slow_burn >= 6:
return "critical"
if slow_burn >= 2:
return "warning"
return "none"
Deduplication and Routing#
Alert deduplication should happen before paging. Use a stable incident key such as service + alert_name + region and suppress alerts that are already correlated to a parent incident. Route alerts to the team that owns the error budget, not the infrastructure component.
Continuous Alert Hygiene#
Review alerts every sprint. For each alert, document the last time it fired, how it was handled, and whether it led to a code change. Retire alerts that have no action.
Conclusion#
Alert fatigue is a process problem, not a tooling problem. Focus on SLO-aligned signals, deduplication, and routine alert hygiene to restore trust in on-call systems.