Introduction#
An error budget is the maximum allowable downtime or error rate derived from your SLO. It converts an abstract reliability goal into a concrete operational policy: if you have budget remaining, ship features; if the budget is exhausted, focus on reliability. Error budgets align engineering and product teams around a shared definition of acceptable risk.
From SLO to Error Budget#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # SLO: 99.9% availability per month
# Error budget = 1 - SLO = 0.1% of requests may fail
SLO = 0.999
ERROR_BUDGET_FRACTION = 1 - SLO # 0.001 = 0.1%
MONTH_MINUTES = 30 * 24 * 60 # 43,200 minutes
ALLOWED_DOWNTIME_MINUTES = MONTH_MINUTES * ERROR_BUDGET_FRACTION # 43.2 minutes
# If handling 1000 req/s:
REQUESTS_PER_MONTH = 1000 * 60 * MONTH_MINUTES # 2.59 billion
ALLOWED_ERRORS = REQUESTS_PER_MONTH * ERROR_BUDGET_FRACTION # 2.59 million errors
print(f"Allowed downtime: {ALLOWED_DOWNTIME_MINUTES:.1f} minutes/month")
print(f"Allowed errors: {ALLOWED_ERRORS:,.0f} per month")
|
Calculating Burn Rate#
Burn rate measures how fast you are consuming your error budget.
1
2
3
4
| Burn rate = current error rate / error budget rate
Burn rate 1 = consuming budget at exactly the right pace (exhausted at month end)
Burn rate 2 = consuming budget 2x too fast (exhausted in 2 weeks)
Burn rate 14.4 = consuming budget 14.4x too fast (exhausted in 1 hour)
|
1
2
3
4
5
6
7
8
9
10
11
12
| def calculate_burn_rate(
current_error_rate: float, # fraction of requests failing right now
slo: float,
) -> float:
error_budget_rate = 1 - slo
return current_error_rate / error_budget_rate
# Example: SLO = 99.9%, currently 1% of requests failing
burn_rate = calculate_burn_rate(0.01, 0.999)
print(f"Burn rate: {burn_rate:.1f}x")
# Output: Burn rate: 10.0x
# At this rate, budget exhausted in 3 days (30 day month / 10)
|
Multi-Window Burn Rate Alerts (Google’s Approach)#
Alert when burn rate is high over multiple time windows to balance precision and recall.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| # Prometheus alerting rules
groups:
- name: error-budget
rules:
# Page immediately: fast burn (budget gone in 1 hour)
- alert: ErrorBudgetBurnFast
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) /
rate(http_requests_total[1h])
) / (1 - 0.999) > 14.4
for: 2m
annotations:
summary: "Error budget burning fast (>14.4x burn rate)"
runbook: "https://runbook.example.com/error-budget-fast"
# Page: moderate fast burn (budget gone in 6 hours)
- alert: ErrorBudgetBurnMedium
expr: |
(
rate(http_requests_total{status=~"5.."}[6h]) /
rate(http_requests_total[6h])
) / (1 - 0.999) > 6
AND
(
rate(http_requests_total{status=~"5.."}[30m]) /
rate(http_requests_total[30m])
) / (1 - 0.999) > 6
for: 15m
annotations:
summary: "Error budget burning (>6x burn rate)"
# Ticket: slow burn (budget gone in 3 days)
- alert: ErrorBudgetBurnSlow
expr: |
(
rate(http_requests_total{status=~"5.."}[24h]) /
rate(http_requests_total[24h])
) / (1 - 0.999) > 3
for: 1h
labels:
severity: warning
annotations:
summary: "Error budget burn rate elevated (>3x)"
|
Error Budget Dashboard#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| # Python: calculate remaining error budget
from datetime import datetime
import prometheus_api_client as prom
client = prom.PrometheusConnect(url="http://prometheus:9090")
def get_error_budget_status(service: str, slo: float, window_days: int = 30) -> dict:
window = f"{window_days * 24}h"
total_requests = client.custom_query(
f'sum(increase(http_requests_total{{service="{service}"}}[{window}]))'
)[0]["value"][1]
error_requests = client.custom_query(
f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{window}]))'
)[0]["value"][1]
total = float(total_requests)
errors = float(error_requests)
error_rate = errors / total if total > 0 else 0
budget_total = total * (1 - slo)
budget_remaining = budget_total - errors
budget_pct_remaining = budget_remaining / budget_total * 100
return {
"error_rate": error_rate,
"budget_consumed_pct": 100 - budget_pct_remaining,
"budget_remaining_pct": budget_pct_remaining,
"errors_allowed": budget_total,
"errors_actual": errors,
}
status = get_error_budget_status("api", slo=0.999)
print(f"Budget remaining: {status['budget_remaining_pct']:.1f}%")
|
Error Budget Policy#
A documented policy prevents debates about prioritization:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| Error Budget Policy — API Service
SLO: 99.9% availability (28-day rolling window)
Actions when budget is:
- >50% remaining: normal feature development cadence
- 25-50% remaining: reduce risky deployments, increase testing
- 10-25% remaining: freeze new features, focus on reliability
- <10% remaining: incident response mode
- Halt all non-critical deployments
- On-call team focuses exclusively on reliability
- Post-mortem required for each contributing incident
- Engineering manager approval required for any deploy
Budget replenishes over time automatically.
Recovery: once budget returns to >25%, normal cadence resumes.
|
Conclusion#
Error budgets transform vague reliability goals into operational decisions. Calculate burn rate continuously. Alert on multi-window burn rates to catch both fast and slow budget consumption. The error budget policy is the most valuable artifact — it eliminates the engineering-vs-product debate about when to ship features vs. fix reliability. Make it explicit and get everyone to agree before an incident.