SRE Error Budgets in Practice

Introduction#

An error budget is the maximum allowable downtime or error rate derived from your SLO. It converts an abstract reliability goal into a concrete operational policy: if you have budget remaining, ship features; if the budget is exhausted, focus on reliability. Error budgets align engineering and product teams around a shared definition of acceptable risk.

From SLO to Error Budget#

# SLO: 99.9% availability per month
# Error budget = 1 - SLO = 0.1% of requests may fail

SLO = 0.999
ERROR_BUDGET_FRACTION = 1 - SLO  # 0.001 = 0.1%

MONTH_MINUTES = 30 * 24 * 60  # 43,200 minutes
ALLOWED_DOWNTIME_MINUTES = MONTH_MINUTES * ERROR_BUDGET_FRACTION  # 43.2 minutes

# If handling 1000 req/s:
REQUESTS_PER_MONTH = 1000 * 60 * MONTH_MINUTES  # 2.59 billion
ALLOWED_ERRORS = REQUESTS_PER_MONTH * ERROR_BUDGET_FRACTION  # 2.59 million errors

print(f"Allowed downtime: {ALLOWED_DOWNTIME_MINUTES:.1f} minutes/month")
print(f"Allowed errors: {ALLOWED_ERRORS:,.0f} per month")

Calculating Burn Rate#

Burn rate measures how fast you are consuming your error budget.

Burn rate = current error rate / error budget rate
Burn rate 1 = consuming budget at exactly the right pace (exhausted at month end)
Burn rate 2 = consuming budget 2x too fast (exhausted in 2 weeks)
Burn rate 14.4 = consuming budget 14.4x too fast (exhausted in 1 hour)

def calculate_burn_rate(
    current_error_rate: float,   # fraction of requests failing right now
    slo: float,
) -> float:
    error_budget_rate = 1 - slo
    return current_error_rate / error_budget_rate

# Example: SLO = 99.9%, currently 1% of requests failing
burn_rate = calculate_burn_rate(0.01, 0.999)
print(f"Burn rate: {burn_rate:.1f}x")
# Output: Burn rate: 10.0x
# At this rate, budget exhausted in 3 days (30 day month / 10)

Multi-Window Burn Rate Alerts (Google’s Approach)#

Alert when burn rate is high over multiple time windows to balance precision and recall.

# Prometheus alerting rules
groups:
- name: error-budget
  rules:
  # Page immediately: fast burn (budget gone in 1 hour)
  - alert: ErrorBudgetBurnFast
    expr: |
      (
        rate(http_requests_total{status=~"5.."}[1h]) /
        rate(http_requests_total[1h])
      ) / (1 - 0.999) > 14.4
    for: 2m
    annotations:
      summary: "Error budget burning fast (>14.4x burn rate)"
      runbook: "https://runbook.example.com/error-budget-fast"

  # Page: moderate fast burn (budget gone in 6 hours)
  - alert: ErrorBudgetBurnMedium
    expr: |
      (
        rate(http_requests_total{status=~"5.."}[6h]) /
        rate(http_requests_total[6h])
      ) / (1 - 0.999) > 6
      AND
      (
        rate(http_requests_total{status=~"5.."}[30m]) /
        rate(http_requests_total[30m])
      ) / (1 - 0.999) > 6
    for: 15m
    annotations:
      summary: "Error budget burning (>6x burn rate)"

  # Ticket: slow burn (budget gone in 3 days)
  - alert: ErrorBudgetBurnSlow
    expr: |
      (
        rate(http_requests_total{status=~"5.."}[24h]) /
        rate(http_requests_total[24h])
      ) / (1 - 0.999) > 3
    for: 1h
    labels:
      severity: warning
    annotations:
      summary: "Error budget burn rate elevated (>3x)"

Error Budget Dashboard#

# Python: calculate remaining error budget
from datetime import datetime
import prometheus_api_client as prom

client = prom.PrometheusConnect(url="http://prometheus:9090")

def get_error_budget_status(service: str, slo: float, window_days: int = 30) -> dict:
    window = f"{window_days * 24}h"

    total_requests = client.custom_query(
        f'sum(increase(http_requests_total{{service="{service}"}}[{window}]))'
    )[0]["value"][1]

    error_requests = client.custom_query(
        f'sum(increase(http_requests_total{{service="{service}",status=~"5.."}}[{window}]))'
    )[0]["value"][1]

    total = float(total_requests)
    errors = float(error_requests)
    error_rate = errors / total if total > 0 else 0

    budget_total = total * (1 - slo)
    budget_remaining = budget_total - errors
    budget_pct_remaining = budget_remaining / budget_total * 100

    return {
        "error_rate": error_rate,
        "budget_consumed_pct": 100 - budget_pct_remaining,
        "budget_remaining_pct": budget_pct_remaining,
        "errors_allowed": budget_total,
        "errors_actual": errors,
    }

status = get_error_budget_status("api", slo=0.999)
print(f"Budget remaining: {status['budget_remaining_pct']:.1f}%")

Error Budget Policy#

A documented policy prevents debates about prioritization:

Error Budget Policy — API Service

SLO: 99.9% availability (28-day rolling window)

Actions when budget is:
- >50% remaining: normal feature development cadence
- 25-50% remaining: reduce risky deployments, increase testing
- 10-25% remaining: freeze new features, focus on reliability
- <10% remaining: incident response mode
  - Halt all non-critical deployments
  - On-call team focuses exclusively on reliability
  - Post-mortem required for each contributing incident
  - Engineering manager approval required for any deploy

Budget replenishes over time automatically.
Recovery: once budget returns to >25%, normal cadence resumes.

Conclusion#

Error budgets transform vague reliability goals into operational decisions. Calculate burn rate continuously. Alert on multi-window burn rates to catch both fast and slow budget consumption. The error budget policy is the most valuable artifact — it eliminates the engineering-vs-product debate about when to ship features vs. fix reliability. Make it explicit and get everyone to agree before an incident.