SLI/SLO/SLA Practical Implementation
Introduction
Service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs) are only useful when they are operationalized. Advanced teams treat them as production-grade artifacts: versioned, monitored, and tied to rollout decisions. This post focuses on the practical pipeline from raw telemetry to enforceable objectives.
Clarifying the Contract
SLIs are metrics, SLOs are targets, and SLAs are external commitments. The implementation order matters.
- SLI: A measurable signal such as availability or latency.
- SLO: A target for the SLI over a time window, like 99.9% availability per 30 days.
- SLA: A contract with penalties, often looser than internal SLOs.
Designing a Robust SLI
A strong SLI is user-centric and derived from production data, not synthetic tests. For HTTP services, availability is usually defined as good_requests / total_requests. For latency, define a success threshold and count requests under that threshold, rather than tracking a mean latency.
Implementation Pipeline
A reliable pipeline has four stages.
- Collect raw events from logs or metrics.
- Classify events as good, bad, or excluded.
- Aggregate within time windows.
- Persist the SLI for alerting and error budget calculations.
Java Example: Availability SLI
The example below shows a Spring Boot service that aggregates good and total requests every minute. The aggregation is explicit so the calculation can be validated and versioned.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public record SliSnapshot(long good, long total) {
public double availability() {
return total == 0 ? 1.0 : (double) good / total;
}
}
public class SliAggregator {
private final AtomicLong good = new AtomicLong();
private final AtomicLong total = new AtomicLong();
public void record(int statusCode) {
total.incrementAndGet();
if (statusCode < 500) {
good.incrementAndGet();
}
}
public SliSnapshot snapshotAndReset() {
return new SliSnapshot(good.getAndSet(0), total.getAndSet(0));
}
}
Persist the computed availability alongside deployment metadata so you can correlate SLO violations with release events.
Error Budgets and Burn Rates
Error budgets turn SLOs into actionable capacity. For a 99.9% SLO over 30 days, the monthly budget is 43.2 minutes of downtime. A fast-burn alert can trigger when the current 1-hour window is consuming the budget 14x faster than allowed, while a slow-burn alert can catch long-running regressions.
Operationalizing SLAs
SLAs should be derived from SLO data, not the other way around. Export weekly or monthly SLO reports, and include exclusions (planned maintenance, known customer outages) with explicit policy.
Conclusion
SLI/SLO/SLA implementation is a data pipeline, not a slide deck. By making calculations explicit and automating error budget enforcement, reliability goals become measurable and enforceable in production.