Introduction#
Service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs) are only useful when they are operationalized. Advanced teams treat them as production-grade artifacts: versioned, monitored, and tied to rollout decisions. This post focuses on the practical pipeline from raw telemetry to enforceable objectives.
Clarifying the Contract#
SLIs are metrics, SLOs are targets, and SLAs are external commitments. The implementation order matters.
- SLI: A measurable signal such as availability or latency.
- SLO: A target for the SLI over a time window, like 99.9% availability per 30 days.
- SLA: A contract with penalties, often looser than internal SLOs.
Designing a Robust SLI#
A strong SLI is user-centric and derived from production data, not synthetic tests. For HTTP services, availability is usually defined as good_requests / total_requests. For latency, define a success threshold and count requests under that threshold, rather than tracking a mean latency.
Implementation Pipeline#
A reliable pipeline has four stages.
- Collect raw events from logs or metrics.
- Classify events as good, bad, or excluded.
- Aggregate within time windows.
- Persist the SLI for alerting and error budget calculations.
Java Example: Availability SLI#
The example below shows a Spring Boot service that aggregates good and total requests every minute. The aggregation is explicit so the calculation can be validated and versioned.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
public record SliSnapshot(long good, long total) {
public double availability() {
return total == 0 ? 1.0 : (double) good / total;
}
}
public class SliAggregator {
private final AtomicLong good = new AtomicLong();
private final AtomicLong total = new AtomicLong();
public void record(int statusCode) {
total.incrementAndGet();
if (statusCode < 500) {
good.incrementAndGet();
}
}
public SliSnapshot snapshotAndReset() {
return new SliSnapshot(good.getAndSet(0), total.getAndSet(0));
}
}
Persist the computed availability alongside deployment metadata so you can correlate SLO violations with release events.
Error Budgets and Burn Rates#
Error budgets turn SLOs into actionable capacity. For a 99.9% SLO over 30 days, the monthly budget is 43.2 minutes of downtime. A fast-burn alert can trigger when the current 1-hour window is consuming the budget 14x faster than allowed, while a slow-burn alert can catch long-running regressions.
Operationalizing SLAs#
SLAs should be derived from SLO data, not the other way around. Export weekly or monthly SLO reports, and include exclusions (planned maintenance, known customer outages) with explicit policy.
Conclusion#
SLI/SLO/SLA implementation is a data pipeline, not a slide deck. By making calculations explicit and automating error budget enforcement, reliability goals become measurable and enforceable in production.