Introduction#
The golden signals are a compact, battle-tested set of metrics that describe user experience and system health. They are especially effective because they are outcome-focused and map cleanly to service-level objectives. Advanced teams use them as the first layer of telemetry, then pivot into detailed traces and logs when an alert fires.
The Four Golden Signals#
Each signal captures a different failure mode. Together they form a balanced view of availability and capacity.
- Latency: Distribution of request duration, including tail latency (p95/p99). Median-only metrics hide queuing and downstream degradation.
- Traffic: The demand on the service, usually requests per second or messages per second. This is needed for error-rate normalization.
- Errors: Both explicit failures (5xx, exceptions) and implicit failures (timeouts, missing data).
- Saturation: How close the service is to its limits, such as CPU, memory, thread pool depth, or queue backlog.
Mapping Signals to Real Metrics#
Use measurable metrics that are already part of your production pipeline. Avoid overly derived or aggregated views for alerting.
| Signal | Example Metrics | Why It Matters |
|---|---|---|
| Latency | http_request_duration_seconds{quantile="0.99"} |
Captures user-visible delays and queueing. |
| Traffic | http_requests_total |
Provides demand baseline and error normalization. |
| Errors | http_request_errors_total, grpc_server_handled_total{code!="OK"} |
Tracks correctness and dependency health. |
| Saturation | process_cpu_seconds_total, work_queue_depth |
Indicates approaching bottlenecks. |
Instrumenting a Python Service#
The Python example below emits latency, errors, and saturation-ready gauges using prometheus_client. The key is to label by route and status, not by user identifiers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from prometheus_client import Histogram, Counter, Gauge
REQUEST_LATENCY = Histogram(
"http_request_duration_seconds",
"Latency by route and status",
["route", "status"]
)
REQUEST_ERRORS = Counter(
"http_request_errors_total",
"Error responses",
["route", "status"]
)
IN_FLIGHT = Gauge("http_requests_in_flight", "Concurrent requests")
QUEUE_DEPTH = Gauge("work_queue_depth", "Work queue backlog")
def record_request(route: str, status: int, duration: float) -> None:
REQUEST_LATENCY.labels(route=route, status=str(status)).observe(duration)
if status >= 500:
REQUEST_ERRORS.labels(route=route, status=str(status)).inc()
For saturation, pair QUEUE_DEPTH with infrastructure-level metrics (CPU, memory, and thread pool utilization) to detect capacity exhaustion before errors spike.
Alerting with Error Budgets#
Instead of alerting on raw error rate, align alerts to your SLO. For a 99.9% availability target, a 30-day error budget is 0.1%. Multi-window burn-rate alerts can detect fast and slow budget consumption while avoiding noise.
Common Pitfalls#
- Alerting on average latency instead of percentiles.
- Using high-cardinality labels such as user IDs or request IDs.
- Ignoring saturation metrics until error rates explode.
- Alerting on traffic spikes without correlating latency and errors.
Conclusion#
Golden signals are small but powerful. When they are instrumented with careful labels and aligned to SLOs, they provide fast detection, predictable alerting, and a consistent path to root-cause analysis.