Golden Signals Explained (With Real Metrics)

The golden signals are a compact, battle-tested set of metrics that describe user experience and system health. They are especially effective because they are outcome-focused and map cleanly to servic

Introduction#

The golden signals are a compact, battle-tested set of metrics that describe user experience and system health. They are especially effective because they are outcome-focused and map cleanly to service-level objectives. Advanced teams use them as the first layer of telemetry, then pivot into detailed traces and logs when an alert fires.

The Four Golden Signals#

Each signal captures a different failure mode. Together they form a balanced view of availability and capacity.

  • Latency: Distribution of request duration, including tail latency (p95/p99). Median-only metrics hide queuing and downstream degradation.
  • Traffic: The demand on the service, usually requests per second or messages per second. This is needed for error-rate normalization.
  • Errors: Both explicit failures (5xx, exceptions) and implicit failures (timeouts, missing data).
  • Saturation: How close the service is to its limits, such as CPU, memory, thread pool depth, or queue backlog.

Mapping Signals to Real Metrics#

Use measurable metrics that are already part of your production pipeline. Avoid overly derived or aggregated views for alerting.

Signal Example Metrics Why It Matters
Latency http_request_duration_seconds{quantile="0.99"} Captures user-visible delays and queueing.
Traffic http_requests_total Provides demand baseline and error normalization.
Errors http_request_errors_total, grpc_server_handled_total{code!="OK"} Tracks correctness and dependency health.
Saturation process_cpu_seconds_total, work_queue_depth Indicates approaching bottlenecks.

Instrumenting a Python Service#

The Python example below emits latency, errors, and saturation-ready gauges using prometheus_client. The key is to label by route and status, not by user identifiers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from prometheus_client import Histogram, Counter, Gauge

REQUEST_LATENCY = Histogram(
    "http_request_duration_seconds",
    "Latency by route and status",
    ["route", "status"]
)
REQUEST_ERRORS = Counter(
    "http_request_errors_total",
    "Error responses",
    ["route", "status"]
)
IN_FLIGHT = Gauge("http_requests_in_flight", "Concurrent requests")
QUEUE_DEPTH = Gauge("work_queue_depth", "Work queue backlog")


def record_request(route: str, status: int, duration: float) -> None:
    REQUEST_LATENCY.labels(route=route, status=str(status)).observe(duration)
    if status >= 500:
        REQUEST_ERRORS.labels(route=route, status=str(status)).inc()

For saturation, pair QUEUE_DEPTH with infrastructure-level metrics (CPU, memory, and thread pool utilization) to detect capacity exhaustion before errors spike.

Alerting with Error Budgets#

Instead of alerting on raw error rate, align alerts to your SLO. For a 99.9% availability target, a 30-day error budget is 0.1%. Multi-window burn-rate alerts can detect fast and slow budget consumption while avoiding noise.

Common Pitfalls#

  • Alerting on average latency instead of percentiles.
  • Using high-cardinality labels such as user IDs or request IDs.
  • Ignoring saturation metrics until error rates explode.
  • Alerting on traffic spikes without correlating latency and errors.

Conclusion#

Golden signals are small but powerful. When they are instrumented with careful labels and aligned to SLOs, they provide fast detection, predictable alerting, and a consistent path to root-cause analysis.

Contents