OpenTelemetry Metrics and Logs: The Complete Observability Stack

OpenTelemetry (OTel) is the CNCF standard for observability instrumentation, covering all three pillars: traces, metrics, and logs. Using OTel means your instrumentation is vendor-neutral — you write

Introduction#

OpenTelemetry (OTel) is the CNCF standard for observability instrumentation, covering all three pillars: traces, metrics, and logs. Using OTel means your instrumentation is vendor-neutral — you write it once and route to any backend (Prometheus, Datadog, Grafana, Honeycomb). This post covers the metrics and logs pillars to complement distributed tracing.

OTel Metrics: Instruments#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    PeriodicExportingMetricReader,
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

def setup_metrics(service_name: str, otlp_endpoint: str = "http://otel-collector:4317"):
    exporter = OTLPMetricExporter(endpoint=otlp_endpoint)
    reader = PeriodicExportingMetricReader(exporter, export_interval_millis=30_000)
    provider = MeterProvider(
        metric_readers=[reader],
        resource=Resource.create({SERVICE_NAME: service_name}),
    )
    metrics.set_meter_provider(provider)
    return metrics.get_meter(service_name)

meter = setup_metrics("order-service")

# Counter: monotonically increasing (requests, errors, events)
request_counter = meter.create_counter(
    "http.requests.total",
    description="Total HTTP requests",
    unit="1",
)

# Histogram: distribution of values (latency, payload size)
request_duration = meter.create_histogram(
    "http.request.duration",
    description="HTTP request duration",
    unit="ms",
)

# Gauge: current value (queue depth, active connections)
active_connections = meter.create_observable_gauge(
    "db.connections.active",
    callbacks=[lambda options: [Observation(get_active_connection_count())]],
    description="Active database connections",
)

# UpDownCounter: can increase or decrease (queue size)
queue_depth = meter.create_up_down_counter(
    "queue.depth",
    description="Messages in processing queue",
)

Instrumenting FastAPI with Metrics#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import time
from fastapi import FastAPI, Request, Response
from opentelemetry import metrics

meter = metrics.get_meter(__name__)

http_requests_total = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests by method, path, status",
)

http_request_duration_ms = meter.create_histogram(
    "http_request_duration_ms",
    description="HTTP request duration in milliseconds",
    unit="ms",
)

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()

    response = await call_next(request)

    duration_ms = (time.time() - start) * 1000
    labels = {
        "method": request.method,
        "route": request.url.path,
        "status": str(response.status_code),
    }

    http_requests_total.add(1, labels)
    http_request_duration_ms.record(duration_ms, labels)

    return response

# Business metrics
orders_created = meter.create_counter("orders.created.total")
order_value = meter.create_histogram("orders.value.usd", unit="USD")
payment_failures = meter.create_counter("payments.failed.total")

@app.post("/orders")
async def create_order(order: OrderRequest):
    try:
        result = await process_order(order)
        orders_created.add(1, {"source": order.source})
        order_value.record(float(order.total), {"currency": order.currency})
        return result
    except PaymentDeclined as e:
        payment_failures.add(1, {"reason": e.reason})
        raise

OTel Logs: Bridging to Existing Logging#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter

def setup_logging(service_name: str, otlp_endpoint: str = "http://otel-collector:4317"):
    """Bridge Python logging to OpenTelemetry."""
    exporter = OTLPLogExporter(endpoint=otlp_endpoint)
    provider = LoggerProvider(
        resource=Resource.create({SERVICE_NAME: service_name})
    )
    provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
    set_logger_provider(provider)

    # Attach OTel handler to root logger
    handler = LoggingHandler(level=logging.INFO, logger_provider=provider)
    logging.getLogger().addHandler(handler)

# Now standard logging automatically includes trace context
import logging
logger = logging.getLogger(__name__)

@app.post("/orders")
async def create_order(order_data: dict):
    # This log will automatically include trace_id and span_id from the active span
    logger.info("Creating order", extra={
        "order_id": order_data["id"],
        "user_id": order_data["user_id"],
        "total": order_data["total"],
    })
    # In Grafana/Loki, you can click a log line and jump directly to the trace

OTel Collector Configuration#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

  resource:
    attributes:
    - key: environment
      value: production
      action: upsert

  # Drop health check spans to reduce noise
  filter:
    traces:
      exclude:
        match_type: strict
        span_names: ["GET /health", "GET /metrics"]

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"    # Prometheus scrapes this

  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true

  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, filter]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Correlating Traces, Metrics, and Logs in Grafana#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
The three pillars work together in Grafana:

Trace → Logs:
  In Jaeger/Tempo, click a span → see all logs with that trace_id
  Requires logs to include trace_id field (OTel logging bridge does this)

Logs → Traces:
  In Loki, click a trace_id in a log line → jump to Jaeger
  Requires Grafana datasource linking

Metrics → Traces:
  In a Prometheus chart, click an anomalous data point
  → See traces from that time window with matching attributes

Exemplars: embed a trace_id in histogram samples
  Allows jumping from a slow histogram bucket → specific trace showing what was slow

Exemplars for Histogram Drill-Down#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from opentelemetry import trace

latency_histogram = meter.create_histogram(
    "http_request_duration_ms",
    description="Request latency with trace exemplars",
)

@app.middleware("http")
async def record_with_exemplar(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = (time.time() - start) * 1000

    # Get current trace context for exemplar
    current_span = trace.get_current_span()
    ctx = current_span.get_span_context()

    latency_histogram.record(
        duration,
        attributes={"route": request.url.path},
        # Exemplar links the histogram sample to the specific trace
    )
    return response

Conclusion#

OpenTelemetry unifies the three observability pillars under a single, vendor-neutral API. Traces, metrics, and logs share the same SDK, resource attributes, and context propagation model. The OTel Collector decouples instrumentation from backend choice — route to Prometheus today, Datadog tomorrow, without changing application code. The highest value comes from correlation: following a slow request from a Prometheus anomaly through a Tempo trace into Loki logs, all linked by the same trace ID.

Contents