Introduction#
OpenTelemetry (OTel) is the CNCF standard for observability instrumentation, covering all three pillars: traces, metrics, and logs. Using OTel means your instrumentation is vendor-neutral — you write it once and route to any backend (Prometheus, Datadog, Grafana, Honeycomb). This post covers the metrics and logs pillars to complement distributed tracing.
OTel Metrics: Instruments#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
PeriodicExportingMetricReader,
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
def setup_metrics(service_name: str, otlp_endpoint: str = "http://otel-collector:4317"):
exporter = OTLPMetricExporter(endpoint=otlp_endpoint)
reader = PeriodicExportingMetricReader(exporter, export_interval_millis=30_000)
provider = MeterProvider(
metric_readers=[reader],
resource=Resource.create({SERVICE_NAME: service_name}),
)
metrics.set_meter_provider(provider)
return metrics.get_meter(service_name)
meter = setup_metrics("order-service")
# Counter: monotonically increasing (requests, errors, events)
request_counter = meter.create_counter(
"http.requests.total",
description="Total HTTP requests",
unit="1",
)
# Histogram: distribution of values (latency, payload size)
request_duration = meter.create_histogram(
"http.request.duration",
description="HTTP request duration",
unit="ms",
)
# Gauge: current value (queue depth, active connections)
active_connections = meter.create_observable_gauge(
"db.connections.active",
callbacks=[lambda options: [Observation(get_active_connection_count())]],
description="Active database connections",
)
# UpDownCounter: can increase or decrease (queue size)
queue_depth = meter.create_up_down_counter(
"queue.depth",
description="Messages in processing queue",
)
|
Instrumenting FastAPI with Metrics#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| import time
from fastapi import FastAPI, Request, Response
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
http_requests_total = meter.create_counter(
"http_requests_total",
description="Total HTTP requests by method, path, status",
)
http_request_duration_ms = meter.create_histogram(
"http_request_duration_ms",
description="HTTP request duration in milliseconds",
unit="ms",
)
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration_ms = (time.time() - start) * 1000
labels = {
"method": request.method,
"route": request.url.path,
"status": str(response.status_code),
}
http_requests_total.add(1, labels)
http_request_duration_ms.record(duration_ms, labels)
return response
# Business metrics
orders_created = meter.create_counter("orders.created.total")
order_value = meter.create_histogram("orders.value.usd", unit="USD")
payment_failures = meter.create_counter("payments.failed.total")
@app.post("/orders")
async def create_order(order: OrderRequest):
try:
result = await process_order(order)
orders_created.add(1, {"source": order.source})
order_value.record(float(order.total), {"currency": order.currency})
return result
except PaymentDeclined as e:
payment_failures.add(1, {"reason": e.reason})
raise
|
OTel Logs: Bridging to Existing Logging#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| import logging
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs import LoggerProvider, LoggingHandler
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor
from opentelemetry.exporter.otlp.proto.grpc._log_exporter import OTLPLogExporter
def setup_logging(service_name: str, otlp_endpoint: str = "http://otel-collector:4317"):
"""Bridge Python logging to OpenTelemetry."""
exporter = OTLPLogExporter(endpoint=otlp_endpoint)
provider = LoggerProvider(
resource=Resource.create({SERVICE_NAME: service_name})
)
provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
set_logger_provider(provider)
# Attach OTel handler to root logger
handler = LoggingHandler(level=logging.INFO, logger_provider=provider)
logging.getLogger().addHandler(handler)
# Now standard logging automatically includes trace context
import logging
logger = logging.getLogger(__name__)
@app.post("/orders")
async def create_order(order_data: dict):
# This log will automatically include trace_id and span_id from the active span
logger.info("Creating order", extra={
"order_id": order_data["id"],
"user_id": order_data["user_id"],
"total": order_data["total"],
})
# In Grafana/Loki, you can click a log line and jump directly to the trace
|
OTel Collector Configuration#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
| # otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
# Drop health check spans to reduce noise
filter:
traces:
exclude:
match_type: strict
span_names: ["GET /health", "GET /metrics"]
exporters:
prometheus:
endpoint: "0.0.0.0:8889" # Prometheus scrapes this
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, filter]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
|
Correlating Traces, Metrics, and Logs in Grafana#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| The three pillars work together in Grafana:
Trace → Logs:
In Jaeger/Tempo, click a span → see all logs with that trace_id
Requires logs to include trace_id field (OTel logging bridge does this)
Logs → Traces:
In Loki, click a trace_id in a log line → jump to Jaeger
Requires Grafana datasource linking
Metrics → Traces:
In a Prometheus chart, click an anomalous data point
→ See traces from that time window with matching attributes
Exemplars: embed a trace_id in histogram samples
Allows jumping from a slow histogram bucket → specific trace showing what was slow
|
Exemplars for Histogram Drill-Down#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| from opentelemetry import trace
latency_histogram = meter.create_histogram(
"http_request_duration_ms",
description="Request latency with trace exemplars",
)
@app.middleware("http")
async def record_with_exemplar(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = (time.time() - start) * 1000
# Get current trace context for exemplar
current_span = trace.get_current_span()
ctx = current_span.get_span_context()
latency_histogram.record(
duration,
attributes={"route": request.url.path},
# Exemplar links the histogram sample to the specific trace
)
return response
|
Conclusion#
OpenTelemetry unifies the three observability pillars under a single, vendor-neutral API. Traces, metrics, and logs share the same SDK, resource attributes, and context propagation model. The OTel Collector decouples instrumentation from backend choice — route to Prometheus today, Datadog tomorrow, without changing application code. The highest value comes from correlation: following a slow request from a Prometheus anomaly through a Tempo trace into Loki logs, all linked by the same trace ID.