Monitoring Event-Driven Systems
Introduction
Event-driven systems shift complexity from request latency to throughput, lag, and retry behavior. Observability must focus on the flow of messages, not just service-level latency, to prevent silent data loss or backlog accumulation.
Core Signals for Event Pipelines
Focus on metrics that indicate flow and failure.
- Ingress rate: Messages produced per second.
- Consumer lag: The distance between produced and processed offsets.
- Retry rate: Percentage of messages retried or dead-lettered.
- Processing latency: Time between enqueue and acknowledge.
Node.js Example: Consumer Instrumentation
This example instruments a message handler with latency and retry metrics.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import { Counter, Histogram } from "prom-client";
const processingLatency = new Histogram({
name: "message_processing_seconds",
help: "Processing latency in seconds",
labelNames: ["topic", "status"]
});
const retryCounter = new Counter({
name: "message_retries_total",
help: "Total retries",
labelNames: ["topic"]
});
export async function handleMessage(topic, message, handler) {
const end = processingLatency.startTimer({ topic });
try {
await handler(message);
end({ status: "success" });
} catch (error) {
retryCounter.inc({ topic });
end({ status: "failure" });
throw error;
}
}
Tracing Across Async Boundaries
Use trace propagation in message headers to link producers and consumers. This makes it possible to connect queue lag with downstream processing time.
Conclusion
Monitoring event-driven systems requires telemetry that follows messages end-to-end. With lag, retries, and processing latency in place, you can detect failures before they impact customers.
This post is licensed under CC BY 4.0 by the author.