Monitoring Event-Driven Systems

Posted Oct 7, 2025

By R G

1 min read

Introduction

Event-driven systems shift complexity from request latency to throughput, lag, and retry behavior. Observability must focus on the flow of messages, not just service-level latency, to prevent silent data loss or backlog accumulation.

Core Signals for Event Pipelines

Focus on metrics that indicate flow and failure.

Ingress rate: Messages produced per second.
Consumer lag: The distance between produced and processed offsets.
Retry rate: Percentage of messages retried or dead-lettered.
Processing latency: Time between enqueue and acknowledge.

Node.js Example: Consumer Instrumentation

This example instruments a message handler with latency and retry metrics.

  
import { Counter, Histogram } from "prom-client";

const processingLatency = new Histogram({
  name: "message_processing_seconds",
  help: "Processing latency in seconds",
  labelNames: ["topic", "status"]
});

const retryCounter = new Counter({
  name: "message_retries_total",
  help: "Total retries",
  labelNames: ["topic"]
});

export async function handleMessage(topic, message, handler) {
  const end = processingLatency.startTimer({ topic });
  try {
    await handler(message);
    end({ status: "success" });
  } catch (error) {
    retryCounter.inc({ topic });
    end({ status: "failure" });
    throw error;
  }
}

Tracing Across Async Boundaries

Use trace propagation in message headers to link producers and consumers. This makes it possible to connect queue lag with downstream processing time.

Conclusion

Monitoring event-driven systems requires telemetry that follows messages end-to-end. With lag, retries, and processing latency in place, you can detect failures before they impact customers.

DevOps

This post is licensed under CC BY 4.0 by the author.