Post

Distributed Tracing: How It Works Internally

Introduction

Distributed tracing exposes the path of a request through multiple services, giving you latency and error context across boundaries. For advanced teams, understanding the internal mechanics is essential for tuning sampling, optimizing storage, and interpreting trace data correctly.

Trace and Span Model

A trace is a tree of spans. Each span represents a logical unit of work and contains timing, status, and attributes. Parent-child relationships allow you to identify which downstream dependency caused latency or errors.

Context Propagation

Tracing relies on context propagation across process boundaries. The W3C Trace Context standard carries a traceparent header with trace and span identifiers. If propagation breaks, traces fragment and latency analysis becomes unreliable.

Sampling Strategies

Sampling controls cost and fidelity. Common strategies include:

  • Head-based sampling: Decide at request start, usually by probability.
  • Tail-based sampling: Decide after completion, based on latency or error status.

Tail sampling provides higher value but requires a collector to buffer spans.

Storage and Indexing

Trace backends store spans in a columnar format optimized for fast queries. High-cardinality attributes can cause expensive indexing. Advanced teams keep a limited set of indexed attributes and rely on trace exemplars linked from metrics for deeper analysis.

C# Example: Manual Span Creation

The .NET Activity API is the core tracing primitive. Explicit spans are useful for database calls, caching layers, and remote dependencies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
using System.Diagnostics;

private static readonly ActivitySource ActivitySource = new("Orders.Service");

public async Task<Order> CreateOrderAsync(OrderRequest request) {
    using var activity = ActivitySource.StartActivity("orders.create", ActivityKind.Server);
    activity?.SetTag("customer.id", request.CustomerId);
    activity?.SetTag("order.total", request.Total);

    var order = await _repository.SaveAsync(request);
    activity?.SetTag("order.id", order.Id);

    return order;
}

Troubleshooting Trace Gaps

Common causes of missing spans include incompatible propagators, asynchronous context loss, and overly aggressive sampling. Validate that trace IDs are consistent across services and verify that sampling decisions are propagated downstream.

Conclusion

Distributed tracing is not just instrumentation, it is a data pipeline. When you understand its internal mechanics, you can tune fidelity and cost without sacrificing reliability insights.

This post is licensed under CC BY 4.0 by the author.