Post

Handling Partial Failures in Microservices

Handling Partial Failures in Microservices

Partial failures are the default state in distributed systems. A single service instance can fail, a downstream dependency can be slow, or a network partition can isolate only part of the system. The system must remain functional under these conditions.

Recognizing Partial Failure Modes

Typical partial failures include:

  • Slow downstream responses that exhaust thread pools.
  • Transient network errors that affect a subset of traffic.
  • Partial data unavailability (for example, a replica lagging).
  • Dependency brownouts, where the service is up but degraded.

Design Principles

1. Timeouts and Budgeting

Set explicit timeouts for every network call and enforce end-to-end latency budgets. If a request requires three downstream calls, split the latency budget among them.

2. Fallback Paths

For non-critical features, serve cached data or partial responses. This prevents a single dependency from taking down the entire request path.

3. Isolation and Bulkheads

Partition resources per dependency so that one slow service does not exhaust all threads. Use separate thread pools or circuit breaker isolation.

4. Idempotent Retries

Retries can improve availability but must be bounded and idempotent to avoid amplification. Pair retries with jittered backoff.

Spring Boot Example with Resilience4j

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
@Service
public class InventoryClient {
    private final WebClient webClient;

    public InventoryClient(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("http://inventory").build();
    }

    @CircuitBreaker(name = "inventory", fallbackMethod = "fallback")
    @Retry(name = "inventory")
    public Mono<InventoryResponse> getInventory(String sku) {
        return webClient.get()
                .uri("/items/{sku}", sku)
                .retrieve()
                .bodyToMono(InventoryResponse.class)
                .timeout(Duration.ofMillis(200));
    }

    private Mono<InventoryResponse> fallback(String sku, Throwable error) {
        return Mono.just(InventoryResponse.unavailable(sku));
    }
}

Data Consistency in Partial Failure Scenarios

When a write to one service succeeds and another fails, you need compensation. The Saga pattern with compensating actions is the most practical option for microservices because 2PC does not tolerate partial failures well.

Operational Best Practices

  • Per-dependency dashboards: visualize error rate, latency, and timeouts.
  • Synthetic checks: detect dependency failures before they hit production traffic.
  • Degraded mode testing: chaos tests should simulate partial failures.

Common Pitfalls

  • Using long, unbounded retries that create retry storms.
  • Letting slow calls occupy shared thread pools.
  • Returning success when critical dependencies have not been updated.

Summary

Partial failures must be assumed and designed for explicitly. With timeout budgets, bulkhead isolation, bounded retries, and fallback paths, microservices can continue operating even when parts of the system are degraded.

This post is licensed under CC BY 4.0 by the author.