Handling Partial Failures in Microservices

Posted Aug 25, 2025

By R G

2 min read

Handling Partial Failures in Microservices

Partial failures are the default state in distributed systems. A single service instance can fail, a downstream dependency can be slow, or a network partition can isolate only part of the system. The system must remain functional under these conditions.

Recognizing Partial Failure Modes

Typical partial failures include:

Slow downstream responses that exhaust thread pools.
Transient network errors that affect a subset of traffic.
Partial data unavailability (for example, a replica lagging).
Dependency brownouts, where the service is up but degraded.

Design Principles

1. Timeouts and Budgeting

Set explicit timeouts for every network call and enforce end-to-end latency budgets. If a request requires three downstream calls, split the latency budget among them.

2. Fallback Paths

For non-critical features, serve cached data or partial responses. This prevents a single dependency from taking down the entire request path.

3. Isolation and Bulkheads

Partition resources per dependency so that one slow service does not exhaust all threads. Use separate thread pools or circuit breaker isolation.

4. Idempotent Retries

Retries can improve availability but must be bounded and idempotent to avoid amplification. Pair retries with jittered backoff.

Spring Boot Example with Resilience4j

  
@Service
public class InventoryClient {
    private final WebClient webClient;

    public InventoryClient(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("http://inventory").build();
    }

    @CircuitBreaker(name = "inventory", fallbackMethod = "fallback")
    @Retry(name = "inventory")
    public Mono<InventoryResponse> getInventory(String sku) {
        return webClient.get()
                .uri("/items/{sku}", sku)
                .retrieve()
                .bodyToMono(InventoryResponse.class)
                .timeout(Duration.ofMillis(200));
    }

    private Mono<InventoryResponse> fallback(String sku, Throwable error) {
        return Mono.just(InventoryResponse.unavailable(sku));
    }
}

Data Consistency in Partial Failure Scenarios

When a write to one service succeeds and another fails, you need compensation. The Saga pattern with compensating actions is the most practical option for microservices because 2PC does not tolerate partial failures well.

Operational Best Practices

Per-dependency dashboards: visualize error rate, latency, and timeouts.
Synthetic checks: detect dependency failures before they hit production traffic.
Degraded mode testing: chaos tests should simulate partial failures.

Common Pitfalls

Using long, unbounded retries that create retry storms.
Letting slow calls occupy shared thread pools.
Returning success when critical dependencies have not been updated.

Summary

Partial failures must be assumed and designed for explicitly. With timeout budgets, bulkhead isolation, bounded retries, and fallback paths, microservices can continue operating even when parts of the system are degraded.

Best-Practices

This post is licensed under CC BY 4.0 by the author.

Handling Partial Failures in Microservices

Recognizing Partial Failure Modes

Design Principles

1. Timeouts and Budgeting

2. Fallback Paths

3. Isolation and Bulkheads

4. Idempotent Retries

Spring Boot Example with Resilience4j

Data Consistency in Partial Failure Scenarios

Operational Best Practices

Common Pitfalls

Summary

Trending Tags