Designing Resilient Distributed Systems

Resilience is the ability of a system to absorb failures and continue operating. It goes beyond availability by focusing on degradation, recovery, and fault isolation.

Designing Resilient Distributed Systems#

Resilience is the ability of a system to absorb failures and continue operating. It goes beyond availability by focusing on degradation, recovery, and fault isolation.

Core Resilience Principles#

1. Redundancy and Failure Domains#

Replicate critical services across zones and regions, but keep failure domains isolated. Avoid correlated failures by separating networks, data stores, and deployment pipelines.

2. Graceful Degradation#

Identify critical vs optional features. For optional features, return partial results or cached data instead of failing the entire request.

3. Backpressure and Load Shedding#

Protect the system by rejecting excess load early. Queue depth must be bounded and linked to latency budgets.

4. Automatic Recovery#

Automate restarts, traffic shifts, and failover. Avoid manual runbooks for common failures.

Spring Boot Example: Fallback for Optional Dependencies#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
@Service
public class PricingService {
    private final WebClient webClient;

    public PricingService(WebClient.Builder builder) {
        this.webClient = builder.baseUrl("http://pricing").build();
    }

    public Mono<Price> getPrice(String sku) {
        return webClient.get()
                .uri("/price/{sku}", sku)
                .retrieve()
                .bodyToMono(Price.class)
                .timeout(Duration.ofMillis(150))
                .onErrorReturn(Price.unavailable(sku));
    }
}

Testing for Resilience#

  • Chaos tests to inject latency and failures.
  • Dependency failure drills.
  • Game days focused on recovery and observability.

Metrics That Matter#

Track SLOs based on success rate and latency, not just uptime. Measure mean time to detect (MTTD) and mean time to recover (MTTR).

Summary#

Resilient systems require redundancy, isolation, and controlled degradation. Automating recovery paths and validating them through failure testing is the most reliable way to build durable platforms.

Contents