Real-World Distributed System Failures

Introduction#

Learning from production failures is critical for building reliable distributed systems. This post analyzes real-world incidents from major tech companies, examining root causes, cascading failures, and lessons learned. Understanding these failure modes helps engineers anticipate problems and design more resilient systems.

The AWS S3 Outage (February 2017)#

What Happened#

An engineer executing a playbook to remove a small number of S3 servers accidentally removed a much larger set of servers, including two critical subsystems. This caused S3 to become unavailable for several hours.

Root Cause#

A typo in the command-line argument removed far more capacity than intended. The subsystems that were taken offline needed to restart and rebuild their state, which took longer than expected at the new scale.

# Python: Simulating the S3 outage scenario
import time
from typing import Set, Dict
from dataclasses import dataclass
from enum import Enum

class ServerState(Enum):
    RUNNING = "running"
    STOPPED = "stopped"
    REBUILDING = "rebuilding"

@dataclass
class Server:
    server_id: str
    state: ServerState
    data_shards: Set[str]
    
class DistributedStorageCluster:
    """Simulates the S3 subsystem failure scenario"""
    
    def __init__(self, num_servers: int):
        self.servers: Dict[str, Server] = {}
        self.index_system_servers: Set[str] = set()
        self.billing_system_servers: Set[str] = set()
        
        # Initialize servers
        for i in range(num_servers):
            server_id = f"server-{i}"
            server = Server(
                server_id=server_id,
                state=ServerState.RUNNING,
                data_shards=set()
            )
            self.servers[server_id] = server
            
            # First 20% are index system
            if i < num_servers * 0.2:
                self.index_system_servers.add(server_id)
            
            # Next 10% are billing system
            elif i < num_servers * 0.3:
                self.billing_system_servers.add(server_id)
    
    def remove_servers_command(self, count: int):
        """
        DANGEROUS: Command to remove servers
        Simulates the typo that caused the outage
        """
        print(f"\n=== Executing remove command for {count} servers ===")
        
        # INTENDED: Remove 'count' servers
        # ACTUAL: Typo causes removal of many more servers
        
        # Simulate typo - removed entire percentage instead of count
        actual_removed = int(len(self.servers) * (count / 100.0))
        
        print(f"INTENDED to remove: {count} servers")
        print(f"ACTUALLY removed: {actual_removed} servers due to typo!")
        
        removed_servers = list(self.servers.keys())[:actual_removed]
        
        critical_systems_affected = False
        
        for server_id in removed_servers:
            self.servers[server_id].state = ServerState.STOPPED
            
            # Check if critical systems affected
            if server_id in self.index_system_servers:
                critical_systems_affected = True
                print(f"WARNING: Index system server {server_id} removed!")
            
            if server_id in self.billing_system_servers:
                critical_systems_affected = True
                print(f"WARNING: Billing system server {server_id} removed!")
        
        if critical_systems_affected:
            print("\n!!! CRITICAL: Essential subsystems taken offline !!!")
            self.begin_recovery()
    
    def begin_recovery(self):
        """Simulate the slow recovery process"""
        print("\n=== Beginning Recovery Process ===")
        
        # Index system needs to rebuild
        print("Index system rebuilding state...")
        rebuild_time_hours = 4.5  # Actual incident took ~4.5 hours
        
        print(f"Estimated recovery time: {rebuild_time_hours} hours")
        print("Issue: System had grown significantly since last restart")
        print("Issue: State rebuilding not optimized for current scale")
        
        # In real scenario, this caused cascading failures
        print("\nCascading effects:")
        print("- S3 PUT/GET/DELETE requests failing")
        print("- AWS Console unable to load (uses S3 for assets)")
        print("- Many AWS services degraded (depend on S3)")
        print("- Public internet affected (many sites use S3)")

# Demonstrate the incident
print("=== AWS S3 Outage Simulation (Feb 2017) ===")
cluster = DistributedStorageCluster(num_servers=1000)

# The fateful command with typo
# Intended: remove 4 servers
# Actual: removed 40% of servers
cluster.remove_servers_command(count=40)

print("\n=== Lessons Learned ===")
print("1. Command-line tools need better input validation")
print("2. Critical operations should require confirmation")
print("3. Implement rate limiting on destructive operations")
print("4. Test recovery procedures at current scale")
print("5. Design for faster state rebuilding")
print("6. Gradual rollout of infrastructure changes")

Lessons Learned#

Input Validation: Implement safeguards on destructive operations Confirmation Steps: Multi-step confirmation for critical commands Rate Limiting: Limit speed of destructive operations Recovery Testing: Regularly test recovery at production scale Graceful Degradation: Design subsystems to function with reduced capacity

The GitHub Outage (October 2018)#

What Happened#

A network partition between East Coast and West Coast data centers lasted 43 seconds. This caused split-brain: both sides elected themselves primary, leading to data inconsistencies that took 24 hours to resolve.

// Java: Simulating the GitHub split-brain scenario
import java.util.*;
import java.util.concurrent.*;
import java.time.Instant;

public class SplitBrainScenario {
    
    enum DatacenterLocation {
        EAST_COAST,
        WEST_COAST
    }
    
    static class Datacenter {
        private final DatacenterLocation location;
        private boolean isPrimary;
        private final Map<String, RepositoryData> repositories;
        private boolean canReachPeer;
        private long primaryElectedAt;
        
        public Datacenter(DatacenterLocation location) {
            this.location = location;
            this.isPrimary = false;
            this.repositories = new ConcurrentHashMap<>();
            this.canReachPeer = true;
        }
        
        public void detectPartition() {
            this.canReachPeer = false;
            System.out.printf(
                "[%s] Network partition detected! Cannot reach peer%n",
                location
            );
            
            // Both datacenters attempt to become primary
            electSelfAsPrimary();
        }
        
        private void electSelfAsPrimary() {
            if (!isPrimary) {
                this.isPrimary = true;
                this.primaryElectedAt = System.currentTimeMillis();
                
                System.out.printf(
                    "[%s] Elected self as PRIMARY (split-brain!)%n",
                    location
                );
            }
        }
        
        public void acceptWrite(String repoId, String data) {
            if (!isPrimary) {
                System.out.printf(
                    "[%s] Rejecting write - not primary%n",
                    location
                );
                return;
            }
            
            RepositoryData repo = repositories.computeIfAbsent(
                repoId,
                id -> new RepositoryData(id, location)
            );
            
            repo.commits.add(new Commit(
                UUID.randomUUID().toString(),
                data,
                Instant.now(),
                location
            ));
            
            System.out.printf(
                "[%s] Accepted write to %s: %s%n",
                location, repoId, data
            );
        }
        
        public void healPartition(Datacenter peer) {
            this.canReachPeer = true;
            System.out.printf(
                "[%s] Partition healed, can reach peer again%n",
                location
            );
            
            // Detect split-brain
            if (this.isPrimary && peer.isPrimary) {
                System.out.println(
                    "\n!!! SPLIT-BRAIN DETECTED !!!"
                );
                System.out.println(
                    "Both datacenters accepted writes independently"
                );
                
                resolveInconsistencies(peer);
            }
        }
        
        private void resolveInconsistencies(Datacenter peer) {
            System.out.println("\n=== Inconsistency Resolution ===");
            
            Set<String> allRepos = new HashSet<>();
            allRepos.addAll(this.repositories.keySet());
            allRepos.addAll(peer.repositories.keySet());
            
            for (String repoId : allRepos) {
                RepositoryData thisRepo = this.repositories.get(repoId);
                RepositoryData peerRepo = peer.repositories.get(repoId);
                
                if (thisRepo != null && peerRepo != null) {
                    if (!thisRepo.commits.equals(peerRepo.commits)) {
                        System.out.printf(
                            "Repository %s has DIVERGED:%n",
                            repoId
                        );
                        System.out.printf(
                            "  %s: %d commits%n",
                            this.location,
                            thisRepo.commits.size()
                        );
                        System.out.printf(
                            "  %s: %d commits%n",
                            peer.location,
                            peerRepo.commits.size()
                        );
                        
                        // Real GitHub incident required manual reconciliation
                        System.out.println(
                            "  Resolution: Manual reconciliation required"
                        );
                    }
                }
            }
            
            System.out.printf(
                "%nResolution took: ~24 hours in actual incident%n"
            );
        }
    }
    
    static class RepositoryData {
        String repoId;
        DatacenterLocation primaryLocation;
        List<Commit> commits;
        
        public RepositoryData(String repoId, DatacenterLocation location) {
            this.repoId = repoId;
            this.primaryLocation = location;
            this.commits = new ArrayList<>();
        }
    }
    
    static class Commit {
        String commitId;
        String data;
        Instant timestamp;
        DatacenterLocation origin;
        
        public Commit(
            String commitId,
            String data,
            Instant timestamp,
            DatacenterLocation origin
        ) {
            this.commitId = commitId;
            this.data = data;
            this.timestamp = timestamp;
            this.origin = origin;
        }
    }
    
    public static void main(String[] args) throws InterruptedException {
        System.out.println("=== GitHub Split-Brain Incident (Oct 2018) ===\n");
        
        Datacenter eastCoast = new Datacenter(DatacenterLocation.EAST_COAST);
        Datacenter westCoast = new Datacenter(DatacenterLocation.WEST_COAST);
        
        // Initially, east coast is primary
        eastCoast.isPrimary = true;
        
        // Normal operation
        eastCoast.acceptWrite("repo-1", "commit-A");
        
        Thread.sleep(100);
        
        // Network partition occurs (43 seconds in real incident)
        System.out.println("\n--- NETWORK PARTITION BEGINS ---\n");
        eastCoast.detectPartition();
        westCoast.detectPartition();
        
        // Both datacenters accept writes independently (split-brain!)
        Thread.sleep(100);
        eastCoast.acceptWrite("repo-1", "commit-B-east");
        westCoast.acceptWrite("repo-1", "commit-C-west");
        
        Thread.sleep(100);
        eastCoast.acceptWrite("repo-1", "commit-D-east");
        westCoast.acceptWrite("repo-1", "commit-E-west");
        
        Thread.sleep(100);
        
        // Partition heals
        System.out.println("\n--- NETWORK PARTITION HEALS ---\n");
        eastCoast.healPartition(westCoast);
        
        System.out.println("\n=== Lessons Learned ===");
        System.out.println("1. Need fencing tokens to prevent split-brain");
        System.out.println("2. Automatic failover must be carefully designed");
        System.out.println("3. Monitor and alert on inconsistencies");
        System.out.println("4. Test partition scenarios regularly");
        System.out.println("5. Have runbooks for manual reconciliation");
        System.out.println("6. Consider strongly consistent coordination (Raft/Paxos)");
    }
}

Lessons Learned#

Fencing Tokens: Use increasing tokens to detect stale primaries Consensus Protocols: Implement proper leader election (Raft/Paxos) Monitoring: Detect split-brain conditions automatically Testing: Regular chaos engineering with partition testing Reconciliation: Have procedures for data divergence resolution

The Cloudflare Outage (July 2019)#

What Happened#

A regex pattern in the Web Application Firewall (WAF) consumed excessive CPU, causing global outage. The pattern had catastrophic backtracking behavior.

// Node.js: Simulating catastrophic backtracking
class ReDoSExample {
    /**
     * Demonstrates catastrophic backtracking
     * Similar to Cloudflare incident
     */
    static demonstrateCatastrophicBacktracking() {
        console.log("=== Cloudflare ReDoS Incident (July 2019) ===\n");
        
        // Simplified version of problematic regex
        // Actual regex was: (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
        
        const badRegex = /^(a+)+$/;
        
        // Test strings that cause exponential time
        const testStrings = [
            "aaaaaa",
            "aaaaaaaaaaa",
            "aaaaaaaaaaaaaaaa",
            "aaaaaaaaaaaaaaaaaaaaaX"  // Doesn't match - worst case
        ];
        
        console.log("Testing regex: /^(a+)+$/\n");
        
        testStrings.forEach(str => {
            const start = Date.now();
            
            try {
                // Set timeout to prevent hanging
                const timeoutMs = 1000;
                const timeoutId = setTimeout(() => {
                    throw new Error('Regex timeout - catastrophic backtracking!');
                }, timeoutMs);
                
                const match = badRegex.test(str);
                clearTimeout(timeoutId);
                
                const duration = Date.now() - start;
                
                console.log(`Input: "${str}" (length: ${str.length})`);
                console.log(`  Match: ${match}`);
                console.log(`  Time: ${duration}ms`);
                console.log(`  Complexity: O(2^n) - exponential!\n`);
                
            } catch (error) {
                const duration = Date.now() - start;
                console.log(`Input: "${str}" (length: ${str.length})`);
                console.log(`  ERROR: ${error.message}`);
                console.log(`  Time: ${duration}ms`);
                console.log(`  This would cause CPU exhaustion!\n`);
            }
        });
        
        this.showImpact();
        this.showSolution();
    }
    
    static showImpact() {
        console.log("=== Real Incident Impact ===");
        console.log("1. WAF rule deployed globally");
        console.log("2. Regex consumed excessive CPU on all servers");
        console.log("3. CPU exhaustion caused request failures");
        console.log("4. Global outage for 27 minutes");
        console.log("5. Affected millions of websites\n");
    }
    
    static showSolution() {
        console.log("=== Solutions ===");
        console.log("1. Validate regex patterns before deployment");
        console.log("2. Use regex analyzers to detect backtracking");
        console.log("3. Set CPU/time limits on regex execution");
        console.log("4. Gradual rollout instead of global deployment");
        console.log("5. Use specialized parsing libraries instead of regex");
        console.log("6. Implement circuit breakers\n");
    }
    
    static demonstrateSafeAlternative() {
        console.log("=== Safe Alternative ===\n");
        
        // Safe regex without nested quantifiers
        const safeRegex = /^a+$/;
        
        const testString = "a".repeat(100000);  // Very long string
        
        const start = Date.now();
        const match = safeRegex.test(testString);
        const duration = Date.now() - start;
        
        console.log(`Safe regex: /^a+$/`);
        console.log(`Input length: ${testString.length}`);
        console.log(`Match: ${match}`);
        console.log(`Time: ${duration}ms`);
        console.log(`Complexity: O(n) - linear!`);
    }
}

// Demonstrate the incident
ReDoSExample.demonstrateCatastrophicBacktracking();
ReDoSExample.demonstrateSafeAlternative();

Lessons Learned#

Regex Validation: Analyze regex for backtracking before deployment Gradual Rollout: Never deploy changes globally instantly Resource Limits: Implement CPU/time limits on operations Circuit Breakers: Detect and stop runaway processes Specialized Tools: Use parsers instead of complex regex

The Google Cloud Networking Outage (June 2019)#

What Happened#

A configuration change intended for a single region was accidentally applied globally, causing widespread routing issues.

// C#: Configuration management failure scenario
using System;
using System.Collections.Generic;
using System.Linq;

public class ConfigurationManagementFailure
{
    public enum Region
    {
        US_EAST_1,
        US_WEST_1,
        EU_WEST_1,
        ASIA_EAST_1
    }
    
    public class NetworkConfiguration
    {
        public Region TargetRegion { get; set; }
        public bool ApplyGlobally { get; set; }
        public Dictionary<string, string> RoutingRules { get; set; }
        public int Version { get; set; }
        
        public NetworkConfiguration()
        {
            RoutingRules = new Dictionary<string, string>();
        }
    }
    
    public class ConfigurationManager
    {
        private readonly Dictionary<Region, NetworkConfiguration> _regionConfigs;
        private NetworkConfiguration _globalConfig;
        
        public ConfigurationManager()
        {
            _regionConfigs = new Dictionary<Region, NetworkConfiguration>();
            
            // Initialize regional configs
            foreach (Region region in Enum.GetValues(typeof(Region)))
            {
                _regionConfigs[region] = new NetworkConfiguration
                {
                    TargetRegion = region,
                    ApplyGlobally = false,
                    Version = 1
                };
            }
            
            _globalConfig = new NetworkConfiguration
            {
                ApplyGlobally = true,
                Version = 1
            };
        }
        
        public void ApplyConfiguration(NetworkConfiguration config)
        {
            Console.WriteLine($"\n=== Applying Configuration ===");
            Console.WriteLine($"Target Region: {config.TargetRegion}");
            Console.WriteLine($"Apply Globally: {config.ApplyGlobally}");
            Console.WriteLine($"Version: {config.Version}");
            
            // BUG: Flag was incorrectly set or interpreted
            if (config.ApplyGlobally)
            {
                Console.WriteLine("\n!!! WARNING: Applying to ALL REGIONS !!!");
                
                // Apply configuration globally (UNINTENDED)
                foreach (var region in _regionConfigs.Keys.ToList())
                {
                    _regionConfigs[region] = config;
                    Console.WriteLine($"  Applied to {region}");
                }
                
                Console.WriteLine("\n!!! OUTAGE: All regions affected !!!");
                SimulateGlobalOutage();
            }
            else
            {
                // Apply to single region (INTENDED)
                _regionConfigs[config.TargetRegion] = config;
                Console.WriteLine($"Applied to {config.TargetRegion} only");
            }
        }
        
        private void SimulateGlobalOutage()
        {
            Console.WriteLine("\nOutage Effects:");
            Console.WriteLine("- Network routing disrupted globally");
            Console.WriteLine("- Services unable to reach backends");
            Console.WriteLine("- Inter-region communication broken");
            Console.WriteLine("- User-facing services down");
            Console.WriteLine("\nDuration: ~4 hours (actual incident)");
        }
    }
    
    public static void Main()
    {
        Console.WriteLine("=== Google Cloud Networking Outage (June 2019) ===\n");
        
        var configManager = new ConfigurationManager();
        
        // Engineer intends to update US-EAST-1 only
        var regionalUpdate = new NetworkConfiguration
        {
            TargetRegion = Region.US_EAST_1,
            ApplyGlobally = false,  // INTENDED
            Version = 2
        };
        
        regionalUpdate.RoutingRules["rule-1"] = "new-routing-config";
        
        // BUG: Flag somehow becomes true (UI bug, API bug, or misunderstanding)
        regionalUpdate.ApplyGlobally = true;  // ACTUAL (bug)
        
        configManager.ApplyConfiguration(regionalUpdate);
        
        Console.WriteLine("\n=== Lessons Learned ===");
        Console.WriteLine("1. Separate regional and global configuration systems");
        Console.WriteLine("2. Require explicit approval for global changes");
        Console.WriteLine("3. Implement dry-run mode for configuration changes");
        Console.WriteLine("4. Gradual rollout with automatic rollback");
        Console.WriteLine("5. Clear UI/API design to prevent misinterpretation");
        Console.WriteLine("6. Configuration validation and diff review");
        Console.WriteLine("7. Blast radius limitation for changes");
    }
}

Lessons Learned#

Separation of Concerns: Separate global and regional configs Change Validation: Require review for wide-impact changes Gradual Rollout: Never apply config changes globally at once Automated Rollback: Detect issues and rollback automatically Clear Interfaces: Design APIs/UIs to prevent misinterpretation

Common Failure Patterns#

# Python: Taxonomy of distributed system failures
from enum import Enum
from typing import List, Dict
from dataclasses import dataclass

class FailureCategory(Enum):
    CASCADING_FAILURE = "cascading_failure"
    RESOURCE_EXHAUSTION = "resource_exhaustion"
    SPLIT_BRAIN = "split_brain"
    TIMING_BUG = "timing_bug"
    CONFIGURATION_ERROR = "configuration_error"
    DEPENDENCY_FAILURE = "dependency_failure"

@dataclass
class FailurePattern:
    name: str
    category: FailureCategory
    description: str
    real_examples: List[str]
    prevention: List[str]
    detection: List[str]

def analyze_failure_patterns():
    patterns = [
        FailurePattern(
            name="Thundering Herd",
            category=FailureCategory.CASCADING_FAILURE,
            description="Many clients retry simultaneously after failure",
            real_examples=[
                "Facebook cache invalidation (2010)",
                "AWS Lambda cold starts under load"
            ],
            prevention=[
                "Exponential backoff with jitter",
                "Circuit breakers",
                "Request rate limiting",
                "Queue-based retry"
            ],
            detection=[
                "Spike in retry rate",
                "Correlated request patterns",
                "Backend overload metrics"
            ]
        ),
        
        FailurePattern(
            name="Retry Storm",
            category=FailureCategory.CASCADING_FAILURE,
            description="Retries overwhelm system preventing recovery",
            real_examples=[
                "Cloudflare API outage (2020)",
                "GitHub webhook delivery delays"
            ],
            prevention=[
                "Exponential backoff",
                "Max retry limits",
                "Circuit breakers",
                "Separate retry queues"
            ],
            detection=[
                "Increasing retry queue depth",
                "High retry-to-new-request ratio",
                "Latency increase despite low success rate"
            ]
        ),
        
        FailurePattern(
            name="Resource Leak",
            category=FailureCategory.RESOURCE_EXHAUSTION,
            description="Gradual resource consumption until exhaustion",
            real_examples=[
                "Netflix Hystrix thread pool exhaustion",
                "Memory leaks in long-running services"
            ],
            prevention=[
                "Resource limits and quotas",
                "Automatic resource cleanup",
                "Regular restarts for stateless services",
                "Memory profiling"
            ],
            detection=[
                "Gradual resource utilization increase",
                "Correlation with uptime",
                "Performance degradation over time"
            ]
        ),
        
        FailurePattern(
            name="Clock Skew",
            category=FailureCategory.TIMING_BUG,
            description="Clock drift causes incorrect ordering or timeouts",
            real_examples=[
                "AWS authentication failures",
                "Certificate validation errors"
            ],
            prevention=[
                "NTP synchronization",
                "Logical clocks (Lamport/Vector)",
                "Lease-based coordination",
                "Monitor clock drift"
            ],
            detection=[
                "Authentication failures",
                "Out-of-order events",
                "Unexpected timeouts"
            ]
        ),
        
        FailurePattern(
            name="Metadata Corruption",
            category=FailureCategory.SPLIT_BRAIN,
            description="Inconsistent metadata leads to incorrect routing",
            real_examples=[
                "GitHub split-brain (2018)",
                "Elasticsearch split-brain scenarios"
            ],
            prevention=[
                "Consensus protocols (Raft/Paxos)",
                "Fencing tokens",
                "Quorum-based operations",
                "Regular consistency checks"
            ],
            detection=[
                "Divergent state across replicas",
                "Multiple nodes claiming leadership",
                "Inconsistent query results"
            ]
        )
    ]
    
    print("=== Common Distributed System Failure Patterns ===\n")
    
    for pattern in patterns:
        print(f"{pattern.name} ({pattern.category.value})")
        print(f"  Description: {pattern.description}")
        print(f"  Real Examples:")
        for example in pattern.real_examples:
            print(f"    - {example}")
        print(f"  Prevention:")
        for prevention in pattern.prevention:
            print(f"    - {prevention}")
        print(f"  Detection:")
        for detection in pattern.detection:
            print(f"    - {detection}")
        print()

analyze_failure_patterns()

Chaos Engineering: Testing for Failures#

// Java: Chaos engineering framework
import java.util.*;
import java.util.concurrent.*;

public class ChaosEngineeringFramework {
    
    public enum ChaosExperiment {
        NETWORK_PARTITION,
        HIGH_LATENCY,
        RESOURCE_EXHAUSTION,
        RANDOM_FAILURES,
        CLOCK_SKEW
    }
    
    public static class ChaosInjector {
        private final Random random = new Random();
        private final Set<ChaosExperiment> activeExperiments;
        
        public ChaosInjector() {
            this.activeExperiments = new HashSet<>();
        }
        
        public void startExperiment(ChaosExperiment experiment) {
            activeExperiments.add(experiment);
            System.out.printf(
                "Started chaos experiment: %s%n",
                experiment
            );
        }
        
        public void stopExperiment(ChaosExperiment experiment) {
            activeExperiments.remove(experiment);
            System.out.printf(
                "Stopped chaos experiment: %s%n",
                experiment
            );
        }
        
        public void injectLatency(int baseLatencyMs) throws InterruptedException {
            if (activeExperiments.contains(ChaosExperiment.HIGH_LATENCY)) {
                int additionalLatency = random.nextInt(5000);
                System.out.printf(
                    "Injecting latency: %dms%n",
                    additionalLatency
                );
                Thread.sleep(additionalLatency);
            }
        }
        
        public boolean shouldFailRequest() {
            if (activeExperiments.contains(ChaosExperiment.RANDOM_FAILURES)) {
                return random.nextDouble() < 0.1;  // 10% failure rate
            }
            return false;
        }
        
        public boolean isPartitioned(String nodeA, String nodeB) {
            if (activeExperiments.contains(ChaosExperiment.NETWORK_PARTITION)) {
                // Simulate partition between east and west
                boolean aIsEast = nodeA.contains("east");
                boolean bIsWest = nodeB.contains("west");
                return aIsEast && bIsWest || !aIsEast && !bIsWest;
            }
            return false;
        }
    }
    
    public static class ResilientService {
        private final ChaosInjector chaos;
        private final CircuitBreaker circuitBreaker;
        
        public ResilientService(ChaosInjector chaos) {
            this.chaos = chaos;
            this.circuitBreaker = new CircuitBreaker(5, 30000);
        }
        
        public String processRequest(String request) {
            try {
                // Check circuit breaker
                if (!circuitBreaker.allowRequest()) {
                    System.out.println("Circuit breaker OPEN - rejecting request");
                    return "Service unavailable";
                }
                
                // Inject chaos
                chaos.injectLatency(100);
                
                if (chaos.shouldFailRequest()) {
                    throw new RuntimeException("Chaos-induced failure");
                }
                
                // Process request
                String result = "Processed: " + request;
                
                // Record success
                circuitBreaker.recordSuccess();
                
                return result;
                
            } catch (Exception e) {
                // Record failure
                circuitBreaker.recordFailure();
                
                System.out.printf(
                    "Request failed: %s%n",
                    e.getMessage()
                );
                
                throw new RuntimeException(e);
            }
        }
    }
    
    public static class CircuitBreaker {
        private final int failureThreshold;
        private final long resetTimeoutMs;
        private int failureCount = 0;
        private long lastFailureTime = 0;
        private boolean isOpen = false;
        
        public CircuitBreaker(int failureThreshold, long resetTimeoutMs) {
            this.failureThreshold = failureThreshold;
            this.resetTimeoutMs = resetTimeoutMs;
        }
        
        public boolean allowRequest() {
            if (!isOpen) {
                return true;
            }
            
            // Check if we should try to close circuit
            if (System.currentTimeMillis() - lastFailureTime > resetTimeoutMs) {
                System.out.println("Circuit breaker attempting to close");
                isOpen = false;
                failureCount = 0;
                return true;
            }
            
            return false;
        }
        
        public void recordSuccess() {
            failureCount = 0;
            isOpen = false;
        }
        
        public void recordFailure() {
            failureCount++;
            lastFailureTime = System.currentTimeMillis();
            
            if (failureCount >= failureThreshold) {
                isOpen = true;
                System.out.printf(
                    "Circuit breaker OPENED after %d failures%n",
                    failureCount
                );
            }
        }
    }
    
    public static void main(String[] args) throws InterruptedException {
        System.out.println("=== Chaos Engineering Demo ===\n");
        
        ChaosInjector chaos = new ChaosInjector();
        ResilientService service = new ResilientService(chaos);
        
        // Normal operation
        System.out.println("Normal operation:");
        for (int i = 0; i < 3; i++) {
            service.processRequest("request-" + i);
        }
        
        // Start chaos experiment
        System.out.println("\nStarting chaos experiment...");
        chaos.startExperiment(ChaosExperiment.RANDOM_FAILURES);
        chaos.startExperiment(ChaosExperiment.HIGH_LATENCY);
        
        // Test resilience
        System.out.println("\nTesting with chaos:");
        for (int i = 0; i < 10; i++) {
            try {
                service.processRequest("chaos-request-" + i);
            } catch (Exception e) {
                // Service handles failures
            }
            Thread.sleep(100);
        }
        
        // Stop chaos
        chaos.stopExperiment(ChaosExperiment.RANDOM_FAILURES);
        chaos.stopExperiment(ChaosExperiment.HIGH_LATENCY);
        
        System.out.println("\nChaos experiment completed");
    }
}

Best Practices for Reliability#

Design for Failure:

Assume everything will fail
Implement timeouts and retries with backoff
Use circuit breakers
Design for graceful degradation

Test Failure Scenarios:

Regular chaos engineering
Partition testing
Resource exhaustion testing
Load testing beyond capacity

Monitoring and Alerting:

Track error rates and latency
Monitor resource utilization
Alert on anomalies
Distributed tracing

Incident Response:

Clear escalation procedures
Runbooks for common issues
Blameless post-mortems
Learn from every incident

Summary#

Real-world distributed system failures teach valuable lessons about resilience, testing, and operational practices. Most incidents result from combinations of factors: timing, scale, configuration errors, and cascading failures. Building reliable systems requires designing for failure, comprehensive testing including chaos engineering, robust monitoring, and learning from incidents through post-mortems.

Key takeaways:

Human errors are inevitable, design systems to limit their impact
Test at production scale and in failure conditions
Implement gradual rollouts for all changes
Use chaos engineering to find weaknesses before production
Learn from others’ failures to avoid repeating them
Maintain detailed runbooks and incident response procedures