Introduction#
Learning from production failures is critical for building reliable distributed systems. This post analyzes real-world incidents from major tech companies, examining root causes, cascading failures, and lessons learned. Understanding these failure modes helps engineers anticipate problems and design more resilient systems.
The AWS S3 Outage (February 2017)#
What Happened#
An engineer executing a playbook to remove a small number of S3 servers accidentally removed a much larger set of servers, including two critical subsystems. This caused S3 to become unavailable for several hours.
Root Cause#
A typo in the command-line argument removed far more capacity than intended. The subsystems that were taken offline needed to restart and rebuild their state, which took longer than expected at the new scale.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# Python: Simulating the S3 outage scenario
import time
from typing import Set, Dict
from dataclasses import dataclass
from enum import Enum
class ServerState(Enum):
RUNNING = "running"
STOPPED = "stopped"
REBUILDING = "rebuilding"
@dataclass
class Server:
server_id: str
state: ServerState
data_shards: Set[str]
class DistributedStorageCluster:
"""Simulates the S3 subsystem failure scenario"""
def __init__(self, num_servers: int):
self.servers: Dict[str, Server] = {}
self.index_system_servers: Set[str] = set()
self.billing_system_servers: Set[str] = set()
# Initialize servers
for i in range(num_servers):
server_id = f"server-{i}"
server = Server(
server_id=server_id,
state=ServerState.RUNNING,
data_shards=set()
)
self.servers[server_id] = server
# First 20% are index system
if i < num_servers * 0.2:
self.index_system_servers.add(server_id)
# Next 10% are billing system
elif i < num_servers * 0.3:
self.billing_system_servers.add(server_id)
def remove_servers_command(self, count: int):
"""
DANGEROUS: Command to remove servers
Simulates the typo that caused the outage
"""
print(f"\n=== Executing remove command for {count} servers ===")
# INTENDED: Remove 'count' servers
# ACTUAL: Typo causes removal of many more servers
# Simulate typo - removed entire percentage instead of count
actual_removed = int(len(self.servers) * (count / 100.0))
print(f"INTENDED to remove: {count} servers")
print(f"ACTUALLY removed: {actual_removed} servers due to typo!")
removed_servers = list(self.servers.keys())[:actual_removed]
critical_systems_affected = False
for server_id in removed_servers:
self.servers[server_id].state = ServerState.STOPPED
# Check if critical systems affected
if server_id in self.index_system_servers:
critical_systems_affected = True
print(f"WARNING: Index system server {server_id} removed!")
if server_id in self.billing_system_servers:
critical_systems_affected = True
print(f"WARNING: Billing system server {server_id} removed!")
if critical_systems_affected:
print("\n!!! CRITICAL: Essential subsystems taken offline !!!")
self.begin_recovery()
def begin_recovery(self):
"""Simulate the slow recovery process"""
print("\n=== Beginning Recovery Process ===")
# Index system needs to rebuild
print("Index system rebuilding state...")
rebuild_time_hours = 4.5 # Actual incident took ~4.5 hours
print(f"Estimated recovery time: {rebuild_time_hours} hours")
print("Issue: System had grown significantly since last restart")
print("Issue: State rebuilding not optimized for current scale")
# In real scenario, this caused cascading failures
print("\nCascading effects:")
print("- S3 PUT/GET/DELETE requests failing")
print("- AWS Console unable to load (uses S3 for assets)")
print("- Many AWS services degraded (depend on S3)")
print("- Public internet affected (many sites use S3)")
# Demonstrate the incident
print("=== AWS S3 Outage Simulation (Feb 2017) ===")
cluster = DistributedStorageCluster(num_servers=1000)
# The fateful command with typo
# Intended: remove 4 servers
# Actual: removed 40% of servers
cluster.remove_servers_command(count=40)
print("\n=== Lessons Learned ===")
print("1. Command-line tools need better input validation")
print("2. Critical operations should require confirmation")
print("3. Implement rate limiting on destructive operations")
print("4. Test recovery procedures at current scale")
print("5. Design for faster state rebuilding")
print("6. Gradual rollout of infrastructure changes")
Lessons Learned#
Input Validation: Implement safeguards on destructive operations Confirmation Steps: Multi-step confirmation for critical commands Rate Limiting: Limit speed of destructive operations Recovery Testing: Regularly test recovery at production scale Graceful Degradation: Design subsystems to function with reduced capacity
The GitHub Outage (October 2018)#
What Happened#
A network partition between East Coast and West Coast data centers lasted 43 seconds. This caused split-brain: both sides elected themselves primary, leading to data inconsistencies that took 24 hours to resolve.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
// Java: Simulating the GitHub split-brain scenario
import java.util.*;
import java.util.concurrent.*;
import java.time.Instant;
public class SplitBrainScenario {
enum DatacenterLocation {
EAST_COAST,
WEST_COAST
}
static class Datacenter {
private final DatacenterLocation location;
private boolean isPrimary;
private final Map<String, RepositoryData> repositories;
private boolean canReachPeer;
private long primaryElectedAt;
public Datacenter(DatacenterLocation location) {
this.location = location;
this.isPrimary = false;
this.repositories = new ConcurrentHashMap<>();
this.canReachPeer = true;
}
public void detectPartition() {
this.canReachPeer = false;
System.out.printf(
"[%s] Network partition detected! Cannot reach peer%n",
location
);
// Both datacenters attempt to become primary
electSelfAsPrimary();
}
private void electSelfAsPrimary() {
if (!isPrimary) {
this.isPrimary = true;
this.primaryElectedAt = System.currentTimeMillis();
System.out.printf(
"[%s] Elected self as PRIMARY (split-brain!)%n",
location
);
}
}
public void acceptWrite(String repoId, String data) {
if (!isPrimary) {
System.out.printf(
"[%s] Rejecting write - not primary%n",
location
);
return;
}
RepositoryData repo = repositories.computeIfAbsent(
repoId,
id -> new RepositoryData(id, location)
);
repo.commits.add(new Commit(
UUID.randomUUID().toString(),
data,
Instant.now(),
location
));
System.out.printf(
"[%s] Accepted write to %s: %s%n",
location, repoId, data
);
}
public void healPartition(Datacenter peer) {
this.canReachPeer = true;
System.out.printf(
"[%s] Partition healed, can reach peer again%n",
location
);
// Detect split-brain
if (this.isPrimary && peer.isPrimary) {
System.out.println(
"\n!!! SPLIT-BRAIN DETECTED !!!"
);
System.out.println(
"Both datacenters accepted writes independently"
);
resolveInconsistencies(peer);
}
}
private void resolveInconsistencies(Datacenter peer) {
System.out.println("\n=== Inconsistency Resolution ===");
Set<String> allRepos = new HashSet<>();
allRepos.addAll(this.repositories.keySet());
allRepos.addAll(peer.repositories.keySet());
for (String repoId : allRepos) {
RepositoryData thisRepo = this.repositories.get(repoId);
RepositoryData peerRepo = peer.repositories.get(repoId);
if (thisRepo != null && peerRepo != null) {
if (!thisRepo.commits.equals(peerRepo.commits)) {
System.out.printf(
"Repository %s has DIVERGED:%n",
repoId
);
System.out.printf(
" %s: %d commits%n",
this.location,
thisRepo.commits.size()
);
System.out.printf(
" %s: %d commits%n",
peer.location,
peerRepo.commits.size()
);
// Real GitHub incident required manual reconciliation
System.out.println(
" Resolution: Manual reconciliation required"
);
}
}
}
System.out.printf(
"%nResolution took: ~24 hours in actual incident%n"
);
}
}
static class RepositoryData {
String repoId;
DatacenterLocation primaryLocation;
List<Commit> commits;
public RepositoryData(String repoId, DatacenterLocation location) {
this.repoId = repoId;
this.primaryLocation = location;
this.commits = new ArrayList<>();
}
}
static class Commit {
String commitId;
String data;
Instant timestamp;
DatacenterLocation origin;
public Commit(
String commitId,
String data,
Instant timestamp,
DatacenterLocation origin
) {
this.commitId = commitId;
this.data = data;
this.timestamp = timestamp;
this.origin = origin;
}
}
public static void main(String[] args) throws InterruptedException {
System.out.println("=== GitHub Split-Brain Incident (Oct 2018) ===\n");
Datacenter eastCoast = new Datacenter(DatacenterLocation.EAST_COAST);
Datacenter westCoast = new Datacenter(DatacenterLocation.WEST_COAST);
// Initially, east coast is primary
eastCoast.isPrimary = true;
// Normal operation
eastCoast.acceptWrite("repo-1", "commit-A");
Thread.sleep(100);
// Network partition occurs (43 seconds in real incident)
System.out.println("\n--- NETWORK PARTITION BEGINS ---\n");
eastCoast.detectPartition();
westCoast.detectPartition();
// Both datacenters accept writes independently (split-brain!)
Thread.sleep(100);
eastCoast.acceptWrite("repo-1", "commit-B-east");
westCoast.acceptWrite("repo-1", "commit-C-west");
Thread.sleep(100);
eastCoast.acceptWrite("repo-1", "commit-D-east");
westCoast.acceptWrite("repo-1", "commit-E-west");
Thread.sleep(100);
// Partition heals
System.out.println("\n--- NETWORK PARTITION HEALS ---\n");
eastCoast.healPartition(westCoast);
System.out.println("\n=== Lessons Learned ===");
System.out.println("1. Need fencing tokens to prevent split-brain");
System.out.println("2. Automatic failover must be carefully designed");
System.out.println("3. Monitor and alert on inconsistencies");
System.out.println("4. Test partition scenarios regularly");
System.out.println("5. Have runbooks for manual reconciliation");
System.out.println("6. Consider strongly consistent coordination (Raft/Paxos)");
}
}
Lessons Learned#
Fencing Tokens: Use increasing tokens to detect stale primaries Consensus Protocols: Implement proper leader election (Raft/Paxos) Monitoring: Detect split-brain conditions automatically Testing: Regular chaos engineering with partition testing Reconciliation: Have procedures for data divergence resolution
The Cloudflare Outage (July 2019)#
What Happened#
A regex pattern in the Web Application Firewall (WAF) consumed excessive CPU, causing global outage. The pattern had catastrophic backtracking behavior.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Node.js: Simulating catastrophic backtracking
class ReDoSExample {
/**
* Demonstrates catastrophic backtracking
* Similar to Cloudflare incident
*/
static demonstrateCatastrophicBacktracking() {
console.log("=== Cloudflare ReDoS Incident (July 2019) ===\n");
// Simplified version of problematic regex
// Actual regex was: (?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
const badRegex = /^(a+)+$/;
// Test strings that cause exponential time
const testStrings = [
"aaaaaa",
"aaaaaaaaaaa",
"aaaaaaaaaaaaaaaa",
"aaaaaaaaaaaaaaaaaaaaaX" // Doesn't match - worst case
];
console.log("Testing regex: /^(a+)+$/\n");
testStrings.forEach(str => {
const start = Date.now();
try {
// Set timeout to prevent hanging
const timeoutMs = 1000;
const timeoutId = setTimeout(() => {
throw new Error('Regex timeout - catastrophic backtracking!');
}, timeoutMs);
const match = badRegex.test(str);
clearTimeout(timeoutId);
const duration = Date.now() - start;
console.log(`Input: "${str}" (length: ${str.length})`);
console.log(` Match: ${match}`);
console.log(` Time: ${duration}ms`);
console.log(` Complexity: O(2^n) - exponential!\n`);
} catch (error) {
const duration = Date.now() - start;
console.log(`Input: "${str}" (length: ${str.length})`);
console.log(` ERROR: ${error.message}`);
console.log(` Time: ${duration}ms`);
console.log(` This would cause CPU exhaustion!\n`);
}
});
this.showImpact();
this.showSolution();
}
static showImpact() {
console.log("=== Real Incident Impact ===");
console.log("1. WAF rule deployed globally");
console.log("2. Regex consumed excessive CPU on all servers");
console.log("3. CPU exhaustion caused request failures");
console.log("4. Global outage for 27 minutes");
console.log("5. Affected millions of websites\n");
}
static showSolution() {
console.log("=== Solutions ===");
console.log("1. Validate regex patterns before deployment");
console.log("2. Use regex analyzers to detect backtracking");
console.log("3. Set CPU/time limits on regex execution");
console.log("4. Gradual rollout instead of global deployment");
console.log("5. Use specialized parsing libraries instead of regex");
console.log("6. Implement circuit breakers\n");
}
static demonstrateSafeAlternative() {
console.log("=== Safe Alternative ===\n");
// Safe regex without nested quantifiers
const safeRegex = /^a+$/;
const testString = "a".repeat(100000); // Very long string
const start = Date.now();
const match = safeRegex.test(testString);
const duration = Date.now() - start;
console.log(`Safe regex: /^a+$/`);
console.log(`Input length: ${testString.length}`);
console.log(`Match: ${match}`);
console.log(`Time: ${duration}ms`);
console.log(`Complexity: O(n) - linear!`);
}
}
// Demonstrate the incident
ReDoSExample.demonstrateCatastrophicBacktracking();
ReDoSExample.demonstrateSafeAlternative();
Lessons Learned#
Regex Validation: Analyze regex for backtracking before deployment Gradual Rollout: Never deploy changes globally instantly Resource Limits: Implement CPU/time limits on operations Circuit Breakers: Detect and stop runaway processes Specialized Tools: Use parsers instead of complex regex
The Google Cloud Networking Outage (June 2019)#
What Happened#
A configuration change intended for a single region was accidentally applied globally, causing widespread routing issues.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// C#: Configuration management failure scenario
using System;
using System.Collections.Generic;
using System.Linq;
public class ConfigurationManagementFailure
{
public enum Region
{
US_EAST_1,
US_WEST_1,
EU_WEST_1,
ASIA_EAST_1
}
public class NetworkConfiguration
{
public Region TargetRegion { get; set; }
public bool ApplyGlobally { get; set; }
public Dictionary<string, string> RoutingRules { get; set; }
public int Version { get; set; }
public NetworkConfiguration()
{
RoutingRules = new Dictionary<string, string>();
}
}
public class ConfigurationManager
{
private readonly Dictionary<Region, NetworkConfiguration> _regionConfigs;
private NetworkConfiguration _globalConfig;
public ConfigurationManager()
{
_regionConfigs = new Dictionary<Region, NetworkConfiguration>();
// Initialize regional configs
foreach (Region region in Enum.GetValues(typeof(Region)))
{
_regionConfigs[region] = new NetworkConfiguration
{
TargetRegion = region,
ApplyGlobally = false,
Version = 1
};
}
_globalConfig = new NetworkConfiguration
{
ApplyGlobally = true,
Version = 1
};
}
public void ApplyConfiguration(NetworkConfiguration config)
{
Console.WriteLine($"\n=== Applying Configuration ===");
Console.WriteLine($"Target Region: {config.TargetRegion}");
Console.WriteLine($"Apply Globally: {config.ApplyGlobally}");
Console.WriteLine($"Version: {config.Version}");
// BUG: Flag was incorrectly set or interpreted
if (config.ApplyGlobally)
{
Console.WriteLine("\n!!! WARNING: Applying to ALL REGIONS !!!");
// Apply configuration globally (UNINTENDED)
foreach (var region in _regionConfigs.Keys.ToList())
{
_regionConfigs[region] = config;
Console.WriteLine($" Applied to {region}");
}
Console.WriteLine("\n!!! OUTAGE: All regions affected !!!");
SimulateGlobalOutage();
}
else
{
// Apply to single region (INTENDED)
_regionConfigs[config.TargetRegion] = config;
Console.WriteLine($"Applied to {config.TargetRegion} only");
}
}
private void SimulateGlobalOutage()
{
Console.WriteLine("\nOutage Effects:");
Console.WriteLine("- Network routing disrupted globally");
Console.WriteLine("- Services unable to reach backends");
Console.WriteLine("- Inter-region communication broken");
Console.WriteLine("- User-facing services down");
Console.WriteLine("\nDuration: ~4 hours (actual incident)");
}
}
public static void Main()
{
Console.WriteLine("=== Google Cloud Networking Outage (June 2019) ===\n");
var configManager = new ConfigurationManager();
// Engineer intends to update US-EAST-1 only
var regionalUpdate = new NetworkConfiguration
{
TargetRegion = Region.US_EAST_1,
ApplyGlobally = false, // INTENDED
Version = 2
};
regionalUpdate.RoutingRules["rule-1"] = "new-routing-config";
// BUG: Flag somehow becomes true (UI bug, API bug, or misunderstanding)
regionalUpdate.ApplyGlobally = true; // ACTUAL (bug)
configManager.ApplyConfiguration(regionalUpdate);
Console.WriteLine("\n=== Lessons Learned ===");
Console.WriteLine("1. Separate regional and global configuration systems");
Console.WriteLine("2. Require explicit approval for global changes");
Console.WriteLine("3. Implement dry-run mode for configuration changes");
Console.WriteLine("4. Gradual rollout with automatic rollback");
Console.WriteLine("5. Clear UI/API design to prevent misinterpretation");
Console.WriteLine("6. Configuration validation and diff review");
Console.WriteLine("7. Blast radius limitation for changes");
}
}
Lessons Learned#
Separation of Concerns: Separate global and regional configs Change Validation: Require review for wide-impact changes Gradual Rollout: Never apply config changes globally at once Automated Rollback: Detect issues and rollback automatically Clear Interfaces: Design APIs/UIs to prevent misinterpretation
Common Failure Patterns#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
# Python: Taxonomy of distributed system failures
from enum import Enum
from typing import List, Dict
from dataclasses import dataclass
class FailureCategory(Enum):
CASCADING_FAILURE = "cascading_failure"
RESOURCE_EXHAUSTION = "resource_exhaustion"
SPLIT_BRAIN = "split_brain"
TIMING_BUG = "timing_bug"
CONFIGURATION_ERROR = "configuration_error"
DEPENDENCY_FAILURE = "dependency_failure"
@dataclass
class FailurePattern:
name: str
category: FailureCategory
description: str
real_examples: List[str]
prevention: List[str]
detection: List[str]
def analyze_failure_patterns():
patterns = [
FailurePattern(
name="Thundering Herd",
category=FailureCategory.CASCADING_FAILURE,
description="Many clients retry simultaneously after failure",
real_examples=[
"Facebook cache invalidation (2010)",
"AWS Lambda cold starts under load"
],
prevention=[
"Exponential backoff with jitter",
"Circuit breakers",
"Request rate limiting",
"Queue-based retry"
],
detection=[
"Spike in retry rate",
"Correlated request patterns",
"Backend overload metrics"
]
),
FailurePattern(
name="Retry Storm",
category=FailureCategory.CASCADING_FAILURE,
description="Retries overwhelm system preventing recovery",
real_examples=[
"Cloudflare API outage (2020)",
"GitHub webhook delivery delays"
],
prevention=[
"Exponential backoff",
"Max retry limits",
"Circuit breakers",
"Separate retry queues"
],
detection=[
"Increasing retry queue depth",
"High retry-to-new-request ratio",
"Latency increase despite low success rate"
]
),
FailurePattern(
name="Resource Leak",
category=FailureCategory.RESOURCE_EXHAUSTION,
description="Gradual resource consumption until exhaustion",
real_examples=[
"Netflix Hystrix thread pool exhaustion",
"Memory leaks in long-running services"
],
prevention=[
"Resource limits and quotas",
"Automatic resource cleanup",
"Regular restarts for stateless services",
"Memory profiling"
],
detection=[
"Gradual resource utilization increase",
"Correlation with uptime",
"Performance degradation over time"
]
),
FailurePattern(
name="Clock Skew",
category=FailureCategory.TIMING_BUG,
description="Clock drift causes incorrect ordering or timeouts",
real_examples=[
"AWS authentication failures",
"Certificate validation errors"
],
prevention=[
"NTP synchronization",
"Logical clocks (Lamport/Vector)",
"Lease-based coordination",
"Monitor clock drift"
],
detection=[
"Authentication failures",
"Out-of-order events",
"Unexpected timeouts"
]
),
FailurePattern(
name="Metadata Corruption",
category=FailureCategory.SPLIT_BRAIN,
description="Inconsistent metadata leads to incorrect routing",
real_examples=[
"GitHub split-brain (2018)",
"Elasticsearch split-brain scenarios"
],
prevention=[
"Consensus protocols (Raft/Paxos)",
"Fencing tokens",
"Quorum-based operations",
"Regular consistency checks"
],
detection=[
"Divergent state across replicas",
"Multiple nodes claiming leadership",
"Inconsistent query results"
]
)
]
print("=== Common Distributed System Failure Patterns ===\n")
for pattern in patterns:
print(f"{pattern.name} ({pattern.category.value})")
print(f" Description: {pattern.description}")
print(f" Real Examples:")
for example in pattern.real_examples:
print(f" - {example}")
print(f" Prevention:")
for prevention in pattern.prevention:
print(f" - {prevention}")
print(f" Detection:")
for detection in pattern.detection:
print(f" - {detection}")
print()
analyze_failure_patterns()
Chaos Engineering: Testing for Failures#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
// Java: Chaos engineering framework
import java.util.*;
import java.util.concurrent.*;
public class ChaosEngineeringFramework {
public enum ChaosExperiment {
NETWORK_PARTITION,
HIGH_LATENCY,
RESOURCE_EXHAUSTION,
RANDOM_FAILURES,
CLOCK_SKEW
}
public static class ChaosInjector {
private final Random random = new Random();
private final Set<ChaosExperiment> activeExperiments;
public ChaosInjector() {
this.activeExperiments = new HashSet<>();
}
public void startExperiment(ChaosExperiment experiment) {
activeExperiments.add(experiment);
System.out.printf(
"Started chaos experiment: %s%n",
experiment
);
}
public void stopExperiment(ChaosExperiment experiment) {
activeExperiments.remove(experiment);
System.out.printf(
"Stopped chaos experiment: %s%n",
experiment
);
}
public void injectLatency(int baseLatencyMs) throws InterruptedException {
if (activeExperiments.contains(ChaosExperiment.HIGH_LATENCY)) {
int additionalLatency = random.nextInt(5000);
System.out.printf(
"Injecting latency: %dms%n",
additionalLatency
);
Thread.sleep(additionalLatency);
}
}
public boolean shouldFailRequest() {
if (activeExperiments.contains(ChaosExperiment.RANDOM_FAILURES)) {
return random.nextDouble() < 0.1; // 10% failure rate
}
return false;
}
public boolean isPartitioned(String nodeA, String nodeB) {
if (activeExperiments.contains(ChaosExperiment.NETWORK_PARTITION)) {
// Simulate partition between east and west
boolean aIsEast = nodeA.contains("east");
boolean bIsWest = nodeB.contains("west");
return aIsEast && bIsWest || !aIsEast && !bIsWest;
}
return false;
}
}
public static class ResilientService {
private final ChaosInjector chaos;
private final CircuitBreaker circuitBreaker;
public ResilientService(ChaosInjector chaos) {
this.chaos = chaos;
this.circuitBreaker = new CircuitBreaker(5, 30000);
}
public String processRequest(String request) {
try {
// Check circuit breaker
if (!circuitBreaker.allowRequest()) {
System.out.println("Circuit breaker OPEN - rejecting request");
return "Service unavailable";
}
// Inject chaos
chaos.injectLatency(100);
if (chaos.shouldFailRequest()) {
throw new RuntimeException("Chaos-induced failure");
}
// Process request
String result = "Processed: " + request;
// Record success
circuitBreaker.recordSuccess();
return result;
} catch (Exception e) {
// Record failure
circuitBreaker.recordFailure();
System.out.printf(
"Request failed: %s%n",
e.getMessage()
);
throw new RuntimeException(e);
}
}
}
public static class CircuitBreaker {
private final int failureThreshold;
private final long resetTimeoutMs;
private int failureCount = 0;
private long lastFailureTime = 0;
private boolean isOpen = false;
public CircuitBreaker(int failureThreshold, long resetTimeoutMs) {
this.failureThreshold = failureThreshold;
this.resetTimeoutMs = resetTimeoutMs;
}
public boolean allowRequest() {
if (!isOpen) {
return true;
}
// Check if we should try to close circuit
if (System.currentTimeMillis() - lastFailureTime > resetTimeoutMs) {
System.out.println("Circuit breaker attempting to close");
isOpen = false;
failureCount = 0;
return true;
}
return false;
}
public void recordSuccess() {
failureCount = 0;
isOpen = false;
}
public void recordFailure() {
failureCount++;
lastFailureTime = System.currentTimeMillis();
if (failureCount >= failureThreshold) {
isOpen = true;
System.out.printf(
"Circuit breaker OPENED after %d failures%n",
failureCount
);
}
}
}
public static void main(String[] args) throws InterruptedException {
System.out.println("=== Chaos Engineering Demo ===\n");
ChaosInjector chaos = new ChaosInjector();
ResilientService service = new ResilientService(chaos);
// Normal operation
System.out.println("Normal operation:");
for (int i = 0; i < 3; i++) {
service.processRequest("request-" + i);
}
// Start chaos experiment
System.out.println("\nStarting chaos experiment...");
chaos.startExperiment(ChaosExperiment.RANDOM_FAILURES);
chaos.startExperiment(ChaosExperiment.HIGH_LATENCY);
// Test resilience
System.out.println("\nTesting with chaos:");
for (int i = 0; i < 10; i++) {
try {
service.processRequest("chaos-request-" + i);
} catch (Exception e) {
// Service handles failures
}
Thread.sleep(100);
}
// Stop chaos
chaos.stopExperiment(ChaosExperiment.RANDOM_FAILURES);
chaos.stopExperiment(ChaosExperiment.HIGH_LATENCY);
System.out.println("\nChaos experiment completed");
}
}
Best Practices for Reliability#
Design for Failure:
- Assume everything will fail
- Implement timeouts and retries with backoff
- Use circuit breakers
- Design for graceful degradation
Test Failure Scenarios:
- Regular chaos engineering
- Partition testing
- Resource exhaustion testing
- Load testing beyond capacity
Monitoring and Alerting:
- Track error rates and latency
- Monitor resource utilization
- Alert on anomalies
- Distributed tracing
Incident Response:
- Clear escalation procedures
- Runbooks for common issues
- Blameless post-mortems
- Learn from every incident
Summary#
Real-world distributed system failures teach valuable lessons about resilience, testing, and operational practices. Most incidents result from combinations of factors: timing, scale, configuration errors, and cascading failures. Building reliable systems requires designing for failure, comprehensive testing including chaos engineering, robust monitoring, and learning from incidents through post-mortems.
Key takeaways:
- Human errors are inevitable, design systems to limit their impact
- Test at production scale and in failure conditions
- Implement gradual rollouts for all changes
- Use chaos engineering to find weaknesses before production
- Learn from others’ failures to avoid repeating them
- Maintain detailed runbooks and incident response procedures