Introduction#
Chaos engineering is the practice of intentionally introducing failures into a system to discover weaknesses before they cause unplanned outages. Netflix popularized the concept with Chaos Monkey. The core idea: if your system will inevitably experience failures, you are better off discovering how it behaves under failure on your terms, not your customers’.
Principles of Chaos Engineering#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
1. Define "steady state" — measurable normal behavior
e.g., p99 latency < 200ms, error rate < 0.1%
2. Hypothesize: "We believe the system will maintain steady state
when X fails"
3. Introduce failure in a controlled way
- Start in staging
- Small blast radius
- Easy to halt
4. Observe: does steady state hold?
5. If not → you found a real weakness before it caused an outage
Fix it, add monitoring, improve resilience
6. Gradually expand scope (staging → canary → production)
Simple Chaos Toolkit Experiment#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# chaos-experiment.json — using the Chaos Toolkit framework
{
"version": "1.0.0",
"title": "API handles database connection timeout",
"description": "Verify the API returns 503 and not 500 when DB times out",
"steady-state-hypothesis": {
"title": "API is healthy",
"probes": [
{
"type": "probe",
"name": "api-responds-to-health",
"tolerance": 200,
"provider": {
"type": "http",
"url": "http://api.production.svc/health",
"timeout": 3
}
},
{
"type": "probe",
"name": "error-rate-is-low",
"tolerance": {"type": "range", "target": 0.5, "range": [0, 1]},
"provider": {
"type": "prometheus",
"url": "http://prometheus:9090",
"query": "sum(rate(http_requests_total{status=~'5..'}[1m])) / sum(rate(http_requests_total[1m])) * 100"
}
}
]
},
"method": [
{
"type": "action",
"name": "add-network-delay-to-database",
"provider": {
"type": "process",
"path": "tc",
"arguments": "qdisc add dev eth0 root netem delay 5000ms"
}
}
],
"rollbacks": [
{
"type": "action",
"name": "remove-network-delay",
"provider": {
"type": "process",
"path": "tc",
"arguments": "qdisc del dev eth0 root"
}
}
]
}
Python-Based Chaos Injection#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import subprocess
import time
import contextlib
import logging
logger = logging.getLogger(__name__)
@contextlib.contextmanager
def network_latency(interface: str, latency_ms: int, jitter_ms: int = 10):
"""Add network latency to an interface using tc netem."""
add_cmd = [
"tc", "qdisc", "add", "dev", interface, "root", "netem",
"delay", f"{latency_ms}ms", f"{jitter_ms}ms",
]
del_cmd = ["tc", "qdisc", "del", "dev", interface, "root"]
logger.info("Adding %dms latency to %s", latency_ms, interface)
subprocess.run(add_cmd, check=True)
try:
yield
finally:
logger.info("Removing latency from %s", interface)
subprocess.run(del_cmd, check=True)
@contextlib.contextmanager
def packet_loss(interface: str, loss_percent: float):
"""Simulate packet loss."""
subprocess.run([
"tc", "qdisc", "add", "dev", interface, "root", "netem",
"loss", f"{loss_percent}%",
], check=True)
try:
yield
finally:
subprocess.run(["tc", "qdisc", "del", "dev", interface, "root"], check=True)
@contextlib.contextmanager
def kill_process(service_name: str):
"""Kill a systemd service."""
subprocess.run(["systemctl", "stop", service_name], check=True)
logger.info("Stopped %s", service_name)
try:
yield
finally:
subprocess.run(["systemctl", "start", service_name], check=True)
logger.info("Restarted %s", service_name)
# Experiment
def test_api_handles_db_latency():
from prometheus_api_client import PrometheusConnect
prom = PrometheusConnect("http://prometheus:9090")
def error_rate() -> float:
result = prom.custom_query(
'sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) * 100'
)
return float(result[0]["value"][1]) if result else 0.0
# Verify steady state
baseline = error_rate()
assert baseline < 1.0, f"Baseline error rate already high: {baseline}%"
# Inject failure
with network_latency("eth0", latency_ms=3000):
time.sleep(30) # let failure propagate
chaos_rate = error_rate()
# Allow recovery
time.sleep(30)
recovered_rate = error_rate()
logger.info(
"Error rates — baseline: %.2f%%, during chaos: %.2f%%, recovered: %.2f%%",
baseline, chaos_rate, recovered_rate,
)
assert recovered_rate < 1.0, f"System did not recover: {recovered_rate}%"
Kubernetes Chaos with Chaos Mesh#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Chaos Mesh: Kubernetes-native chaos engineering
# Install: helm install chaos-mesh chaos-mesh/chaos-mesh
# Pod failure experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: api-pod-failure
namespace: production
spec:
action: pod-kill # or pod-failure (makes pods unavailable)
mode: one # kill one pod at a time
selector:
namespaces: [production]
labelSelectors:
app: api
scheduler:
cron: "@every 10m" # run every 10 minutes
---
# Network partition: block traffic between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: api-to-db-partition
spec:
action: partition
mode: all
selector:
namespaces: [production]
labelSelectors:
app: api
direction: to
target:
selector:
namespaces: [production]
labelSelectors:
app: postgres
duration: "30s"
---
# CPU stress test
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
name: api-cpu-stress
spec:
mode: all
selector:
labelSelectors:
app: api
stressors:
cpu:
workers: 2
load: 80 # 80% CPU load per worker
duration: "60s"
AWS Fault Injection Simulator#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import boto3
fis = boto3.client("fis", region_name="us-east-1")
# Create an experiment template
experiment_template = fis.create_experiment_template(
description="Test EC2 instance termination recovery",
targets={
"ec2-instances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {"Environment": "staging", "App": "api"},
"selectionMode": "PERCENT(25)", # terminate 25% of instances
}
},
actions={
"terminate-ec2": {
"actionId": "aws:ec2:terminate-instances",
"targets": {"Instances": "ec2-instances"},
}
},
stopConditions=[
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:us-east-1:123:alarm:ApiErrorRateHigh",
}
],
roleArn="arn:aws:iam::123:role/FISRole",
)
template_id = experiment_template["experimentTemplate"]["id"]
# Run the experiment
experiment = fis.start_experiment(experimentTemplateId=template_id)
experiment_id = experiment["experiment"]["id"]
# Monitor
import time
while True:
status = fis.get_experiment(id=experiment_id)["experiment"]["state"]["status"]
print(f"Experiment status: {status}")
if status in ("completed", "failed", "stopped"):
break
time.sleep(10)
Chaos Engineering Runbook#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Chaos Experiment Runbook
## Before the Experiment
- [ ] Inform the on-call engineer
- [ ] Verify monitoring dashboards are open
- [ ] Confirm you know how to halt the experiment
- [ ] Verify the steady-state hypothesis metrics
- [ ] Choose smallest possible blast radius
## During the Experiment
- [ ] Monitor key metrics continuously
- [ ] Document observations in real time
- [ ] Halt immediately if:
- Error rate exceeds 5%
- Latency p99 exceeds 2x normal
- Any data loss is observed
- Experiment is not behaving as expected
## After the Experiment
- [ ] Verify system returned to steady state
- [ ] Document: what failed, what held, what was surprising
- [ ] File issues for weaknesses discovered
- [ ] Add monitoring for failure modes discovered
- [ ] Schedule follow-up experiments after fixes
Conclusion#
Chaos engineering is not about breaking things randomly — it is about forming hypotheses, designing controlled experiments, and learning from the results. Start with staging, define clear steady-state metrics, and ensure you have a halt mechanism. The most valuable discoveries are the ones that reveal silent failure modes: services that fail without proper error propagation, circuit breakers that are not configured, or fallback paths that have never been tested. Each weakness found and fixed before a real incident is an outage prevented.