Chaos Engineering: Building Confidence Through Controlled Failure

Introduction#

Chaos engineering is the practice of intentionally introducing failures into a system to discover weaknesses before they cause unplanned outages. Netflix popularized the concept with Chaos Monkey. The core idea: if your system will inevitably experience failures, you are better off discovering how it behaves under failure on your terms, not your customers’.

Principles of Chaos Engineering#

1. Define "steady state" — measurable normal behavior
   e.g., p99 latency < 200ms, error rate < 0.1%

2. Hypothesize: "We believe the system will maintain steady state
   when X fails"

3. Introduce failure in a controlled way
   - Start in staging
   - Small blast radius
   - Easy to halt

4. Observe: does steady state hold?

5. If not → you found a real weakness before it caused an outage
   Fix it, add monitoring, improve resilience

6. Gradually expand scope (staging → canary → production)

Simple Chaos Toolkit Experiment#

# chaos-experiment.json — using the Chaos Toolkit framework
{
  "version": "1.0.0",
  "title": "API handles database connection timeout",
  "description": "Verify the API returns 503 and not 500 when DB times out",
  "steady-state-hypothesis": {
    "title": "API is healthy",
    "probes": [
      {
        "type": "probe",
        "name": "api-responds-to-health",
        "tolerance": 200,
        "provider": {
          "type": "http",
          "url": "http://api.production.svc/health",
          "timeout": 3
        }
      },
      {
        "type": "probe",
        "name": "error-rate-is-low",
        "tolerance": {"type": "range", "target": 0.5, "range": [0, 1]},
        "provider": {
          "type": "prometheus",
          "url": "http://prometheus:9090",
          "query": "sum(rate(http_requests_total{status=~'5..'}[1m])) / sum(rate(http_requests_total[1m])) * 100"
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "add-network-delay-to-database",
      "provider": {
        "type": "process",
        "path": "tc",
        "arguments": "qdisc add dev eth0 root netem delay 5000ms"
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "remove-network-delay",
      "provider": {
        "type": "process",
        "path": "tc",
        "arguments": "qdisc del dev eth0 root"
      }
    }
  ]
}

Python-Based Chaos Injection#

import subprocess
import time
import contextlib
import logging

logger = logging.getLogger(__name__)

@contextlib.contextmanager
def network_latency(interface: str, latency_ms: int, jitter_ms: int = 10):
    """Add network latency to an interface using tc netem."""
    add_cmd = [
        "tc", "qdisc", "add", "dev", interface, "root", "netem",
        "delay", f"{latency_ms}ms", f"{jitter_ms}ms",
    ]
    del_cmd = ["tc", "qdisc", "del", "dev", interface, "root"]

    logger.info("Adding %dms latency to %s", latency_ms, interface)
    subprocess.run(add_cmd, check=True)
    try:
        yield
    finally:
        logger.info("Removing latency from %s", interface)
        subprocess.run(del_cmd, check=True)

@contextlib.contextmanager
def packet_loss(interface: str, loss_percent: float):
    """Simulate packet loss."""
    subprocess.run([
        "tc", "qdisc", "add", "dev", interface, "root", "netem",
        "loss", f"{loss_percent}%",
    ], check=True)
    try:
        yield
    finally:
        subprocess.run(["tc", "qdisc", "del", "dev", interface, "root"], check=True)

@contextlib.contextmanager
def kill_process(service_name: str):
    """Kill a systemd service."""
    subprocess.run(["systemctl", "stop", service_name], check=True)
    logger.info("Stopped %s", service_name)
    try:
        yield
    finally:
        subprocess.run(["systemctl", "start", service_name], check=True)
        logger.info("Restarted %s", service_name)

# Experiment
def test_api_handles_db_latency():
    from prometheus_api_client import PrometheusConnect

    prom = PrometheusConnect("http://prometheus:9090")

    def error_rate() -> float:
        result = prom.custom_query(
            'sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m])) * 100'
        )
        return float(result[0]["value"][1]) if result else 0.0

    # Verify steady state
    baseline = error_rate()
    assert baseline < 1.0, f"Baseline error rate already high: {baseline}%"

    # Inject failure
    with network_latency("eth0", latency_ms=3000):
        time.sleep(30)  # let failure propagate
        chaos_rate = error_rate()

    # Allow recovery
    time.sleep(30)
    recovered_rate = error_rate()

    logger.info(
        "Error rates — baseline: %.2f%%, during chaos: %.2f%%, recovered: %.2f%%",
        baseline, chaos_rate, recovered_rate,
    )

    assert recovered_rate < 1.0, f"System did not recover: {recovered_rate}%"

Kubernetes Chaos with Chaos Mesh#

# Chaos Mesh: Kubernetes-native chaos engineering
# Install: helm install chaos-mesh chaos-mesh/chaos-mesh

# Pod failure experiment
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: api-pod-failure
  namespace: production
spec:
  action: pod-kill          # or pod-failure (makes pods unavailable)
  mode: one                 # kill one pod at a time
  selector:
    namespaces: [production]
    labelSelectors:
      app: api
  scheduler:
    cron: "@every 10m"      # run every 10 minutes
---
# Network partition: block traffic between services
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: api-to-db-partition
spec:
  action: partition
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      app: api
  direction: to
  target:
    selector:
      namespaces: [production]
      labelSelectors:
        app: postgres
  duration: "30s"
---
# CPU stress test
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: api-cpu-stress
spec:
  mode: all
  selector:
    labelSelectors:
      app: api
  stressors:
    cpu:
      workers: 2
      load: 80       # 80% CPU load per worker
  duration: "60s"

AWS Fault Injection Simulator#

import boto3

fis = boto3.client("fis", region_name="us-east-1")

# Create an experiment template
experiment_template = fis.create_experiment_template(
    description="Test EC2 instance termination recovery",
    targets={
        "ec2-instances": {
            "resourceType": "aws:ec2:instance",
            "resourceTags": {"Environment": "staging", "App": "api"},
            "selectionMode": "PERCENT(25)",  # terminate 25% of instances
        }
    },
    actions={
        "terminate-ec2": {
            "actionId": "aws:ec2:terminate-instances",
            "targets": {"Instances": "ec2-instances"},
        }
    },
    stopConditions=[
        {
            "source": "aws:cloudwatch:alarm",
            "value": "arn:aws:cloudwatch:us-east-1:123:alarm:ApiErrorRateHigh",
        }
    ],
    roleArn="arn:aws:iam::123:role/FISRole",
)

template_id = experiment_template["experimentTemplate"]["id"]

# Run the experiment
experiment = fis.start_experiment(experimentTemplateId=template_id)
experiment_id = experiment["experiment"]["id"]

# Monitor
import time
while True:
    status = fis.get_experiment(id=experiment_id)["experiment"]["state"]["status"]
    print(f"Experiment status: {status}")
    if status in ("completed", "failed", "stopped"):
        break
    time.sleep(10)

Chaos Engineering Runbook#

# Chaos Experiment Runbook

## Before the Experiment
- [ ] Inform the on-call engineer
- [ ] Verify monitoring dashboards are open
- [ ] Confirm you know how to halt the experiment
- [ ] Verify the steady-state hypothesis metrics
- [ ] Choose smallest possible blast radius

## During the Experiment
- [ ] Monitor key metrics continuously
- [ ] Document observations in real time
- [ ] Halt immediately if:
  - Error rate exceeds 5%
  - Latency p99 exceeds 2x normal
  - Any data loss is observed
  - Experiment is not behaving as expected

## After the Experiment
- [ ] Verify system returned to steady state
- [ ] Document: what failed, what held, what was surprising
- [ ] File issues for weaknesses discovered
- [ ] Add monitoring for failure modes discovered
- [ ] Schedule follow-up experiments after fixes

Conclusion#

Chaos engineering is not about breaking things randomly — it is about forming hypotheses, designing controlled experiments, and learning from the results. Start with staging, define clear steady-state metrics, and ensure you have a halt mechanism. The most valuable discoveries are the ones that reveal silent failure modes: services that fail without proper error propagation, circuit breakers that are not configured, or fallback paths that have never been tested. Each weakness found and fixed before a real incident is an outage prevented.