Chaos Engineering: Practical Guide

Chaos engineering validates that your system can tolerate real-world failures. The goal is not to break production, but to expose weak assumptions under controlled conditions and reduce the blast radi

Introduction#

Chaos engineering validates that your system can tolerate real-world failures. The goal is not to break production, but to expose weak assumptions under controlled conditions and reduce the blast radius before a real incident occurs.

Define Steady-State Metrics#

Start with a measurable steady-state such as request success rate, latency, or queue depth. Without this baseline, experiments do not produce actionable outcomes.

Experiment Design#

A mature experiment has a clear hypothesis, a blast radius limit, and a rollback plan.

  • Define the hypothesis and expected steady-state.
  • Choose a single failure mode to inject.
  • Limit scope to a canary or small region.
  • Automate rollback when SLOs are violated.

Java Example: Latency Injection Filter#

This Spring Boot filter introduces deterministic latency when a feature flag is enabled. It can be used in a staging environment before production experiments.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
@Component
public class ChaosLatencyFilter implements Filter {
    @Value("${chaos.latency.enabled:false}")
    private boolean enabled;

    @Value("${chaos.latency.ms:0}")
    private long latencyMs;

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {
        if (enabled && latencyMs > 0) {
            try {
                Thread.sleep(latencyMs);
            } catch (InterruptedException ignored) {
                Thread.currentThread().interrupt();
            }
        }
        chain.doFilter(request, response);
    }
}

Safe Execution in Production#

Use guardrails like automatic aborts, limited concurrency, and an emergency stop. Always log the experiment state so incident reviews can correlate anomalies with chaos events.

Conclusion#

Chaos engineering is a reliability practice, not a stunt. With precise hypotheses, scoped blast radius, and automated rollback, it becomes a controlled way to build resilience.

Contents