Chaos Engineering: Practical Guide
Introduction
Chaos engineering validates that your system can tolerate real-world failures. The goal is not to break production, but to expose weak assumptions under controlled conditions and reduce the blast radius before a real incident occurs.
Define Steady-State Metrics
Start with a measurable steady-state such as request success rate, latency, or queue depth. Without this baseline, experiments do not produce actionable outcomes.
Experiment Design
A mature experiment has a clear hypothesis, a blast radius limit, and a rollback plan.
- Define the hypothesis and expected steady-state.
- Choose a single failure mode to inject.
- Limit scope to a canary or small region.
- Automate rollback when SLOs are violated.
Java Example: Latency Injection Filter
This Spring Boot filter introduces deterministic latency when a feature flag is enabled. It can be used in a staging environment before production experiments.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
@Component
public class ChaosLatencyFilter implements Filter {
@Value("${chaos.latency.enabled:false}")
private boolean enabled;
@Value("${chaos.latency.ms:0}")
private long latencyMs;
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
if (enabled && latencyMs > 0) {
try {
Thread.sleep(latencyMs);
} catch (InterruptedException ignored) {
Thread.currentThread().interrupt();
}
}
chain.doFilter(request, response);
}
}
Safe Execution in Production
Use guardrails like automatic aborts, limited concurrency, and an emergency stop. Always log the experiment state so incident reviews can correlate anomalies with chaos events.
Conclusion
Chaos engineering is a reliability practice, not a stunt. With precise hypotheses, scoped blast radius, and automated rollback, it becomes a controlled way to build resilience.