Introduction#
Chaos engineering validates that your system can tolerate real-world failures. The goal is not to break production, but to expose weak assumptions under controlled conditions and reduce the blast radius before a real incident occurs.
Define Steady-State Metrics#
Start with a measurable steady-state such as request success rate, latency, or queue depth. Without this baseline, experiments do not produce actionable outcomes.
Experiment Design#
A mature experiment has a clear hypothesis, a blast radius limit, and a rollback plan.
- Define the hypothesis and expected steady-state.
- Choose a single failure mode to inject.
- Limit scope to a canary or small region.
- Automate rollback when SLOs are violated.
Java Example: Latency Injection Filter#
This Spring Boot filter introduces deterministic latency when a feature flag is enabled. It can be used in a staging environment before production experiments.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
@Component
public class ChaosLatencyFilter implements Filter {
@Value("${chaos.latency.enabled:false}")
private boolean enabled;
@Value("${chaos.latency.ms:0}")
private long latencyMs;
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
if (enabled && latencyMs > 0) {
try {
Thread.sleep(latencyMs);
} catch (InterruptedException ignored) {
Thread.currentThread().interrupt();
}
}
chain.doFilter(request, response);
}
}
Safe Execution in Production#
Use guardrails like automatic aborts, limited concurrency, and an emergency stop. Always log the experiment state so incident reviews can correlate anomalies with chaos events.
Conclusion#
Chaos engineering is a reliability practice, not a stunt. With precise hypotheses, scoped blast radius, and automated rollback, it becomes a controlled way to build resilience.