Infrastructure Drift: Detection and Prevention
Introduction
Infrastructure drift occurs when real-world resources diverge from the declared state in infrastructure-as-code (IaC). Drift erodes reliability, makes deployments unpredictable, and complicates incident response. Preventing it requires both detection and process discipline.
Common Causes of Drift
- Manual changes in the console during incidents.
- Emergency fixes applied without IaC updates.
- Hidden defaults or provider updates that alter configuration.
- Auto-scaling groups or managed services that mutate resource properties.
Drift Detection Strategies
Continuous IaC Validation
Run frequent plan or diff operations and alert on unexpected changes.
Configuration Baselines
Maintain baseline security controls using policy-as-code to detect and remediate drift.
Resource Inventory and Tagging
Use centralized inventory services and mandatory tagging policies to identify unmanaged resources.
Drift Prevention Mechanisms
- Enforce change management through CI/CD pipelines.
- Restrict console access to break-glass accounts.
- Automate remediation via pull requests rather than manual edits.
- Include drift alerts in operational dashboards.
Example: Drift Snapshot Comparison
This Python example shows a simplified drift check that compares an expected configuration with a live snapshot. In practice, you would use a cloud SDK to gather the live state.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from dataclasses import dataclass
@dataclass(frozen=True)
class ExpectedSubnet:
cidr: str
public: bool
expected = {
"subnet-a": ExpectedSubnet(cidr="10.0.1.0/24", public=True),
"subnet-b": ExpectedSubnet(cidr="10.0.2.0/24", public=False),
}
live_snapshot = {
"subnet-a": {"cidr": "10.0.1.0/24", "public": True},
"subnet-b": {"cidr": "10.0.2.0/24", "public": True},
}
for name, expected_config in expected.items():
live_config = live_snapshot.get(name)
if not live_config:
raise RuntimeError(f"Missing subnet: {name}")
if live_config["public"] != expected_config.public:
raise RuntimeError(f"Drift detected in {name}")
Operational Response
When drift is detected, decide whether to:
- Revert the live environment to the IaC state.
- Update IaC to reflect the intentional change.
- Escalate for security review if the change is unauthorized.
Conclusion
Drift is inevitable without automation. Combine pipeline enforcement, continuous drift detection, and strict access controls to keep infrastructure aligned with your declared state.