Introduction#
Infrastructure drift occurs when real-world resources diverge from the declared state in infrastructure-as-code (IaC). Drift erodes reliability, makes deployments unpredictable, and complicates incident response. Preventing it requires both detection and process discipline.
Common Causes of Drift#
- Manual changes in the console during incidents.
- Emergency fixes applied without IaC updates.
- Hidden defaults or provider updates that alter configuration.
- Auto-scaling groups or managed services that mutate resource properties.
Drift Detection Strategies#
Continuous IaC Validation#
Run frequent plan or diff operations and alert on unexpected changes.
Configuration Baselines#
Maintain baseline security controls using policy-as-code to detect and remediate drift.
Resource Inventory and Tagging#
Use centralized inventory services and mandatory tagging policies to identify unmanaged resources.
Drift Prevention Mechanisms#
- Enforce change management through CI/CD pipelines.
- Restrict console access to break-glass accounts.
- Automate remediation via pull requests rather than manual edits.
- Include drift alerts in operational dashboards.
Example: Drift Snapshot Comparison#
This Python example shows a simplified drift check that compares an expected configuration with a live snapshot. In practice, you would use a cloud SDK to gather the live state.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from dataclasses import dataclass
@dataclass(frozen=True)
class ExpectedSubnet:
cidr: str
public: bool
expected = {
"subnet-a": ExpectedSubnet(cidr="10.0.1.0/24", public=True),
"subnet-b": ExpectedSubnet(cidr="10.0.2.0/24", public=False),
}
live_snapshot = {
"subnet-a": {"cidr": "10.0.1.0/24", "public": True},
"subnet-b": {"cidr": "10.0.2.0/24", "public": True},
}
for name, expected_config in expected.items():
live_config = live_snapshot.get(name)
if not live_config:
raise RuntimeError(f"Missing subnet: {name}")
if live_config["public"] != expected_config.public:
raise RuntimeError(f"Drift detected in {name}")
Operational Response#
When drift is detected, decide whether to:
- Revert the live environment to the IaC state.
- Update IaC to reflect the intentional change.
- Escalate for security review if the change is unauthorized.
Conclusion#
Drift is inevitable without automation. Combine pipeline enforcement, continuous drift detection, and strict access controls to keep infrastructure aligned with your declared state.