Post

Infrastructure Drift: Detection and Prevention

Introduction

Infrastructure drift occurs when real-world resources diverge from the declared state in infrastructure-as-code (IaC). Drift erodes reliability, makes deployments unpredictable, and complicates incident response. Preventing it requires both detection and process discipline.

Common Causes of Drift

  • Manual changes in the console during incidents.
  • Emergency fixes applied without IaC updates.
  • Hidden defaults or provider updates that alter configuration.
  • Auto-scaling groups or managed services that mutate resource properties.

Drift Detection Strategies

Continuous IaC Validation

Run frequent plan or diff operations and alert on unexpected changes.

Configuration Baselines

Maintain baseline security controls using policy-as-code to detect and remediate drift.

Resource Inventory and Tagging

Use centralized inventory services and mandatory tagging policies to identify unmanaged resources.

Drift Prevention Mechanisms

  • Enforce change management through CI/CD pipelines.
  • Restrict console access to break-glass accounts.
  • Automate remediation via pull requests rather than manual edits.
  • Include drift alerts in operational dashboards.

Example: Drift Snapshot Comparison

This Python example shows a simplified drift check that compares an expected configuration with a live snapshot. In practice, you would use a cloud SDK to gather the live state.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from dataclasses import dataclass

@dataclass(frozen=True)
class ExpectedSubnet:
    cidr: str
    public: bool

expected = {
    "subnet-a": ExpectedSubnet(cidr="10.0.1.0/24", public=True),
    "subnet-b": ExpectedSubnet(cidr="10.0.2.0/24", public=False),
}

live_snapshot = {
    "subnet-a": {"cidr": "10.0.1.0/24", "public": True},
    "subnet-b": {"cidr": "10.0.2.0/24", "public": True},
}

for name, expected_config in expected.items():
    live_config = live_snapshot.get(name)
    if not live_config:
        raise RuntimeError(f"Missing subnet: {name}")
    if live_config["public"] != expected_config.public:
        raise RuntimeError(f"Drift detected in {name}")

Operational Response

When drift is detected, decide whether to:

  • Revert the live environment to the IaC state.
  • Update IaC to reflect the intentional change.
  • Escalate for security review if the change is unauthorized.

Conclusion

Drift is inevitable without automation. Combine pipeline enforcement, continuous drift detection, and strict access controls to keep infrastructure aligned with your declared state.

This post is licensed under CC BY 4.0 by the author.