SRE Toil Reduction: Automating the Work That Shouldn’t Exist

Introduction#

Toil is the manual, repetitive, automatable operational work that scales linearly with service growth. It is the opposite of engineering work that adds lasting value. Google’s SRE book defines the goal as keeping toil below 50% of an SRE’s time. Measuring, categorizing, and eliminating toil is a core SRE practice.

What Counts as Toil#

Toil characteristics:
- Manual: requires a human to execute
- Repetitive: same steps run again and again
- Automatable: a script could do it
- Tactical: reacting to events, not improving the system
- No lasting value: doing it again doesn't leave things better
- O(n) with service growth: more services = more toil

Common toil categories:
- Manually restarting services after OOM crashes
- Responding to high-disk-usage alerts with cleanup scripts
- Updating config files across dozens of servers
- Provisioning new environments by hand
- Running the same runbook step-by-step every incident
- Copy-pasting access tokens for team members
- Manual certificate renewals

Measuring Toil#

# Track toil hours per sprint or week
TOIL_LOG = {
    "week": "2025-W47",
    "engineer": "alice",
    "entries": [
        {
            "task": "restarted api service after OOM",
            "duration_minutes": 15,
            "is_toil": True,
            "category": "manual_intervention",
            "recurring": True,
            "automatable": True,
        },
        {
            "task": "certificate renewal for api.example.com",
            "duration_minutes": 30,
            "is_toil": True,
            "category": "certificate_management",
            "recurring": True,
            "automatable": True,
        },
        {
            "task": "wrote runbook for new database failover procedure",
            "duration_minutes": 120,
            "is_toil": False,  # engineering work, lasting value
            "category": "documentation",
            "recurring": False,
        },
    ]
}

def toil_percentage(entries: list[dict]) -> float:
    total = sum(e["duration_minutes"] for e in entries)
    toil = sum(e["duration_minutes"] for e in entries if e["is_toil"])
    return (toil / total * 100) if total else 0

print(f"Toil: {toil_percentage(TOIL_LOG['entries']):.1f}%")  # 30.4%

Eliminating OOM Restarts with Proper Limits#

# TOIL: manually restarting services that OOM
# ROOT CAUSE: no memory limits set, no auto-restart policy

# Kubernetes: set proper resource limits and restart policy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  template:
    spec:
      containers:
      - name: api
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"   # OOM kill instead of unbounded growth
            cpu: "500m"
        # Kubernetes restarts the container automatically on failure
        # restartPolicy: Always (default for Deployments)

      # Also fix the root cause: find and fix the memory leak
      # pprof profile → find allocation hotspot → fix it

Automating Certificate Renewal#

#!/bin/bash
# TOIL: manual certificate renewals
# SOLUTION: automate with certbot + cron/systemd

# Install certbot for Let's Encrypt
apt-get install certbot

# Initial certificate
certbot certonly --standalone -d api.example.com --non-interactive \
    --agree-tos --email ops@example.com

# Auto-renewal via systemd timer (certbot installs this automatically)
# Check it's enabled:
systemctl status certbot.timer
# certbot.timer - Run certbot twice daily
# Loaded: loaded (/lib/systemd/system/certbot.timer; enabled)

# Verify renewal works:
certbot renew --dry-run

# For internal services using private CA via Vault:
# vault write pki/issue/my-role common_name="internal.example.com" ttl="720h"
# Automate with vault-agent or cert-manager in Kubernetes

Self-Healing: Automatic Disk Cleanup#

#!/usr/bin/env python3
"""
Auto-remediation for high disk usage.
Eliminates: "respond to high-disk alert, manually clean up logs/tmp"
"""

import os
import shutil
import time
import logging
from pathlib import Path
from datetime import datetime, timedelta

logger = logging.getLogger(__name__)

CLEANUP_DIRS = [
    {"path": "/var/log/app", "max_age_days": 7, "pattern": "*.log.*"},
    {"path": "/tmp/builds", "max_age_days": 1, "pattern": "*"},
    {"path": "/var/cache/apt", "max_age_days": 30, "pattern": "*.deb"},
]

def get_disk_usage(path: str) -> float:
    """Return disk usage percentage for the filesystem containing path."""
    usage = shutil.disk_usage(path)
    return (usage.used / usage.total) * 100

def cleanup_old_files(directory: str, max_age_days: int, pattern: str) -> int:
    """Delete files matching pattern older than max_age_days. Returns bytes freed."""
    cutoff = time.time() - (max_age_days * 86400)
    freed = 0

    for path in Path(directory).rglob(pattern):
        try:
            if path.is_file() and path.stat().st_mtime < cutoff:
                size = path.stat().st_size
                path.unlink()
                freed += size
                logger.info("Deleted %s (%d bytes)", path, size)
        except PermissionError:
            logger.warning("Cannot delete %s: permission denied", path)

    return freed

def auto_remediate(threshold_pct: float = 80.0) -> None:
    usage = get_disk_usage("/")
    if usage < threshold_pct:
        return  # no action needed

    logger.warning("Disk usage %.1f%% exceeds threshold %.1f%%", usage, threshold_pct)

    total_freed = 0
    for config in CLEANUP_DIRS:
        freed = cleanup_old_files(config["path"], config["max_age_days"], config["pattern"])
        total_freed += freed
        logger.info("Freed %d bytes from %s", freed, config["path"])

    new_usage = get_disk_usage("/")
    logger.info(
        "Cleanup complete: %.1f%% → %.1f%%, freed %.1f MB",
        usage, new_usage, total_freed / (1024 * 1024)
    )

    if new_usage >= threshold_pct:
        logger.error(
            "Disk still at %.1f%% after cleanup — escalating to on-call",
            new_usage
        )
        page_oncall("disk_high_after_remediation", f"Disk at {new_usage:.1f}% after auto-cleanup")

if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    auto_remediate()

Automating Access Provisioning#

# TOIL: manually adding team members to groups, rotating their credentials
# SOLUTION: infrastructure-as-code for access management

# Terraform for AWS IAM (stored in git, reviewed, applied automatically)
"""
# iam.tf
resource "aws_iam_user" "engineers" {
  for_each = toset(var.engineer_emails)
  name     = each.value
}

resource "aws_iam_user_group_membership" "engineers" {
  for_each = aws_iam_user.engineers
  user     = each.value.name
  groups   = ["engineers", "readonly-production"]
}

# New team member: add to var.engineer_emails in variables.tf
# Open PR → review → merge → CI applies terraform
# No manual console work
"""

# For temporary access (break-glass scenarios):
import boto3
import datetime

def grant_temporary_access(user_arn: str, role_arn: str, duration_hours: int = 4) -> str:
    """Grant temporary elevated access via STS assume-role."""
    sts = boto3.client("sts")
    response = sts.assume_role(
        RoleArn=role_arn,
        RoleSessionName=f"breakglass-{user_arn.split('/')[-1]}",
        DurationSeconds=duration_hours * 3600,
    )

    expiry = response["Credentials"]["Expiration"]
    logger.info(
        "Granted %s access to %s until %s",
        user_arn, role_arn, expiry.isoformat()
    )
    return response["Credentials"]["SessionToken"]

Runbook Automation with Ansible#

# TOIL: 15-step runbook that engineers follow manually
# SOLUTION: Ansible playbook that executes the same steps

# runbook-high-latency.yml
---
- name: Automated high-latency remediation
  hosts: "{{ target_host | default('production-api') }}"
  gather_facts: no
  tasks:
    - name: Check connection pool saturation
      shell: |
        psql -U app -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
      register: pg_stats

    - name: Kill long-running idle connections
      shell: |
        psql -U app -c "
          SELECT pg_terminate_backend(pid)
          FROM pg_stat_activity
          WHERE state = 'idle'
            AND query_start < NOW() - INTERVAL '5 minutes';
        "
      when: "'idle' in pg_stats.stdout"

    - name: Check for memory pressure
      shell: free -m | awk '/Mem:/ {print $3/$2 * 100}'
      register: mem_usage

    - name: Restart service if memory critical
      systemd:
        name: api-service
        state: restarted
      when: mem_usage.stdout | float > 90

    - name: Notify Slack
      uri:
        url: "{{ slack_webhook }}"
        method: POST
        body_format: json
        body:
          text: "Auto-remediation ran on {{ inventory_hostname }}: {{ ansible_play_recap }}"

Toil Reduction Checklist#

Identify:
- [ ] Track time spent on each type of operational task for 2 sprints
- [ ] Flag tasks that recur, are manual, and could be automated
- [ ] Calculate toil percentage (target: < 50%)

Eliminate:
- [ ] OOM crashes → set memory limits, fix root cause memory leak
- [ ] Disk full alerts → automate cleanup, add retention policies
- [ ] Manual deployments → CI/CD pipeline
- [ ] Certificate renewals → cert-manager or certbot auto-renewal
- [ ] Manual config changes → Terraform/Ansible IaC
- [ ] Access provisioning → IaC + self-service portal
- [ ] Recurring runbook steps → Ansible playbooks or auto-remediation scripts

Prevent:
- [ ] New services must have runbooks that can be executed without human judgment
- [ ] Alerts must have automated remediation or clear escalation path
- [ ] Any task done more than 3 times gets a ticket to automate it

Conclusion#

Toil accumulates invisibly and crowds out engineering work. Measure it explicitly, categorize it, and treat it as technical debt. The most impactful eliminations are usually certificate management, disk cleanup, access provisioning, and service restarts — all automatable with off-the-shelf tools. Each hour invested in automation returns hours of toil eliminated every week. The goal is not zero toil but a declining toil percentage as the service grows.