Canary Deployments in Practice

Introduction#

A canary deployment releases a new version to a small percentage of traffic before rolling out to everyone. It limits the blast radius of a bad deployment: if the new version has a bug, only a fraction of users are affected and rollback is fast. This post covers implementation patterns in Kubernetes and common pitfalls.

Traffic Splitting Strategies#

Kubernetes with Nginx Ingress#

# Two deployments: stable (current) and canary (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-stable
  namespace: production
spec:
  replicas: 9    # 90% of traffic
  selector:
    matchLabels:
      app: api
      version: stable
  template:
    metadata:
      labels:
        app: api
        version: stable
    spec:
      containers:
      - name: api
        image: api:v2.3.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-canary
  namespace: production
spec:
  replicas: 1    # 10% of traffic (1 out of 10 pods)
  selector:
    matchLabels:
      app: api
      version: canary
  template:
    metadata:
      labels:
        app: api
        version: canary
    spec:
      containers:
      - name: api
        image: api:v2.4.0
---
# One service routes to both (weighted by pod count)
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api    # matches both stable and canary pods
  ports:
  - port: 80
    targetPort: 8080

Nginx Ingress Weight-Based Canary#

# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-stable
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-stable
            port:
              number: 80
---
# Canary ingress with weight annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"   # 10% of traffic
spec:
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-canary
            port:
              number: 80

Argo Rollouts: Automated Progressive Delivery#

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 10     # 10% to canary
      - pause: {duration: 10m}  # wait 10 minutes
      - analysis:
          templates:
          - templateName: error-rate-analysis
      - setWeight: 30     # promote to 30%
      - pause: {duration: 10m}
      - setWeight: 60
      - pause: {duration: 5m}
      # Full rollout after all steps
      canaryService: api-canary
      stableService: api-stable
      trafficRouting:
        nginx:
          stableIngress: api-stable
---
# AnalysisTemplate: automated success criteria
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-analysis
spec:
  metrics:
  - name: error-rate
    interval: 1m
    count: 5
    successCondition: result[0] < 0.01    # error rate < 1%
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{version="canary"}[5m]))

Monitoring Canary Health#

# Compare canary metrics against baseline
import requests
from dataclasses import dataclass

PROMETHEUS_URL = "http://prometheus:9090"

@dataclass
class CanaryHealth:
    error_rate_stable: float
    error_rate_canary: float
    p99_latency_stable: float
    p99_latency_canary: float

    @property
    def is_healthy(self) -> bool:
        # Canary should not be significantly worse than stable
        return (
            self.error_rate_canary <= self.error_rate_stable * 2
            and self.p99_latency_canary <= self.p99_latency_stable * 1.2
        )

def query_prometheus(query: str) -> float:
    resp = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
    return float(resp.json()["data"]["result"][0]["value"][1])

def check_canary_health() -> CanaryHealth:
    return CanaryHealth(
        error_rate_stable=query_prometheus(
            'rate(http_requests_total{version="stable",status=~"5.."}[5m]) / rate(http_requests_total{version="stable"}[5m])'
        ),
        error_rate_canary=query_prometheus(
            'rate(http_requests_total{version="canary",status=~"5.."}[5m]) / rate(http_requests_total{version="canary"}[5m])'
        ),
        p99_latency_stable=query_prometheus(
            'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{version="stable"}[5m]))'
        ),
        p99_latency_canary=query_prometheus(
            'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{version="canary"}[5m]))'
        ),
    )

Rollout and Rollback Script#

#!/bin/bash
# canary-rollout.sh: progressive rollout with health checks

set -euo pipefail

IMAGE=$1
NAMESPACE=production

echo "Deploying canary: $IMAGE"
kubectl set image deployment/api-canary api=$IMAGE -n $NAMESPACE

echo "Waiting for canary rollout..."
kubectl rollout status deployment/api-canary -n $NAMESPACE

echo "Canary at 10%, monitoring for 10 minutes..."
sleep 600

# Check error rate
CANARY_ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=rate(http_errors_total{version="canary"}[5m])/rate(http_requests_total{version="canary"}[5m])' \
  | jq -r '.data.result[0].value[1]')

if (( $(echo "$CANARY_ERROR_RATE > 0.02" | bc -l) )); then
  echo "ERROR: Canary error rate $CANARY_ERROR_RATE > 2%. Rolling back."
  kubectl rollout undo deployment/api-canary -n $NAMESPACE
  exit 1
fi

echo "Canary healthy. Promoting to full rollout."
kubectl set image deployment/api-stable api=$IMAGE -n $NAMESPACE
kubectl rollout status deployment/api-stable -n $NAMESPACE
echo "Rollout complete."

Common Pitfalls#

Database migrations: ensure new schema is backward compatible — both old and new versions run simultaneously during a canary. See the Expand-Contract pattern.

Sticky sessions: if users are pinned to a version, canary metrics may be biased. For A/B testing purposes, use header-based routing instead of weight-based.

Insufficient sample size: 10% canary traffic on low-volume endpoints may not generate enough signal in 10 minutes. Extend monitoring windows for low-traffic services.

Conclusion#

Canary deployments reduce deployment risk without requiring feature flags or blue-green infrastructure duplication. Start at 5-10% traffic, monitor error rate and latency compared to baseline, and either automate promotion/rollback (Argo Rollouts) or run it manually with explicit health gates. The most important thing is having the rollback path rehearsed before you need it.