Introduction#
A canary deployment releases a new version to a small percentage of traffic before rolling out to everyone. It limits the blast radius of a bad deployment: if the new version has a bug, only a fraction of users are affected and rollback is fast. This post covers implementation patterns in Kubernetes and common pitfalls.
Traffic Splitting Strategies#
Kubernetes with Nginx Ingress#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Two deployments: stable (current) and canary (new)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-stable
namespace: production
spec:
replicas: 9 # 90% of traffic
selector:
matchLabels:
app: api
version: stable
template:
metadata:
labels:
app: api
version: stable
spec:
containers:
- name: api
image: api:v2.3.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-canary
namespace: production
spec:
replicas: 1 # 10% of traffic (1 out of 10 pods)
selector:
matchLabels:
app: api
version: canary
template:
metadata:
labels:
app: api
version: canary
spec:
containers:
- name: api
image: api:v2.4.0
---
# One service routes to both (weighted by pod count)
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api # matches both stable and canary pods
ports:
- port: 80
targetPort: 8080
Nginx Ingress Weight-Based Canary#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Stable ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-stable
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-stable
port:
number: 80
---
# Canary ingress with weight annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% of traffic
spec:
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-canary
port:
number: 80
Argo Rollouts: Automated Progressive Delivery#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10 # 10% to canary
- pause: {duration: 10m} # wait 10 minutes
- analysis:
templates:
- templateName: error-rate-analysis
- setWeight: 30 # promote to 30%
- pause: {duration: 10m}
- setWeight: 60
- pause: {duration: 5m}
# Full rollout after all steps
canaryService: api-canary
stableService: api-stable
trafficRouting:
nginx:
stableIngress: api-stable
---
# AnalysisTemplate: automated success criteria
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-analysis
spec:
metrics:
- name: error-rate
interval: 1m
count: 5
successCondition: result[0] < 0.01 # error rate < 1%
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{version="canary"}[5m]))
Monitoring Canary Health#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Compare canary metrics against baseline
import requests
from dataclasses import dataclass
PROMETHEUS_URL = "http://prometheus:9090"
@dataclass
class CanaryHealth:
error_rate_stable: float
error_rate_canary: float
p99_latency_stable: float
p99_latency_canary: float
@property
def is_healthy(self) -> bool:
# Canary should not be significantly worse than stable
return (
self.error_rate_canary <= self.error_rate_stable * 2
and self.p99_latency_canary <= self.p99_latency_stable * 1.2
)
def query_prometheus(query: str) -> float:
resp = requests.get(f"{PROMETHEUS_URL}/api/v1/query", params={"query": query})
return float(resp.json()["data"]["result"][0]["value"][1])
def check_canary_health() -> CanaryHealth:
return CanaryHealth(
error_rate_stable=query_prometheus(
'rate(http_requests_total{version="stable",status=~"5.."}[5m]) / rate(http_requests_total{version="stable"}[5m])'
),
error_rate_canary=query_prometheus(
'rate(http_requests_total{version="canary",status=~"5.."}[5m]) / rate(http_requests_total{version="canary"}[5m])'
),
p99_latency_stable=query_prometheus(
'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{version="stable"}[5m]))'
),
p99_latency_canary=query_prometheus(
'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{version="canary"}[5m]))'
),
)
Rollout and Rollback Script#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
# canary-rollout.sh: progressive rollout with health checks
set -euo pipefail
IMAGE=$1
NAMESPACE=production
echo "Deploying canary: $IMAGE"
kubectl set image deployment/api-canary api=$IMAGE -n $NAMESPACE
echo "Waiting for canary rollout..."
kubectl rollout status deployment/api-canary -n $NAMESPACE
echo "Canary at 10%, monitoring for 10 minutes..."
sleep 600
# Check error rate
CANARY_ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=rate(http_errors_total{version="canary"}[5m])/rate(http_requests_total{version="canary"}[5m])' \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$CANARY_ERROR_RATE > 0.02" | bc -l) )); then
echo "ERROR: Canary error rate $CANARY_ERROR_RATE > 2%. Rolling back."
kubectl rollout undo deployment/api-canary -n $NAMESPACE
exit 1
fi
echo "Canary healthy. Promoting to full rollout."
kubectl set image deployment/api-stable api=$IMAGE -n $NAMESPACE
kubectl rollout status deployment/api-stable -n $NAMESPACE
echo "Rollout complete."
Common Pitfalls#
Database migrations: ensure new schema is backward compatible — both old and new versions run simultaneously during a canary. See the Expand-Contract pattern.
Sticky sessions: if users are pinned to a version, canary metrics may be biased. For A/B testing purposes, use header-based routing instead of weight-based.
Insufficient sample size: 10% canary traffic on low-volume endpoints may not generate enough signal in 10 minutes. Extend monitoring windows for low-traffic services.
Conclusion#
Canary deployments reduce deployment risk without requiring feature flags or blue-green infrastructure duplication. Start at 5-10% traffic, monitor error rate and latency compared to baseline, and either automate promotion/rollback (Argo Rollouts) or run it manually with explicit health gates. The most important thing is having the rollback path rehearsed before you need it.