Introduction#
Kubernetes provides two autoscaling mechanisms: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on metrics; Vertical Pod Autoscaler (VPA) adjusts resource requests and limits on existing pods. Understanding when to use each — and how to configure them well — is essential for cost-efficient, reliable deployments.
Horizontal Pod Autoscaler (HPA)#
HPA adjusts replica count based on observed metrics vs target values.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
| apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale when avg CPU > 70% of request
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric: requests per second from Prometheus
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # react quickly to spikes
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 60 # remove at most 2 pods per minute
|
HPA Requirements#
HPA requires resource requests to be set on containers — it calculates utilization as actual_usage / request.
1
2
3
4
5
6
7
8
9
10
| spec:
containers:
- name: api
resources:
requests:
cpu: 250m # HPA targets are relative to this
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
|
Custom Metrics with Prometheus Adapter#
1
2
3
4
5
6
7
8
9
10
11
| # prometheus-adapter config: expose http_requests_total as a custom metric
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
|
1
2
| # Verify the custom metric is available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second"
|
Vertical Pod Autoscaler (VPA)#
VPA recommends (or automatically sets) resource requests based on historical usage. It does not change replica count.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Off = recommend only, no automatic updates
# Auto = automatically evict pods to apply new recommendations
# Initial = only apply on new pods
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
controlledResources: [cpu, memory]
|
1
2
3
4
5
6
7
8
| # View VPA recommendations
kubectl describe vpa api-vpa
# Recommendation:
# Container Recommendations:
# Container Name: api
# Lower Bound: cpu: 150m memory: 200Mi
# Target: cpu: 350m memory: 420Mi
# Upper Bound: cpu: 800m memory: 900Mi
|
Use VPA in Off mode first to gather recommendations before enabling automatic updates.
HPA vs VPA: When to Use Each#
| Scenario |
Use |
| Stateless API with variable traffic |
HPA |
| Batch workers with variable load |
HPA |
| Service with stable load but uncertain right-sizing |
VPA (Off mode) |
| Single-replica service (databases, leaders) |
VPA (Auto) |
| Services needing both |
HPA for replicas, VPA for sizing |
Do not use HPA and VPA (Auto) on the same deployment for CPU — they conflict. If using both, set VPA to Off for CPU and only auto-update memory.
Cluster Autoscaler#
HPA scales pods; Cluster Autoscaler scales nodes when pods cannot be scheduled.
1
2
3
4
| # Deployment annotation for cluster autoscaler to not evict
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
|
1
2
3
4
5
6
| # View scaling events
kubectl describe configmap cluster-autoscaler-status -n kube-system
# Check why nodes are not scaling down
kubectl get nodes -l kubernetes.io/role=node
kubectl describe node <node-name> | grep -A10 "Conditions:"
|
Scaling Metrics Best Practices#
- CPU utilization target: 60-70% leaves headroom for traffic spikes before scale-out.
- Scale-down stabilization: 3-5 minutes prevents thrashing on transient traffic drops.
- minReplicas >= 2: Single-replica deployments cannot scale down to zero and have no HA.
- Custom metrics: Prefer request-rate over CPU for HTTP services — CPU can spike during initialization unrelated to load.
Conclusion#
HPA is the primary tool for handling traffic variability. Configure scale-down conservatively to avoid thrashing. Use VPA in Off mode to right-size resource requests — over-requested pods waste money and under-requested pods get throttled or OOM-killed. The combination of correct requests (from VPA recommendations) and HPA on custom metrics gives you both efficient resource usage and responsive scaling.