Kubernetes HPA and VPA: Autoscaling in Practice

Introduction#

Kubernetes provides two autoscaling mechanisms: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on metrics; Vertical Pod Autoscaler (VPA) adjusts resource requests and limits on existing pods. Understanding when to use each — and how to configure them well — is essential for cost-efficient, reliable deployments.

Horizontal Pod Autoscaler (HPA)#

HPA adjusts replica count based on observed metrics vs target values.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
  # CPU-based scaling
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # scale when avg CPU > 70% of request
  # Memory-based scaling
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Custom metric: requests per second from Prometheus
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30   # react quickly to spikes
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # wait 5 min before scaling down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60              # remove at most 2 pods per minute

HPA Requirements#

HPA requires resource requests to be set on containers — it calculates utilization as actual_usage / request.

spec:
  containers:
  - name: api
    resources:
      requests:
        cpu: 250m      # HPA targets are relative to this
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 512Mi

Custom Metrics with Prometheus Adapter#

# prometheus-adapter config: expose http_requests_total as a custom metric
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total$"
    as: "${1}_per_second"
  metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

# Verify the custom metric is available
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/http_requests_per_second"

Vertical Pod Autoscaler (VPA)#

VPA recommends (or automatically sets) resource requests based on historical usage. It does not change replica count.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"  # Off = recommend only, no automatic updates
    # Auto = automatically evict pods to apply new recommendations
    # Initial = only apply on new pods
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 2Gi
      controlledResources: [cpu, memory]

# View VPA recommendations
kubectl describe vpa api-vpa
# Recommendation:
#   Container Recommendations:
#     Container Name: api
#     Lower Bound:    cpu: 150m   memory: 200Mi
#     Target:         cpu: 350m   memory: 420Mi
#     Upper Bound:    cpu: 800m   memory: 900Mi

Use VPA in Off mode first to gather recommendations before enabling automatic updates.

HPA vs VPA: When to Use Each#

Scenario	Use
Stateless API with variable traffic	HPA
Batch workers with variable load	HPA
Service with stable load but uncertain right-sizing	VPA (Off mode)
Single-replica service (databases, leaders)	VPA (Auto)
Services needing both	HPA for replicas, VPA for sizing

Do not use HPA and VPA (Auto) on the same deployment for CPU — they conflict. If using both, set VPA to Off for CPU and only auto-update memory.

Cluster Autoscaler#

HPA scales pods; Cluster Autoscaler scales nodes when pods cannot be scheduled.

# Deployment annotation for cluster autoscaler to not evict
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

# View scaling events
kubectl describe configmap cluster-autoscaler-status -n kube-system

# Check why nodes are not scaling down
kubectl get nodes -l kubernetes.io/role=node
kubectl describe node <node-name> | grep -A10 "Conditions:"

Scaling Metrics Best Practices#

CPU utilization target: 60-70% leaves headroom for traffic spikes before scale-out.
Scale-down stabilization: 3-5 minutes prevents thrashing on transient traffic drops.
minReplicas >= 2: Single-replica deployments cannot scale down to zero and have no HA.
Custom metrics: Prefer request-rate over CPU for HTTP services — CPU can spike during initialization unrelated to load.

Conclusion#

HPA is the primary tool for handling traffic variability. Configure scale-down conservatively to avoid thrashing. Use VPA in Off mode to right-size resource requests — over-requested pods waste money and under-requested pods get throttled or OOM-killed. The combination of correct requests (from VPA recommendations) and HPA on custom metrics gives you both efficient resource usage and responsive scaling.