Kubernetes Probes: Liveness, Readiness, and Startup

Introduction#

Kubernetes uses three probe types to manage pod lifecycle: liveness (is the container healthy?), readiness (is the container ready to serve traffic?), and startup (has the container finished initializing?). Misconfigured probes cause unnecessary restarts, traffic to unready pods, and slow rolling deployments.

Probe Types#

Liveness Probe#

Determines if the container is running correctly. Failure triggers a container restart.

Use case: detect deadlocks, memory leaks that make the app unresponsive, or infinite loops that cannot be detected from outside.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10   # wait before first probe
  periodSeconds: 15          # probe every 15s
  timeoutSeconds: 5
  failureThreshold: 3        # restart after 3 consecutive failures
  successThreshold: 1

Readiness Probe#

Determines if the container is ready to serve traffic. Failure removes the pod from Service endpoints (no traffic routed to it).

Use case: warm-up time after restart, temporarily overloaded pods, maintenance mode, dependency unavailability.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3
  successThreshold: 2    # require 2 consecutive successes to mark ready again

Startup Probe#

Handles slow-starting containers. Disables liveness and readiness probes until it succeeds. Prevents premature liveness failures during long initialization.

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 30   # allow up to 30 * 10s = 5 minutes to start

Implementing Health Endpoints#

# FastAPI: health endpoints that do meaningful checks
from fastapi import FastAPI, Response
import asyncpg
import redis.asyncio as aioredis

app = FastAPI()
db_pool = None
redis_client = None

@app.get("/healthz")
async def liveness():
    # Liveness: check that the app is not deadlocked
    # Keep this SIMPLE — avoid external dependency checks
    return {"status": "alive"}

@app.get("/ready")
async def readiness(response: Response):
    checks = {}

    # Check database connectivity
    try:
        await db_pool.fetchval("SELECT 1")
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {e}"
        response.status_code = 503

    # Check Redis connectivity
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception as e:
        checks["redis"] = f"error: {e}"
        response.status_code = 503

    return checks

Go: Health Endpoint#

package main

import (
    "context"
    "encoding/json"
    "net/http"
    "time"
)

var db *sql.DB

func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{"status": "alive"})
}

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second)
    defer cancel()

    if err := db.PingContext(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        json.NewEncoder(w).Encode(map[string]string{
            "status": "not ready",
            "error":  err.Error(),
        })
        return
    }

    w.WriteHeader(http.StatusOK)
    json.NewEncoder(w).Encode(map[string]string{"status": "ready"})
}

Common Mistakes#

Liveness Probe Checking External Dependencies#

# BAD: if the database is slow, the pod restarts unnecessarily
# A database outage would cause all pods to restart in a loop
livenessProbe:
  httpGet:
    path: /health-with-db-check
    port: 8080

# GOOD: liveness checks only that the app process is alive
livenessProbe:
  httpGet:
    path: /healthz  # returns 200 always unless the app is hung
    port: 8080

Too-Aggressive Liveness Settings#

# BAD: will restart pods during brief GC pauses or CPU spikes
livenessProbe:
  periodSeconds: 5
  timeoutSeconds: 1
  failureThreshold: 1

# BETTER: tolerant of transient slowness
livenessProbe:
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3   # allows 30 seconds of failure before restart

Missing Startup Probe for Slow Services#

# BAD: liveness probe fires before JVM finishes starting
# initialDelaySeconds is a fixed wait — wrong for variable startup times
livenessProbe:
  initialDelaySeconds: 60   # hardcoded guess
  httpGet:
    path: /healthz
    port: 8080

# GOOD: startup probe handles variable initialization time
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 5
  failureThreshold: 60    # allow up to 5 minutes

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3
  # No initialDelaySeconds needed — startup probe gates liveness

Full Configuration Example#

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      containers:
      - name: api
        image: my-api:latest
        ports:
        - containerPort: 8080
        startupProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 5
          failureThreshold: 24    # 2 minutes max startup
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 15
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1

Conclusion#

Keep liveness probes simple and internal-only — they should detect the app being hung, not dependency failures. Use readiness probes to check real dependencies and gate traffic. Use startup probes for slow-starting services instead of initialDelaySeconds. The most common production issues come from liveness probes that are too aggressive or check external services, causing restart loops during dependency outages.