Post

Monitoring and Observability with Prometheus and Grafana

Monitoring and Observability with Prometheus and Grafana

Monitoring and observability are critical components of modern infrastructure and application management. Prometheus and Grafana have become the de facto standard for metrics collection, monitoring, and visualization in cloud-native environments. In this comprehensive guide, we will explore Prometheus architecture, metric types, PromQL queries, Grafana dashboards, alerting rules, service discovery, exporters, best practices, and real-world examples.

Understanding Monitoring vs Observability

Before diving into the tools, let us understand the difference:

Monitoring tells you when something is wrong. It is about collecting predefined metrics and setting alerts on known failure modes.

Observability tells you why something is wrong. It is about understanding the internal state of your system from its external outputs through metrics, logs, and traces.

Prometheus and Grafana together provide both monitoring and observability capabilities.

Prometheus Architecture

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability.

Core Components

  1. Prometheus Server: Scrapes and stores time series data
  2. Client Libraries: Instrument application code
  3. Push Gateway: For short-lived jobs
  4. Exporters: Expose metrics from third-party systems
  5. Alertmanager: Handles alerts
  6. Service Discovery: Automatically discovers targets

How Prometheus Works

Prometheus uses a pull model where it scrapes metrics from instrumented targets at specified intervals. Metrics are stored in a time series database with a flexible query language (PromQL) for analysis.

Installing Prometheus

Docker Installation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Create prometheus configuration
cat > prometheus.yml <<EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    environment: 'prod'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
EOF

# Run Prometheus
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:latest \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus \
  --web.console.libraries=/usr/share/prometheus/console_libraries \
  --web.console.templates=/usr/share/prometheus/consoles

Kubernetes Installation with Helm

1
2
3
4
5
6
7
8
9
10
11
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123

Custom values for production deployment:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "45GB"
    replicas: 2
    
    resources:
      requests:
        cpu: 500m
        memory: 2Gi
      limits:
        cpu: 2000m
        memory: 4Gi
    
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: fast-ssd
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    
    additionalScrapeConfigs:
      - job_name: 'custom-app'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - production
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

alertmanager:
  alertmanagerSpec:
    replicas: 3
    storage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 10Gi

grafana:
  replicas: 2
  adminPassword: "ChangeMe123!"
  
  persistence:
    enabled: true
    size: 10Gi
  
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Prometheus
          type: prometheus
          url: http://prometheus-operated:9090
          access: proxy
          isDefault: true

Metric Types in Prometheus

Prometheus supports four metric types:

Counter

A counter is a cumulative metric that only increases. Use for counting requests, errors, completed tasks.

1
2
3
4
5
6
7
8
9
10
11
12
from prometheus_client import Counter

# Create counter
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Increment counter
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
http_requests_total.labels(method='POST', endpoint='/api/users', status='201').inc()

Gauge

A gauge is a metric that can go up or down. Use for temperature, memory usage, concurrent requests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from prometheus_client import Gauge

# Create gauge
active_connections = Gauge(
    'active_connections',
    'Number of active connections',
    ['service']
)

# Set gauge value
active_connections.labels(service='database').set(42)

# Increment/decrement
active_connections.labels(service='cache').inc()
active_connections.labels(service='cache').dec()

Histogram

A histogram samples observations and counts them in configurable buckets. Use for request durations, response sizes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from prometheus_client import Histogram

# Create histogram
request_duration_seconds = Histogram(
    'request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

# Observe value
request_duration_seconds.labels(method='GET', endpoint='/api/users').observe(0.234)

# Use as decorator
@request_duration_seconds.labels(method='GET', endpoint='/api/status').time()
def get_status():
    # Function code
    return {"status": "ok"}

Summary

A summary samples observations and calculates configurable quantiles. Similar to histogram but more expensive to compute.

1
2
3
4
5
6
7
8
9
10
11
from prometheus_client import Summary

# Create summary
request_latency = Summary(
    'request_latency_seconds',
    'Request latency in seconds',
    ['endpoint']
)

# Observe value
request_latency.labels(endpoint='/api/data').observe(0.156)

Instrumenting Applications

Python Application with Flask

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from flask import Flask, request, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total requests',
    ['method', 'endpoint', 'http_status']
)

REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'Request duration',
    ['method', 'endpoint']
)

ACTIVE_REQUESTS = Gauge(
    'app_active_requests',
    'Active requests',
    ['endpoint']
)

# Middleware for automatic instrumentation
@app.before_request
def before_request():
    request.start_time = time.time()
    ACTIVE_REQUESTS.labels(endpoint=request.path).inc()

@app.after_request
def after_request(response):
    request_duration = time.time() - request.start_time
    
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        http_status=response.status_code
    ).inc()
    
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.path
    ).observe(request_duration)
    
    ACTIVE_REQUESTS.labels(endpoint=request.path).dec()
    
    return response

# Business logic endpoints
@app.route('/')
def index():
    return {"message": "Hello World"}

@app.route('/api/users')
def get_users():
    # Simulate database query
    time.sleep(0.1)
    return {"users": ["Alice", "Bob"]}

# Metrics endpoint
@app.route('/metrics')
def metrics():
    return Response(generate_latest(REGISTRY), mimetype='text/plain')

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Go Application

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
package main

import (
    "net/http"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )
    
    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
    
    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        activeConnections.Inc()
        defer activeConnections.Dec()
        
        // Call the actual handler
        next(w, r)
        
        // Record metrics
        duration := time.Since(start).Seconds()
        httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
        httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
    }
}

func indexHandler(w http.ResponseWriter, r *http.Request) {
    w.Write([]byte("Hello, World!"))
}

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/", instrumentedHandler(indexHandler))
    
    http.ListenAndServe(":8080", nil)
}

PromQL Queries

PromQL is Prometheus’ query language for selecting and aggregating time series data.

Basic Queries

# Get current value of a metric
http_requests_total

# Filter by labels
http_requests_total{method="GET", status="200"}

# Range vector (last 5 minutes)
http_requests_total[5m]

# Rate of increase (per second)
rate(http_requests_total[5m])

# Increase over time range
increase(http_requests_total[1h])

Aggregation Operators

# Sum across all instances
sum(http_requests_total)

# Sum by label
sum by(endpoint) (http_requests_total)

# Average by label
avg by(instance) (cpu_usage_percent)

# Maximum value
max(memory_usage_bytes)

# Minimum value
min(memory_usage_bytes)

# Count number of time series
count(up)

# Standard deviation
stddev(response_time_seconds)

Complex Queries

# Request rate by endpoint
sum by(endpoint) (rate(http_requests_total[5m]))

# Error rate (errors per second)
sum(rate(http_requests_total{status=~"5.."}[5m]))

# Success rate percentage
sum(rate(http_requests_total{status="200"}[5m])) / 
sum(rate(http_requests_total[5m])) * 100

# 95th percentile latency
histogram_quantile(0.95, 
  sum by(le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / 
node_memory_MemTotal_bytes * 100

# CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / 
node_filesystem_size_bytes * 100

# Predict when disk will be full (linear regression)
predict_linear(node_filesystem_avail_bytes[4h], 24*3600) < 0

Comparison and Arithmetic

# Compare current vs 1 hour ago
rate(http_requests_total[5m]) / 
rate(http_requests_total[5m] offset 1h)

# Calculate ratio
sum(rate(http_requests_total{status="500"}[5m])) / 
sum(rate(http_requests_total[5m]))

# Subtract metrics
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Combine multiple metrics
avg(cpu_usage) * avg(memory_usage) * avg(disk_io_utilization)

Exporters

Exporters expose metrics from third-party systems in Prometheus format.

Node Exporter

Node Exporter exposes hardware and OS metrics.

1
2
3
4
5
6
7
8
# Run Node Exporter
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

Kubernetes DaemonSet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9100"
    spec:
      hostNetwork: true
      hostPID: true
      containers:
      - name: node-exporter
        image: quay.io/prometheus/node-exporter:latest
        args:
          - --path.procfs=/host/proc
          - --path.sysfs=/host/sys
          - --path.rootfs=/host/root
          - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
        ports:
        - containerPort: 9100
          name: metrics
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - name: proc
          mountPath: /host/proc
          readOnly: true
        - name: sys
          mountPath: /host/sys
          readOnly: true
        - name: root
          mountPath: /host/root
          readOnly: true
      volumes:
      - name: proc
        hostPath:
          path: /proc
      - name: sys
        hostPath:
          path: /sys
      - name: root
        hostPath:
          path: /

Custom Exporter

Create a custom exporter in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from prometheus_client import start_http_server, Gauge
import time
import psutil

# Define metrics
cpu_usage = Gauge('custom_cpu_usage_percent', 'CPU usage percentage')
memory_usage = Gauge('custom_memory_usage_percent', 'Memory usage percentage')
disk_usage = Gauge('custom_disk_usage_percent', 'Disk usage percentage', ['mountpoint'])

def collect_metrics():
    """Collect system metrics"""
    while True:
        # CPU usage
        cpu_usage.set(psutil.cpu_percent(interval=1))
        
        # Memory usage
        memory = psutil.virtual_memory()
        memory_usage.set(memory.percent)
        
        # Disk usage
        for partition in psutil.disk_partitions():
            try:
                usage = psutil.disk_usage(partition.mountpoint)
                disk_usage.labels(mountpoint=partition.mountpoint).set(usage.percent)
            except PermissionError:
                continue
        
        time.sleep(15)

if __name__ == '__main__':
    # Start metrics server
    start_http_server(8000)
    
    # Collect metrics
    collect_metrics()

Alerting with Alertmanager

Alertmanager handles alerts sent by Prometheus server.

Alerting Rules

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  alert-rules.yaml: |
    groups:
      - name: instance_alerts
        interval: 30s
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Instance  down"
              description: " of job  has been down for more than 5 minutes."
          
          - alert: HighCPUUsage
            expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on "
              description: "CPU usage is above 80% (current value: %)"
          
          - alert: HighMemoryUsage
            expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage on "
              description: "Memory usage is above 85% (current value: %)"
          
          - alert: DiskSpaceLow
            expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Low disk space on "
              description: "Disk space is below 15% on "
          
          - alert: HighErrorRate
            expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High error rate detected"
              description: "Error rate is above 5% (current value: )"
          
          - alert: HighLatency
            expr: histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "High request latency"
              description: "95th percentile latency is above 2 seconds (current value: s)"
      
      - name: application_alerts
        interval: 30s
        rules:
          - alert: PodCrashLooping
            expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "Pod / is crash looping"
              description: "Pod has restarted  times in the last 15 minutes"
          
          - alert: PodNotReady
            expr: kube_pod_status_phase{phase!="Running"} > 0
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Pod / not ready"
              description: "Pod has been in  state for more than 10 minutes"

Alertmanager Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# alertmanager-config.yaml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: true
    
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@example.com'
        from: 'alertmanager@example.com'
        smarthost: 'smtp.gmail.com:587'
        auth_username: 'alertmanager@example.com'
        auth_password: 'password'
  
  - name: 'critical-alerts'
    slack_configs:
      - channel: '#critical-alerts'
        title: 'Critical Alert'
        text: ''
        send_resolved: true
    
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
  
  - name: 'warning-alerts'
    slack_configs:
      - channel: '#monitoring'
        title: 'Warning Alert'
        text: ''
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Grafana Dashboards

Grafana provides beautiful visualizations for Prometheus data.

Installing Grafana

1
2
3
4
5
6
# Run Grafana with Docker
docker run -d \
  --name=grafana \
  -p 3000:3000 \
  -e "GF_SECURITY_ADMIN_PASSWORD=admin" \
  grafana/grafana:latest

Configuring Prometheus Data Source

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# grafana-datasource.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  prometheus.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        access: proxy
        url: http://prometheus-server:9090
        isDefault: true
        editable: true
        jsonData:
          timeInterval: "15s"
          queryTimeout: "60s"
          httpMethod: "POST"

Creating Dashboards with JSON

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
  "dashboard": {
    "title": "Application Metrics",
    "tags": ["application", "monitoring"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": ""
          }
        ],
        "yaxes": [
          {
            "label": "requests/sec",
            "format": "short"
          }
        ]
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
            "legendFormat": "Error Rate %"
          }
        ]
      },
      {
        "id": 3,
        "title": "Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p99"
          }
        ]
      }
    ],
    "refresh": "30s",
    "time": {
      "from": "now-1h",
      "to": "now"
    }
  }
}

Service Discovery

Prometheus supports various service discovery mechanisms.

Kubernetes Service Discovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      
      # Use custom metrics path if specified
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      
      # Use custom port if specified
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Consul Service Discovery

1
2
3
4
5
6
7
8
9
10
11
12
13
14
scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul.service.consul:8500'
        datacenter: 'dc1'
        services: []
    
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job
      
      - source_labels: [__meta_consul_tags]
        regex: '.*,monitoring,.*'
        action: keep

Best Practices

Metric Naming Conventions

1
2
3
4
5
6
7
8
9
10
# Good metric names
http_requests_total  # counter
http_request_duration_seconds  # histogram
database_connection_pool_size  # gauge
queue_messages_processed_total  # counter

# Bad metric names
httpRequests  # Use snake_case, not camelCase
request_time  # Missing unit
total_requests  # Should end with _total for counters

Label Best Practices

1
2
3
4
5
6
7
8
9
# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}

# Bad: High cardinality labels (user IDs, timestamps)
http_requests_total{user_id="12345", timestamp="2024-01-01T12:00:00Z"}

# Limit number of labels
# Good: 5-10 labels
# Bad: 20+ labels

Recording Rules

Use recording rules to precompute expensive queries:

1
2
3
4
5
6
7
8
9
10
11
12
groups:
  - name: request_rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by(job) (rate(http_requests_total[5m]))
      
      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m])))
      
      - record: instance:node_cpu:utilization
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Real-World Example: Complete Monitoring Stack

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# complete-monitoring-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: app
        image: sample-app:latest
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8081
          name: metrics
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: sample-app
  namespace: default
  labels:
    app: sample-app
spec:
  selector:
    app: sample-app
  ports:
  - name: http
    port: 80
    targetPort: 8080
  - name: metrics
    port: 8081
    targetPort: 8081

Conclusion

Prometheus and Grafana provide a powerful, scalable, and flexible monitoring solution for modern cloud-native applications. By understanding the architecture, metric types, PromQL queries, alerting capabilities, and best practices, you can build comprehensive monitoring and observability systems.

Key takeaways:

  • Understand the four metric types and when to use each
  • Instrument your applications properly with client libraries
  • Use PromQL effectively for querying and analysis
  • Implement comprehensive alerting rules
  • Create meaningful Grafana dashboards
  • Follow metric naming and labeling best practices
  • Use recording rules for expensive queries
  • Leverage service discovery for dynamic environments
  • Monitor the metrics that matter to your business

References

  • Prometheus Documentation: https://prometheus.io/docs/
  • Prometheus Best Practices: https://prometheus.io/docs/practices/
  • PromQL Documentation: https://prometheus.io/docs/prometheus/latest/querying/basics/
  • Grafana Documentation: https://grafana.com/docs/
  • Prometheus Operator: https://github.com/prometheus-operator/prometheus-operator
  • Node Exporter: https://github.com/prometheus/node_exporter
  • Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager/
  • Client Libraries: https://prometheus.io/docs/instrumenting/clientlibs/
  • Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/
  • Prometheus Exporters: https://prometheus.io/docs/instrumenting/exporters/
This post is licensed under CC BY 4.0 by the author.