Monitoring and Observability with Prometheus and Grafana
Monitoring and Observability with Prometheus and Grafana
Monitoring and observability are critical components of modern infrastructure and application management. Prometheus and Grafana have become the de facto standard for metrics collection, monitoring, and visualization in cloud-native environments. In this comprehensive guide, we will explore Prometheus architecture, metric types, PromQL queries, Grafana dashboards, alerting rules, service discovery, exporters, best practices, and real-world examples.
Understanding Monitoring vs Observability
Before diving into the tools, let us understand the difference:
Monitoring tells you when something is wrong. It is about collecting predefined metrics and setting alerts on known failure modes.
Observability tells you why something is wrong. It is about understanding the internal state of your system from its external outputs through metrics, logs, and traces.
Prometheus and Grafana together provide both monitoring and observability capabilities.
Prometheus Architecture
Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability.
Core Components
- Prometheus Server: Scrapes and stores time series data
- Client Libraries: Instrument application code
- Push Gateway: For short-lived jobs
- Exporters: Expose metrics from third-party systems
- Alertmanager: Handles alerts
- Service Discovery: Automatically discovers targets
How Prometheus Works
Prometheus uses a pull model where it scrapes metrics from instrumented targets at specified intervals. Metrics are stored in a time series database with a flexible query language (PromQL) for analysis.
Installing Prometheus
Docker Installation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Create prometheus configuration
cat > prometheus.yml <<EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
EOF
# Run Prometheus
docker run -d \
--name prometheus \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/prometheus \
--web.console.libraries=/usr/share/prometheus/console_libraries \
--web.console.templates=/usr/share/prometheus/consoles
Kubernetes Installation with Helm
1
2
3
4
5
6
7
8
9
10
11
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
--set grafana.adminPassword=admin123
Custom values for production deployment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# values.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "45GB"
replicas: 2
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: 'custom-app'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
alertmanager:
alertmanagerSpec:
replicas: 3
storage:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 10Gi
grafana:
replicas: 2
adminPassword: "ChangeMe123!"
persistence:
enabled: true
size: 10Gi
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-operated:9090
access: proxy
isDefault: true
Metric Types in Prometheus
Prometheus supports four metric types:
Counter
A counter is a cumulative metric that only increases. Use for counting requests, errors, completed tasks.
1
2
3
4
5
6
7
8
9
10
11
12
from prometheus_client import Counter
# Create counter
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
# Increment counter
http_requests_total.labels(method='GET', endpoint='/api/users', status='200').inc()
http_requests_total.labels(method='POST', endpoint='/api/users', status='201').inc()
Gauge
A gauge is a metric that can go up or down. Use for temperature, memory usage, concurrent requests.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from prometheus_client import Gauge
# Create gauge
active_connections = Gauge(
'active_connections',
'Number of active connections',
['service']
)
# Set gauge value
active_connections.labels(service='database').set(42)
# Increment/decrement
active_connections.labels(service='cache').inc()
active_connections.labels(service='cache').dec()
Histogram
A histogram samples observations and counts them in configurable buckets. Use for request durations, response sizes.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from prometheus_client import Histogram
# Create histogram
request_duration_seconds = Histogram(
'request_duration_seconds',
'Request duration in seconds',
['method', 'endpoint'],
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
# Observe value
request_duration_seconds.labels(method='GET', endpoint='/api/users').observe(0.234)
# Use as decorator
@request_duration_seconds.labels(method='GET', endpoint='/api/status').time()
def get_status():
# Function code
return {"status": "ok"}
Summary
A summary samples observations and calculates configurable quantiles. Similar to histogram but more expensive to compute.
1
2
3
4
5
6
7
8
9
10
11
from prometheus_client import Summary
# Create summary
request_latency = Summary(
'request_latency_seconds',
'Request latency in seconds',
['endpoint']
)
# Observe value
request_latency.labels(endpoint='/api/data').observe(0.156)
Instrumenting Applications
Python Application with Flask
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from flask import Flask, request, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'app_requests_total',
'Total requests',
['method', 'endpoint', 'http_status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'Request duration',
['method', 'endpoint']
)
ACTIVE_REQUESTS = Gauge(
'app_active_requests',
'Active requests',
['endpoint']
)
# Middleware for automatic instrumentation
@app.before_request
def before_request():
request.start_time = time.time()
ACTIVE_REQUESTS.labels(endpoint=request.path).inc()
@app.after_request
def after_request(response):
request_duration = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
http_status=response.status_code
).inc()
REQUEST_DURATION.labels(
method=request.method,
endpoint=request.path
).observe(request_duration)
ACTIVE_REQUESTS.labels(endpoint=request.path).dec()
return response
# Business logic endpoints
@app.route('/')
def index():
return {"message": "Hello World"}
@app.route('/api/users')
def get_users():
# Simulate database query
time.sleep(0.1)
return {"users": ["Alice", "Bob"]}
# Metrics endpoint
@app.route('/metrics')
def metrics():
return Response(generate_latest(REGISTRY), mimetype='text/plain')
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Go Application
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func instrumentedHandler(next http.HandlerFunc) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
// Call the actual handler
next(w, r)
// Record metrics
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, "200").Inc()
}
}
func indexHandler(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Hello, World!"))
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/", instrumentedHandler(indexHandler))
http.ListenAndServe(":8080", nil)
}
PromQL Queries
PromQL is Prometheus’ query language for selecting and aggregating time series data.
Basic Queries
# Get current value of a metric
http_requests_total
# Filter by labels
http_requests_total{method="GET", status="200"}
# Range vector (last 5 minutes)
http_requests_total[5m]
# Rate of increase (per second)
rate(http_requests_total[5m])
# Increase over time range
increase(http_requests_total[1h])
Aggregation Operators
# Sum across all instances
sum(http_requests_total)
# Sum by label
sum by(endpoint) (http_requests_total)
# Average by label
avg by(instance) (cpu_usage_percent)
# Maximum value
max(memory_usage_bytes)
# Minimum value
min(memory_usage_bytes)
# Count number of time series
count(up)
# Standard deviation
stddev(response_time_seconds)
Complex Queries
# Request rate by endpoint
sum by(endpoint) (rate(http_requests_total[5m]))
# Error rate (errors per second)
sum(rate(http_requests_total{status=~"5.."}[5m]))
# Success rate percentage
sum(rate(http_requests_total{status="200"}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# 95th percentile latency
histogram_quantile(0.95,
sum by(le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) /
node_memory_MemTotal_bytes * 100
# CPU usage percentage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_avail_bytes) /
node_filesystem_size_bytes * 100
# Predict when disk will be full (linear regression)
predict_linear(node_filesystem_avail_bytes[4h], 24*3600) < 0
Comparison and Arithmetic
# Compare current vs 1 hour ago
rate(http_requests_total[5m]) /
rate(http_requests_total[5m] offset 1h)
# Calculate ratio
sum(rate(http_requests_total{status="500"}[5m])) /
sum(rate(http_requests_total[5m]))
# Subtract metrics
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Combine multiple metrics
avg(cpu_usage) * avg(memory_usage) * avg(disk_io_utilization)
Exporters
Exporters expose metrics from third-party systems in Prometheus format.
Node Exporter
Node Exporter exposes hardware and OS metrics.
1
2
3
4
5
6
7
8
# Run Node Exporter
docker run -d \
--name node-exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
Kubernetes DaemonSet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: quay.io/prometheus/node-exporter:latest
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
ports:
- containerPort: 9100
name: metrics
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
Custom Exporter
Create a custom exporter in Python:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from prometheus_client import start_http_server, Gauge
import time
import psutil
# Define metrics
cpu_usage = Gauge('custom_cpu_usage_percent', 'CPU usage percentage')
memory_usage = Gauge('custom_memory_usage_percent', 'Memory usage percentage')
disk_usage = Gauge('custom_disk_usage_percent', 'Disk usage percentage', ['mountpoint'])
def collect_metrics():
"""Collect system metrics"""
while True:
# CPU usage
cpu_usage.set(psutil.cpu_percent(interval=1))
# Memory usage
memory = psutil.virtual_memory()
memory_usage.set(memory.percent)
# Disk usage
for partition in psutil.disk_partitions():
try:
usage = psutil.disk_usage(partition.mountpoint)
disk_usage.labels(mountpoint=partition.mountpoint).set(usage.percent)
except PermissionError:
continue
time.sleep(15)
if __name__ == '__main__':
# Start metrics server
start_http_server(8000)
# Collect metrics
collect_metrics()
Alerting with Alertmanager
Alertmanager handles alerts sent by Prometheus server.
Alerting Rules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
alert-rules.yaml: |
groups:
- name: instance_alerts
interval: 30s
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Instance down"
description: " of job has been down for more than 5 minutes."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on "
description: "CPU usage is above 80% (current value: %)"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on "
description: "Memory usage is above 85% (current value: %)"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on "
description: "Disk space is below 15% on "
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% (current value: )"
- alert: HighLatency
expr: histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m]))) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile latency is above 2 seconds (current value: s)"
- name: application_alerts
interval: 30s
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod / is crash looping"
description: "Pod has restarted times in the last 15 minutes"
- alert: PodNotReady
expr: kube_pod_status_phase{phase!="Running"} > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod / not ready"
description: "Pod has been in state for more than 10 minutes"
Alertmanager Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# alertmanager-config.yaml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
- name: 'critical-alerts'
slack_configs:
- channel: '#critical-alerts'
title: 'Critical Alert'
text: ''
send_resolved: true
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
- name: 'warning-alerts'
slack_configs:
- channel: '#monitoring'
title: 'Warning Alert'
text: ''
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Grafana Dashboards
Grafana provides beautiful visualizations for Prometheus data.
Installing Grafana
1
2
3
4
5
6
# Run Grafana with Docker
docker run -d \
--name=grafana \
-p 3000:3000 \
-e "GF_SECURITY_ADMIN_PASSWORD=admin" \
grafana/grafana:latest
Configuring Prometheus Data Source
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# grafana-datasource.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
prometheus.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-server:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
queryTimeout: "60s"
httpMethod: "POST"
Creating Dashboards with JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
"dashboard": {
"title": "Application Metrics",
"tags": ["application", "monitoring"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
"legendFormat": ""
}
],
"yaxes": [
{
"label": "requests/sec",
"format": "short"
}
]
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
"legendFormat": "Error Rate %"
}
]
},
{
"id": 3,
"title": "Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.50, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, sum by(le) (rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "p99"
}
]
}
],
"refresh": "30s",
"time": {
"from": "now-1h",
"to": "now"
}
}
}
Service Discovery
Prometheus supports various service discovery mechanisms.
Kubernetes Service Discovery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Consul Service Discovery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.service.consul:8500'
datacenter: 'dc1'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_tags]
regex: '.*,monitoring,.*'
action: keep
Best Practices
Metric Naming Conventions
1
2
3
4
5
6
7
8
9
10
# Good metric names
http_requests_total # counter
http_request_duration_seconds # histogram
database_connection_pool_size # gauge
queue_messages_processed_total # counter
# Bad metric names
httpRequests # Use snake_case, not camelCase
request_time # Missing unit
total_requests # Should end with _total for counters
Label Best Practices
1
2
3
4
5
6
7
8
9
# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}
# Bad: High cardinality labels (user IDs, timestamps)
http_requests_total{user_id="12345", timestamp="2024-01-01T12:00:00Z"}
# Limit number of labels
# Good: 5-10 labels
# Bad: 20+ labels
Recording Rules
Use recording rules to precompute expensive queries:
1
2
3
4
5
6
7
8
9
10
11
12
groups:
- name: request_rates
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by(job) (rate(http_requests_total[5m]))
- record: job:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum by(job, le) (rate(http_request_duration_seconds_bucket[5m])))
- record: instance:node_cpu:utilization
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Real-World Example: Complete Monitoring Stack
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# complete-monitoring-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: sample-app:latest
ports:
- containerPort: 8080
name: http
- containerPort: 8081
name: metrics
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: sample-app
namespace: default
labels:
app: sample-app
spec:
selector:
app: sample-app
ports:
- name: http
port: 80
targetPort: 8080
- name: metrics
port: 8081
targetPort: 8081
Conclusion
Prometheus and Grafana provide a powerful, scalable, and flexible monitoring solution for modern cloud-native applications. By understanding the architecture, metric types, PromQL queries, alerting capabilities, and best practices, you can build comprehensive monitoring and observability systems.
Key takeaways:
- Understand the four metric types and when to use each
- Instrument your applications properly with client libraries
- Use PromQL effectively for querying and analysis
- Implement comprehensive alerting rules
- Create meaningful Grafana dashboards
- Follow metric naming and labeling best practices
- Use recording rules for expensive queries
- Leverage service discovery for dynamic environments
- Monitor the metrics that matter to your business
References
- Prometheus Documentation: https://prometheus.io/docs/
- Prometheus Best Practices: https://prometheus.io/docs/practices/
- PromQL Documentation: https://prometheus.io/docs/prometheus/latest/querying/basics/
- Grafana Documentation: https://grafana.com/docs/
- Prometheus Operator: https://github.com/prometheus-operator/prometheus-operator
- Node Exporter: https://github.com/prometheus/node_exporter
- Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager/
- Client Libraries: https://prometheus.io/docs/instrumenting/clientlibs/
- Google SRE Book - Monitoring Distributed Systems: https://sre.google/sre-book/monitoring-distributed-systems/
- Prometheus Exporters: https://prometheus.io/docs/instrumenting/exporters/