Linux Performance Tools: Diagnosing CPU, Memory, and I/O Bottlenecks

When a service is slow, you need a systematic approach to identify whether the bottleneck is CPU, memory, disk I/O, network, or application code. Linux provides a rich set of performance tools at ever

Introduction#

When a service is slow, you need a systematic approach to identify whether the bottleneck is CPU, memory, disk I/O, network, or application code. Linux provides a rich set of performance tools at every level of the stack. This post covers the essential tools for diagnosing production performance issues.

The USE Method#

1
2
3
4
5
6
7
8
9
For every resource (CPU, memory, disk, network):
  Utilization: how busy is the resource?
  Saturation: is work queued because the resource is maxed?
  Errors: are there errors affecting the resource?

CPU: top/mpstat (U), vmstat run queue (S), dmesg/perf (E)
Memory: free/vmstat (U), vmstat si/so swap (S), dmesg OOM (E)
Disk: iostat %util (U), iostat await high (S), smartctl (E)
Network: ifstat/sar (U), ifstat queue drops (S), ip -s (E)

CPU Analysis#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Overall CPU usage
top -b -n 1 -d 1         # single snapshot, non-interactive
htop                      # interactive, color, per-core view

# Per-CPU utilization (spot imbalance)
mpstat -P ALL 1 5         # 5 samples, 1 second apart
# %usr %sys %iowait %steal %idle per CPU

# High %iowait: disk I/O is blocking CPUs
# High %steal: VM host is oversubscribed
# High %sys: kernel work (syscalls, interrupts)

# Find which process is using CPU
ps aux --sort=-%cpu | head -20
pidstat 1 5               # CPU usage per process, sampled

# CPU stealing (in VMs): common in cloud instances under load
# Watch steal column in top/vmstat: >5% = VM host problem

# CPU load vs utilization
uptime
# load averages: 1min 5min 15min
# On 4-core system: load=4.0 → 100% utilization
# load > num_cpus → saturation (processes waiting for CPU)

Memory Analysis#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Memory overview
free -h
vmstat 1 5

# vmstat columns:
# r: processes waiting for CPU (runqueue)
# b: processes in uninterruptible sleep (I/O wait)
# si/so: swap in/out (kb/s) — non-zero = memory pressure
# us/sy/id/wa/st: CPU percentages

# Detailed memory breakdown
cat /proc/meminfo

# Find memory-hungry processes
ps aux --sort=-%mem | head -20
smem -k -p               # includes shared memory

# OOM kills (out of memory killer)
dmesg -T | grep -i "oom\|killed"
journalctl -k | grep -i "out of memory"

# Memory leak detection: watch process RSS over time
watch -n 2 "ps -p $(pgrep myapp) -o pid,rss,vsz,pmem"

Disk I/O Analysis#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# I/O overview
iostat -xz 1 5
# r/s, w/s: reads/writes per second
# rkB/s, wkB/s: throughput
# await: average request time (ms) — good: <10ms SSD, <20ms HDD
# %util: device saturation (>80% = saturated for most workloads)
# svctm: service time — if close to await, no queue; if much less, there's a queue

# Per-process I/O
iotop -b -n 5             # top-style I/O per process
pidstat -d 1 5            # I/O per process, sampled

# File-level I/O (which files are being accessed)
# strace (high overhead, use on a single process)
strace -e trace=read,write,open,close -p $(pgrep myapp)

# openfiles / file descriptors
lsof -p $(pgrep myapp) | wc -l
ls -la /proc/$(pgrep myapp)/fd | wc -l
cat /proc/sys/fs/file-max  # system-wide fd limit

# Disk latency heatmap with perf
perf record -e block:block_rq_complete -a sleep 5
perf script | ... # analyze with FlameGraph

Network Analysis#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Network throughput
ip -s link show eth0
sar -n DEV 1 5           # per-interface stats
ifstat -i eth0 1         # live throughput

# Connection states
ss -s                    # summary: established, time-wait, etc.
ss -tuanp                # all TCP connections with process info

# Too many TIME_WAIT: port exhaustion risk
# Check: ss -s | grep TIME-WAIT
# Fix: tune net.ipv4.tcp_tw_reuse, net.ipv4.ip_local_port_range

# TCP retransmits (packet loss indicator)
netstat -s | grep -i retransmit
ss -ti dst :5432 | grep retrans  # retransmits to PostgreSQL

# Bandwidth usage per connection
nethogs eth0              # per-process network bandwidth

Application Profiling with perf#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# CPU flame graph: where is time being spent?
# Install perf, FlameGraph
git clone https://github.com/brendangregg/FlameGraph

# Sample CPU at 99Hz for 30 seconds
perf record -F 99 -p $(pgrep myapp) -g -- sleep 30

# Generate flame graph
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > cpu.svg

# For Python processes
py-spy record -o profile.svg --pid $(pgrep python3)

# For Go processes (pprof)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# For Java processes (async-profiler)
./profiler.sh -d 30 -f cpu.svg $(pgrep java)

System Call Overhead#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Identify syscall-heavy processes
perf top -e syscalls:sys_enter_write  # specific syscall
strace -c -p $(pgrep myapp)  # count syscalls by type

# Example output:
# % time  seconds  calls   errors  syscall
#  45.23    0.123   5234        0  write
#  30.12    0.082   2100        0  read
#  10.45    0.028    890        0  epoll_wait

# High write() count: buffering issue (small writes)
# Fix: increase write buffer size or use writev()

# Context switches
vmstat 1 5 | awk '{print $12, $13}'  # cs column = context switches/sec
pidstat -w 1 5                        # per-process context switches

One-Liner Cheatsheet#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Runqueue depth (CPU saturation)
vmstat 1 | awk '{print $1}'

# Top 5 CPU processes
ps -eo pcpu,pid,user,args --sort=-pcpu | head -5

# Top 5 memory processes
ps -eo pmem,rss,pid,args --sort=-pmem | head -5

# Disk I/O utilization
iostat -xz 1 3 | awk '/^sd|^nvme/ {print $1, $NF"%"}'

# Open TCP connections per state
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn

# Recent OOM kills
dmesg -T | grep -A2 "Out of memory"

# File descriptor exhaustion
for pid in $(ls /proc | grep '^[0-9]'); do
  count=$(ls /proc/$pid/fd 2>/dev/null | wc -l)
  if [ "$count" -gt 100 ]; then
    echo "$count $(cat /proc/$pid/cmdline 2>/dev/null | tr '\0' ' ')"
  fi
done | sort -rn | head -10

# Disk inodes (can be exhausted independently of space)
df -i

# Network errors and drops
ip -s link show | awk '/errors|dropped/ {print prev, $0} {prev=$0}'

Conclusion#

Performance debugging is systematic: start with the USE method to identify which resource is the bottleneck, then drill in with the appropriate tool. CPU bottlenecks call for perf and flame graphs. Memory pressure shows up in vmstat swap columns and OOM logs. Disk I/O shows in iostat await times. Network issues appear in TCP retransmits and connection state counts. Always measure before optimizing — assumptions about bottlenecks are frequently wrong, and the right tool tells you exactly where time is being spent.

Contents