Linux Networking Stack Internals for Backend Engineers

Every network call your application makes passes through the Linux networking stack. Knowing how it works helps you understand TCP tuning, diagnose ETIMEDOUT and ECONNREFUSED errors, and interpret ss,

Introduction#

Every network call your application makes passes through the Linux networking stack. Knowing how it works helps you understand TCP tuning, diagnose ETIMEDOUT and ECONNREFUSED errors, and interpret ss, netstat, and tcpdump output.

Packet Path: Inbound#

1
2
3
4
5
NIC Hardware → Driver (interrupt or NAPI poll) → sk_buff
→ Network Layer (IP: routing, fragmentation)
→ Transport Layer (TCP/UDP: port demux, checksums)
→ Socket receive buffer
→ Application: recv() / read()
1
2
3
4
5
6
7
# View receive buffer sizes
cat /proc/sys/net/core/rmem_default   # default receive buffer
cat /proc/sys/net/core/rmem_max       # max receive buffer

# Per-socket receive buffer
ss -nm | head -20
# Recv-Q: bytes in receive buffer not yet read by application

TCP State Machine#

Understanding TCP states is essential for diagnosing connection issues.

1
2
3
4
5
6
7
8
9
# Show all TCP connections and states
ss -tan
# LISTEN    0    128    0.0.0.0:8080    0.0.0.0:*
# ESTAB     0    0      10.0.0.1:8080   10.0.0.2:54321
# TIME-WAIT 0    0      10.0.0.1:8080   10.0.0.3:54322
# CLOSE-WAIT 12  0      ...

# Count by state
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

Common states and what they mean:

  • LISTEN: accepting new connections
  • ESTABLISHED: active connection
  • TIME-WAIT: closed by local side, waiting for delayed packets (2*MSL = ~60s)
  • CLOSE-WAIT: remote side closed, local side has not called close() — often a leak
  • FIN-WAIT-2: local side sent FIN, waiting for remote FIN

A large number of CLOSE-WAIT sockets indicates a connection leak in your application (failing to close connections).

Socket Backlog and Accept Queue#

1
2
3
Incoming SYN → SYN queue (incomplete connections, 3-way handshake in progress)
            → Accept queue (completed connections waiting for accept())
                        → Application calls accept()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Backlog = max length of accept queue
# listen(fd, backlog)
# nginx default: listen 80 backlog=511

# Show accept queue depth
ss -tlnp
# Recv-Q on LISTEN socket = current accept queue depth
# Send-Q on LISTEN socket = max accept queue depth (backlog)

# SYN backlog (incomplete connections)
cat /proc/sys/net/ipv4/tcp_max_syn_backlog  # default: 128

# Accept queue max
cat /proc/sys/net/core/somaxconn  # default: 128
# Increase for high-traffic servers
sysctl net.core.somaxconn=65535

TCP Buffer Tuning#

Throughput is bounded by buffer_size / latency (bandwidth-delay product).

1
2
3
4
5
6
7
8
9
10
11
12
# Current TCP buffer sizes
cat /proc/sys/net/ipv4/tcp_rmem  # min default max for receive
# 4096  131072  6291456

cat /proc/sys/net/ipv4/tcp_wmem  # min default max for send

# For high-bandwidth, high-latency links (e.g., cross-region):
sysctl net.ipv4.tcp_rmem="4096 87380 33554432"
sysctl net.ipv4.tcp_wmem="4096 65536 33554432"

# Enable auto-tuning (on by default in modern kernels)
cat /proc/sys/net/ipv4/tcp_moderate_rcvbuf  # should be 1

TIME-WAIT and Port Exhaustion#

Every outbound connection uses an ephemeral port. When connections close, ports enter TIME-WAIT for 60 seconds.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range
# 32768   60999  (28231 ports available)

# When making many short-lived connections (HTTP without keep-alive):
# If rate exceeds ~470 connections/sec, you exhaust the port range

# Fix: enable port reuse
sysctl net.ipv4.tcp_tw_reuse=1  # reuse TIME-WAIT sockets for new connections

# Widen port range
sysctl net.ipv4.ip_local_port_range="1024 65535"

# Check current usage
ss -tan state time-wait | wc -l

Useful Diagnostic Commands#

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Real-time packet counters
watch -n1 'cat /proc/net/dev'

# TCP statistics (errors, retransmits)
netstat -s | grep -E "(retrans|error|fail)"
ss -s  # summary

# Capture traffic (requires root)
tcpdump -i eth0 port 8080 -w capture.pcap
tcpdump -r capture.pcap -nn

# Trace connection establishment
strace -e trace=network -p $(pgrep nginx | head -1)

# Check for dropped packets in iptables
iptables -L -n -v | grep -v "0     0"

# Network interface stats
ip -s link show eth0

Keep-Alive Settings#

1
2
3
4
5
6
7
8
9
# TCP keep-alive: detect dead connections
cat /proc/sys/net/ipv4/tcp_keepalive_time     # 7200s (2 hours) — too long
cat /proc/sys/net/ipv4/tcp_keepalive_intvl    # 75s between probes
cat /proc/sys/net/ipv4/tcp_keepalive_probes   # 9 probes before giving up

# Recommended for servers behind load balancers
sysctl net.ipv4.tcp_keepalive_time=60
sysctl net.ipv4.tcp_keepalive_intvl=10
sysctl net.ipv4.tcp_keepalive_probes=6

Conclusion#

The Linux networking stack processes packets through hardware interrupts, builds TCP state machines for each connection, and exposes everything through /proc. For backend engineers, the most actionable items are: set somaxconn and tcp_max_syn_backlog to match your traffic, enable tcp_tw_reuse for services making many outbound connections, tune TCP buffers for high-latency links, and use ss with state filters to diagnose connection issues.

Contents