Introduction#
TCP congestion control prevents senders from overwhelming the network. It is the reason a fresh TCP connection starts slowly and why a connection over a lossy link performs poorly. Understanding it helps you tune network timeouts, choose the right congestion control algorithm, and diagnose slow transfers.
The Congestion Window#
TCP limits how much data can be in-flight (sent but not yet acknowledged) using two windows:
- rwnd (receiver window): how much the receiver can buffer (flow control)
- cwnd (congestion window): how much the sender estimates the network can handle
The effective in-flight limit is min(rwnd, cwnd).
1
2
3
4
5
6
7
Throughput ≈ cwnd / RTT
For 100ms RTT and cwnd = 1MB:
Throughput ≈ 1MB / 0.1s = 10MB/s
To achieve 100MB/s over a 100ms link:
Required cwnd = 100MB/s × 0.1s = 10MB
Slow Start#
When a connection begins, cwnd starts at 1 segment (10 segments in modern implementations). For each ACK received, cwnd increases by 1 MSS (maximum segment size). This doubles cwnd each round-trip.
1
2
3
4
RTT 1: send 1 segment → receive 1 ACK → cwnd = 2
RTT 2: send 2 segments → receive 2 ACKs → cwnd = 4
RTT 3: send 4 segments → receive 4 ACKs → cwnd = 8
...until cwnd reaches ssthresh (slow start threshold)
Slow start is exponential growth — the name is misleading. It exits slow start when:
- cwnd reaches ssthresh (transitions to congestion avoidance)
- A packet loss is detected
Congestion Avoidance#
After ssthresh, cwnd grows linearly: +1 MSS per round-trip.
1
2
3
4
cwnd = 20 MSS (at ssthresh)
RTT 1: cwnd = 21
RTT 2: cwnd = 22
...
Packet Loss and Reaction#
When a packet is lost (timeout or 3 duplicate ACKs):
Timeout (severe): ssthresh = cwnd/2, reset cwnd = 1. Slow start again.
3 duplicate ACKs (fast retransmit): ssthresh = cwnd/2, cwnd = ssthresh. No slow start, continue with congestion avoidance. Less disruptive.
1
2
3
cwnd = 32 MSS, loss detected:
ssthresh = 16 MSS
cwnd = 16 MSS (fast recovery) or 1 MSS (timeout)
Modern Congestion Control Algorithms#
CUBIC (default on Linux)#
1
2
3
4
5
6
# Check current algorithm
sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = cubic
# CUBIC grows as a cubic function of time since last loss
# Better than Reno/TAHOE for high-bandwidth, high-latency links
BBR (Bottleneck Bandwidth and Round-trip propagation time)#
BBR (developed by Google) models the network rather than reacting to loss. It maintains a model of the bottleneck bandwidth and minimum RTT, and keeps the network pipe full without filling buffers.
1
2
3
4
5
6
7
# Enable BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq # required for BBR
# Persist
echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf
echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
BBR provides significantly better throughput on high-latency links and in the presence of shallow buffers (e.g., datacenter switches).
Diagnosing Connection Issues#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# View per-socket TCP statistics including cwnd
ss -tin dst :443
# State Recv-Q Send-Q Local Address:Port Peer Address:Port
# ESTAB 0 0 10.0.0.1:54321 93.184.216.34:443
# cubic rto:204 rtt:22.234/11.117 cwnd:10 ssthresh:14 bytes_sent:1234
# ^^^^ ^^^^^^^^
# congestion window slow start threshold
# High rto (retransmission timeout): network is lossy
# cwnd stuck low: repeated loss events limiting throughput
# ssthresh very low: recent severe congestion event
# Measure connection throughput
iperf3 -c server -t 30 -P 4 # 4 parallel streams, 30 second test
Impact on Application Design#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# HTTP connection reuse avoids slow start overhead
# Each new connection starts at cwnd=10 segments
# An established connection has already ramped up
import httpx
# BAD: new connection per request (slow start every time)
async def fetch_price(product_id: int) -> float:
async with httpx.AsyncClient() as client: # new connection each call
resp = await client.get(f"https://api.example.com/price/{product_id}")
return resp.json()["price"]
# GOOD: reuse connection (one slow start, then full speed)
_client = httpx.AsyncClient(
limits=httpx.Limits(max_connections=20, max_keepalive_connections=10)
)
async def fetch_price(product_id: int) -> float:
resp = await _client.get(f"https://api.example.com/price/{product_id}")
return resp.json()["price"]
TCP Initial Congestion Window#
Google’s research showed increasing the initial cwnd from 3 to 10 segments improves page load time by 10% by reducing slow start time for short connections.
1
2
3
4
5
6
# Check initial cwnd on a route
ip route show | grep default
# default via 10.0.0.1 dev eth0 proto dhcp initcwnd 10
# Set initial cwnd on the default route
ip route change default via 10.0.0.1 initcwnd 10
Modern Linux kernels default to initcwnd=10.
Conclusion#
Slow start protects the network but causes poor initial throughput for short-lived connections. HTTP keep-alive and connection pooling amortize slow start cost across requests. BBR outperforms CUBIC on high-latency or lossy links and is worth enabling on servers that make cross-region connections. Use ss -tin to inspect per-connection congestion state during performance investigations.