Linux Namespaces and cgroups: How Containers Work

Containers are not magic. Docker and Kubernetes rely on two Linux kernel primitives: namespaces (isolation) and cgroups (resource limits). Understanding these mechanisms makes container behavior predi

Introduction#

Containers are not magic. Docker and Kubernetes rely on two Linux kernel primitives: namespaces (isolation) and cgroups (resource limits). Understanding these mechanisms makes container behavior predictable and aids debugging.

Namespaces: Isolation#

A namespace wraps a global resource and makes processes inside the namespace see their own isolated copy.

Types of namespaces:

Namespace Isolates
pid Process IDs
net Network interfaces, routes, iptables
mnt Mount points (filesystems)
uts Hostname and domain name
ipc System V IPC, POSIX message queues
user User and group IDs
cgroup cgroup root
1
2
3
4
5
6
7
8
9
10
11
# Create a new network namespace and run a command inside it
ip netns add my-ns
ip netns exec my-ns ip link list
# Only loopback is visible — isolated from host network

# List namespaces of a running process
ls -la /proc/$(pgrep nginx | head -1)/ns/

# Enter a running container's namespaces
nsenter --target $(docker inspect --format='{{.State.Pid}}' my-container) \
  --net --pid --mount -- bash

PID Namespace#

A process inside a PID namespace sees a separate process tree. The first process gets PID 1. From the host, you can see all container processes with their real PIDs.

1
2
3
4
5
6
7
8
# Inside container: only container processes are visible
ps aux
# PID 1: /bin/sh
# PID 5: ps aux

# On host: container processes visible with host PIDs
ps aux | grep nginx
# host_pid 12345: nginx: master process

Network Namespace#

Each container gets its own network stack. Docker creates a veth pair: one end in the container, one end on the host bridge (docker0).

1
2
3
4
5
# Show veth pairs on host
ip link show type veth

# Inspect network namespace of a container
docker inspect my-container --format='{{.NetworkSettings.IPAddress}}'

cgroups: Resource Limits#

Control groups limit and account for resource usage: CPU, memory, disk I/O, network.

1
2
3
4
5
6
7
8
# Show cgroup hierarchy for a container
cat /proc/$(docker inspect --format='{{.State.Pid}}' my-container)/cgroup

# Docker sets memory limits via cgroup
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes

# Current memory usage
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.usage_in_bytes

Setting Resource Limits#

1
2
3
4
5
6
7
8
# Run a container with CPU and memory limits
docker run --cpus=0.5 --memory=256m nginx

# What Docker actually does:
# CPU: sets cpu.cfs_quota_us and cpu.cfs_period_us
# Memory: sets memory.limit_in_bytes
cat /sys/fs/cgroup/cpu/docker/<id>/cpu.cfs_quota_us   # 50000 (50ms per 100ms = 0.5 CPU)
cat /sys/fs/cgroup/memory/docker/<id>/memory.limit_in_bytes  # 268435456 (256MB)

OOM Killer#

When a container exceeds its memory limit, the kernel’s Out-Of-Memory killer terminates a process inside the cgroup.

1
2
3
4
5
# Check if a container was OOM killed
docker inspect my-container --format='{{.State.OOMKilled}}'

# View OOM events in kernel logs
dmesg | grep -i "out of memory"

Union Filesystems#

Container images use a union filesystem (overlayfs on modern systems) to layer read-only image layers with a writable top layer.

1
2
3
4
5
6
7
8
9
10
11
# Show overlay mounts for running containers
mount | grep overlay

# Typical overlay mount:
# overlay on /var/lib/docker/overlay2/<id>/merged
#   lowerdir=<image-layers>:<base-layer>
#   upperdir=<container-writable-layer>
#   workdir=<workdir>

# Inspect layer structure
docker image inspect nginx --format='{{range .RootFS.Layers}}{{.}}\n{{end}}'

seccomp and Capabilities#

Docker adds two more security layers: Linux capabilities and seccomp profiles.

1
2
3
4
5
6
7
8
9
10
11
# Default Docker drops these capabilities
# CAP_NET_ADMIN, CAP_SYS_ADMIN, CAP_SYS_MODULE, and others

# Run with specific capability added
docker run --cap-add NET_ADMIN nginx

# View process capabilities
cat /proc/$(pgrep nginx)/status | grep Cap

# Default Docker seccomp profile blocks ~44 syscalls
docker run --security-opt seccomp=default.json nginx

Putting It Together#

When you run docker run nginx:

  1. Docker creates new namespaces: pid, net, mnt, uts, ipc
  2. Sets up a veth pair, connects container net namespace to docker0 bridge
  3. Creates cgroup entries with the resource limits you specified
  4. Mounts the image layers using overlayfs
  5. Applies seccomp profile and drops capabilities
  6. Runs the process as PID 1 in the new namespace

Conclusion#

Containers are processes with isolated views of the OS (namespaces) and bounded resource consumption (cgroups). This model has no hypervisor overhead — container processes run directly on the host kernel. Understanding this makes it clear why container escape vulnerabilities are serious (the kernel is shared) and why resource limits must be set explicitly.

Contents