Container Internals

9 min read 1741 words

Table of Contents

Container Internals

Container Internals

What a container actually is under the hood. If you can’t explain this without saying “lightweight VM,” you don’t understand containers.

Containers Are Not VMs

A container is a regular Linux process with restricted visibility and limited resources. That’s it. No hypervisor, no guest kernel, no hardware emulation.

Aspect	VM	Container
Isolation mechanism	Hypervisor + guest OS kernel	Linux namespaces + cgroups
Boot time	30-60 seconds	Milliseconds
Size overhead	GB (full OS)	MB (just app + dependencies)
Kernel	Separate kernel per VM	Shares host kernel
Security boundary	Strong (hardware-level)	Weaker (kernel-level)
Resource overhead	5-15% CPU/memory for hypervisor	Near-zero

ELI5: A VM is like renting a separate apartment — your own walls, plumbing, electricity meter. A container is like getting a desk in a co-working space — you share the building’s infrastructure but you can only see your own desk. Cheaper and faster to set up, but the walls are thinner.

Common mistake: “Containers are just lightweight VMs.” No. VMs emulate hardware and run a separate kernel. Containers share the host kernel and use kernel features (namespaces, cgroups) for isolation. This is why you can’t run Windows containers on a Linux host without a VM layer — they need different kernels.

Linux Namespaces — The Isolation Layer

Namespaces control what a process can see. Each namespace type isolates a different system resource.

Namespace	Flag	What It Isolates	Why It Matters
PID	`CLONE_NEWPID`	Process IDs	Container sees only its own processes. PID 1 inside = something else on host.
NET	`CLONE_NEWNET`	Network stack	Container gets its own interfaces, IP addresses, routing tables, iptables rules.
MNT	`CLONE_NEWNS`	Mount points	Container sees its own filesystem tree. Host mounts invisible.
UTS	`CLONE_NEWUTS`	Hostname/domain	Container can have its own hostname without affecting host.
IPC	`CLONE_NEWIPC`	Inter-process communication	Shared memory, semaphores, message queues isolated per container.
USER	`CLONE_NEWUSER`	User/group IDs	Root inside container can map to unprivileged user on host. Essential for rootless.
CGROUP	`CLONE_NEWCGROUP`	Cgroup root	Container can’t see or modify host’s cgroup hierarchy.
TIME	`CLONE_NEWTIME`	System clocks	Container can have different `CLOCK_MONOTONIC` offset. Linux 5.6+.

ELI5: Imagine you put on VR goggles that show you a fake desktop, fake coworkers, and a fake clock. You think you’re alone in an office, but actually you’re in a crowded room. Namespaces are those VR goggles for a process — they change what the process perceives without changing the actual system.

PID Namespace Deep Dive

Host:       PID 1 (systemd) → PID 4521 (containerd) → PID 4822 (container's PID 1)
Container:  PID 1 (your app)  → PID 2 (worker)      → PID 3 (worker)

The container process sees itself as PID 1. On the host, it’s PID 4822. This matters because:

PID 1 has special signal handling (doesn’t get default SIGTERM behavior)
If your app doesn’t handle signals properly as PID 1, docker stop hangs for 10 seconds then SIGKILL
This is why tini or dumb-init exist — proper PID 1 signal forwarding

Common mistake: Running your app directly as PID 1 without an init process. If your app spawns child processes, zombie processes accumulate because PID 1 is responsible for reaping orphans. Use --init flag or tini in your Dockerfile.

NET Namespace Deep Dive

Each container gets:

Its own eth0 interface (a veth pair — one end in container, one end on host bridge)
Its own IP address (typically from 172.17.0.0/16 for default bridge)
Its own routing table
Its own iptables rules

Container eth0 ←→ veth pair ←→ docker0 bridge ←→ Host eth0 ←→ Internet

Think of it this way: The container’s eth0 is connected to the host’s network bridge like plugging an ethernet cable from your laptop into a switch. The bridge (docker0) acts as a switch that connects all containers and routes traffic to the outside world via the host’s real network interface.

USER Namespace

Maps container UIDs to host UIDs:

Container root (UID 0) → Host UID 100000
Container user (UID 1000) → Host UID 101000

Why it matters: Without user namespaces, root inside the container IS root on the host. If the container escapes (via kernel exploit), you have root access to the host. With user namespaces, container root maps to an unprivileged host user — escape gives you nothing.

Cgroups — The Resource Limit Layer

Namespaces control visibility. Cgroups control how much a process can use.

Resource	Cgroup Controller	What It Limits
CPU	`cpu`, `cpuacct`, `cpuset`	CPU time, CPU pinning, accounting
Memory	`memory`	RAM usage, swap, OOM behavior
I/O	`blkio` (v1), `io` (v2)	Disk read/write bandwidth, IOPS
PIDs	`pids`	Max number of processes (fork bomb protection)
Network	`net_cls`, `net_prio`	Traffic classification and priority

ELI5: Namespaces are blinders (what you can see), cgroups are handcuffs (what you can use). A container might think it has the whole machine, but cgroups ensure it can only use 512MB RAM and 0.5 CPU cores.

Cgroups v1 vs v2

Feature	Cgroups v1	Cgroups v2
Hierarchy	Multiple trees (one per controller)	Single unified tree
Resource distribution	Per-controller	Unified across controllers
Pressure Stall Information (PSI)	No	Yes — tells you how much a resource is contended
eBPF integration	Limited	Full support
Default in	Older distros (Ubuntu <22.04, RHEL <9)	Modern distros (Ubuntu 22.04+, RHEL 9+)

Why v2 matters for interviews: Cgroups v2 enables PSI (Pressure Stall Information), which tells you not just “are we at the limit?” but “how much time are processes stalling waiting for this resource?” This is critical for autoscaling decisions in Kubernetes.

Common mistake: Setting memory limits without understanding OOM killer behavior. When a container exceeds its memory limit, the kernel’s OOM killer terminates it — no graceful shutdown, no signal handling. Your app just disappears. Set memory limits with headroom and configure your app to stay well under.

CPU Limits: Shares vs Quota

# docker run --cpus=0.5          → 50% of one core (quota)
# docker run --cpu-shares=512    → relative weight (shares)

Mechanism	How It Works	When to Use
`--cpus` (quota)	Hard limit. 0.5 = 50ms every 100ms period	Prevent noisy neighbors. Production.
`--cpu-shares` (shares)	Proportional. Only matters when CPU is contended	Development. Soft priority.
`--cpuset-cpus` (pinning)	Pin to specific CPU cores	Latency-sensitive, NUMA-aware workloads

Why this matters: If you set --cpus=1 and your app tries to use 2 cores, it gets throttled — the kernel pauses it mid-execution. This causes latency spikes that look like application bugs but are actually cgroup throttling. Check /sys/fs/cgroup/cpu.stat for nr_throttled counter.

Union Filesystem — The Image Layer System

Container images are not single files. They’re stacks of read-only layers with a thin writable layer on top.

[Writable layer]     ← Container changes go here (copy-on-write)
[Layer 4: COPY app]  ← Your application code
[Layer 3: RUN pip]   ← Installed dependencies  
[Layer 2: RUN apt]   ← System packages
[Layer 1: base image] ← Ubuntu/Alpine/etc

ELI5: Imagine a stack of transparent sheets. Each sheet has some drawing on it. When you look down through the stack, you see the combined picture. If you want to change something, you put a new transparent sheet on top and draw over it — the original sheets are untouched. That’s how container image layers work.

OverlayFS (Default Storage Driver)

merged/   ← What the container sees (unified view)
upper/    ← Writable layer (container changes)
work/     ← Internal bookkeeping for OverlayFS
lower/    ← Read-only image layers (stacked)

Copy-on-Write (CoW): When a container modifies a file from a lower layer, the entire file is copied to the upper (writable) layer first. This means:

Small modification to a 1GB file = 1GB copied to writable layer
Frequent writes to files from lower layers = slow performance
This is why databases should NEVER store data in the container’s writable layer — use volumes

Storage Driver	Backing FS	Performance	Use Case
overlay2	ext4, xfs	Best general purpose	Default. Use this unless you have a reason not to.
devicemapper	direct-lvm	Good for direct-lvm, bad for loop	Legacy. RHEL/CentOS before overlay2 support.
btrfs	btrfs	Good for snapshots	When host already uses btrfs.
zfs	zfs	Good for snapshots	When host already uses zfs.

Decision framework: Use overlay2. Period. The only exceptions are if your host filesystem is already btrfs/zfs and you want native snapshot support.

OCI Specification — The Standards

OCI (Open Container Initiative) defines two specs that make the container ecosystem interoperable:

Spec	What It Defines	Why It Matters
Image Spec	Image format: layers, manifest, config	Any tool can build images any runtime can run
Runtime Spec	How to run a container: namespaces, cgroups, mounts	Docker, Podman, CRI-O all use the same container format
Distribution Spec	How registries serve images	Push/pull works across Docker Hub, ECR, GCR, ACR

Think of it this way: OCI is like USB standards. Before USB, every device had a different connector. OCI ensures that an image built with Docker can run on Podman, containerd, CRI-O, or any OCI-compliant runtime. Build once, run anywhere (for real this time).

Interview question: “What’s the difference between a Docker image and an OCI image?” Trick question — since Docker 1.10+, Docker images ARE OCI images. Docker contributed its image format to OCI, which became the standard.

Container Lifecycle

Created → Running → Paused → Running → Stopped → Removed
                                          ↓
                                       Restarted

State	What’s Happening	Key Detail
Created	Namespace/cgroups set up, filesystem mounted, no process running	`docker create` does this
Running	PID 1 process executing	`docker start` or `docker run`
Paused	Process frozen via `SIGSTOP` / cgroup freezer	CPU usage drops to zero, memory still held
Stopped	PID 1 exited (or killed)	Writable layer still exists, can restart
Removed	Everything cleaned up	Writable layer deleted. Data gone unless in volumes.

Common mistake: Assuming docker stop kills your process instantly. It sends SIGTERM, waits 10 seconds (configurable with --time), then SIGKILL. If your app doesn’t handle SIGTERM, you always get the 10-second delay on every stop/deploy. Handle SIGTERM properly.

Key Takeaways for Interviews

“What is a container?” → A Linux process with namespaces (isolation) and cgroups (resource limits) running on a union filesystem. Not a VM.
“How is container networking implemented?” → NET namespace with veth pairs connected to a bridge (docker0). Each container gets its own IP, routing table, iptables.
“Why do we need an init process?” → PID 1 in a container must handle signals and reap zombies. Most applications don’t do this correctly. Use tini or –init.
“What happens when a container exceeds its memory limit?” → OOM killer terminates it immediately. No graceful shutdown. Set limits with headroom.
“Cgroups v1 vs v2?” → v2 has unified hierarchy and PSI (pressure stall information). Modern distros use v2. Matters for Kubernetes resource-aware scheduling.

Container Internals#

Containers Are Not VMs#

Linux Namespaces — The Isolation Layer#

PID Namespace Deep Dive#

NET Namespace Deep Dive#

USER Namespace#

Cgroups — The Resource Limit Layer#

Cgroups v1 vs v2#

CPU Limits: Shares vs Quota#

Union Filesystem — The Image Layer System#

OverlayFS (Default Storage Driver)#

OCI Specification — The Standards#

Container Lifecycle#

Key Takeaways for Interviews#

Container Internals

Containers Are Not VMs

Linux Namespaces — The Isolation Layer

PID Namespace Deep Dive

NET Namespace Deep Dive

USER Namespace

Cgroups — The Resource Limit Layer

Cgroups v1 vs v2

CPU Limits: Shares vs Quota

Union Filesystem — The Image Layer System

OverlayFS (Default Storage Driver)

OCI Specification — The Standards

Container Lifecycle

Key Takeaways for Interviews