Container Internals
Container Internals
What a container actually is under the hood. If you can’t explain this without saying “lightweight VM,” you don’t understand containers.
Containers Are Not VMs
A container is a regular Linux process with restricted visibility and limited resources. That’s it. No hypervisor, no guest kernel, no hardware emulation.
| Aspect | VM | Container |
|---|---|---|
| Isolation mechanism | Hypervisor + guest OS kernel | Linux namespaces + cgroups |
| Boot time | 30-60 seconds | Milliseconds |
| Size overhead | GB (full OS) | MB (just app + dependencies) |
| Kernel | Separate kernel per VM | Shares host kernel |
| Security boundary | Strong (hardware-level) | Weaker (kernel-level) |
| Resource overhead | 5-15% CPU/memory for hypervisor | Near-zero |
ELI5: A VM is like renting a separate apartment — your own walls, plumbing, electricity meter. A container is like getting a desk in a co-working space — you share the building’s infrastructure but you can only see your own desk. Cheaper and faster to set up, but the walls are thinner.
Common mistake: “Containers are just lightweight VMs.” No. VMs emulate hardware and run a separate kernel. Containers share the host kernel and use kernel features (namespaces, cgroups) for isolation. This is why you can’t run Windows containers on a Linux host without a VM layer — they need different kernels.
Linux Namespaces — The Isolation Layer
Namespaces control what a process can see. Each namespace type isolates a different system resource.
| Namespace | Flag | What It Isolates | Why It Matters |
|---|---|---|---|
| PID | CLONE_NEWPID | Process IDs | Container sees only its own processes. PID 1 inside = something else on host. |
| NET | CLONE_NEWNET | Network stack | Container gets its own interfaces, IP addresses, routing tables, iptables rules. |
| MNT | CLONE_NEWNS | Mount points | Container sees its own filesystem tree. Host mounts invisible. |
| UTS | CLONE_NEWUTS | Hostname/domain | Container can have its own hostname without affecting host. |
| IPC | CLONE_NEWIPC | Inter-process communication | Shared memory, semaphores, message queues isolated per container. |
| USER | CLONE_NEWUSER | User/group IDs | Root inside container can map to unprivileged user on host. Essential for rootless. |
| CGROUP | CLONE_NEWCGROUP | Cgroup root | Container can’t see or modify host’s cgroup hierarchy. |
| TIME | CLONE_NEWTIME | System clocks | Container can have different CLOCK_MONOTONIC offset. Linux 5.6+. |
ELI5: Imagine you put on VR goggles that show you a fake desktop, fake coworkers, and a fake clock. You think you’re alone in an office, but actually you’re in a crowded room. Namespaces are those VR goggles for a process — they change what the process perceives without changing the actual system.
PID Namespace Deep Dive
Host: PID 1 (systemd) → PID 4521 (containerd) → PID 4822 (container's PID 1)
Container: PID 1 (your app) → PID 2 (worker) → PID 3 (worker)
The container process sees itself as PID 1. On the host, it’s PID 4822. This matters because:
- PID 1 has special signal handling (doesn’t get default SIGTERM behavior)
- If your app doesn’t handle signals properly as PID 1,
docker stophangs for 10 seconds then SIGKILL - This is why tini or dumb-init exist — proper PID 1 signal forwarding
Common mistake: Running your app directly as PID 1 without an init process. If your app spawns child processes, zombie processes accumulate because PID 1 is responsible for reaping orphans. Use --init flag or tini in your Dockerfile.
NET Namespace Deep Dive
Each container gets:
- Its own
eth0interface (a veth pair — one end in container, one end on host bridge) - Its own IP address (typically from 172.17.0.0/16 for default bridge)
- Its own routing table
- Its own iptables rules
Container eth0 ←→ veth pair ←→ docker0 bridge ←→ Host eth0 ←→ Internet
Think of it this way: The container’s
eth0is connected to the host’s network bridge like plugging an ethernet cable from your laptop into a switch. The bridge (docker0) acts as a switch that connects all containers and routes traffic to the outside world via the host’s real network interface.
USER Namespace
Maps container UIDs to host UIDs:
Container root (UID 0) → Host UID 100000
Container user (UID 1000) → Host UID 101000
Why it matters: Without user namespaces, root inside the container IS root on the host. If the container escapes (via kernel exploit), you have root access to the host. With user namespaces, container root maps to an unprivileged host user — escape gives you nothing.
Cgroups — The Resource Limit Layer
Namespaces control visibility. Cgroups control how much a process can use.
| Resource | Cgroup Controller | What It Limits |
|---|---|---|
| CPU | cpu, cpuacct, cpuset | CPU time, CPU pinning, accounting |
| Memory | memory | RAM usage, swap, OOM behavior |
| I/O | blkio (v1), io (v2) | Disk read/write bandwidth, IOPS |
| PIDs | pids | Max number of processes (fork bomb protection) |
| Network | net_cls, net_prio | Traffic classification and priority |
ELI5: Namespaces are blinders (what you can see), cgroups are handcuffs (what you can use). A container might think it has the whole machine, but cgroups ensure it can only use 512MB RAM and 0.5 CPU cores.
Cgroups v1 vs v2
| Feature | Cgroups v1 | Cgroups v2 |
|---|---|---|
| Hierarchy | Multiple trees (one per controller) | Single unified tree |
| Resource distribution | Per-controller | Unified across controllers |
| Pressure Stall Information (PSI) | No | Yes — tells you how much a resource is contended |
| eBPF integration | Limited | Full support |
| Default in | Older distros (Ubuntu <22.04, RHEL <9) | Modern distros (Ubuntu 22.04+, RHEL 9+) |
Why v2 matters for interviews: Cgroups v2 enables PSI (Pressure Stall Information), which tells you not just “are we at the limit?” but “how much time are processes stalling waiting for this resource?” This is critical for autoscaling decisions in Kubernetes.
Common mistake: Setting memory limits without understanding OOM killer behavior. When a container exceeds its memory limit, the kernel’s OOM killer terminates it — no graceful shutdown, no signal handling. Your app just disappears. Set memory limits with headroom and configure your app to stay well under.
CPU Limits: Shares vs Quota
# docker run --cpus=0.5 → 50% of one core (quota)
# docker run --cpu-shares=512 → relative weight (shares)
| Mechanism | How It Works | When to Use |
|---|---|---|
--cpus (quota) | Hard limit. 0.5 = 50ms every 100ms period | Prevent noisy neighbors. Production. |
--cpu-shares (shares) | Proportional. Only matters when CPU is contended | Development. Soft priority. |
--cpuset-cpus (pinning) | Pin to specific CPU cores | Latency-sensitive, NUMA-aware workloads |
Why this matters: If you set
--cpus=1and your app tries to use 2 cores, it gets throttled — the kernel pauses it mid-execution. This causes latency spikes that look like application bugs but are actually cgroup throttling. Check/sys/fs/cgroup/cpu.statfornr_throttledcounter.
Union Filesystem — The Image Layer System
Container images are not single files. They’re stacks of read-only layers with a thin writable layer on top.
[Writable layer] ← Container changes go here (copy-on-write)
[Layer 4: COPY app] ← Your application code
[Layer 3: RUN pip] ← Installed dependencies
[Layer 2: RUN apt] ← System packages
[Layer 1: base image] ← Ubuntu/Alpine/etc
ELI5: Imagine a stack of transparent sheets. Each sheet has some drawing on it. When you look down through the stack, you see the combined picture. If you want to change something, you put a new transparent sheet on top and draw over it — the original sheets are untouched. That’s how container image layers work.
OverlayFS (Default Storage Driver)
merged/ ← What the container sees (unified view)
upper/ ← Writable layer (container changes)
work/ ← Internal bookkeeping for OverlayFS
lower/ ← Read-only image layers (stacked)
Copy-on-Write (CoW): When a container modifies a file from a lower layer, the entire file is copied to the upper (writable) layer first. This means:
- Small modification to a 1GB file = 1GB copied to writable layer
- Frequent writes to files from lower layers = slow performance
- This is why databases should NEVER store data in the container’s writable layer — use volumes
| Storage Driver | Backing FS | Performance | Use Case |
|---|---|---|---|
| overlay2 | ext4, xfs | Best general purpose | Default. Use this unless you have a reason not to. |
| devicemapper | direct-lvm | Good for direct-lvm, bad for loop | Legacy. RHEL/CentOS before overlay2 support. |
| btrfs | btrfs | Good for snapshots | When host already uses btrfs. |
| zfs | zfs | Good for snapshots | When host already uses zfs. |
Decision framework: Use overlay2. Period. The only exceptions are if your host filesystem is already btrfs/zfs and you want native snapshot support.
OCI Specification — The Standards
OCI (Open Container Initiative) defines two specs that make the container ecosystem interoperable:
| Spec | What It Defines | Why It Matters |
|---|---|---|
| Image Spec | Image format: layers, manifest, config | Any tool can build images any runtime can run |
| Runtime Spec | How to run a container: namespaces, cgroups, mounts | Docker, Podman, CRI-O all use the same container format |
| Distribution Spec | How registries serve images | Push/pull works across Docker Hub, ECR, GCR, ACR |
Think of it this way: OCI is like USB standards. Before USB, every device had a different connector. OCI ensures that an image built with Docker can run on Podman, containerd, CRI-O, or any OCI-compliant runtime. Build once, run anywhere (for real this time).
Interview question: “What’s the difference between a Docker image and an OCI image?” Trick question — since Docker 1.10+, Docker images ARE OCI images. Docker contributed its image format to OCI, which became the standard.
Container Lifecycle
Created → Running → Paused → Running → Stopped → Removed
↓
Restarted
| State | What’s Happening | Key Detail |
|---|---|---|
| Created | Namespace/cgroups set up, filesystem mounted, no process running | docker create does this |
| Running | PID 1 process executing | docker start or docker run |
| Paused | Process frozen via SIGSTOP / cgroup freezer | CPU usage drops to zero, memory still held |
| Stopped | PID 1 exited (or killed) | Writable layer still exists, can restart |
| Removed | Everything cleaned up | Writable layer deleted. Data gone unless in volumes. |
Common mistake: Assuming
docker stopkills your process instantly. It sends SIGTERM, waits 10 seconds (configurable with--time), then SIGKILL. If your app doesn’t handle SIGTERM, you always get the 10-second delay on every stop/deploy. Handle SIGTERM properly.
Key Takeaways for Interviews
- “What is a container?” → A Linux process with namespaces (isolation) and cgroups (resource limits) running on a union filesystem. Not a VM.
- “How is container networking implemented?” → NET namespace with veth pairs connected to a bridge (docker0). Each container gets its own IP, routing table, iptables.
- “Why do we need an init process?” → PID 1 in a container must handle signals and reap zombies. Most applications don’t do this correctly. Use tini or –init.
- “What happens when a container exceeds its memory limit?” → OOM killer terminates it immediately. No graceful shutdown. Set limits with headroom.
- “Cgroups v1 vs v2?” → v2 has unified hierarchy and PSI (pressure stall information). Modern distros use v2. Matters for Kubernetes resource-aware scheduling.