← Docker & Containers Advanced

Container Internals

Container Internals

What a container actually is under the hood. If you can’t explain this without saying “lightweight VM,” you don’t understand containers.


Containers Are Not VMs

A container is a regular Linux process with restricted visibility and limited resources. That’s it. No hypervisor, no guest kernel, no hardware emulation.

AspectVMContainer
Isolation mechanismHypervisor + guest OS kernelLinux namespaces + cgroups
Boot time30-60 secondsMilliseconds
Size overheadGB (full OS)MB (just app + dependencies)
KernelSeparate kernel per VMShares host kernel
Security boundaryStrong (hardware-level)Weaker (kernel-level)
Resource overhead5-15% CPU/memory for hypervisorNear-zero

ELI5: A VM is like renting a separate apartment — your own walls, plumbing, electricity meter. A container is like getting a desk in a co-working space — you share the building’s infrastructure but you can only see your own desk. Cheaper and faster to set up, but the walls are thinner.

Common mistake: “Containers are just lightweight VMs.” No. VMs emulate hardware and run a separate kernel. Containers share the host kernel and use kernel features (namespaces, cgroups) for isolation. This is why you can’t run Windows containers on a Linux host without a VM layer — they need different kernels.


Linux Namespaces — The Isolation Layer

Namespaces control what a process can see. Each namespace type isolates a different system resource.

NamespaceFlagWhat It IsolatesWhy It Matters
PIDCLONE_NEWPIDProcess IDsContainer sees only its own processes. PID 1 inside = something else on host.
NETCLONE_NEWNETNetwork stackContainer gets its own interfaces, IP addresses, routing tables, iptables rules.
MNTCLONE_NEWNSMount pointsContainer sees its own filesystem tree. Host mounts invisible.
UTSCLONE_NEWUTSHostname/domainContainer can have its own hostname without affecting host.
IPCCLONE_NEWIPCInter-process communicationShared memory, semaphores, message queues isolated per container.
USERCLONE_NEWUSERUser/group IDsRoot inside container can map to unprivileged user on host. Essential for rootless.
CGROUPCLONE_NEWCGROUPCgroup rootContainer can’t see or modify host’s cgroup hierarchy.
TIMECLONE_NEWTIMESystem clocksContainer can have different CLOCK_MONOTONIC offset. Linux 5.6+.

ELI5: Imagine you put on VR goggles that show you a fake desktop, fake coworkers, and a fake clock. You think you’re alone in an office, but actually you’re in a crowded room. Namespaces are those VR goggles for a process — they change what the process perceives without changing the actual system.

PID Namespace Deep Dive

Host:       PID 1 (systemd) → PID 4521 (containerd) → PID 4822 (container's PID 1)
Container:  PID 1 (your app)  → PID 2 (worker)      → PID 3 (worker)

The container process sees itself as PID 1. On the host, it’s PID 4822. This matters because:

  • PID 1 has special signal handling (doesn’t get default SIGTERM behavior)
  • If your app doesn’t handle signals properly as PID 1, docker stop hangs for 10 seconds then SIGKILL
  • This is why tini or dumb-init exist — proper PID 1 signal forwarding

Common mistake: Running your app directly as PID 1 without an init process. If your app spawns child processes, zombie processes accumulate because PID 1 is responsible for reaping orphans. Use --init flag or tini in your Dockerfile.

NET Namespace Deep Dive

Each container gets:

  • Its own eth0 interface (a veth pair — one end in container, one end on host bridge)
  • Its own IP address (typically from 172.17.0.0/16 for default bridge)
  • Its own routing table
  • Its own iptables rules
Container eth0 ←→ veth pair ←→ docker0 bridge ←→ Host eth0 ←→ Internet

Think of it this way: The container’s eth0 is connected to the host’s network bridge like plugging an ethernet cable from your laptop into a switch. The bridge (docker0) acts as a switch that connects all containers and routes traffic to the outside world via the host’s real network interface.

USER Namespace

Maps container UIDs to host UIDs:

Container root (UID 0) → Host UID 100000
Container user (UID 1000) → Host UID 101000

Why it matters: Without user namespaces, root inside the container IS root on the host. If the container escapes (via kernel exploit), you have root access to the host. With user namespaces, container root maps to an unprivileged host user — escape gives you nothing.


Cgroups — The Resource Limit Layer

Namespaces control visibility. Cgroups control how much a process can use.

ResourceCgroup ControllerWhat It Limits
CPUcpu, cpuacct, cpusetCPU time, CPU pinning, accounting
MemorymemoryRAM usage, swap, OOM behavior
I/Oblkio (v1), io (v2)Disk read/write bandwidth, IOPS
PIDspidsMax number of processes (fork bomb protection)
Networknet_cls, net_prioTraffic classification and priority

ELI5: Namespaces are blinders (what you can see), cgroups are handcuffs (what you can use). A container might think it has the whole machine, but cgroups ensure it can only use 512MB RAM and 0.5 CPU cores.

Cgroups v1 vs v2

FeatureCgroups v1Cgroups v2
HierarchyMultiple trees (one per controller)Single unified tree
Resource distributionPer-controllerUnified across controllers
Pressure Stall Information (PSI)NoYes — tells you how much a resource is contended
eBPF integrationLimitedFull support
Default inOlder distros (Ubuntu <22.04, RHEL <9)Modern distros (Ubuntu 22.04+, RHEL 9+)

Why v2 matters for interviews: Cgroups v2 enables PSI (Pressure Stall Information), which tells you not just “are we at the limit?” but “how much time are processes stalling waiting for this resource?” This is critical for autoscaling decisions in Kubernetes.

Common mistake: Setting memory limits without understanding OOM killer behavior. When a container exceeds its memory limit, the kernel’s OOM killer terminates it — no graceful shutdown, no signal handling. Your app just disappears. Set memory limits with headroom and configure your app to stay well under.

CPU Limits: Shares vs Quota

# docker run --cpus=0.5          → 50% of one core (quota)
# docker run --cpu-shares=512    → relative weight (shares)
MechanismHow It WorksWhen to Use
--cpus (quota)Hard limit. 0.5 = 50ms every 100ms periodPrevent noisy neighbors. Production.
--cpu-shares (shares)Proportional. Only matters when CPU is contendedDevelopment. Soft priority.
--cpuset-cpus (pinning)Pin to specific CPU coresLatency-sensitive, NUMA-aware workloads

Why this matters: If you set --cpus=1 and your app tries to use 2 cores, it gets throttled — the kernel pauses it mid-execution. This causes latency spikes that look like application bugs but are actually cgroup throttling. Check /sys/fs/cgroup/cpu.stat for nr_throttled counter.


Union Filesystem — The Image Layer System

Container images are not single files. They’re stacks of read-only layers with a thin writable layer on top.

[Writable layer]     ← Container changes go here (copy-on-write)
[Layer 4: COPY app]  ← Your application code
[Layer 3: RUN pip]   ← Installed dependencies  
[Layer 2: RUN apt]   ← System packages
[Layer 1: base image] ← Ubuntu/Alpine/etc

ELI5: Imagine a stack of transparent sheets. Each sheet has some drawing on it. When you look down through the stack, you see the combined picture. If you want to change something, you put a new transparent sheet on top and draw over it — the original sheets are untouched. That’s how container image layers work.

OverlayFS (Default Storage Driver)

merged/   ← What the container sees (unified view)
upper/    ← Writable layer (container changes)
work/     ← Internal bookkeeping for OverlayFS
lower/    ← Read-only image layers (stacked)

Copy-on-Write (CoW): When a container modifies a file from a lower layer, the entire file is copied to the upper (writable) layer first. This means:

  • Small modification to a 1GB file = 1GB copied to writable layer
  • Frequent writes to files from lower layers = slow performance
  • This is why databases should NEVER store data in the container’s writable layer — use volumes
Storage DriverBacking FSPerformanceUse Case
overlay2ext4, xfsBest general purposeDefault. Use this unless you have a reason not to.
devicemapperdirect-lvmGood for direct-lvm, bad for loopLegacy. RHEL/CentOS before overlay2 support.
btrfsbtrfsGood for snapshotsWhen host already uses btrfs.
zfszfsGood for snapshotsWhen host already uses zfs.

Decision framework: Use overlay2. Period. The only exceptions are if your host filesystem is already btrfs/zfs and you want native snapshot support.


OCI Specification — The Standards

OCI (Open Container Initiative) defines two specs that make the container ecosystem interoperable:

SpecWhat It DefinesWhy It Matters
Image SpecImage format: layers, manifest, configAny tool can build images any runtime can run
Runtime SpecHow to run a container: namespaces, cgroups, mountsDocker, Podman, CRI-O all use the same container format
Distribution SpecHow registries serve imagesPush/pull works across Docker Hub, ECR, GCR, ACR

Think of it this way: OCI is like USB standards. Before USB, every device had a different connector. OCI ensures that an image built with Docker can run on Podman, containerd, CRI-O, or any OCI-compliant runtime. Build once, run anywhere (for real this time).

Interview question: “What’s the difference between a Docker image and an OCI image?” Trick question — since Docker 1.10+, Docker images ARE OCI images. Docker contributed its image format to OCI, which became the standard.


Container Lifecycle

Created → Running → Paused → Running → Stopped → Removed
                                          ↓
                                       Restarted
StateWhat’s HappeningKey Detail
CreatedNamespace/cgroups set up, filesystem mounted, no process runningdocker create does this
RunningPID 1 process executingdocker start or docker run
PausedProcess frozen via SIGSTOP / cgroup freezerCPU usage drops to zero, memory still held
StoppedPID 1 exited (or killed)Writable layer still exists, can restart
RemovedEverything cleaned upWritable layer deleted. Data gone unless in volumes.

Common mistake: Assuming docker stop kills your process instantly. It sends SIGTERM, waits 10 seconds (configurable with --time), then SIGKILL. If your app doesn’t handle SIGTERM, you always get the 10-second delay on every stop/deploy. Handle SIGTERM properly.


Key Takeaways for Interviews

  1. “What is a container?” → A Linux process with namespaces (isolation) and cgroups (resource limits) running on a union filesystem. Not a VM.
  2. “How is container networking implemented?” → NET namespace with veth pairs connected to a bridge (docker0). Each container gets its own IP, routing table, iptables.
  3. “Why do we need an init process?” → PID 1 in a container must handle signals and reap zombies. Most applications don’t do this correctly. Use tini or –init.
  4. “What happens when a container exceeds its memory limit?” → OOM killer terminates it immediately. No graceful shutdown. Set limits with headroom.
  5. “Cgroups v1 vs v2?” → v2 has unified hierarchy and PSI (pressure stall information). Modern distros use v2. Matters for Kubernetes resource-aware scheduling.