← Docker & Containers Advanced

Container Security

Container Security

One misconfigured container = full cluster compromise. Container security is not optional — it’s table stakes for any production deployment.


The Threat Model

Supply Chain          Runtime              Escape
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Malicious     │    │ Container    │    │ Kernel       │
│ base image    │    │ runs as root │    │ exploit      │
│ Vulnerable    │    │ Excessive    │    │ Mounted host │
│ dependencies  │    │ capabilities │    │ filesystem   │
│ Leaked secrets│    │ No seccomp   │    │ Privileged   │
│ in layers     │    │ No AppArmor  │    │ mode         │
└──────────────┘    └──────────────┘    └──────────────┘

ELI5: Container security is like airport security with three checkpoints. Supply chain = checking your bags before you board (are there dangerous items baked into the image?). Runtime = in-flight rules (what passengers can and can’t do during the flight). Escape prevention = making sure nobody can open the emergency exit mid-flight (break out of the container to the host).


Running as Non-Root

The single most impactful security measure. By default, containers run as root (UID 0), which maps to root on the host unless user namespaces are enabled.

# Create a non-root user and switch to it
FROM node:20-slim
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
COPY --chown=appuser:appuser . .
USER appuser
CMD ["node", "server.js"]

Why this matters: If an attacker exploits your app and gets code execution inside the container, they’re root. If the container has any host mounts, shared namespaces, or a kernel vulnerability, root inside = root outside.

Common mistake: “But my app needs root to bind to port 80.” No it doesn’t. Use port 8080+ inside the container and map with -p 80:8080. Or use setcap cap_net_bind_service to grant just that one capability.


Linux Capabilities

Traditional Linux: either you’re root (can do everything) or you’re not. Capabilities split root’s power into ~40 individual permissions.

CapabilityWhat It AllowsShould You Grant It?
NET_BIND_SERVICEBind to ports < 1024Usually fine
SYS_PTRACEDebug/trace other processesOnly for debugging containers
NET_ADMINModify network configNetwork tools only
SYS_ADMINMount filesystems, many admin opsAlmost NEVER — it’s nearly equivalent to full root
NET_RAWRaw sockets (ping, tcpdump)Drop in production (used for ARP spoofing attacks)
MKNODCreate device filesDrop in production
AUDIT_WRITEWrite to kernel audit logDrop unless needed

Docker’s default: drops most dangerous capabilities but keeps ~14. Best practice:

# Drop ALL capabilities, add back only what you need
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp

ELI5: Capabilities are like giving someone specific keys instead of a master key. Instead of “here’s full admin access” (root), you say “you can open the mailbox (NET_BIND_SERVICE) but not the vault (SYS_ADMIN).” Drop all keys first, then hand back only the ones actually needed.


Seccomp Profiles

Seccomp restricts which system calls a container can make. Docker’s default profile blocks ~44 of ~300+ syscalls.

Blocked by default (dangerous):

  • mount — mount filesystems
  • reboot — reboot the host
  • kexec_load — load new kernel
  • ptrace — debug processes (unless SYS_PTRACE cap added)
  • unshare — create new namespaces
# Use Docker's default (recommended baseline)
docker run --security-opt seccomp=default myapp

# Custom profile (restrict further)
docker run --security-opt seccomp=custom-profile.json myapp

# Disable seccomp (NEVER in production)
docker run --security-opt seccomp=unconfined myapp

Think of it this way: If capabilities control which “rooms” a process can enter, seccomp controls which “tools” it can use inside those rooms. Even if a process has the capability to do networking, seccomp can block the specific system calls needed for raw socket operations.

Common mistake: Disabling seccomp because an app “doesn’t work.” Instead, run with strace to find which syscalls are blocked, and add only those to a custom profile.


AppArmor and SELinux

Mandatory Access Control (MAC) systems that restrict file/network/capability access at the kernel level.

FeatureAppArmorSELinux
Default onUbuntu, Debian, SUSERHEL, CentOS, Fedora
Policy modelPath-basedLabel-based
ComplexitySimpler to write profilesSteeper learning curve
Docker supportDefault profile appliedDefault profile applied
K8s supportPod annotationPod seLinuxOptions

Docker applies a default AppArmor profile that prevents:

  • Writing to /proc and /sys (except allowed paths)
  • Mounting filesystems
  • Accessing raw devices
  • Changing network configuration
# Kubernetes: apply custom AppArmor profile
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/mycontainer: localhost/my-custom-profile

Image Security & Supply Chain

Vulnerability Scanning

ToolTypeIntegrates WithBest For
Docker ScoutSaaS + CLIDocker Desktop, CI/CDDocker ecosystem, SBOM
TrivyOSS CLICI/CD, K8s admissionFast, comprehensive, free
GrypeOSS CLICI/CD, Syft (SBOM)Anchore ecosystem
SnykSaaSCI/CD, IDE, registriesDeveloper-friendly, fix suggestions
# Scan with Trivy
trivy image myapp:latest

# Scan with Docker Scout
docker scout cves myapp:latest

# Generate SBOM with Syft
syft myapp:latest -o spdx-json > sbom.json

Why SBOM matters: A Software Bill of Materials lists every package in your image. When a new CVE drops (like Log4Shell), you can instantly know which images are affected without scanning everything. SBOM is becoming mandatory for government contracts (US Executive Order 14028).

Image Signing and Trust

# Sign with Cosign (Sigstore)
cosign sign --key cosign.key myregistry.io/myapp:v1.0

# Verify
cosign verify --key cosign.pub myregistry.io/myapp:v1.0

Admission control: Block unsigned/unscanned images from deploying:

  • K8s: Kyverno or OPA/Gatekeeper policies
  • Docker: Content Trust (DOCKER_CONTENT_TRUST=1)

Layer Secrets Leak

# BAD: secret is baked into a layer (visible via docker history)
COPY .env /app/.env
RUN echo "password123" > /app/secret.txt

# ALSO BAD: deleting in a later layer doesn't remove from previous layer
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json && rm /tmp/creds.json  # STILL IN LAYER

# GOOD: use multi-stage build
FROM builder AS build
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json

FROM runtime
COPY --from=build /app/output /app/output
# credentials.json never appears in final image

Common mistake: Thinking RUN rm secret.txt removes the secret. It doesn’t — each layer is immutable. The file exists in the layer where it was added. Use multi-stage builds or BuildKit secrets (--mount=type=secret).


Kubernetes Security

RBAC (Role-Based Access Control)

User/ServiceAccount → RoleBinding → Role → Resources + Verbs
                      (who)         (what they can do)
ScopeRole TypeBinding TypeApplies To
NamespaceRoleRoleBindingResources in one namespace
Cluster-wideClusterRoleClusterRoleBindingAll namespaces, cluster resources
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]  # read-only, no create/delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: read-pods
subjects:
- kind: ServiceAccount
  name: monitoring-sa
  namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Principle of least privilege: Every ServiceAccount should have only the permissions it needs. The default ServiceAccount in every namespace has no permissions by default — don’t add any. Create specific ServiceAccounts for each workload.

Common mistake: Binding cluster-admin ClusterRole to a ServiceAccount because “it works.” This gives full cluster access. If that pod is compromised, the attacker owns the entire cluster.

Pod Security Standards (PSS)

Replaced PodSecurityPolicy (deprecated in 1.21, removed in 1.25).

LevelWhat It AllowsUse Case
PrivilegedEverythingSystem-level workloads (CNI, storage drivers)
BaselineBlocks known privilege escalationsGeneral workloads with minor restrictions
RestrictedBlocks all privilege escalation, enforces non-rootProduction workloads, security-sensitive
# Apply restricted policy to a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

Secrets Management

MethodSecurity LevelUse Case
K8s Secrets (base64)Low — not encrypted at rest by defaultNon-sensitive config
K8s Secrets + encryption at restMediumModerate sensitivity
External Secrets OperatorHigh — fetches from Vault/AWS SM/GCP SMProduction secrets
Sealed SecretsHigh — encrypted in Git, decrypted in clusterGitOps workflows
CSI Secret StoreHigh — mounts secrets as files from external providerDirect integration

Why this matters: K8s Secrets are base64-encoded, NOT encrypted. Anyone with kubectl get secret access can read them. For real security, enable encryption at rest (EncryptionConfiguration) AND use an external secrets manager.


The Security Checklist (Interview Ready)

Image Build Time

  • Use minimal base images (distroless, Alpine, scratch)
  • Run as non-root USER in Dockerfile
  • No secrets in image layers (use BuildKit --mount=type=secret)
  • Pin base image versions (no :latest)
  • Scan for CVEs in CI pipeline (Trivy/Scout)
  • Generate and store SBOM
  • Sign images (Cosign/Notary)

Runtime

  • Drop all capabilities, add back only needed
  • Enable seccomp (default or custom profile)
  • Read-only root filesystem (--read-only)
  • No privileged mode (--privileged=false)
  • Resource limits (memory, CPU, PIDs)
  • No host namespace sharing unless required

Kubernetes

  • Pod Security Standards: restricted for prod namespaces
  • RBAC: per-workload ServiceAccounts, least privilege
  • Network Policies: default deny, explicit allow
  • Secrets: external secrets manager, encryption at rest
  • Admission control: block unsigned/vulnerable images
  • Audit logging enabled

Key Takeaways for Interviews

  1. “How do you secure a container?” → Non-root user, drop all capabilities + add back needed, seccomp default profile, read-only rootfs, resource limits, scan images.
  2. “Explain RBAC” → Users/ServiceAccounts → RoleBindings → Roles → Resources+Verbs. Namespace-scoped or cluster-wide. Least privilege always.
  3. “How do you handle secrets?” → Never in image layers. K8s Secrets + encryption at rest minimum. External secrets manager (Vault, AWS SM) for production.
  4. “What’s the difference between capabilities and seccomp?” → Capabilities = which permission categories (coarse). Seccomp = which system calls (fine-grained). Use both.
  5. “How do you prevent container escape?” → Non-root + user namespaces + no privileged mode + drop SYS_ADMIN + seccomp + AppArmor/SELinux + patched kernel.