Container Security
Container Security
One misconfigured container = full cluster compromise. Container security is not optional — it’s table stakes for any production deployment.
The Threat Model
Supply Chain Runtime Escape
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Malicious │ │ Container │ │ Kernel │
│ base image │ │ runs as root │ │ exploit │
│ Vulnerable │ │ Excessive │ │ Mounted host │
│ dependencies │ │ capabilities │ │ filesystem │
│ Leaked secrets│ │ No seccomp │ │ Privileged │
│ in layers │ │ No AppArmor │ │ mode │
└──────────────┘ └──────────────┘ └──────────────┘
ELI5: Container security is like airport security with three checkpoints. Supply chain = checking your bags before you board (are there dangerous items baked into the image?). Runtime = in-flight rules (what passengers can and can’t do during the flight). Escape prevention = making sure nobody can open the emergency exit mid-flight (break out of the container to the host).
Running as Non-Root
The single most impactful security measure. By default, containers run as root (UID 0), which maps to root on the host unless user namespaces are enabled.
# Create a non-root user and switch to it
FROM node:20-slim
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
COPY --chown=appuser:appuser . .
USER appuser
CMD ["node", "server.js"]
Why this matters: If an attacker exploits your app and gets code execution inside the container, they’re root. If the container has any host mounts, shared namespaces, or a kernel vulnerability, root inside = root outside.
Common mistake: “But my app needs root to bind to port 80.” No it doesn’t. Use port 8080+ inside the container and map with
-p 80:8080. Or usesetcap cap_net_bind_serviceto grant just that one capability.
Linux Capabilities
Traditional Linux: either you’re root (can do everything) or you’re not. Capabilities split root’s power into ~40 individual permissions.
| Capability | What It Allows | Should You Grant It? |
|---|---|---|
NET_BIND_SERVICE | Bind to ports < 1024 | Usually fine |
SYS_PTRACE | Debug/trace other processes | Only for debugging containers |
NET_ADMIN | Modify network config | Network tools only |
SYS_ADMIN | Mount filesystems, many admin ops | Almost NEVER — it’s nearly equivalent to full root |
NET_RAW | Raw sockets (ping, tcpdump) | Drop in production (used for ARP spoofing attacks) |
MKNOD | Create device files | Drop in production |
AUDIT_WRITE | Write to kernel audit log | Drop unless needed |
Docker’s default: drops most dangerous capabilities but keeps ~14. Best practice:
# Drop ALL capabilities, add back only what you need
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp
ELI5: Capabilities are like giving someone specific keys instead of a master key. Instead of “here’s full admin access” (root), you say “you can open the mailbox (NET_BIND_SERVICE) but not the vault (SYS_ADMIN).” Drop all keys first, then hand back only the ones actually needed.
Seccomp Profiles
Seccomp restricts which system calls a container can make. Docker’s default profile blocks ~44 of ~300+ syscalls.
Blocked by default (dangerous):
mount— mount filesystemsreboot— reboot the hostkexec_load— load new kernelptrace— debug processes (unlessSYS_PTRACEcap added)unshare— create new namespaces
# Use Docker's default (recommended baseline)
docker run --security-opt seccomp=default myapp
# Custom profile (restrict further)
docker run --security-opt seccomp=custom-profile.json myapp
# Disable seccomp (NEVER in production)
docker run --security-opt seccomp=unconfined myapp
Think of it this way: If capabilities control which “rooms” a process can enter, seccomp controls which “tools” it can use inside those rooms. Even if a process has the capability to do networking, seccomp can block the specific system calls needed for raw socket operations.
Common mistake: Disabling seccomp because an app “doesn’t work.” Instead, run with strace to find which syscalls are blocked, and add only those to a custom profile.
AppArmor and SELinux
Mandatory Access Control (MAC) systems that restrict file/network/capability access at the kernel level.
| Feature | AppArmor | SELinux |
|---|---|---|
| Default on | Ubuntu, Debian, SUSE | RHEL, CentOS, Fedora |
| Policy model | Path-based | Label-based |
| Complexity | Simpler to write profiles | Steeper learning curve |
| Docker support | Default profile applied | Default profile applied |
| K8s support | Pod annotation | Pod seLinuxOptions |
Docker applies a default AppArmor profile that prevents:
- Writing to
/procand/sys(except allowed paths) - Mounting filesystems
- Accessing raw devices
- Changing network configuration
# Kubernetes: apply custom AppArmor profile
metadata:
annotations:
container.apparmor.security.beta.kubernetes.io/mycontainer: localhost/my-custom-profile
Image Security & Supply Chain
Vulnerability Scanning
| Tool | Type | Integrates With | Best For |
|---|---|---|---|
| Docker Scout | SaaS + CLI | Docker Desktop, CI/CD | Docker ecosystem, SBOM |
| Trivy | OSS CLI | CI/CD, K8s admission | Fast, comprehensive, free |
| Grype | OSS CLI | CI/CD, Syft (SBOM) | Anchore ecosystem |
| Snyk | SaaS | CI/CD, IDE, registries | Developer-friendly, fix suggestions |
# Scan with Trivy
trivy image myapp:latest
# Scan with Docker Scout
docker scout cves myapp:latest
# Generate SBOM with Syft
syft myapp:latest -o spdx-json > sbom.json
Why SBOM matters: A Software Bill of Materials lists every package in your image. When a new CVE drops (like Log4Shell), you can instantly know which images are affected without scanning everything. SBOM is becoming mandatory for government contracts (US Executive Order 14028).
Image Signing and Trust
# Sign with Cosign (Sigstore)
cosign sign --key cosign.key myregistry.io/myapp:v1.0
# Verify
cosign verify --key cosign.pub myregistry.io/myapp:v1.0
Admission control: Block unsigned/unscanned images from deploying:
- K8s: Kyverno or OPA/Gatekeeper policies
- Docker: Content Trust (
DOCKER_CONTENT_TRUST=1)
Layer Secrets Leak
# BAD: secret is baked into a layer (visible via docker history)
COPY .env /app/.env
RUN echo "password123" > /app/secret.txt
# ALSO BAD: deleting in a later layer doesn't remove from previous layer
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json && rm /tmp/creds.json # STILL IN LAYER
# GOOD: use multi-stage build
FROM builder AS build
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json
FROM runtime
COPY --from=build /app/output /app/output
# credentials.json never appears in final image
Common mistake: Thinking
RUN rm secret.txtremoves the secret. It doesn’t — each layer is immutable. The file exists in the layer where it was added. Use multi-stage builds or BuildKit secrets (--mount=type=secret).
Kubernetes Security
RBAC (Role-Based Access Control)
User/ServiceAccount → RoleBinding → Role → Resources + Verbs
(who) (what they can do)
| Scope | Role Type | Binding Type | Applies To |
|---|---|---|---|
| Namespace | Role | RoleBinding | Resources in one namespace |
| Cluster-wide | ClusterRole | ClusterRoleBinding | All namespaces, cluster resources |
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"] # read-only, no create/delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: read-pods
subjects:
- kind: ServiceAccount
name: monitoring-sa
namespace: production
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Principle of least privilege: Every ServiceAccount should have only the permissions it needs. The default ServiceAccount in every namespace has no permissions by default — don’t add any. Create specific ServiceAccounts for each workload.
Common mistake: Binding
cluster-adminClusterRole to a ServiceAccount because “it works.” This gives full cluster access. If that pod is compromised, the attacker owns the entire cluster.
Pod Security Standards (PSS)
Replaced PodSecurityPolicy (deprecated in 1.21, removed in 1.25).
| Level | What It Allows | Use Case |
|---|---|---|
| Privileged | Everything | System-level workloads (CNI, storage drivers) |
| Baseline | Blocks known privilege escalations | General workloads with minor restrictions |
| Restricted | Blocks all privilege escalation, enforces non-root | Production workloads, security-sensitive |
# Apply restricted policy to a namespace
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/warn: restricted
pod-security.kubernetes.io/audit: restricted
Secrets Management
| Method | Security Level | Use Case |
|---|---|---|
| K8s Secrets (base64) | Low — not encrypted at rest by default | Non-sensitive config |
| K8s Secrets + encryption at rest | Medium | Moderate sensitivity |
| External Secrets Operator | High — fetches from Vault/AWS SM/GCP SM | Production secrets |
| Sealed Secrets | High — encrypted in Git, decrypted in cluster | GitOps workflows |
| CSI Secret Store | High — mounts secrets as files from external provider | Direct integration |
Why this matters: K8s Secrets are base64-encoded, NOT encrypted. Anyone with
kubectl get secretaccess can read them. For real security, enable encryption at rest (EncryptionConfiguration) AND use an external secrets manager.
The Security Checklist (Interview Ready)
Image Build Time
- Use minimal base images (distroless, Alpine, scratch)
- Run as non-root USER in Dockerfile
- No secrets in image layers (use BuildKit
--mount=type=secret) - Pin base image versions (no
:latest) - Scan for CVEs in CI pipeline (Trivy/Scout)
- Generate and store SBOM
- Sign images (Cosign/Notary)
Runtime
- Drop all capabilities, add back only needed
- Enable seccomp (default or custom profile)
- Read-only root filesystem (
--read-only) - No privileged mode (
--privileged=false) - Resource limits (memory, CPU, PIDs)
- No host namespace sharing unless required
Kubernetes
- Pod Security Standards: restricted for prod namespaces
- RBAC: per-workload ServiceAccounts, least privilege
- Network Policies: default deny, explicit allow
- Secrets: external secrets manager, encryption at rest
- Admission control: block unsigned/vulnerable images
- Audit logging enabled
Key Takeaways for Interviews
- “How do you secure a container?” → Non-root user, drop all capabilities + add back needed, seccomp default profile, read-only rootfs, resource limits, scan images.
- “Explain RBAC” → Users/ServiceAccounts → RoleBindings → Roles → Resources+Verbs. Namespace-scoped or cluster-wide. Least privilege always.
- “How do you handle secrets?” → Never in image layers. K8s Secrets + encryption at rest minimum. External secrets manager (Vault, AWS SM) for production.
- “What’s the difference between capabilities and seccomp?” → Capabilities = which permission categories (coarse). Seccomp = which system calls (fine-grained). Use both.
- “How do you prevent container escape?” → Non-root + user namespaces + no privileged mode + drop SYS_ADMIN + seccomp + AppArmor/SELinux + patched kernel.