Container Security

8 min read 1512 words

Table of Contents

Container Security

Container Security

One misconfigured container = full cluster compromise. Container security is not optional — it’s table stakes for any production deployment.

The Threat Model

Supply Chain          Runtime              Escape
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Malicious     │    │ Container    │    │ Kernel       │
│ base image    │    │ runs as root │    │ exploit      │
│ Vulnerable    │    │ Excessive    │    │ Mounted host │
│ dependencies  │    │ capabilities │    │ filesystem   │
│ Leaked secrets│    │ No seccomp   │    │ Privileged   │
│ in layers     │    │ No AppArmor  │    │ mode         │
└──────────────┘    └──────────────┘    └──────────────┘

ELI5: Container security is like airport security with three checkpoints. Supply chain = checking your bags before you board (are there dangerous items baked into the image?). Runtime = in-flight rules (what passengers can and can’t do during the flight). Escape prevention = making sure nobody can open the emergency exit mid-flight (break out of the container to the host).

Running as Non-Root

The single most impactful security measure. By default, containers run as root (UID 0), which maps to root on the host unless user namespaces are enabled.

# Create a non-root user and switch to it
FROM node:20-slim
RUN groupadd -r appuser && useradd -r -g appuser appuser
WORKDIR /app
COPY --chown=appuser:appuser . .
USER appuser
CMD ["node", "server.js"]

Why this matters: If an attacker exploits your app and gets code execution inside the container, they’re root. If the container has any host mounts, shared namespaces, or a kernel vulnerability, root inside = root outside.

Common mistake: “But my app needs root to bind to port 80.” No it doesn’t. Use port 8080+ inside the container and map with -p 80:8080. Or use setcap cap_net_bind_service to grant just that one capability.

Linux Capabilities

Traditional Linux: either you’re root (can do everything) or you’re not. Capabilities split root’s power into ~40 individual permissions.

Capability	What It Allows	Should You Grant It?
`NET_BIND_SERVICE`	Bind to ports < 1024	Usually fine
`SYS_PTRACE`	Debug/trace other processes	Only for debugging containers
`NET_ADMIN`	Modify network config	Network tools only
`SYS_ADMIN`	Mount filesystems, many admin ops	Almost NEVER — it’s nearly equivalent to full root
`NET_RAW`	Raw sockets (ping, tcpdump)	Drop in production (used for ARP spoofing attacks)
`MKNOD`	Create device files	Drop in production
`AUDIT_WRITE`	Write to kernel audit log	Drop unless needed

Docker’s default: drops most dangerous capabilities but keeps ~14. Best practice:

# Drop ALL capabilities, add back only what you need
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp

ELI5: Capabilities are like giving someone specific keys instead of a master key. Instead of “here’s full admin access” (root), you say “you can open the mailbox (NET_BIND_SERVICE) but not the vault (SYS_ADMIN).” Drop all keys first, then hand back only the ones actually needed.

Seccomp Profiles

Seccomp restricts which system calls a container can make. Docker’s default profile blocks ~44 of ~300+ syscalls.

Blocked by default (dangerous):

mount — mount filesystems
reboot — reboot the host
kexec_load — load new kernel
ptrace — debug processes (unless SYS_PTRACE cap added)
unshare — create new namespaces

# Use Docker's default (recommended baseline)
docker run --security-opt seccomp=default myapp

# Custom profile (restrict further)
docker run --security-opt seccomp=custom-profile.json myapp

# Disable seccomp (NEVER in production)
docker run --security-opt seccomp=unconfined myapp

Think of it this way: If capabilities control which “rooms” a process can enter, seccomp controls which “tools” it can use inside those rooms. Even if a process has the capability to do networking, seccomp can block the specific system calls needed for raw socket operations.

Common mistake: Disabling seccomp because an app “doesn’t work.” Instead, run with strace to find which syscalls are blocked, and add only those to a custom profile.

AppArmor and SELinux

Mandatory Access Control (MAC) systems that restrict file/network/capability access at the kernel level.

Feature	AppArmor	SELinux
Default on	Ubuntu, Debian, SUSE	RHEL, CentOS, Fedora
Policy model	Path-based	Label-based
Complexity	Simpler to write profiles	Steeper learning curve
Docker support	Default profile applied	Default profile applied
K8s support	Pod annotation	Pod `seLinuxOptions`

Docker applies a default AppArmor profile that prevents:

Writing to /proc and /sys (except allowed paths)
Mounting filesystems
Accessing raw devices
Changing network configuration

# Kubernetes: apply custom AppArmor profile
metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/mycontainer: localhost/my-custom-profile

Image Security & Supply Chain

Vulnerability Scanning

Tool	Type	Integrates With	Best For
Docker Scout	SaaS + CLI	Docker Desktop, CI/CD	Docker ecosystem, SBOM
Trivy	OSS CLI	CI/CD, K8s admission	Fast, comprehensive, free
Grype	OSS CLI	CI/CD, Syft (SBOM)	Anchore ecosystem
Snyk	SaaS	CI/CD, IDE, registries	Developer-friendly, fix suggestions

# Scan with Trivy
trivy image myapp:latest

# Scan with Docker Scout
docker scout cves myapp:latest

# Generate SBOM with Syft
syft myapp:latest -o spdx-json > sbom.json

Why SBOM matters: A Software Bill of Materials lists every package in your image. When a new CVE drops (like Log4Shell), you can instantly know which images are affected without scanning everything. SBOM is becoming mandatory for government contracts (US Executive Order 14028).

Image Signing and Trust

# Sign with Cosign (Sigstore)
cosign sign --key cosign.key myregistry.io/myapp:v1.0

# Verify
cosign verify --key cosign.pub myregistry.io/myapp:v1.0

Admission control: Block unsigned/unscanned images from deploying:

K8s: Kyverno or OPA/Gatekeeper policies
Docker: Content Trust (DOCKER_CONTENT_TRUST=1)

Layer Secrets Leak

# BAD: secret is baked into a layer (visible via docker history)
COPY .env /app/.env
RUN echo "password123" > /app/secret.txt

# ALSO BAD: deleting in a later layer doesn't remove from previous layer
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json && rm /tmp/creds.json  # STILL IN LAYER

# GOOD: use multi-stage build
FROM builder AS build
COPY credentials.json /tmp/creds.json
RUN use-creds /tmp/creds.json

FROM runtime
COPY --from=build /app/output /app/output
# credentials.json never appears in final image

Common mistake: Thinking RUN rm secret.txt removes the secret. It doesn’t — each layer is immutable. The file exists in the layer where it was added. Use multi-stage builds or BuildKit secrets (--mount=type=secret).

Kubernetes Security

RBAC (Role-Based Access Control)

User/ServiceAccount → RoleBinding → Role → Resources + Verbs
                      (who)         (what they can do)

Scope	Role Type	Binding Type	Applies To
Namespace	Role	RoleBinding	Resources in one namespace
Cluster-wide	ClusterRole	ClusterRoleBinding	All namespaces, cluster resources

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]  # read-only, no create/delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: production
  name: read-pods
subjects:
- kind: ServiceAccount
  name: monitoring-sa
  namespace: production
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

Principle of least privilege: Every ServiceAccount should have only the permissions it needs. The default ServiceAccount in every namespace has no permissions by default — don’t add any. Create specific ServiceAccounts for each workload.

Common mistake: Binding cluster-admin ClusterRole to a ServiceAccount because “it works.” This gives full cluster access. If that pod is compromised, the attacker owns the entire cluster.

Pod Security Standards (PSS)

Replaced PodSecurityPolicy (deprecated in 1.21, removed in 1.25).

Level	What It Allows	Use Case
Privileged	Everything	System-level workloads (CNI, storage drivers)
Baseline	Blocks known privilege escalations	General workloads with minor restrictions
Restricted	Blocks all privilege escalation, enforces non-root	Production workloads, security-sensitive

# Apply restricted policy to a namespace
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

Secrets Management

Method	Security Level	Use Case
K8s Secrets (base64)	Low — not encrypted at rest by default	Non-sensitive config
K8s Secrets + encryption at rest	Medium	Moderate sensitivity
External Secrets Operator	High — fetches from Vault/AWS SM/GCP SM	Production secrets
Sealed Secrets	High — encrypted in Git, decrypted in cluster	GitOps workflows
CSI Secret Store	High — mounts secrets as files from external provider	Direct integration

Why this matters: K8s Secrets are base64-encoded, NOT encrypted. Anyone with kubectl get secret access can read them. For real security, enable encryption at rest (EncryptionConfiguration) AND use an external secrets manager.

The Security Checklist (Interview Ready)

Image Build Time

Use minimal base images (distroless, Alpine, scratch)
Run as non-root USER in Dockerfile
No secrets in image layers (use BuildKit --mount=type=secret)
Pin base image versions (no :latest)
Scan for CVEs in CI pipeline (Trivy/Scout)
Generate and store SBOM
Sign images (Cosign/Notary)

Runtime

Drop all capabilities, add back only needed
Enable seccomp (default or custom profile)
Read-only root filesystem (--read-only)
No privileged mode (--privileged=false)
Resource limits (memory, CPU, PIDs)
No host namespace sharing unless required

Kubernetes

Pod Security Standards: restricted for prod namespaces
RBAC: per-workload ServiceAccounts, least privilege
Network Policies: default deny, explicit allow
Secrets: external secrets manager, encryption at rest
Admission control: block unsigned/vulnerable images
Audit logging enabled

Key Takeaways for Interviews

“How do you secure a container?” → Non-root user, drop all capabilities + add back needed, seccomp default profile, read-only rootfs, resource limits, scan images.
“Explain RBAC” → Users/ServiceAccounts → RoleBindings → Roles → Resources+Verbs. Namespace-scoped or cluster-wide. Least privilege always.
“How do you handle secrets?” → Never in image layers. K8s Secrets + encryption at rest minimum. External secrets manager (Vault, AWS SM) for production.
“What’s the difference between capabilities and seccomp?” → Capabilities = which permission categories (coarse). Seccomp = which system calls (fine-grained). Use both.
“How do you prevent container escape?” → Non-root + user namespaces + no privileged mode + drop SYS_ADMIN + seccomp + AppArmor/SELinux + patched kernel.

Container Security#

The Threat Model#

Running as Non-Root#

Linux Capabilities#

Seccomp Profiles#

AppArmor and SELinux#

Image Security & Supply Chain#

Vulnerability Scanning#

Image Signing and Trust#

Layer Secrets Leak#

Kubernetes Security#

RBAC (Role-Based Access Control)#

Pod Security Standards (PSS)#

Secrets Management#

The Security Checklist (Interview Ready)#

Image Build Time#

Runtime#

Kubernetes#

Key Takeaways for Interviews#

Container Security

The Threat Model

Running as Non-Root

Linux Capabilities

Seccomp Profiles

AppArmor and SELinux

Image Security & Supply Chain

Vulnerability Scanning

Image Signing and Trust

Layer Secrets Leak

Kubernetes Security

RBAC (Role-Based Access Control)

Pod Security Standards (PSS)

Secrets Management

The Security Checklist (Interview Ready)

Image Build Time

Runtime

Kubernetes

Key Takeaways for Interviews