← Docker & Containers Advanced

Production Operations

Production Operations

The difference between “it works locally” and “it works at 3AM when you’re on-call.” This covers logging, monitoring, resource management, health checks, graceful shutdown, and debugging running containers.


Health Checks

Without health checks, the orchestrator only knows if your process is alive — not if it’s healthy. A process can be running but deadlocked, OOM-thrashing, or unable to serve requests.

Docker HEALTHCHECK

HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD curl -f http://localhost:8080/health || exit 1
ParameterDefaultWhat It Means
--interval30sTime between checks
--timeout30sMax time for check to complete
--retries3Failures before marking unhealthy
--start-period0sGrace period on startup (failures don’t count)

Kubernetes Probes

ProbePurposeFailure Action
StartupIs the app finished initializing?Keep waiting (don’t run liveness/readiness)
LivenessIs the app alive?Restart the container
ReadinessCan the app serve traffic?Remove from Service endpoints (stop sending traffic)
containers:
- name: api
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    initialDelaySeconds: 15
    periodSeconds: 10
    failureThreshold: 3
  readinessProbe:
    httpGet:
      path: /ready
      port: 8080
    periodSeconds: 5
    failureThreshold: 2
  startupProbe:
    httpGet:
      path: /healthz
      port: 8080
    failureThreshold: 30    # 30 * 10s = 5 minutes to start
    periodSeconds: 10

ELI5: Liveness = “Are you conscious?” (if no → call an ambulance / restart). Readiness = “Can you take customers?” (if no → close the shop window / stop sending traffic). Startup = “Are you done getting dressed?” (don’t bug me until I’m ready).

Common mistake: Using liveness probes that depend on external services (database, cache). If the database goes down, the liveness probe fails, K8s restarts ALL your pods, which creates a thundering herd that makes the database even worse. Liveness should check only the process itself. Use readiness for dependency checks.

Common mistake #2: No startup probe for slow-starting apps (JVM, large ML models). Without it, the liveness probe starts immediately and kills the container before it’s done loading. Use startupProbe with high failureThreshold.


Graceful Shutdown

When a container stops (deploy, scale-down, node drain), the orchestrator sends SIGTERM. Your app must:

  1. Stop accepting new requests
  2. Finish in-flight requests
  3. Close connections (DB, message queue)
  4. Exit cleanly
SIGTERM → app starts graceful shutdown → finishes work → exits 0
                                                          ↓
                                              (if too slow)
                                         SIGKILL after grace period

The PID 1 Problem

If your Dockerfile uses shell form CMD node server.js, the process tree is:

PID 1: /bin/sh -c "node server.js"
  PID 2: node server.js

SIGTERM goes to PID 1 (sh), which ignores it. After 10 seconds, SIGKILL. Your app never gets a chance to shut down gracefully.

Fix 1: Use exec form: CMD ["node", "server.js"] — Node runs as PID 1 directly.

Fix 2: Use --init flag or tini: ENTRYPOINT ["/tini", "--", "node", "server.js"]

Why tini/dumb-init matter: PID 1 has two special responsibilities: (a) handle signals properly (default signal dispositions don’t apply to PID 1), (b) reap zombie processes. Most applications don’t implement either. tini handles both and forwards signals to your app.

Kubernetes Grace Period

spec:
  terminationGracePeriodSeconds: 60  # default: 30

Timeline on pod deletion:

  1. Pod marked Terminating
  2. Removed from Service endpoints (no new traffic)
  3. preStop hook runs (if defined)
  4. SIGTERM sent to containers
  5. Wait up to terminationGracePeriodSeconds
  6. SIGKILL

Common mistake: Setting terminationGracePeriodSeconds too low for long-running requests. If your API has requests that take 30+ seconds, you need a grace period longer than that. Also: the preStop hook time counts against the grace period — not in addition to it.

preStop Hook for Zero-Downtime Deploys

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]

Why the sleep? When a pod is terminated, endpoint removal and SIGTERM happen concurrently. Some kube-proxies/ingress controllers may still send traffic for a few seconds after SIGTERM. The sleep gives them time to update.


Resource Management

Docker Resource Limits

docker run \
  --memory=512m \           # Hard memory limit
  --memory-reservation=256m \ # Soft limit (hint to scheduler)
  --cpus=0.5 \              # 50% of one CPU
  --pids-limit=100 \        # Max 100 processes (fork bomb protection)
  myapp

Kubernetes Resource Requests and Limits

resources:
  requests:          # Scheduler uses this for placement
    memory: "256Mi"
    cpu: "250m"      # 250 millicores = 0.25 CPU
  limits:            # Hard ceiling
    memory: "512Mi"
    cpu: "500m"
ConceptWhat It DoesWhat Happens If Exceeded
RequestGuarantees minimum resources; scheduler uses for placementN/A (it’s a minimum)
Memory limitHard ceilingOOM killed immediately
CPU limitThrottling ceilingThrottled (slowed down, not killed)

ELI5: Requests = your reserved seat on a bus. You’re guaranteed that seat. Limits = the maximum space you’re allowed to take up. Memory limit exceeded = you’re kicked off the bus (OOM kill). CPU limit exceeded = you have to walk slower (throttled) but you can stay.

Decision framework for requests vs limits:

ResourceSet request?Set limit?Reasoning
MemoryYes (always)Yes (always)Without limit, one pod OOMs the whole node
CPUYes (always)ControversialCPU limits cause throttling latency spikes. Many teams set requests but not limits.

Common mistake: Setting CPU limits too tight causes latency spikes that look like application bugs. Container gets throttled mid-request even when CPU is available on the node. Many production teams (including Google) recommend NOT setting CPU limits — only CPU requests.

Quality of Service (QoS) Classes

K8s assigns QoS based on requests/limits:

QoS ClassConditionEviction Priority
Guaranteedrequests == limits for all containersLast to evict
BurstableAt least one request set, request != limitMiddle
BestEffortNo requests or limits setFirst to evict

Under memory pressure, K8s evicts BestEffort first, then Burstable, then Guaranteed. Always set at least requests.


Logging

Docker Logging Drivers

DriverDestinationUse Case
json-file (default)/var/lib/docker/containers/<id>/*.logDevelopment, simple setups
journaldsystemd journalSystemd-based hosts
fluentdFluentd collectorCentralized logging (EFK stack)
awslogsCloudWatch LogsAWS deployments
gcplogsGoogle Cloud LoggingGCP deployments
splunkSplunk HECEnterprise logging

Best practice: Write logs to stdout/stderr (not files). Docker captures stdout/stderr automatically. The logging driver handles shipping. If you write to files, you need separate log collection (sidecar or host agent).

Think of it this way: Your app should just shout its logs into the void (stdout). Docker catches the shout and writes it down. Where it writes it down (file, CloudWatch, Fluentd) is configured at the infrastructure level, not in your app.

Kubernetes Logging Architecture

App → stdout → Container runtime captures → Node log file
                                              ↓
                                     DaemonSet log agent
                                     (Fluent Bit/Fluentd)
                                              ↓
                                     Centralized logging
                                     (Elasticsearch/Loki/CloudWatch)

Pattern 1: Node-level agent (DaemonSet) — Fluent Bit runs on every node, reads container log files, ships to backend. Most common, lowest overhead.

Pattern 2: Sidecar container — Fluent Bit runs as a sidecar in each pod. More flexible (per-pod config) but higher resource usage.

Pattern 3: Application-level — App ships logs directly. Most control, but couples app to logging infra.


Monitoring and Observability

The Three Pillars

PillarWhat It Tells YouKey Tools
MetricsWhat’s happening (numbers over time)Prometheus, Grafana, Datadog
LogsWhy it happened (event details)EFK, Loki, CloudWatch
TracesHow it happened (request flow across services)Jaeger, Zipkin, Tempo

Container Metrics to Monitor

MetricWhyAlert Threshold
CPU usage vs requestOverprovisioned? Underprovisioned?>80% sustained = scale up
Memory usage vs limitApproaching OOM?>85% = investigate
Restart countCrash loops, OOM kills>0 in production = investigate
Network errorsConnectivity, DNS issuesAny sustained errors
Disk I/OStorage bottleneckHigh latency = check storage driver
Container ready timeSlow starts, failed health checks>expected startup time

Prometheus + Grafana Stack

# Prometheus scrape config for K8s pods
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true

Add prometheus.io/scrape: "true" annotation to your pods, expose /metrics endpoint, and Prometheus auto-discovers and scrapes.


Debugging Running Containers

Essential Debug Commands

# Execute shell in running container
docker exec -it <container> sh

# View real-time logs
docker logs -f --tail 100 <container>

# Inspect container details (network, mounts, env)
docker inspect <container>

# View resource usage
docker stats <container>

# Copy files out for analysis
docker cp <container>:/app/core.dump ./

# K8s equivalents
kubectl exec -it <pod> -- sh
kubectl logs -f <pod> --tail=100
kubectl describe pod <pod>
kubectl top pod <pod>
kubectl cp <pod>:/path/to/file ./local-file

Debug Containers (K8s 1.23+)

When a production container has no shell (distroless, scratch):

# Attach a debug container to a running pod
kubectl debug -it <pod> --image=busybox:1.36 --target=<container>

# Debug with network tools
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

# Create a copy of the pod with debug image
kubectl debug <pod> --copy-to=debug-pod --container=debug --image=busybox

Why this matters: You followed security best practices and used distroless images. Now something is broken in production and there’s no shell to debug with. Ephemeral debug containers solve this without compromising your security posture.

Common Debug Scenarios

ProblemDebug Approach
Container crash loopskubectl logs <pod> --previous (see logs from crashed instance)
OOM killedkubectl describe pod → look for OOMKilled in lastState
Can’t reach servicekubectl exec -- nslookup <svc>, check endpoints: kubectl get ep
Slow responseskubectl top pod, check CPU throttling: cat /sys/fs/cgroup/cpu.stat
Mount permission deniedCheck runAsUser vs file ownership, use fsGroup in securityContext

Key Takeaways for Interviews

  1. “How do you handle zero-downtime deploys?” → Readiness probes + preStop sleep hook + rolling update strategy + terminationGracePeriodSeconds matching your longest request.
  2. “How do you set resource limits?” → Always set memory request + limit. Set CPU request, consider skipping CPU limit (throttling causes latency). Use VPA recommendations for right-sizing.
  3. “How do you debug a distroless container?”kubectl debug ephemeral containers. Attach a debug image (busybox, netshoot) to the running pod’s namespaces.
  4. “Logging strategy?” → App writes to stdout. Node-level DaemonSet (Fluent Bit) ships to centralized backend (Loki/ES). Structured JSON logs. Don’t write to files inside containers.
  5. “What metrics do you monitor?” → CPU/memory vs requests, restart count, network errors, request latency (app-level). Use Prometheus + Grafana. Alert on symptoms (latency, errors) not causes (CPU%).