Production Operations
Production Operations
The difference between “it works locally” and “it works at 3AM when you’re on-call.” This covers logging, monitoring, resource management, health checks, graceful shutdown, and debugging running containers.
Health Checks
Without health checks, the orchestrator only knows if your process is alive — not if it’s healthy. A process can be running but deadlocked, OOM-thrashing, or unable to serve requests.
Docker HEALTHCHECK
HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
CMD curl -f http://localhost:8080/health || exit 1
| Parameter | Default | What It Means |
|---|---|---|
--interval | 30s | Time between checks |
--timeout | 30s | Max time for check to complete |
--retries | 3 | Failures before marking unhealthy |
--start-period | 0s | Grace period on startup (failures don’t count) |
Kubernetes Probes
| Probe | Purpose | Failure Action |
|---|---|---|
| Startup | Is the app finished initializing? | Keep waiting (don’t run liveness/readiness) |
| Liveness | Is the app alive? | Restart the container |
| Readiness | Can the app serve traffic? | Remove from Service endpoints (stop sending traffic) |
containers:
- name: api
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes to start
periodSeconds: 10
ELI5: Liveness = “Are you conscious?” (if no → call an ambulance / restart). Readiness = “Can you take customers?” (if no → close the shop window / stop sending traffic). Startup = “Are you done getting dressed?” (don’t bug me until I’m ready).
Common mistake: Using liveness probes that depend on external services (database, cache). If the database goes down, the liveness probe fails, K8s restarts ALL your pods, which creates a thundering herd that makes the database even worse. Liveness should check only the process itself. Use readiness for dependency checks.
Common mistake #2: No startup probe for slow-starting apps (JVM, large ML models). Without it, the liveness probe starts immediately and kills the container before it’s done loading. Use startupProbe with high failureThreshold.
Graceful Shutdown
When a container stops (deploy, scale-down, node drain), the orchestrator sends SIGTERM. Your app must:
- Stop accepting new requests
- Finish in-flight requests
- Close connections (DB, message queue)
- Exit cleanly
SIGTERM → app starts graceful shutdown → finishes work → exits 0
↓
(if too slow)
SIGKILL after grace period
The PID 1 Problem
If your Dockerfile uses shell form CMD node server.js, the process tree is:
PID 1: /bin/sh -c "node server.js"
PID 2: node server.js
SIGTERM goes to PID 1 (sh), which ignores it. After 10 seconds, SIGKILL. Your app never gets a chance to shut down gracefully.
Fix 1: Use exec form: CMD ["node", "server.js"] — Node runs as PID 1 directly.
Fix 2: Use --init flag or tini: ENTRYPOINT ["/tini", "--", "node", "server.js"]
Why tini/dumb-init matter: PID 1 has two special responsibilities: (a) handle signals properly (default signal dispositions don’t apply to PID 1), (b) reap zombie processes. Most applications don’t implement either. tini handles both and forwards signals to your app.
Kubernetes Grace Period
spec:
terminationGracePeriodSeconds: 60 # default: 30
Timeline on pod deletion:
- Pod marked
Terminating - Removed from Service endpoints (no new traffic)
preStophook runs (if defined)- SIGTERM sent to containers
- Wait up to
terminationGracePeriodSeconds - SIGKILL
Common mistake: Setting terminationGracePeriodSeconds too low for long-running requests. If your API has requests that take 30+ seconds, you need a grace period longer than that. Also: the preStop hook time counts against the grace period — not in addition to it.
preStop Hook for Zero-Downtime Deploys
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 5"]
Why the sleep? When a pod is terminated, endpoint removal and SIGTERM happen concurrently. Some kube-proxies/ingress controllers may still send traffic for a few seconds after SIGTERM. The sleep gives them time to update.
Resource Management
Docker Resource Limits
docker run \
--memory=512m \ # Hard memory limit
--memory-reservation=256m \ # Soft limit (hint to scheduler)
--cpus=0.5 \ # 50% of one CPU
--pids-limit=100 \ # Max 100 processes (fork bomb protection)
myapp
Kubernetes Resource Requests and Limits
resources:
requests: # Scheduler uses this for placement
memory: "256Mi"
cpu: "250m" # 250 millicores = 0.25 CPU
limits: # Hard ceiling
memory: "512Mi"
cpu: "500m"
| Concept | What It Does | What Happens If Exceeded |
|---|---|---|
| Request | Guarantees minimum resources; scheduler uses for placement | N/A (it’s a minimum) |
| Memory limit | Hard ceiling | OOM killed immediately |
| CPU limit | Throttling ceiling | Throttled (slowed down, not killed) |
ELI5: Requests = your reserved seat on a bus. You’re guaranteed that seat. Limits = the maximum space you’re allowed to take up. Memory limit exceeded = you’re kicked off the bus (OOM kill). CPU limit exceeded = you have to walk slower (throttled) but you can stay.
Decision framework for requests vs limits:
| Resource | Set request? | Set limit? | Reasoning |
|---|---|---|---|
| Memory | Yes (always) | Yes (always) | Without limit, one pod OOMs the whole node |
| CPU | Yes (always) | Controversial | CPU limits cause throttling latency spikes. Many teams set requests but not limits. |
Common mistake: Setting CPU limits too tight causes latency spikes that look like application bugs. Container gets throttled mid-request even when CPU is available on the node. Many production teams (including Google) recommend NOT setting CPU limits — only CPU requests.
Quality of Service (QoS) Classes
K8s assigns QoS based on requests/limits:
| QoS Class | Condition | Eviction Priority |
|---|---|---|
| Guaranteed | requests == limits for all containers | Last to evict |
| Burstable | At least one request set, request != limit | Middle |
| BestEffort | No requests or limits set | First to evict |
Under memory pressure, K8s evicts BestEffort first, then Burstable, then Guaranteed. Always set at least requests.
Logging
Docker Logging Drivers
| Driver | Destination | Use Case |
|---|---|---|
| json-file (default) | /var/lib/docker/containers/<id>/*.log | Development, simple setups |
| journald | systemd journal | Systemd-based hosts |
| fluentd | Fluentd collector | Centralized logging (EFK stack) |
| awslogs | CloudWatch Logs | AWS deployments |
| gcplogs | Google Cloud Logging | GCP deployments |
| splunk | Splunk HEC | Enterprise logging |
Best practice: Write logs to stdout/stderr (not files). Docker captures stdout/stderr automatically. The logging driver handles shipping. If you write to files, you need separate log collection (sidecar or host agent).
Think of it this way: Your app should just shout its logs into the void (stdout). Docker catches the shout and writes it down. Where it writes it down (file, CloudWatch, Fluentd) is configured at the infrastructure level, not in your app.
Kubernetes Logging Architecture
App → stdout → Container runtime captures → Node log file
↓
DaemonSet log agent
(Fluent Bit/Fluentd)
↓
Centralized logging
(Elasticsearch/Loki/CloudWatch)
Pattern 1: Node-level agent (DaemonSet) — Fluent Bit runs on every node, reads container log files, ships to backend. Most common, lowest overhead.
Pattern 2: Sidecar container — Fluent Bit runs as a sidecar in each pod. More flexible (per-pod config) but higher resource usage.
Pattern 3: Application-level — App ships logs directly. Most control, but couples app to logging infra.
Monitoring and Observability
The Three Pillars
| Pillar | What It Tells You | Key Tools |
|---|---|---|
| Metrics | What’s happening (numbers over time) | Prometheus, Grafana, Datadog |
| Logs | Why it happened (event details) | EFK, Loki, CloudWatch |
| Traces | How it happened (request flow across services) | Jaeger, Zipkin, Tempo |
Container Metrics to Monitor
| Metric | Why | Alert Threshold |
|---|---|---|
| CPU usage vs request | Overprovisioned? Underprovisioned? | >80% sustained = scale up |
| Memory usage vs limit | Approaching OOM? | >85% = investigate |
| Restart count | Crash loops, OOM kills | >0 in production = investigate |
| Network errors | Connectivity, DNS issues | Any sustained errors |
| Disk I/O | Storage bottleneck | High latency = check storage driver |
| Container ready time | Slow starts, failed health checks | >expected startup time |
Prometheus + Grafana Stack
# Prometheus scrape config for K8s pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Add prometheus.io/scrape: "true" annotation to your pods, expose /metrics endpoint, and Prometheus auto-discovers and scrapes.
Debugging Running Containers
Essential Debug Commands
# Execute shell in running container
docker exec -it <container> sh
# View real-time logs
docker logs -f --tail 100 <container>
# Inspect container details (network, mounts, env)
docker inspect <container>
# View resource usage
docker stats <container>
# Copy files out for analysis
docker cp <container>:/app/core.dump ./
# K8s equivalents
kubectl exec -it <pod> -- sh
kubectl logs -f <pod> --tail=100
kubectl describe pod <pod>
kubectl top pod <pod>
kubectl cp <pod>:/path/to/file ./local-file
Debug Containers (K8s 1.23+)
When a production container has no shell (distroless, scratch):
# Attach a debug container to a running pod
kubectl debug -it <pod> --image=busybox:1.36 --target=<container>
# Debug with network tools
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>
# Create a copy of the pod with debug image
kubectl debug <pod> --copy-to=debug-pod --container=debug --image=busybox
Why this matters: You followed security best practices and used distroless images. Now something is broken in production and there’s no shell to debug with. Ephemeral debug containers solve this without compromising your security posture.
Common Debug Scenarios
| Problem | Debug Approach |
|---|---|
| Container crash loops | kubectl logs <pod> --previous (see logs from crashed instance) |
| OOM killed | kubectl describe pod → look for OOMKilled in lastState |
| Can’t reach service | kubectl exec -- nslookup <svc>, check endpoints: kubectl get ep |
| Slow responses | kubectl top pod, check CPU throttling: cat /sys/fs/cgroup/cpu.stat |
| Mount permission denied | Check runAsUser vs file ownership, use fsGroup in securityContext |
Key Takeaways for Interviews
- “How do you handle zero-downtime deploys?” → Readiness probes + preStop sleep hook + rolling update strategy + terminationGracePeriodSeconds matching your longest request.
- “How do you set resource limits?” → Always set memory request + limit. Set CPU request, consider skipping CPU limit (throttling causes latency). Use VPA recommendations for right-sizing.
- “How do you debug a distroless container?” →
kubectl debugephemeral containers. Attach a debug image (busybox, netshoot) to the running pod’s namespaces. - “Logging strategy?” → App writes to stdout. Node-level DaemonSet (Fluent Bit) ships to centralized backend (Loki/ES). Structured JSON logs. Don’t write to files inside containers.
- “What metrics do you monitor?” → CPU/memory vs requests, restart count, network errors, request latency (app-level). Use Prometheus + Grafana. Alert on symptoms (latency, errors) not causes (CPU%).