Production Operations

8 min read 1626 words

Production Operations

The difference between “it works locally” and “it works at 3AM when you’re on-call.” This covers logging, monitoring, resource management, health checks, graceful shutdown, and debugging running containers.

Health Checks

Without health checks, the orchestrator only knows if your process is alive — not if it’s healthy. A process can be running but deadlocked, OOM-thrashing, or unable to serve requests.

Docker HEALTHCHECK

HEALTHCHECK --interval=30s --timeout=5s --retries=3 --start-period=60s \
  CMD curl -f http://localhost:8080/health || exit 1

Parameter	Default	What It Means
`--interval`	30s	Time between checks
`--timeout`	30s	Max time for check to complete
`--retries`	3	Failures before marking unhealthy
`--start-period`	0s	Grace period on startup (failures don’t count)

Kubernetes Probes

Probe	Purpose	Failure Action
Startup	Is the app finished initializing?	Keep waiting (don’t run liveness/readiness)
Liveness	Is the app alive?	Restart the container
Readiness	Can the app serve traffic?	Remove from Service endpoints (stop sending traffic)

containers:
- name: api
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    initialDelaySeconds: 15
    periodSeconds: 10
    failureThreshold: 3
  readinessProbe:
    httpGet:
      path: /ready
      port: 8080
    periodSeconds: 5
    failureThreshold: 2
  startupProbe:
    httpGet:
      path: /healthz
      port: 8080
    failureThreshold: 30    # 30 * 10s = 5 minutes to start
    periodSeconds: 10

ELI5: Liveness = “Are you conscious?” (if no → call an ambulance / restart). Readiness = “Can you take customers?” (if no → close the shop window / stop sending traffic). Startup = “Are you done getting dressed?” (don’t bug me until I’m ready).

Common mistake: Using liveness probes that depend on external services (database, cache). If the database goes down, the liveness probe fails, K8s restarts ALL your pods, which creates a thundering herd that makes the database even worse. Liveness should check only the process itself. Use readiness for dependency checks.

Common mistake #2: No startup probe for slow-starting apps (JVM, large ML models). Without it, the liveness probe starts immediately and kills the container before it’s done loading. Use startupProbe with high failureThreshold.

Graceful Shutdown

When a container stops (deploy, scale-down, node drain), the orchestrator sends SIGTERM. Your app must:

Stop accepting new requests
Finish in-flight requests
Close connections (DB, message queue)
Exit cleanly

SIGTERM → app starts graceful shutdown → finishes work → exits 0
                                                          ↓
                                              (if too slow)
                                         SIGKILL after grace period

The PID 1 Problem

If your Dockerfile uses shell form CMD node server.js, the process tree is:

PID 1: /bin/sh -c "node server.js"
  PID 2: node server.js

SIGTERM goes to PID 1 (sh), which ignores it. After 10 seconds, SIGKILL. Your app never gets a chance to shut down gracefully.

Fix 1: Use exec form: CMD ["node", "server.js"] — Node runs as PID 1 directly.

Fix 2: Use --init flag or tini: ENTRYPOINT ["/tini", "--", "node", "server.js"]

Why tini/dumb-init matter: PID 1 has two special responsibilities: (a) handle signals properly (default signal dispositions don’t apply to PID 1), (b) reap zombie processes. Most applications don’t implement either. tini handles both and forwards signals to your app.

Kubernetes Grace Period

spec:
  terminationGracePeriodSeconds: 60  # default: 30

Timeline on pod deletion:

Pod marked Terminating
Removed from Service endpoints (no new traffic)
preStop hook runs (if defined)
SIGTERM sent to containers
Wait up to terminationGracePeriodSeconds
SIGKILL

Common mistake: Setting terminationGracePeriodSeconds too low for long-running requests. If your API has requests that take 30+ seconds, you need a grace period longer than that. Also: the preStop hook time counts against the grace period — not in addition to it.

preStop Hook for Zero-Downtime Deploys

lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 5"]

Why the sleep? When a pod is terminated, endpoint removal and SIGTERM happen concurrently. Some kube-proxies/ingress controllers may still send traffic for a few seconds after SIGTERM. The sleep gives them time to update.

Resource Management

Docker Resource Limits

docker run \
  --memory=512m \           # Hard memory limit
  --memory-reservation=256m \ # Soft limit (hint to scheduler)
  --cpus=0.5 \              # 50% of one CPU
  --pids-limit=100 \        # Max 100 processes (fork bomb protection)
  myapp

Kubernetes Resource Requests and Limits

resources:
  requests:          # Scheduler uses this for placement
    memory: "256Mi"
    cpu: "250m"      # 250 millicores = 0.25 CPU
  limits:            # Hard ceiling
    memory: "512Mi"
    cpu: "500m"

Concept	What It Does	What Happens If Exceeded
Request	Guarantees minimum resources; scheduler uses for placement	N/A (it’s a minimum)
Memory limit	Hard ceiling	OOM killed immediately
CPU limit	Throttling ceiling	Throttled (slowed down, not killed)

ELI5: Requests = your reserved seat on a bus. You’re guaranteed that seat. Limits = the maximum space you’re allowed to take up. Memory limit exceeded = you’re kicked off the bus (OOM kill). CPU limit exceeded = you have to walk slower (throttled) but you can stay.

Decision framework for requests vs limits:

Resource	Set request?	Set limit?	Reasoning
Memory	Yes (always)	Yes (always)	Without limit, one pod OOMs the whole node
CPU	Yes (always)	Controversial	CPU limits cause throttling latency spikes. Many teams set requests but not limits.

Common mistake: Setting CPU limits too tight causes latency spikes that look like application bugs. Container gets throttled mid-request even when CPU is available on the node. Many production teams (including Google) recommend NOT setting CPU limits — only CPU requests.

Quality of Service (QoS) Classes

K8s assigns QoS based on requests/limits:

QoS Class	Condition	Eviction Priority
Guaranteed	requests == limits for all containers	Last to evict
Burstable	At least one request set, request != limit	Middle
BestEffort	No requests or limits set	First to evict

Under memory pressure, K8s evicts BestEffort first, then Burstable, then Guaranteed. Always set at least requests.

Logging

Docker Logging Drivers

Driver	Destination	Use Case
json-file (default)	`/var/lib/docker/containers/<id>/*.log`	Development, simple setups
journald	systemd journal	Systemd-based hosts
fluentd	Fluentd collector	Centralized logging (EFK stack)
awslogs	CloudWatch Logs	AWS deployments
gcplogs	Google Cloud Logging	GCP deployments
splunk	Splunk HEC	Enterprise logging

Best practice: Write logs to stdout/stderr (not files). Docker captures stdout/stderr automatically. The logging driver handles shipping. If you write to files, you need separate log collection (sidecar or host agent).

Think of it this way: Your app should just shout its logs into the void (stdout). Docker catches the shout and writes it down. Where it writes it down (file, CloudWatch, Fluentd) is configured at the infrastructure level, not in your app.

Kubernetes Logging Architecture

App → stdout → Container runtime captures → Node log file
                                              ↓
                                     DaemonSet log agent
                                     (Fluent Bit/Fluentd)
                                              ↓
                                     Centralized logging
                                     (Elasticsearch/Loki/CloudWatch)

Pattern 1: Node-level agent (DaemonSet) — Fluent Bit runs on every node, reads container log files, ships to backend. Most common, lowest overhead.

Pattern 2: Sidecar container — Fluent Bit runs as a sidecar in each pod. More flexible (per-pod config) but higher resource usage.

Pattern 3: Application-level — App ships logs directly. Most control, but couples app to logging infra.

Monitoring and Observability

The Three Pillars

Pillar	What It Tells You	Key Tools
Metrics	What’s happening (numbers over time)	Prometheus, Grafana, Datadog
Logs	Why it happened (event details)	EFK, Loki, CloudWatch
Traces	How it happened (request flow across services)	Jaeger, Zipkin, Tempo

Container Metrics to Monitor

Metric	Why	Alert Threshold
CPU usage vs request	Overprovisioned? Underprovisioned?	>80% sustained = scale up
Memory usage vs limit	Approaching OOM?	>85% = investigate
Restart count	Crash loops, OOM kills	>0 in production = investigate
Network errors	Connectivity, DNS issues	Any sustained errors
Disk I/O	Storage bottleneck	High latency = check storage driver
Container ready time	Slow starts, failed health checks	>expected startup time

Prometheus + Grafana Stack

# Prometheus scrape config for K8s pods
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true

Add prometheus.io/scrape: "true" annotation to your pods, expose /metrics endpoint, and Prometheus auto-discovers and scrapes.

Debugging Running Containers

Essential Debug Commands

# Execute shell in running container
docker exec -it <container> sh

# View real-time logs
docker logs -f --tail 100 <container>

# Inspect container details (network, mounts, env)
docker inspect <container>

# View resource usage
docker stats <container>

# Copy files out for analysis
docker cp <container>:/app/core.dump ./

# K8s equivalents
kubectl exec -it <pod> -- sh
kubectl logs -f <pod> --tail=100
kubectl describe pod <pod>
kubectl top pod <pod>
kubectl cp <pod>:/path/to/file ./local-file

Debug Containers (K8s 1.23+)

When a production container has no shell (distroless, scratch):

# Attach a debug container to a running pod
kubectl debug -it <pod> --image=busybox:1.36 --target=<container>

# Debug with network tools
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>

# Create a copy of the pod with debug image
kubectl debug <pod> --copy-to=debug-pod --container=debug --image=busybox

Why this matters: You followed security best practices and used distroless images. Now something is broken in production and there’s no shell to debug with. Ephemeral debug containers solve this without compromising your security posture.

Common Debug Scenarios

Problem	Debug Approach
Container crash loops	`kubectl logs <pod> --previous` (see logs from crashed instance)
OOM killed	`kubectl describe pod` → look for `OOMKilled` in `lastState`
Can’t reach service	`kubectl exec -- nslookup <svc>`, check endpoints: `kubectl get ep`
Slow responses	`kubectl top pod`, check CPU throttling: `cat /sys/fs/cgroup/cpu.stat`
Mount permission denied	Check `runAsUser` vs file ownership, use `fsGroup` in securityContext

Key Takeaways for Interviews

“How do you handle zero-downtime deploys?” → Readiness probes + preStop sleep hook + rolling update strategy + terminationGracePeriodSeconds matching your longest request.
“How do you set resource limits?” → Always set memory request + limit. Set CPU request, consider skipping CPU limit (throttling causes latency). Use VPA recommendations for right-sizing.
“How do you debug a distroless container?” → kubectl debug ephemeral containers. Attach a debug image (busybox, netshoot) to the running pod’s namespaces.
“Logging strategy?” → App writes to stdout. Node-level DaemonSet (Fluent Bit) ships to centralized backend (Loki/ES). Structured JSON logs. Don’t write to files inside containers.
“What metrics do you monitor?” → CPU/memory vs requests, restart count, network errors, request latency (app-level). Use Prometheus + Grafana. Alert on symptoms (latency, errors) not causes (CPU%).

Production Operations#

Health Checks#

Docker HEALTHCHECK#

Kubernetes Probes#

Graceful Shutdown#

The PID 1 Problem#

Kubernetes Grace Period#

preStop Hook for Zero-Downtime Deploys#

Resource Management#

Docker Resource Limits#

Kubernetes Resource Requests and Limits#

Quality of Service (QoS) Classes#

Logging#

Docker Logging Drivers#

Kubernetes Logging Architecture#

Monitoring and Observability#

The Three Pillars#

Container Metrics to Monitor#

Prometheus + Grafana Stack#

Debugging Running Containers#

Essential Debug Commands#

Debug Containers (K8s 1.23+)#

Common Debug Scenarios#

Key Takeaways for Interviews#

Production Operations

Health Checks

Docker HEALTHCHECK

Kubernetes Probes

Graceful Shutdown

The PID 1 Problem

Kubernetes Grace Period

preStop Hook for Zero-Downtime Deploys

Resource Management

Docker Resource Limits

Kubernetes Resource Requests and Limits

Quality of Service (QoS) Classes

Logging

Docker Logging Drivers

Kubernetes Logging Architecture

Monitoring and Observability

The Three Pillars

Container Metrics to Monitor

Prometheus + Grafana Stack

Debugging Running Containers

Essential Debug Commands

Debug Containers (K8s 1.23+)

Common Debug Scenarios

Key Takeaways for Interviews