Load Balancing & Proxying
Load balancing is one of those topics engineers learn a surface-level version of (“just put nginx in front of it”) and then never revisit until something catastrophic happens in prod. This note goes from the fundamentals all the way to the patterns that separate well-engineered distributed systems from ones that fall over on deploy day.
Why Load Balancing
A single server is both a capacity ceiling and a single point of failure. You can vertical-scale it (bigger machine) to a point, but at some threshold you hit hardware limits, and the server still goes down for maintenance or crashes under a spike.
Load balancing solves this by spreading traffic across a fleet of backends:
- Availability: if one backend dies, others absorb the traffic
- Throughput: aggregate capacity = sum of individual servers
- Graceful deploys: drain connections off one server, update it, bring it back — users see nothing
That last point is underappreciated. Without a load balancer, any deploy involves downtime. With one, rolling deploys become a standard operation.
ELI5: Imagine one cashier at a supermarket. On a slow Tuesday, fine. On Christmas Eve, the queue wraps around the store. Load balancing is opening more checkout lanes and having a person at the door directing customers to the shortest one. If one cashier goes on break, the door person just stops sending customers there.
Three Levels of Load Balancing
| Type | Layer | Sees | Example |
|---|---|---|---|
| DNS-based | Application | Hostnames | Route 53 latency routing |
| L4 (transport) | TCP/UDP | IP + port | AWS NLB, HAProxy TCP mode |
| L7 (application) | HTTP/gRPC | Headers, URL, body | nginx, Envoy, AWS ALB |
DNS load balancing is the bluntest instrument — you return multiple A records and let the client pick. TTL is the killer: a failing server stays in rotation until TTL expires. You can’t react to real-time health.
L4 vs L7 Load Balancing
This distinction matters more than most engineers realize when choosing infrastructure.
L4 load balancer sees only IP addresses and port numbers. It forwards TCP segments without caring what’s inside them. It cannot see HTTP headers, URL paths, or cookies. It is fast because it does minimal parsing.
L7 load balancer parses the application protocol (HTTP, gRPC, etc.), so it can route based on the content of the request. This enables a whole class of smart routing decisions.
L4 LB view:
[IP: 10.0.0.1:54321] → [IP: 10.0.1.5:8080]
(that's all it knows)
L7 LB view:
GET /api/v2/users HTTP/1.1
Host: api.example.com
Cookie: session=abc123
(it reads all of this before deciding where to send)
ELI5: An L4 load balancer is like a postal sorting machine that reads only the zip code and drops letters into bins. An L7 load balancer is a human mail sorter who opens the letter, reads what department it’s for, stamps it “URGENT” if needed, and walks it to the right desk.
Feature Comparison
| Feature | L4 | L7 |
|---|---|---|
| Routing by URL path | No | Yes |
| Routing by HTTP header | No | Yes |
| TLS termination | Limited | Yes |
| Session stickiness (cookies) | No (IP hash only) | Yes |
| Header injection (X-Request-ID) | No | Yes |
| A/B testing / canary | No | Yes |
| Authentication at edge | No | Yes |
| Throughput per dollar | Higher | Lower |
| Connection overhead | Low | Higher |
| Protocol agnostic | Yes | No |
When to choose L4: high-throughput non-HTTP workloads (databases, raw TCP, game servers), or when you need maximum performance and don’t need content routing.
When to choose L7: HTTP APIs, microservices, anything needing canary deploys, auth at the edge, or path-based routing to different backend clusters.
Common mistake: Running a database (PostgreSQL, MySQL) behind an HTTP load balancer because “it’s what the team knows.” Use L4 (or a dedicated proxy like PgBouncer/ProxySQL) — your database does not speak HTTP.
Load Balancing Algorithms
The algorithm determines which backend receives the next request. Wrong choice leads to hot spots, uneven memory usage, or session breaks.
Algorithm Overview
| Algorithm | How it works | Best for |
|---|---|---|
| Round Robin | Rotate through backends in order | Uniform, stateless requests |
| Weighted Round Robin | Same but servers have weights | Mixed-capacity backends |
| Least Connections | Send to server with fewest open conns | Variable request duration |
| Weighted Least Connections | Least connections factored by weight | Mixed capacity + variable duration |
| IP Hash | Hash client IP to pick backend | When you need client affinity |
| Random | Pick a random backend | High-scale stateless services |
| Consistent Hashing | Hash key maps to a ring of nodes | Caching, minimize redistribution |
The One You Should Default To: Least Connections
Round Robin ignores the fact that requests have different costs. A 10ms request and a 5-second database query count the same under round robin. Least Connections naturally routes away from overloaded backends.
ELI5: Round Robin is “next in line.” Least Connections is “join the shortest queue.” At a bank with tellers handling both quick questions and complex loans, shortest-queue wins every time.
Power of Two Choices (P2C)
A variant of random that gets close to least-connections performance without the coordination overhead: pick two random backends, send to the one with fewer connections. At scale this eliminates hot spots almost as well as pure least-connections, with much lower synchronization cost.
Consistent Hashing
Used when you want the same key (user ID, cache key, session) to always land on the same backend. Normal modulo hashing (server = hash(key) % N) means adding or removing one server reshuffles ~N-1/N of all keys. Consistent hashing places servers on a ring; adding/removing one server only moves 1/N of keys.
0
/|\
270-+ | +-90
\|/
180
Servers placed at positions on ring.
Key hashes to a position → walks clockwise to next server.
Add a server → only keys between old predecessor and new server move.
ELI5: Normal hashing is like assigning classroom seats by dividing student number by class size. Move one student, everyone’s seat changes. Consistent hashing is like seats on a circular train: add one car and only the passengers in the cars around it shift.
Common mistake: Using IP hash for session stickiness in a mobile app. Mobile users’ IPs change constantly (cell tower handoffs, IPv6 rotation). Use cookie-based stickiness at L7 instead.
Health Checks
A load balancer sending traffic to a dead backend is worse than no load balancer. Health checks are the mechanism that prevents this.
┌─────────────────────────────┐
│ Load Balancer │
│ │
│ [active check loop] │
│ every 10s: GET /health │
└──────┬───────────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Backend A Backend B Backend C
✓ 200ms ✗ timeout ✓ 150ms
[healthy] [unhealthy] [healthy]
Active vs Passive Health Checks
Active: the LB proactively pings backends on a schedule — HTTP GET, TCP connect, or gRPC health check. Simple, predictable, but adds overhead and has a lag (you might send traffic to a dead server for up to one check interval).
Passive: the LB watches real traffic. If 5 consecutive requests to backend X return 5xx or time out, mark it unhealthy. Faster reaction, but the initial failures are real user errors. Good LBs use both together.
/health vs /ready
Two distinct concepts that get conflated:
/health(liveness): is the process alive? If this fails, the process should be killed and restarted. A true binary: the app is alive or it isn’t./ready(readiness): is this instance ready to receive traffic? An app can be alive but warming up a cache, running DB migrations, or recovering from a partial failure. Readiness failing means “don’t route here yet, but don’t kill me.”
Use /ready for LB health checks. Use /health for your orchestrator’s liveness probe (Kubernetes, ECS).
Flapping and Thresholds
A backend that oscillates between healthy/unhealthy causes chaos — traffic bounces in and out of rotation, users see intermittent failures, and logs become noise.
Fix with thresholds:
- Healthy threshold: must pass 3 consecutive checks to come back into rotation
- Unhealthy threshold: must fail 3 consecutive checks before being removed
This trades a small amount of reaction speed for stability. Almost always worth it.
ELI5: Don’t fire an employee after one bad day. Put them on a performance plan (unhealthy threshold). Similarly, don’t rehire someone after one good interview — they need to consistently perform (healthy threshold).
Connection Draining
When you mark a backend for removal (deploy, scale-down), you don’t want to hard-kill active connections. Draining means:
- Stop sending new connections to that backend
- Let existing connections finish (up to a configurable timeout, e.g., 30s)
- Then remove the backend
AWS calls this “deregistration delay.” Without it, users in the middle of a file upload or long API call get a hard disconnect on every deploy.
Common mistake: Setting drain timeout to 0 seconds to make deploys faster. You just turned every deploy into a user-facing error.
Reverse Proxy
These terms get confused constantly:
Forward proxy: sits between clients and the internet. The client configures it. The server sees the proxy’s IP, not the client’s. Used for: corporate filtering, anonymous browsing, egress control.
Reverse proxy: sits in front of your servers. The server admin deploys it. The client talks to the proxy, unaware of the actual backends. Used for: TLS termination, caching, routing, rate limiting.
Forward Proxy:
Client → [Forward Proxy] → Internet
(client knows about proxy, server doesn't)
Reverse Proxy:
Internet → [Reverse Proxy] → Backend Server
(server knows about proxy, client doesn't)
ELI5: A forward proxy is a middleman you hire to shop on your behalf — stores know the middleman, not you. A reverse proxy is a receptionist at a company — visitors talk to the receptionist, who routes them to the right employee. Visitors don’t know the org chart.
Reverse Proxy Comparison
| Proxy | Type | Strengths | Weaknesses |
|---|---|---|---|
| nginx | Web server + proxy | Static files, high performance, widely known | Config is declarative, not programmable |
| HAProxy | Pure LB / proxy | Extremely mature, detailed stats, L4+L7 | No built-in service discovery |
| Envoy | Service proxy (sidecar) | Dynamic config via xDS API, observability | Complex to operate standalone |
| Traefik | Cloud-native proxy | Auto-discovers containers, ACME TLS | Performance ceiling lower than nginx/HAProxy |
| Caddy | Web server + proxy | Automatic HTTPS out of the box | Smaller ecosystem |
nginx vs HAProxy: if you need a web server that can also proxy, nginx. If you need a pure high-performance load balancer with rich health check and ACL controls, HAProxy. Envoy is the right choice when you’re building a service mesh or need dynamic config via an API.
TLS Termination
TLS between client and server is standard, but where you decrypt matters.
Three Models
1. TLS Termination at LB:
Client ──(TLS)──► LB ──(plaintext)──► Backend
2. TLS Passthrough:
Client ──(TLS)──► LB ──(TLS, unchanged)──► Backend
3. TLS Re-encryption (mTLS to backend):
Client ──(TLS)──► LB ──(new TLS)──► Backend
Termination is simplest. Certificate management is centralized. Backends communicate over plain HTTP inside a trusted VPC. Easy to inspect, log, and modify requests. Downside: if someone gets inside your network, traffic is unencrypted.
Passthrough gives end-to-end encryption. The LB can’t inspect or modify the payload (so no header injection, no routing by content). Useful for non-HTTP protocols or strict compliance requirements.
Re-encryption is the best of both worlds and the most operationally complex. LB decrypts, inspects, routes, then re-encrypts to the backend using mTLS. Mandatory in zero-trust networks.
ELI5: Termination is like opening a sealed letter at the mailroom, reading it, then handing it unsealed to the recipient inside the building. Passthrough is the mailroom just handing it along still sealed — they can’t read it but also can’t add a sticky note. Re-encryption is opening it, stamping it, then re-sealing it in a new envelope.
Preserving Client Information
When TLS terminates at the LB, the backend loses sight of the real client. Preserve it with headers:
X-Forwarded-For: 1.2.3.4, 10.0.0.1— the original client IP (can be spoofed, validate carefully)X-Forwarded-Proto: https— the original protocol (so your app knows the request came in as HTTPS)X-Real-IP: 1.2.3.4— nginx’s simpler alternative to X-Forwarded-For
For L4 (TCP), headers don’t exist. Use PROXY protocol instead — a small plaintext preamble prepended to the TCP stream that contains source/destination IP and port. HAProxy and nginx both support it. Backends must be configured to read and strip the preamble.
Common mistake: Trusting
X-Forwarded-Forfor security decisions (rate limiting, IP allowlisting) without validating that the request actually came through your LB. Clients can set this header directly. Only trust headers that your LB overwrites (not appends).
Service Discovery
A static list of backend IPs hardcoded in your LB config doesn’t survive autoscaling, container scheduling, or routine instance replacement. Service discovery is how the LB stays current.
From Static to Dynamic
Static:
upstream backend {
server 10.0.1.1:8080;
server 10.0.1.2:8080;
}
(reload nginx every time you scale)
DNS-based:
upstream backend {
server myapp.internal:8080 resolve;
}
(LB re-resolves DNS periodically)
Registry-based (Consul/etcd):
Backends register themselves on startup.
LB polls or watches the registry.
Instant updates, no DNS TTL delays.
DNS-based is simple but TTL causes lag (new backends aren’t routable until TTL expires; removed backends stay in rotation). For anything dynamic, use a registry.
Kubernetes Service Discovery
In Kubernetes, the control plane handles all of this:
- Service: a stable virtual IP (ClusterIP) + DNS name that selects pods by label
- Endpoints/EndpointSlice: the LB-like mapping from Service VIP to actual pod IPs (updated by the controller as pods come and go)
- kube-proxy: programs iptables/ipvs rules on each node to forward Service VIP traffic to real pods
- Ingress: L7 HTTP routing into the cluster — maps hostnames and paths to Services
- Gateway API: the next-gen replacement for Ingress, with richer routing models and proper role separation
ELI5: A Kubernetes Service is like a department’s phone extension. People call extension 200 for “support.” The PBX routes the call to whichever support agent is available. When agents start/stop work, the PBX list updates automatically. The callers never need to know the agents’ direct numbers.
Service Mesh
When you have many services talking to each other (not just external traffic in), a service mesh pushes an Envoy sidecar into each pod. The sidecar handles all outbound and inbound traffic transparently:
- Load balancing between instances of a service
- mTLS between every pair of services (zero-trust)
- Retries, timeouts, circuit breaking
- Distributed tracing headers
- Traffic splitting for canary deployments
Control planes: Istio (full-featured, complex), Linkerd (simpler, lower overhead), Cilium (eBPF-based, no sidecar).
Advanced Patterns
Global Load Balancing
Traffic from a user in Tokyo shouldn’t have to travel to us-east-1. Global LB routes users to the nearest healthy region.
- Anycast: same IP announced from multiple datacenters via BGP. Traffic routes to the closest one at the routing layer. Cloudflare and AWS Global Accelerator use this. Fast failover: BGP reconverges in seconds.
- GeoDNS: return different A records based on client’s geographic location (resolver’s IP as proxy for client location). Simple, but DNS TTL means failover is slow.
- GSLB (Global Server Load Balancing): health-aware GeoDNS that removes unhealthy regions from DNS responses. F5, Azure Traffic Manager.
Rate Limiting Algorithms
| Algorithm | How it works | Burst behavior |
|---|---|---|
| Token bucket | Tokens fill at rate R, consume 1 per request | Allows bursts up to bucket size |
| Leaky bucket | Requests queue, drain at fixed rate | Smooths bursts, adds latency |
| Fixed window | Count requests per window (1s, 1m) | Burst at window boundaries |
| Sliding window | Rolling count over last N seconds | More accurate, higher memory cost |
Token bucket is the most common at the edge. It allows short bursts (good for humans, bad for scrapers) while bounding long-term rate.
ELI5: Token bucket: you get 10 tokens per second in a bucket that holds 20. Each request costs 1 token. You can fire 20 requests instantly if you’ve been idle, but not 21. Leaky bucket: all requests go into a queue that drains at exactly 10/second — no bursting, perfectly smooth.
Zero-Downtime Deployment Patterns
Canary deploy: route a small percentage (1%, 5%, 10%) to the new version. Watch error rates, latency, business metrics. If healthy, increase the percentage. Roll back instantly by setting the weight to 0.
Blue-green deploy: two identical environments, “blue” serving live traffic and “green” running the new version. Switch the LB to send 100% to green. Rollback is switching back to blue. Fast, but requires double the infrastructure.
Rolling deploy: replace backends one at a time. Works with any load balancer. Slower than blue-green, but doesn’t require double capacity.
Canary routing in nginx:
upstream v1_backend { server v1:8080 weight=95; }
upstream v2_backend { server v2:8080 weight=5; }
Circuit Breaker
If a backend is slow or returning errors, the LB can stop sending requests to it immediately rather than queuing them up (which makes things worse). A circuit breaker has three states:
- Closed: normal operation, requests go through
- Open: backend failed threshold, requests are rejected immediately (fail fast)
- Half-open: after a timeout, allow a probe request — if it succeeds, close the circuit
This prevents cascading failures. When service B is slow, service A backs up, which blocks service C, which exhausts connections everywhere. A circuit breaker at the LB or client stops the cascade at the source.
ELI5: Circuit breaker is exactly what it sounds like — like in your electrical panel. Too much current (errors), the breaker trips. Power stops flowing immediately rather than the wires burning. After a few minutes, you flip it back on to test if the problem is resolved.
Request Coalescing (Collapse)
When the same uncached resource is requested by 100 concurrent users at once, a naive proxy fires 100 requests at the backend. Request coalescing means the proxy queues the first request, holds the other 99, and when the answer comes back, fans it out to all 100. This protects backends from cache stampedes. Varnish, nginx proxy_cache_lock, and Envoy all support variations.
Summary: Decision Framework
| Situation | Recommendation |
|---|---|
| Need URL-path or header routing | L7 (nginx, Envoy, ALB) |
| High-throughput TCP (DB, game server) | L4 (HAProxy, NLB) |
| Simple stateless HTTP, even load | Least Connections (L7) |
| Caching layer, minimize redistribution | Consistent Hashing |
| Mobile clients, need stickiness | Cookie-based (not IP hash) |
| Zero-downtime deploy | Canary or Blue-Green |
| Service-to-service in Kubernetes | Service + Ingress or service mesh |
| Global multi-region HA | Anycast or GeoDNS + health checks |
| Backend going slow, protect others | Circuit breaker |
| Same cert everywhere, simple ops | TLS termination at LB |
| Zero-trust internal network | mTLS re-encryption |
| Frequent autoscaling (containers) | Registry-based discovery (Consul, k8s Endpoints) |