← Networking Mastery — Fundamentals to Principal

Load Balancing & Proxying

14 min read 2982 words

Load balancing is one of those topics engineers learn a surface-level version of (“just put nginx in front of it”) and then never revisit until something catastrophic happens in prod. This note goes from the fundamentals all the way to the patterns that separate well-engineered distributed systems from ones that fall over on deploy day.

Why Load Balancing

A single server is both a capacity ceiling and a single point of failure. You can vertical-scale it (bigger machine) to a point, but at some threshold you hit hardware limits, and the server still goes down for maintenance or crashes under a spike.

Load balancing solves this by spreading traffic across a fleet of backends:

Availability: if one backend dies, others absorb the traffic
Throughput: aggregate capacity = sum of individual servers
Graceful deploys: drain connections off one server, update it, bring it back — users see nothing

That last point is underappreciated. Without a load balancer, any deploy involves downtime. With one, rolling deploys become a standard operation.

ELI5: Imagine one cashier at a supermarket. On a slow Tuesday, fine. On Christmas Eve, the queue wraps around the store. Load balancing is opening more checkout lanes and having a person at the door directing customers to the shortest one. If one cashier goes on break, the door person just stops sending customers there.

Three Levels of Load Balancing

Type	Layer	Sees	Example
DNS-based	Application	Hostnames	Route 53 latency routing
L4 (transport)	TCP/UDP	IP + port	AWS NLB, HAProxy TCP mode
L7 (application)	HTTP/gRPC	Headers, URL, body	nginx, Envoy, AWS ALB

DNS load balancing is the bluntest instrument — you return multiple A records and let the client pick. TTL is the killer: a failing server stays in rotation until TTL expires. You can’t react to real-time health.

L4 vs L7 Load Balancing

This distinction matters more than most engineers realize when choosing infrastructure.

L4 load balancer sees only IP addresses and port numbers. It forwards TCP segments without caring what’s inside them. It cannot see HTTP headers, URL paths, or cookies. It is fast because it does minimal parsing.

L7 load balancer parses the application protocol (HTTP, gRPC, etc.), so it can route based on the content of the request. This enables a whole class of smart routing decisions.

L4 LB view:
  [IP: 10.0.0.1:54321] → [IP: 10.0.1.5:8080]
  (that's all it knows)

L7 LB view:
  GET /api/v2/users HTTP/1.1
  Host: api.example.com
  Cookie: session=abc123
  (it reads all of this before deciding where to send)

ELI5: An L4 load balancer is like a postal sorting machine that reads only the zip code and drops letters into bins. An L7 load balancer is a human mail sorter who opens the letter, reads what department it’s for, stamps it “URGENT” if needed, and walks it to the right desk.

Feature Comparison

Feature	L4	L7
Routing by URL path	No	Yes
Routing by HTTP header	No	Yes
TLS termination	Limited	Yes
Session stickiness (cookies)	No (IP hash only)	Yes
Header injection (X-Request-ID)	No	Yes
A/B testing / canary	No	Yes
Authentication at edge	No	Yes
Throughput per dollar	Higher	Lower
Connection overhead	Low	Higher
Protocol agnostic	Yes	No

When to choose L4: high-throughput non-HTTP workloads (databases, raw TCP, game servers), or when you need maximum performance and don’t need content routing.

When to choose L7: HTTP APIs, microservices, anything needing canary deploys, auth at the edge, or path-based routing to different backend clusters.

Common mistake: Running a database (PostgreSQL, MySQL) behind an HTTP load balancer because “it’s what the team knows.” Use L4 (or a dedicated proxy like PgBouncer/ProxySQL) — your database does not speak HTTP.

Load Balancing Algorithms

The algorithm determines which backend receives the next request. Wrong choice leads to hot spots, uneven memory usage, or session breaks.

Algorithm Overview

Algorithm	How it works	Best for
Round Robin	Rotate through backends in order	Uniform, stateless requests
Weighted Round Robin	Same but servers have weights	Mixed-capacity backends
Least Connections	Send to server with fewest open conns	Variable request duration
Weighted Least Connections	Least connections factored by weight	Mixed capacity + variable duration
IP Hash	Hash client IP to pick backend	When you need client affinity
Random	Pick a random backend	High-scale stateless services
Consistent Hashing	Hash key maps to a ring of nodes	Caching, minimize redistribution

The One You Should Default To: Least Connections

Round Robin ignores the fact that requests have different costs. A 10ms request and a 5-second database query count the same under round robin. Least Connections naturally routes away from overloaded backends.

ELI5: Round Robin is “next in line.” Least Connections is “join the shortest queue.” At a bank with tellers handling both quick questions and complex loans, shortest-queue wins every time.

Power of Two Choices (P2C)

A variant of random that gets close to least-connections performance without the coordination overhead: pick two random backends, send to the one with fewer connections. At scale this eliminates hot spots almost as well as pure least-connections, with much lower synchronization cost.

Consistent Hashing

Used when you want the same key (user ID, cache key, session) to always land on the same backend. Normal modulo hashing (server = hash(key) % N) means adding or removing one server reshuffles ~N-1/N of all keys. Consistent hashing places servers on a ring; adding/removing one server only moves 1/N of keys.

         0
        /|\
   270-+ | +-90
        \|/
        180

Servers placed at positions on ring.
Key hashes to a position → walks clockwise to next server.
Add a server → only keys between old predecessor and new server move.

ELI5: Normal hashing is like assigning classroom seats by dividing student number by class size. Move one student, everyone’s seat changes. Consistent hashing is like seats on a circular train: add one car and only the passengers in the cars around it shift.

Common mistake: Using IP hash for session stickiness in a mobile app. Mobile users’ IPs change constantly (cell tower handoffs, IPv6 rotation). Use cookie-based stickiness at L7 instead.

Health Checks

A load balancer sending traffic to a dead backend is worse than no load balancer. Health checks are the mechanism that prevents this.

                  ┌─────────────────────────────┐
                  │         Load Balancer        │
                  │                              │
                  │  [active check loop]         │
                  │    every 10s: GET /health    │
                  └──────┬───────────────────────┘
                         │
          ┌──────────────┼──────────────┐
          ▼              ▼              ▼
      Backend A      Backend B      Backend C
       ✓ 200ms        ✗ timeout       ✓ 150ms
       [healthy]     [unhealthy]     [healthy]

Active vs Passive Health Checks

Active: the LB proactively pings backends on a schedule — HTTP GET, TCP connect, or gRPC health check. Simple, predictable, but adds overhead and has a lag (you might send traffic to a dead server for up to one check interval).

Passive: the LB watches real traffic. If 5 consecutive requests to backend X return 5xx or time out, mark it unhealthy. Faster reaction, but the initial failures are real user errors. Good LBs use both together.

`/health` vs `/ready`

Two distinct concepts that get conflated:

/health (liveness): is the process alive? If this fails, the process should be killed and restarted. A true binary: the app is alive or it isn’t.
/ready (readiness): is this instance ready to receive traffic? An app can be alive but warming up a cache, running DB migrations, or recovering from a partial failure. Readiness failing means “don’t route here yet, but don’t kill me.”

Use /ready for LB health checks. Use /health for your orchestrator’s liveness probe (Kubernetes, ECS).

Flapping and Thresholds

A backend that oscillates between healthy/unhealthy causes chaos — traffic bounces in and out of rotation, users see intermittent failures, and logs become noise.

Fix with thresholds:

Healthy threshold: must pass 3 consecutive checks to come back into rotation
Unhealthy threshold: must fail 3 consecutive checks before being removed

This trades a small amount of reaction speed for stability. Almost always worth it.

ELI5: Don’t fire an employee after one bad day. Put them on a performance plan (unhealthy threshold). Similarly, don’t rehire someone after one good interview — they need to consistently perform (healthy threshold).

Connection Draining

When you mark a backend for removal (deploy, scale-down), you don’t want to hard-kill active connections. Draining means:

Stop sending new connections to that backend
Let existing connections finish (up to a configurable timeout, e.g., 30s)
Then remove the backend

AWS calls this “deregistration delay.” Without it, users in the middle of a file upload or long API call get a hard disconnect on every deploy.

Common mistake: Setting drain timeout to 0 seconds to make deploys faster. You just turned every deploy into a user-facing error.

Reverse Proxy

These terms get confused constantly:

Forward proxy: sits between clients and the internet. The client configures it. The server sees the proxy’s IP, not the client’s. Used for: corporate filtering, anonymous browsing, egress control.

Reverse proxy: sits in front of your servers. The server admin deploys it. The client talks to the proxy, unaware of the actual backends. Used for: TLS termination, caching, routing, rate limiting.

Forward Proxy:
Client → [Forward Proxy] → Internet
(client knows about proxy, server doesn't)

Reverse Proxy:
Internet → [Reverse Proxy] → Backend Server
(server knows about proxy, client doesn't)

ELI5: A forward proxy is a middleman you hire to shop on your behalf — stores know the middleman, not you. A reverse proxy is a receptionist at a company — visitors talk to the receptionist, who routes them to the right employee. Visitors don’t know the org chart.

Reverse Proxy Comparison

Proxy	Type	Strengths	Weaknesses
nginx	Web server + proxy	Static files, high performance, widely known	Config is declarative, not programmable
HAProxy	Pure LB / proxy	Extremely mature, detailed stats, L4+L7	No built-in service discovery
Envoy	Service proxy (sidecar)	Dynamic config via xDS API, observability	Complex to operate standalone
Traefik	Cloud-native proxy	Auto-discovers containers, ACME TLS	Performance ceiling lower than nginx/HAProxy
Caddy	Web server + proxy	Automatic HTTPS out of the box	Smaller ecosystem

nginx vs HAProxy: if you need a web server that can also proxy, nginx. If you need a pure high-performance load balancer with rich health check and ACL controls, HAProxy. Envoy is the right choice when you’re building a service mesh or need dynamic config via an API.

TLS Termination

TLS between client and server is standard, but where you decrypt matters.

Three Models

1. TLS Termination at LB:
   Client ──(TLS)──► LB ──(plaintext)──► Backend
   
2. TLS Passthrough:
   Client ──(TLS)──► LB ──(TLS, unchanged)──► Backend
   
3. TLS Re-encryption (mTLS to backend):
   Client ──(TLS)──► LB ──(new TLS)──► Backend

Termination is simplest. Certificate management is centralized. Backends communicate over plain HTTP inside a trusted VPC. Easy to inspect, log, and modify requests. Downside: if someone gets inside your network, traffic is unencrypted.

Passthrough gives end-to-end encryption. The LB can’t inspect or modify the payload (so no header injection, no routing by content). Useful for non-HTTP protocols or strict compliance requirements.

Re-encryption is the best of both worlds and the most operationally complex. LB decrypts, inspects, routes, then re-encrypts to the backend using mTLS. Mandatory in zero-trust networks.

ELI5: Termination is like opening a sealed letter at the mailroom, reading it, then handing it unsealed to the recipient inside the building. Passthrough is the mailroom just handing it along still sealed — they can’t read it but also can’t add a sticky note. Re-encryption is opening it, stamping it, then re-sealing it in a new envelope.

Preserving Client Information

When TLS terminates at the LB, the backend loses sight of the real client. Preserve it with headers:

X-Forwarded-For: 1.2.3.4, 10.0.0.1 — the original client IP (can be spoofed, validate carefully)
X-Forwarded-Proto: https — the original protocol (so your app knows the request came in as HTTPS)
X-Real-IP: 1.2.3.4 — nginx’s simpler alternative to X-Forwarded-For

For L4 (TCP), headers don’t exist. Use PROXY protocol instead — a small plaintext preamble prepended to the TCP stream that contains source/destination IP and port. HAProxy and nginx both support it. Backends must be configured to read and strip the preamble.

Common mistake: Trusting X-Forwarded-For for security decisions (rate limiting, IP allowlisting) without validating that the request actually came through your LB. Clients can set this header directly. Only trust headers that your LB overwrites (not appends).

Service Discovery

A static list of backend IPs hardcoded in your LB config doesn’t survive autoscaling, container scheduling, or routine instance replacement. Service discovery is how the LB stays current.

From Static to Dynamic

Static:
  upstream backend {
    server 10.0.1.1:8080;
    server 10.0.1.2:8080;
  }
  (reload nginx every time you scale)

DNS-based:
  upstream backend {
    server myapp.internal:8080 resolve;
  }
  (LB re-resolves DNS periodically)

Registry-based (Consul/etcd):
  Backends register themselves on startup.
  LB polls or watches the registry.
  Instant updates, no DNS TTL delays.

DNS-based is simple but TTL causes lag (new backends aren’t routable until TTL expires; removed backends stay in rotation). For anything dynamic, use a registry.

Kubernetes Service Discovery

In Kubernetes, the control plane handles all of this:

Service: a stable virtual IP (ClusterIP) + DNS name that selects pods by label
Endpoints/EndpointSlice: the LB-like mapping from Service VIP to actual pod IPs (updated by the controller as pods come and go)
kube-proxy: programs iptables/ipvs rules on each node to forward Service VIP traffic to real pods
Ingress: L7 HTTP routing into the cluster — maps hostnames and paths to Services
Gateway API: the next-gen replacement for Ingress, with richer routing models and proper role separation

ELI5: A Kubernetes Service is like a department’s phone extension. People call extension 200 for “support.” The PBX routes the call to whichever support agent is available. When agents start/stop work, the PBX list updates automatically. The callers never need to know the agents’ direct numbers.

Service Mesh

When you have many services talking to each other (not just external traffic in), a service mesh pushes an Envoy sidecar into each pod. The sidecar handles all outbound and inbound traffic transparently:

Load balancing between instances of a service
mTLS between every pair of services (zero-trust)
Retries, timeouts, circuit breaking
Distributed tracing headers
Traffic splitting for canary deployments

Control planes: Istio (full-featured, complex), Linkerd (simpler, lower overhead), Cilium (eBPF-based, no sidecar).

Advanced Patterns

Global Load Balancing

Traffic from a user in Tokyo shouldn’t have to travel to us-east-1. Global LB routes users to the nearest healthy region.

Anycast: same IP announced from multiple datacenters via BGP. Traffic routes to the closest one at the routing layer. Cloudflare and AWS Global Accelerator use this. Fast failover: BGP reconverges in seconds.
GeoDNS: return different A records based on client’s geographic location (resolver’s IP as proxy for client location). Simple, but DNS TTL means failover is slow.
GSLB (Global Server Load Balancing): health-aware GeoDNS that removes unhealthy regions from DNS responses. F5, Azure Traffic Manager.

Rate Limiting Algorithms

Algorithm	How it works	Burst behavior
Token bucket	Tokens fill at rate R, consume 1 per request	Allows bursts up to bucket size
Leaky bucket	Requests queue, drain at fixed rate	Smooths bursts, adds latency
Fixed window	Count requests per window (1s, 1m)	Burst at window boundaries
Sliding window	Rolling count over last N seconds	More accurate, higher memory cost

Token bucket is the most common at the edge. It allows short bursts (good for humans, bad for scrapers) while bounding long-term rate.

ELI5: Token bucket: you get 10 tokens per second in a bucket that holds 20. Each request costs 1 token. You can fire 20 requests instantly if you’ve been idle, but not 21. Leaky bucket: all requests go into a queue that drains at exactly 10/second — no bursting, perfectly smooth.

Zero-Downtime Deployment Patterns

Canary deploy: route a small percentage (1%, 5%, 10%) to the new version. Watch error rates, latency, business metrics. If healthy, increase the percentage. Roll back instantly by setting the weight to 0.

Blue-green deploy: two identical environments, “blue” serving live traffic and “green” running the new version. Switch the LB to send 100% to green. Rollback is switching back to blue. Fast, but requires double the infrastructure.

Rolling deploy: replace backends one at a time. Works with any load balancer. Slower than blue-green, but doesn’t require double capacity.

Canary routing in nginx:
  upstream v1_backend { server v1:8080 weight=95; }
  upstream v2_backend { server v2:8080 weight=5;  }

Circuit Breaker

If a backend is slow or returning errors, the LB can stop sending requests to it immediately rather than queuing them up (which makes things worse). A circuit breaker has three states:

Closed: normal operation, requests go through
Open: backend failed threshold, requests are rejected immediately (fail fast)
Half-open: after a timeout, allow a probe request — if it succeeds, close the circuit

This prevents cascading failures. When service B is slow, service A backs up, which blocks service C, which exhausts connections everywhere. A circuit breaker at the LB or client stops the cascade at the source.

ELI5: Circuit breaker is exactly what it sounds like — like in your electrical panel. Too much current (errors), the breaker trips. Power stops flowing immediately rather than the wires burning. After a few minutes, you flip it back on to test if the problem is resolved.

Request Coalescing (Collapse)

When the same uncached resource is requested by 100 concurrent users at once, a naive proxy fires 100 requests at the backend. Request coalescing means the proxy queues the first request, holds the other 99, and when the answer comes back, fans it out to all 100. This protects backends from cache stampedes. Varnish, nginx proxy_cache_lock, and Envoy all support variations.

Summary: Decision Framework

Situation	Recommendation
Need URL-path or header routing	L7 (nginx, Envoy, ALB)
High-throughput TCP (DB, game server)	L4 (HAProxy, NLB)
Simple stateless HTTP, even load	Least Connections (L7)
Caching layer, minimize redistribution	Consistent Hashing
Mobile clients, need stickiness	Cookie-based (not IP hash)
Zero-downtime deploy	Canary or Blue-Green
Service-to-service in Kubernetes	Service + Ingress or service mesh
Global multi-region HA	Anycast or GeoDNS + health checks
Backend going slow, protect others	Circuit breaker
Same cert everywhere, simple ops	TLS termination at LB
Zero-trust internal network	mTLS re-encryption
Frequent autoscaling (containers)	Registry-based discovery (Consul, k8s Endpoints)

Why Load Balancing#

Three Levels of Load Balancing#

L4 vs L7 Load Balancing#

Feature Comparison#

Load Balancing Algorithms#

Algorithm Overview#

The One You Should Default To: Least Connections#

Power of Two Choices (P2C)#

Consistent Hashing#

Health Checks#

Active vs Passive Health Checks#

/health vs /ready#

Flapping and Thresholds#

Connection Draining#

Reverse Proxy#

Reverse Proxy Comparison#

TLS Termination#

Three Models#

Preserving Client Information#

Service Discovery#

From Static to Dynamic#

Kubernetes Service Discovery#

Service Mesh#

Advanced Patterns#

Global Load Balancing#

Rate Limiting Algorithms#

Zero-Downtime Deployment Patterns#

Circuit Breaker#

Request Coalescing (Collapse)#

Summary: Decision Framework#