← Networking Mastery — Fundamentals to Principal

Advanced Network Patterns

Advanced Network Patterns

You have the fundamentals. You understand TCP, DNS, HTTP. Now comes the part where architecture decisions get expensive when you get them wrong — service meshes, zero-trust, CDNs, resilience patterns. This is principal-level territory: understanding not just what these things do, but when to use them and when to walk away.


Service Mesh

The Problem It Solves

You have 50 microservices. Every one of them needs:

  • Mutual TLS to encrypt inter-service traffic
  • Retries with backoff when downstream services hiccup
  • Timeouts to avoid cascading failures
  • Circuit breakers to stop hammering dead services
  • Metrics and distributed traces so you can debug anything

Your options: implement all of this in every service (painful, inconsistent, requires library updates across 50 repos) — or extract it to the infrastructure layer. That’s a service mesh.

The Sidecar Pattern

Every pod gets an Envoy proxy injected alongside the application container. The app thinks it’s talking directly to the network. In reality, all traffic flows through the sidecar.

┌─────────────────────────┐     ┌─────────────────────────┐
│  Pod A                  │     │  Pod B                  │
│  ┌───────────┐           │     │  ┌───────────┐           │
│  │   App     │◄──────────┼─────┼──│   App     │           │
│  └─────┬─────┘           │     │  └─────┬─────┘           │
│        │ localhost        │     │        │ localhost        │
│  ┌─────▼─────┐           │     │  ┌─────▼─────┘           │
│  │  Envoy    │◄──mTLS────┼─────┼──│  Envoy    │           │
│  │  Sidecar  │───────────┼─────┼──►  Sidecar  │           │
│  └───────────┘           │     │  └───────────┘           │
└─────────────────────────┘     └─────────────────────────┘
         Data Plane                      Data Plane
                    ▲                 ▲
                    └────────┬────────┘
                    ┌────────▼────────┐
                    │  Control Plane  │
                    │  (Istio/Linkerd) │
                    │  - Policy push  │
                    │  - Cert mgmt    │
                    │  - Telemetry    │
                    └─────────────────┘

Data plane = the sidecars doing the actual work (Envoy). Control plane = the management layer pushing config to sidecars (Istio, Linkerd, Consul Connect).

ELI5: Imagine every employee in a company needs to follow security procedures: show ID, log visitor access, follow safe communication protocols. You could train each person individually and hope they all do it correctly. Or you could put a trained security guard at every desk who handles all of that automatically, and a central security director who updates the guards’ procedures. That’s a service mesh — the guards are sidecars, the director is the control plane.

What You Get

FeatureWithout MeshWith Mesh
mTLS between servicesManual cert management per serviceAutomatic, rotated certs
Retries / timeoutsPer-library config in each serviceCentral policy in YAML
Traffic shifting (canary)Requires code changesPure config
Distributed tracingManual instrumentationAutomatic via sidecar headers
Circuit breakingPer-service library (Hystrix)Mesh policy

The Real Cost

  • Latency: Each hop through a sidecar adds 1–3ms. In a deep call chain (A → B → C → D), that’s 6–12ms added latency. For low-latency systems, this matters.
  • Memory: Envoy uses 50–150MB per pod. At 1,000 pods, that’s 50–150GB overhead.
  • Operational complexity: You now operate Istiod, manage CRDs, debug Envoy configs. This is a full-time job at scale.

Common mistake: Teams adopt a service mesh at 10 services because it sounds enterprise-grade. They spend 3 months fighting Istio configs, half their engineering capacity goes to mesh ops, and the actual app doesn’t get better. Meshes earn their keep at 50+ services with strict security requirements.

When to Use It

Use a mesh when:

  • 50+ microservices where consistent policy matters
  • Compliance requires mTLS everywhere (PCI, HIPAA, SOC 2)
  • You need fine-grained traffic control (canary, A/B, weighted routing)
  • You have platform/infra team dedicated to operating it

Skip the mesh when:

  • Fewer than 20 services
  • Small team with no platform specialization
  • Latency budget is tight (financial, gaming)
  • Kubernetes-native NetworkPolicy handles your security needs

Zero-Trust Networking

The Old Model Is Broken

Classic perimeter security: castle with a moat. If you’re inside the VPN, you’re trusted. The flaw: once an attacker (or a compromised credential) gets past the firewall, they can move laterally to anything inside.

Zero-trust flips this: never trust any network, always verify every request, regardless of where it originates.

BeyondCorp: Google’s Model

Google published BeyondCorp in 2014 after the Aurora attacks. The core insight: access is based on identity and device health, not network location. No VPN. Employees work from anywhere; an identity-aware proxy verifies every request.

User Request Flow (Zero-Trust):

 [ User ] ──► [ Identity-Aware Proxy ] ──► checks:
                                           1. Who are you? (mTLS cert / OAuth token)
                                           2. Is your device healthy? (MDM check)
                                           3. Are you allowed this resource? (policy engine)
                                           └─► Allow / Deny

Implementation Stack

  • mTLS between services: each service has a certificate (identity). Mutual authentication — both sides verify.
  • Identity-aware proxy (IAP): Google IAP, Cloudflare Access, Pomerium — sits in front of internal apps and enforces policy.
  • Policy engine: Open Policy Agent (OPA) evaluates: “Can service X call endpoint Y on service Z?”
  • SPIFFE/SPIRE: standards for service identity. SPIFFE defines the spec (URI-based identity: spiffe://cluster/ns/default/sa/orders). SPIRE is the reference implementation that issues SVID certs to workloads.

ELI5: Old security is like a gated community — show your badge at the gate, and inside you can go anywhere. Zero-trust is like an office where every door has a card reader. Being inside the building doesn’t help you get into the server room. Every access point checks: who you are, why you’re here, and whether you should be allowed in.

Kubernetes Network Policies

Kubernetes by default allows all pod-to-pod traffic in a cluster. Network policies restrict that.

# Only allow pods labeled app=frontend to reach app=backend on port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - port: 8080

Calico and Cilium extend this. Cilium uses eBPF to enforce policies at kernel level — faster than iptables, richer L7 policy support.

Common mistake: Deploying Kubernetes without any NetworkPolicy. By default, every pod can talk to every other pod on any port. A compromised pod can reach your database directly. Start with a default-deny policy and explicitly allow traffic.


CDN Architecture

What CDNs Do

CDN = Content Delivery Network. The problem: your origin server is in us-east-1. A user in Tokyo gets 180ms just from round-trip latency before a byte is sent. CDNs solve this by caching content at edge locations near users.

Without CDN:                    With CDN:
User (Tokyo)                    User (Tokyo)
    │ 180ms RTT                     │ 5ms RTT
    ▼                               ▼
Origin (us-east-1)             Edge (Tokyo POP)
                                    │ cache HIT → immediate response
                                    │ cache MISS → 180ms to origin (once)
                                    ▼              then cached for future users
                               Origin (us-east-1)

How It Works

  1. DNS resolves cdn.example.com to the nearest edge node (anycast IP or GeoDNS)
  2. Edge checks its cache
  3. Cache hit: response served immediately from edge — fast
  4. Cache miss (origin pull): edge fetches from origin, caches it, serves to user. Subsequent users hit the cache.

Cache Invalidation

The classic hard problem. Three mechanisms:

  • TTL expiry: Cache-Control: max-age=3600 — content expires after 1 hour. Simple, but stale content until TTL expires.
  • Purge API: explicitly invalidate a URL or path. CloudFront, Cloudflare, Fastly all have purge APIs. Use for deployments.
  • Stale-while-revalidate: serve stale content immediately, revalidate in background. Best UX — user never waits for cache miss. Cache-Control: max-age=60, stale-while-revalidate=300

CDN Comparison

CDNStrengthEdge LocationsEdge ComputeDDoS Protection
CloudflareBest free tier, DDoS, Workers300+ PoPsWorkers (V8 isolates)Included, excellent
AkamaiEnterprise, media streaming4,000+ PoPsEdgeWorkersKona Site Defender
FastlyReal-time purge, Varnish VCL80+ PoPsCompute@Edge (Wasm)Signal Sciences
AWS CloudFrontAWS-native, Lambda@Edge450+ PoPsLambda@EdgeShield Standard/Advanced

Edge Compute

CDN vendors now let you run code at the edge. Instead of: user → CDN (cache) → origin (auth + logic) — you can run auth, A/B testing, personalization at the CDN edge.

  • Cloudflare Workers: V8 isolates. ~0ms cold start. 10ms CPU limit (free), 30ms (paid). JavaScript/WASM.
  • Lambda@Edge: runs Node.js/Python at CloudFront. 1–5ms cold start. 5s execution limit.
  • Fastly Compute@Edge: WebAssembly. Language-agnostic. Sub-millisecond init.

ELI5: A CDN is like a franchise fast food chain. Instead of everyone driving to the one main kitchen (your origin server), there are locations in every city. Most people get their order from the local branch. Only things the local branch doesn’t have go back to headquarters. Edge compute is like giving the local branch its own small kitchen — not just caching, but actually cooking some dishes locally.


API Gateway Patterns

What an API Gateway Is

A single entry point for all external requests. It sits in front of your services and handles cross-cutting concerns: routing, auth, rate limiting, request/response transformation, aggregation.

Clients ──► [ API Gateway ] ──► Service A
                    │          ──► Service B
                    │          ──► Service C
                    │
               Handles:
               - Auth (JWT verify)
               - Rate limiting
               - SSL termination
               - Request logging
               - Protocol translation

Gateway vs Service Mesh

This confuses people. They’re complementary, not competing:

ConcernAPI GatewayService Mesh
Traffic directionNorth-South (client → cluster)East-West (service → service)
Primary usersExternal clients, mobile appsInternal services
AuthClient auth (API keys, OAuth)Service-to-service mTLS
FocusExternal API managementInternal reliability

Common mistake: Using an API gateway for east-west traffic between internal services. That’s what a service mesh is for. The gateway becomes a bottleneck and single point of failure for all internal calls.

Backend for Frontend (BFF) Pattern

Different clients have different needs. Mobile app needs compact payloads, fewer fields. Web app needs richer data. Public API has different auth than internal dashboard.

Instead of one gateway for all, create a dedicated gateway per client type:

Mobile App     ──► [ BFF: Mobile Gateway  ] ──► Services
Web Frontend   ──► [ BFF: Web Gateway     ] ──► Services
Partner API    ──► [ BFF: Partner Gateway ] ──► Services

Each BFF is thin — it aggregates and shapes data for its client without bloating the downstream services.

Rate Limiting Algorithms

Token bucket: bucket holds N tokens, refills at rate R. Each request consumes a token. Allows bursts up to bucket size. Most CDNs/gateways use this.

Sliding window: count requests in a rolling time window. Smoother than fixed windows (which can double traffic at window boundaries). More memory-intensive.

Leaky bucket: requests enter a queue and are processed at a constant rate. Smooths traffic spikes. If queue fills, excess requests are dropped. Good for protecting downstream services from bursts.

ELI5: Token bucket is like a subway turnstile with stored credits — you can use credits you’ve saved up to rush through fast. Leaky bucket is like a drip irrigation system — water goes in at any rate, but only drips out slowly, steadily. Sliding window is like a bouncer counting how many people entered in the last hour, continuously.


DNS Architecture at Scale

Beyond Single-Record DNS

Simple DNS: one A record pointing to one IP. At scale, DNS becomes an active traffic management layer.

Key Patterns

DNS failover: health check monitors your primary endpoint. If it goes down, Route 53 (or equivalent) automatically switches the record to a backup. Recovery time: 30–60 seconds (TTL + health check interval).

GeoDNS: route based on user’s geographic location. EU users go to eu-west-1, APAC users go to ap-southeast-1. Route 53 calls this “geolocation routing.”

Latency-based routing: instead of geo guess, measure actual latency. Route 53 maintains a latency database between AWS regions and resolver locations. Sends user to genuinely lowest-latency region.

Weighted routing: send 10% of traffic to new stack (canary), 90% to old. DNS-level traffic splitting.

Authoritative DNS Comparison

ProviderStrengthUnique Feature
Route 53AWS-native, health checksDeep AWS integration, alias records
Cloudflare DNSFastest resolution (1.1.1.1), freeAnycast, 0 TTL support
NS1Programmable DNS, filter chainsData-driven routing decisions
Google Cloud DNSGCP-nativeManaged DNSSEC

Consul DNS Interface

HashiCorp Consul can be your service registry AND a DNS server. Services register with Consul; other services discover them via DNS: orders.service.consul resolves to healthy order service instances. Built-in health checking, no external DNS provider needed for internal service discovery.

ELI5: DNS at scale isn’t just a phonebook — it’s a smart call center. When you call a company’s main number (the domain), the routing system checks: where are you calling from? What time is it? Is the local office open? Then it routes your call to the right branch. If a branch goes down, calls get rerouted automatically.


Network Resilience Patterns

Circuit Breaker

When a downstream service starts failing, stop calling it. The circuit breaker has three states:

Closed (normal) ──► [failures exceed threshold] ──► Open (blocking)
                                                          │
                                                    [timeout expires]
                                                          ▼
                                                  Half-Open (probe)
                                                    │          │
                                              [success]   [failure]
                                                 │              │
                                             Closed         Open

Implementations: resilience4j (Java), Polly (.NET), go-breaker (Go). At the infrastructure level: Envoy/Istio circuit breaking via outlierDetection.

ELI5: Imagine calling a restaurant to order delivery. If every time you call the line is busy or the order never arrives, you stop calling for a while and try a different restaurant. After some time, you try again to see if they’ve fixed their issues. That’s a circuit breaker — don’t waste time on things that are obviously broken, give them time to recover, then re-check.

Retry with Exponential Backoff + Jitter

Naive retry: fail → wait 1s → retry. If 1,000 services all fail simultaneously and retry at exactly 1s, you create a thundering herd — 1,000 simultaneous retries hammering an already-struggling system.

Exponential backoff: wait 1s, 2s, 4s, 8s… Spreads out retries. Jitter adds randomness: wait random(0, 2^attempt * base_delay). Breaks synchronization across clients.

Rule: retry only on transient failures (timeouts, 503s). Never retry on 400/401/403/404 — those won’t self-heal.

Timeout Propagation (Deadlines)

Set a deadline at the ingress point and propagate it through the call chain. If the overall request budget is 500ms, every downstream call knows it too. When the deadline expires, all in-flight calls cancel.

gRPC has this built in with deadline propagation. For HTTP services, pass X-Request-Deadline or use context cancellation.

Common mistake: Setting timeouts on individual calls but not propagating the overall deadline. Service A times out its call to B at 5s. But A’s caller timed out the whole request at 1s. A wastes 4 more seconds on a request the caller already gave up on.

Bulkhead

Isolate failures by resource pools. Without bulkhead: slow service B causes thread pool exhaustion, cascading failure to services A and C which share the same pool. With bulkhead: each dependency gets its own connection pool. Service B’s slowness only affects its own pool.

Chaos Engineering

Proactively inject failures to find weaknesses before production does it for you. Tools:

ToolWhat it does
Chaos Monkey (Netflix)Randomly terminates EC2 instances in production
LitmusKubernetes-native chaos experiments (pod kill, network partition)
GremlinCommercial, structured chaos with rollback
AWS Fault Injection SimulatorManaged chaos for AWS services

Start small: kill one pod in staging, verify your health checks + restart policies work. Graduate to: network partition between services, latency injection, disk full simulation.


Network Security Architecture

Defense in Depth

No single control protects everything. Layer multiple independent controls so that bypassing one doesn’t compromise the system.

Internet
   │
[DDoS Protection / CDN — L3/L4/L7 volumetric scrubbing]
   │
[WAF — L7 OWASP Top 10 filtering]
   │
[Load Balancer / API Gateway — auth, rate limiting]
   │
[VPC / Security Groups — network-layer allow/deny]
   │
[Service Mesh / mTLS — zero-trust east-west]
   │
[Application — input validation, authz checks]
   │
[Data — encryption at rest, field-level encryption]

Network Segmentation

VPC: isolated virtual network. All your resources live inside; nothing enters or exits without explicit rules.

Subnets: divide the VPC. Public subnets have routes to the internet gateway. Private subnets don’t.

Security groups (AWS): stateful L3/L4 firewall on each resource. Allow specific IP ranges and ports. Stateful = return traffic automatically allowed.

NACLs: stateless L3/L4 rules at the subnet level. Applied before security groups. Because stateless, you must explicitly allow return traffic.

WAF

Web Application Firewall operates at L7. Protects against OWASP Top 10: SQL injection, XSS, path traversal, etc. Inspects HTTP request bodies, headers, query parameters.

Options: AWS WAF, Cloudflare WAF, Imperva, ModSecurity (open source). Managed rule sets save you from writing rules yourself — AWS Managed Rules, Cloudflare’s OWASP ruleset.

DDoS Attack Types

LayerAttack typeMitigation
L3/L4UDP flood, SYN flood, volumetricScrubbing centers, anycast routing
L7HTTP flood, Slowloris, cache bustingRate limiting, WAF, CAPTCHA, bot detection

L3/L4 attacks are about raw volume — 1Tbps floods that exhaust bandwidth. L7 attacks are low-volume but expensive to process — each request looks legitimate but hits expensive endpoints (search, cart). L7 attacks are harder to defend.

ELI5: DDoS is like someone hiring a thousand people to sit in every seat at your restaurant and never order anything. Volume attacks (L3/L4) are like a mob blocking your door so real customers can’t get in. Application attacks (L7) are like the mob sitting down, ordering water, and sending it back repeatedly — using your staff’s time but paying nothing.


Emerging Network Technologies

eBPF

Extended Berkeley Packet Filter lets you run sandboxed programs inside the Linux kernel — without writing kernel modules or rebooting. This is revolutionary for networking.

What eBPF enables:

  • Cilium: eBPF-based Kubernetes CNI. Replaces iptables (which doesn’t scale past ~1,000 rules) with eBPF maps. 10x better throughput, L7-aware policy.
  • Falco: eBPF-based runtime security. Detect exec, file access, network calls at kernel level with near-zero overhead.
  • Pixie: auto-instrumentation of Kubernetes services using eBPF. No code changes, full request/response visibility.

The pitch: all the observability and security of a service mesh sidecar, without the sidecar tax (latency + memory).

DPDK

Data Plane Development Kit. Moves packet processing from kernel space to user space. Kernel networking overhead: context switches, interrupt handling, memory copies. DPDK bypasses all of this.

Result: 10–100M packets/second instead of ~1M with kernel networking. Used by telecom (5G), high-frequency trading, NFV.

Not for typical web services — this is for when you’re building network infrastructure itself (firewalls, load balancers, telco equipment).

IPv6-Only Networks

AWS launched IPv6-only subnet support in 2021. Apple has required IPv6 support in iOS apps since 2016. The world is slowly, painfully moving.

For services not reachable via IPv6: NAT64/DNS64. DNS64 synthesizes AAAA records from A records. NAT64 translates IPv6 traffic to IPv4 at the network boundary. An IPv6-only client can reach an IPv4-only server transparently.

eBPF vs Sidecar: The Future of Service Mesh

ApproachLatencyMemoryVisibilityMaturity
Sidecar (Istio/Envoy)+1–3ms per hop50–150MB/podFull L7Production-proven
eBPF (Cilium, Hubble)~0ms overheadShared kernelFull L7Maturing fast

The trajectory: eBPF-based meshes will replace sidecar meshes for most use cases over the next 3–5 years. Cilium with Hubble already replaces Istio’s observability features. Isovalent (Cilium creators) is building eBPF-native mTLS.


When to Reach for What — Summary

PatternUse whenSkip when
Service mesh50+ services, strict mTLS, complex traffic mgmtSmall team, <20 services
Zero-trust / SPIFFEMulti-cloud, compliance-driven, high-value targetsSimple internal monolith
CDNStatic assets, global users, DDoS riskInternal APIs, single-region
Edge computeAuth/personalization at edge, latency-sensitiveSimple static content only
API gatewayExternal API surface, client-facing authInternal service-to-service
BFF patternMultiple distinct client types (mobile, web, partner)Single client type
Circuit breakerAny distributed system with external depsMonolith, single-process
Chaos engineeringMature SRE practice, redundancy existsPre-redundancy systems
WAFPublic-facing web app, compliance requirementInternal services
eBPF (Cilium)New Kubernetes deployments, sidecar overhead concernLegacy kernel (<4.14)
GeoDNSMulti-region active-active, compliance data residencySingle-region
BulkheadMultiple unrelated external dependenciesSingle dependency