← Networking Mastery — Fundamentals to Principal

TCP & UDP Deep Dive

TCP and UDP are the two workhorses of the transport layer. TCP gets all the glory because it’s used everywhere, but understanding both — and why each design choice was made — is what separates people who “know networking” from engineers who can actually diagnose production incidents.

TCP Fundamentals

TCP is connection-oriented, reliable, ordered, byte-stream. Each word matters:

  • Connection-oriented: you establish a circuit before data flows (handshake)
  • Reliable: the sender knows every byte was received
  • Ordered: bytes arrive in the sequence they were sent
  • Byte-stream: no message boundaries — the application decides where records start/end

ELI5: TCP is like sending a package via certified mail. You get a confirmation when it’s delivered, the carrier will resend if it gets lost, and if you send boxes 1, 2, 3 they arrive as 1, 2, 3. UDP is like dropping flyers from a plane — fast, cheap, but you have no idea who caught one.

The 3-Way Handshake

Client                        Server
  |                              |
  |------- SYN (seq=x) -------->|   Client picks ISN x
  |                              |
  |<-- SYN-ACK (seq=y, ack=x+1)-|   Server picks ISN y, acks x
  |                              |
  |------- ACK (ack=y+1) ------->|   Client acks y
  |                              |
  |     [data can flow now]      |

Why 3 ways, not 2? Both sides need to synchronize their sequence numbers AND confirm the other side received them. Two messages only let the server know the client’s ISN — the client has no proof the server got anything. The third ACK completes the proof. Total cost: 1 RTT before any data flows.

ELI5: The handshake is like two people meeting on a walkie-talkie. “Can you hear me?” (SYN) → “Yes I hear you, can you hear me?” (SYN-ACK) → “Yes.” (ACK). You need all three so both sides know the radio works in both directions. Skip step 3 and the first person never confirmed they heard the reply.

The 4-Way Teardown and TIME_WAIT

Client                        Server
  |------- FIN ----------------->|   "I'm done sending"
  |<------ ACK ------------------|   "Got it"
  |<------ FIN ------------------|   "I'm done sending too"
  |------- ACK ----------------->|   "Got it"  ← Client enters TIME_WAIT here
  |                              |
  [Client waits 2*MSL before port is released]

Why TIME_WAIT? The final ACK might get lost. If it does, the server resends its FIN. If the client is already gone, it will respond with RST, which can confuse a new connection that reused the same port. TIME_WAIT holds the port for 2×MSL (Maximum Segment Lifetime, typically 2 minutes = 4 minutes total) so any delayed duplicates from the old connection expire before a new one starts.

Why TIME_WAIT causes port exhaustion: In high-throughput services (load balancers, proxies), the client side of a connection enters TIME_WAIT for every closed connection. With 4-minute waits and ephemeral ports capped at ~28,000 by default, you can exhaust ports at >100 connections/sec. Fix: net.ipv4.ip_local_port_range = 1024 65535 + tcp_tw_reuse.

ELI5: TIME_WAIT is like the post office holding a tracking number for 4 minutes after delivery. If someone sends a duplicate package with the same tracking number by mistake, you can say “we already handled this” instead of getting confused. Without the hold period, old duplicate packets could corrupt a brand new connection that reused the same port.

TCP Header Fields That Matter

FieldSizeWhy it matters
Source/Dest Port16 bits eachConnection demultiplexing
Sequence number32 bitsByte ordering and reliability
Ack number32 bits“I received everything up to X”
Flags9 bitsSYN, FIN, ACK, RST, PSH, URG
Window size16 bitsFlow control (receiver’s buffer space)
Checksum16 bitsError detection
OptionsvariableTimestamps, MSS, window scaling, SACK

The sequence number tracks bytes, not segments. If you send 1000 bytes starting at seq=100, the next segment starts at seq=1100. This is what enables ordered reassembly and reliable delivery.


TCP Flow Control

Flow control answers: “how fast can the receiver absorb data?” It’s a receiver-side concern.

Sliding Window

The sender can have up to rwnd bytes “in flight” (sent but not yet ACKed). This is called the send window. As ACKs arrive, the window slides forward.

Bytes: [1  2  3  4  5  6  7  8  9  10  11  12]
        |<--- ACKed --->|<- in flight ->|<- can send ->|
                        ^               ^
                    last ack        last sent
        |<----------- rwnd = 8 -------->|

If the receive buffer fills up, the receiver advertises rwnd=0. The sender stops. Every ~500ms it sends a 1-byte window probe to check if space has freed up. When the receiver has room again, it sends a window update.

ELI5: The sliding window is like a garden hose. The receiver controls the nozzle — they can let water flow freely, slow it down, or shut it off completely. The sender can keep pushing water in but only as fast as the nozzle allows. If the bucket at the other end fills up, the receiver closes the nozzle (window=0) until there’s room.

Window Scaling (RFC 7323)

The 16-bit window field caps flow at 64KB. On a 1 Gbps link with 50ms RTT, the bandwidth-delay product is 6.25MB — 100x larger than the max window. The window scaling TCP option (negotiated during handshake) multiplies the advertised window by a power of 2 (up to 2^14), enabling windows up to 1GB.

Common mistake: Window scaling only works if both sides negotiate it during the SYN/SYN-ACK. A middlebox (firewall, NAT device) that strips TCP options will silently kill window scaling, causing mysterious throughput caps at 64KB regardless of your kernel settings.


TCP Congestion Control

Flow control is about the receiver. Congestion control is about the network — don’t send faster than the bottleneck link can handle.

Why It Exists

In 1986, the internet nearly collapsed. Gateways were dropping packets, senders were retransmitting, which caused more drops. Positive feedback loop → complete meltdown. Van Jacobson designed the original congestion control algorithms that saved the internet. The insight: packet loss = congestion signal. Slow down when you see loss.

ELI5: Imagine everyone at a conference talking at the same time. The more people who can’t hear responses, the louder everyone talks. Eventually it’s total noise and nobody understands anything. TCP congestion control is the “raise your hand and wait to be called on” rule — when you detect the network is overwhelmed (dropped packets), you shut up for a bit and start slowly again.

The States

         cwnd
           |
ssthresh --+-----.
           |      '.
           |        '-----._________ (congestion avoidance: +1 MSS per RTT)
           |
           | (slow start: double cwnd per RTT)
           |
    1 MSS  +
           +-----------> time

Slow Start: Start with cwnd=1 MSS. Double cwnd every RTT until cwnd reaches ssthresh. “Slow” is relative — it’s slow compared to dumping everything at once, but exponential growth is still fast.

Congestion Avoidance: After ssthresh, grow cwnd by 1 MSS per RTT. Linear, conservative.

On timeout (packet loss): ssthresh = cwnd/2, cwnd = 1 MSS. Start over from slow start.

Fast Retransmit and Fast Recovery

The problem with timeout-based loss detection: TCP’s retransmit timeout (RTO) is in the hundreds of milliseconds. Waiting for timeout on every lost packet kills throughput.

Fast retransmit: If the sender receives 3 duplicate ACKs for the same sequence number, a packet was lost (later packets arrived, causing the receiver to keep ACKing the hole). Retransmit immediately without waiting for timeout.

Fast recovery: After fast retransmit, don’t go back to slow start (cwnd=1). Instead, set ssthresh = cwnd/2 and start congestion avoidance from there. The 3 dup-ACKs prove the network is still delivering packets — no need for the nuclear option.

ELI5: Fast retransmit is like a waiter taking orders at a table. If three people say “where’s the appetizer for seat 4?”, the kitchen knows seat 4’s order got lost and immediately resends it — they don’t wait for a timer to go off. Fast recovery means they don’t then reset the entire kitchen workflow; they just fix that one missing dish.

CUBIC vs BBR

CUBICBBR
SignalPacket lossBandwidth + RTT measurement
ApproachGrow aggressively, back off on lossModel the bottleneck bandwidth
Good forDatacenter / low-loss linksLong-haul / lossy links
Bad atLossy WiFi (false congestion signals)Competing with CUBIC flows
DefaultLinux kernel (all versions)Google internal, opt-in on Linux

CUBIC assumes loss = congestion. On a lossy WiFi link, random loss causes constant backoff even when the link has bandwidth to spare. BBR estimates the actual bottleneck bandwidth and RTT, then paces traffic to fill that bandwidth without creating a queue. BBR wins on long-fat pipes and handles random loss better. CUBIC wins in highly contested datacenter links where loss really does mean congestion.


TCP Performance Tuning

This is where most people waste hours with trial and error. Know these.

Nagle + Delayed ACK = 40ms Latency Bug

Nagle’s algorithm: Buffer small writes. Only send a new segment if: (a) buffer ≥ MSS, or (b) previous sent data is fully ACKed. Goal: reduce small-packet overhead.

Delayed ACK: The receiver waits up to 40ms before sending an ACK, hoping to piggyback it on a data segment going the other direction.

Together they’re a disaster: Sender writes a small payload, waits for ACK to send more (Nagle). Receiver has nothing to send back, so it waits 40ms before ACKing (delayed ACK). Result: 40ms added latency for every small write. This kills latency-sensitive protocols (databases, game servers, interactive SSH).

Fix: TCP_NODELAY on the socket disables Nagle. For database drivers and messaging systems, this is almost always what you want.

ELI5: Nagle’s algorithm is like a delivery driver who won’t leave until the truck is full. Delayed ACK is a warehouse manager who takes 40 minutes to sign the receipt. Put them together: driver waits for a signed receipt from the last delivery before loading the next one, and the manager takes 40 minutes to sign. Every delivery takes 40 minutes extra.

SO_REUSEADDR and SO_REUSEPORT

OptionWhat it actually does
SO_REUSEADDRAllow bind to a port in TIME_WAIT state. Essential for servers that restart.
SO_REUSEPORTAllow multiple sockets to bind to the same port. Each gets a share of incoming connections. Enables per-core accept queues.

Common mistake: Thinking SO_REUSEADDR lets you run two services on the same port. It doesn’t — it only affects TIME_WAIT sockets. SO_REUSEPORT is what enables load-balancing across worker processes (Nginx, envoy do this).

tcp_tw_reuse vs tcp_tw_recycle

  • tcp_tw_reuse=1: Allow reusing a TIME_WAIT socket for a new outbound connection if the new connection’s timestamp is newer. Safe. Use this.
  • tcp_tw_recycle=1: Aggressively recycle TIME_WAIT sockets based on timestamps. Dangerous with NAT — multiple clients behind NAT can have different timestamps, causing the kernel to silently drop their SYNs. Removed in Linux 4.12.

UDP Fundamentals

UDP strips everything down: no connection, no reliability, no ordering, no congestion control.

UDP Header (8 bytes total):
+--------+--------+--------+--------+
|  Src   |  Dst   | Length | Chksum |
|  Port  |  Port  |        |        |
+--------+--------+--------+--------+

That’s it. 8 bytes. Compare to TCP’s minimum 20 bytes (usually 32+ with options).

ELI5: UDP is a postcard with no tracking. You write it, drop it in the mailbox, and move on. It might arrive, might not. It might arrive twice. Two postcards might arrive out of order. But you can write 1000 postcards in the time it takes to set up a single certified mail shipment. For things like “current GPS location” or “is the server alive?”, stale data is worse than no data — so you don’t care if a packet gets lost.

When Unreliable Is the Right Choice

Use caseWhy UDP wins
DNS queriesSingle packet in, single packet out. TCP overhead is 3x the payload.
Video/audio streamingLate packets are useless — skip ahead, don’t wait
Online gamingPosition updates: old position data is worthless, send new one
VoIP150ms is the human perception threshold for call quality — can’t wait for retransmit
QUICUDP + application-layer reliability with better performance than TCP
Multicast/broadcastTCP is point-to-point; UDP can send one packet to many hosts

Common mistake: Assuming you need TCP for reliability. You can build application-level reliability on UDP (QUIC does this) and get better performance because you control the retransmit logic, congestion algorithm, and connection semantics without the OS kernel getting in the way.


TCP vs UDP — The Real Decision

ELI5: Choosing TCP vs UDP is like choosing certified mail vs a flyer campaign. Certified mail (TCP) guarantees delivery and order but costs more time. Flyers (UDP) are cheap and fast but you don’t know if anyone read them. Use certified mail for contracts (financial transactions, file transfers). Use flyers for promotions where approximate reach is good enough (live video, game state updates).

Decision Table

ScenarioUseWhy
HTTP/HTTPSTCPEvery byte must be correct
SSH, SFTPTCPInteractive session, ordering critical
Database connectionsTCPTransactional correctness
DNS lookupUDPOne packet query/response, retry is cheap
Video streaming (HLS/DASH)TCPBuffered, reliability over latency
Live video (WebRTC)UDPLatency over reliability
VoIP (SIP/RTP)UDPReal-time, stale audio is useless
QUIC (HTTP/3)UDPCustom reliability on UDP
Online gaming stateUDPTolerate loss, can’t tolerate delay
Service health checksUDPSimple, low overhead

The Hybrid Approach: Build on UDP

QUIC (used by HTTP/3) is the modern case study. It runs on UDP and implements:

  • Connection IDs (survive IP changes — critical for mobile)
  • Multiplexed streams without head-of-line blocking
  • 0-RTT reconnects for repeat visitors
  • Pluggable congestion control

Why not just fix TCP? TCP is implemented in OS kernels. Deploying changes requires kernel upgrades across millions of servers and clients — years of rollout. QUIC runs in userspace (inside Chrome, in nginx, in envoy), deployable with a software update.


Connection Lifecycle Patterns

Why New TCP Connections Are Expensive

A new connection costs: 1 RTT for handshake + TLS negotiation (1-2 more RTTs). At 100ms RTT, that’s 300ms before your first byte of application data.

Connection pooling solves this: create N connections at startup, reuse them. Every request saves 3 RTTs. This is why database drivers, HTTP clients, and gRPC channels all implement pooling. Not using a connection pool is one of the most common performance bugs in backend code.

HTTP Keep-Alive

HTTP/1.0 opened a new TCP connection per request. HTTP/1.1 added Connection: keep-alive — reuse the connection for multiple requests. HTTP/2 went further: multiplex multiple requests on a single connection simultaneously.

Head-of-Line Blocking at TCP Level

HTTP/2 over TCP has a subtle performance problem: TCP guarantees order at the byte level. If one packet is lost, all streams on that connection stall waiting for retransmission, even streams that have no data in that lost packet. This is TCP-level head-of-line blocking.

QUIC solves this: streams are independent at the QUIC layer. A lost UDP packet only blocks the stream whose data was in it.

TCP Fast Open (TFO)

Standard TCP can’t send data until after the 3-way handshake completes. TFO allows the client to send data in the SYN packet on repeat connections using a cryptographic cookie. Saves 1 full RTT. Used by Google, enabled by default in iOS/macOS, opt-in on Linux (tcp_fastopen=3).


Debugging TCP/UDP Issues

The Tools You Actually Use

# Connection state summary — your first stop
ss -s

# All active connections with state
ss -tan   # TCP
ss -uan   # UDP

# Capture packets on interface eth0, port 80
tcpdump -i eth0 port 80 -nn

# Capture and save for Wireshark analysis
tcpdump -i eth0 -w capture.pcap

# Watch TCP retransmission counters
netstat -s | grep -i retransmit

# Check kernel TCP tuning
sysctl net.ipv4 | grep -E "tcp_(rmem|wmem|tw|syn)"

Recognizing Patterns

What you seeWhat it means
Many connections in TIME_WAITNormal for short-lived connections; check port exhaustion if > 28k
SYN_SENT stuckServer not listening, firewall dropping, or server overloaded
CLOSE_WAIT pile-upApplication bug — code is not closing connections after receiving FIN
Retransmission spikePacket loss somewhere — check switch errors, NIC stats
Zero window eventsReceiver buffer full — application not reading fast enough
RST stormConnection refused, NAT timeout, or bad load balancer config
SYN floodDoS attack — check SYN cookies (net.ipv4.tcp_syncookies=1)

ELI5: CLOSE_WAIT piling up is almost always an application bug, not a network problem. It means the remote side said “I’m done talking” (sent FIN), your kernel ACKed it, but your application code never called close() on the socket. The connection is stuck waiting for your app to close its side. A quick ss -tan | grep CLOSE_WAIT growing over time is a socket leak.

TCP vs Application Problem

SymptomLikely TCPLikely Application
Latency spike with retransmissions in tcpdumpYesNo
High latency but no packet lossNoYes (query is slow)
Connection refusedPartial (port not open)Yes (service crashed)
Intermittent timeouts under loadMaybeYes (thread pool exhaustion)
High TIME_WAIT countYesMaybe (connection not pooled)

Common mistake: Blaming the network when tcpdump shows clean traffic with no retransmissions. If packets are flowing without loss and you’re still seeing high latency, the problem is in your application layer — slow queries, lock contention, serialization overhead.


Summary Decision Table

QuestionAnswer
Do I need guaranteed delivery?TCP (or UDP + app-level reliability like QUIC)
Is my data real-time and loss-tolerant?UDP
Am I building a service that handles 10k+ connections?Use connection pooling regardless of protocol
Seeing 40ms latency spikes with no loss?Check Nagle + Delayed ACK, set TCP_NODELAY
Port exhaustion on a proxy/load balancer?Raise port range, enable tcp_tw_reuse
Need to multiplex many streams?HTTP/2 (TCP) or QUIC/HTTP3 (UDP)
Mysterious throughput cap at ~64KB?Check window scaling, inspect firewall option stripping
CLOSE_WAIT connections accumulating?Application code is not closing sockets
TCP vs UDP for a custom protocol?Start with UDP if you need custom congestion or multicast; TCP if you want OS-managed reliability