← Networking Mastery — Fundamentals to Principal

Network Performance & Debugging

15 min read 3193 words

Performance bugs are the worst: the app “works” but feels broken. Slow response times, intermittent timeouts, mysterious packet loss — these don’t show up in application logs. You need to understand the network layer, know which tool to reach for, and be able to read raw packet traces. This file covers everything from first principles to hands-on debugging.

1. Latency — The Silent Killer

Latency is the time for a message to travel from A to B. It has four components, and confusing them leads to bad diagnoses.

The Four Components

Component	Cause	Typical magnitude	Variable?
Propagation delay	Speed of light in medium	1ms–150ms	No
Serialization delay	Pushing bits onto the wire	<1ms at high BW	No
Processing delay	Router/NIC/server CPU	<1ms (modern hardware)	Slightly
Queuing delay	Waiting in buffers	0ms–seconds	Yes — a lot

Propagation delay is physics. Light travels ~300,000 km/s in a vacuum, but fiber is about 2/3 of that — roughly 200,000 km/s. New York to London is ~5,500 km, so one-way propagation is ~27ms. RTT is ~54ms — and that’s in a straight line with zero queuing. Real cross-Atlantic RTT is ~85–100ms.

ELI5: Propagation delay is like the speed of sound. You can’t shout faster than ~340 m/s no matter how loud you are. Similarly, you can’t get data from San Francisco to Tokyo in under ~35ms no matter how good your hardware is — physics won’t allow it. Every millisecond of “irreducible” latency is just distance.

Serialization delay is how long it takes to push all the bits of a packet onto the wire. A 1,500 byte packet on a 1 Mbps link takes $\frac{1500 \times 8}{1{,}000{,}000} = 12ms$. On a 1 Gbps link, it’s 12 microseconds. Only matters on slow links (DSL, satellite).

Queuing delay is the dangerous one — it’s variable and causes jitter. When a router’s output buffer is backed up, packets sit and wait. This is where bufferbloat lives (see Section 8). Jitter is the variance in delay; it destroys real-time applications like VoIP and gaming even when average latency is acceptable.

Bandwidth-Delay Product

$$\text{BDP} = \text{Bandwidth} \times \text{RTT}$$

BDP is the amount of data that can be “in flight” at any moment. On a 1 Gbps link with 100ms RTT: $1{,}000{,}000{,}000 \times 0.1 = 100{,}000{,}000$ bytes = 100 MB in flight.

Why does this matter? TCP’s congestion window limits how much unacknowledged data can be in flight. If your window is smaller than BDP, you can never fill the pipe — your throughput is capped at $\frac{\text{window size}}{\text{RTT}}$. This is why a file transfer over a high-latency link (satellite) is agonizingly slow even with “100 Mbps” bandwidth — the small default window gets ACKed slowly and the pipe stays mostly empty.

ELI5: BDP is like a highway. A 10-lane highway (bandwidth) between two cities 500 km apart (RTT) can have thousands of cars on it at once (BDP). If the on-ramp only lets 10 cars in at a time (small TCP window), the highway is mostly empty and you’re wasting capacity. You need to open the on-ramp (increase window size) to actually use the road.

Tail Latency — P50 / P95 / P99

Never report average latency. Use percentiles:

Metric	Meaning	Who cares
P50	Median — half of requests are faster	Marketing
P95	95th percentile — 1 in 20 requests	Engineering
P99	99th percentile — 1 in 100 requests	SLAs, real users
P999	1 in 1,000	High-scale systems

At 1,000 RPS, your P99 fires 10 times per second. A microservices call chain with 5 services: the overall P99 latency is roughly $1 - (1 - 0.01)^5 \approx 5%$ of requests see the worst case from at least one service. Tail latency multiplies across service calls.

Why this matters: If your P99 is 2 seconds and average is 50ms, your average looks great but 1% of users are waiting 2 seconds. At a million requests per day, that’s 10,000 users getting a bad experience. Averages hide the outliers that define user satisfaction.

2. Bandwidth vs Throughput vs Goodput

These three words are often used interchangeably. They are not the same thing.

Term	Definition	Layer
Bandwidth	Maximum theoretical capacity of the link	Physical
Throughput	Actual data transferred per second	Transport
Goodput	Application-layer useful data per second	Application

Why 1 Gbps doesn’t mean 1 Gbps file transfers:

Protocol overhead: TCP/IP headers add ~40 bytes per ~1,460 bytes of payload. ~2.7% overhead.
Congestion window growth: TCP starts slow (slow start) and ramps up. Short-lived connections never reach full speed.
Retransmissions: On a 1% loss link, you lose ~1% of throughput to resends.
ACK traffic: Every data packet generates ACK traffic in the reverse direction.
Application behavior: If the app sends data in small batches (Nagle’s algorithm, small writes), the pipe is mostly idle.

ELI5: Bandwidth is the maximum speed on a highway. Throughput is how fast you’re actually driving (you slow for traffic, tolls, and construction). Goodput is how much useful cargo arrives — not counting the truck itself, the fuel, and the packaging. A “1 Gbps” file transfer might only move 700–900 Mbps of actual file data when you account for all the overhead.

Common mistake: Measuring throughput with wget and declaring the link “fine.” wget measures transfer rate for one connection. You need iperf3 with multiple parallel streams to actually saturate a high-bandwidth link:

# Server side
iperf3 -s

# Client side — 8 parallel streams, 30 second test
iperf3 -c 192.168.1.1 -P 8 -t 30

3. tcpdump — Packet Capture

tcpdump is your first responder. It runs on any Linux/macOS system with no GUI, captures on remote servers, and can write pcap files for offline analysis.

Basic Syntax

# Capture all traffic on eth0, no DNS resolution (-n), verbose (-v)
tcpdump -i eth0 -n -v

# Capture only port 443 traffic
tcpdump -i eth0 -n port 443

# Capture HTTP traffic from a specific host
tcpdump -i eth0 -n 'host 10.0.1.5 and port 80'

# Save to file for Wireshark analysis (-s 0 = full packet)
tcpdump -i eth0 -n -s 0 -w /tmp/capture.pcap port 443

# Read back a saved capture
tcpdump -r /tmp/capture.pcap -n

Filter Expression Reference

Filter	Example	Captures
`host`	`host 10.0.0.1`	Traffic to/from IP
`src` / `dst`	`src 10.0.0.1`	One direction only
`port`	`port 5432`	Any traffic on port
`net`	`net 10.0.0.0/24`	Entire subnet
`tcp[tcpflags]`	`tcp[tcpflags] & tcp-rst != 0`	RST packets
Logical	`and`, `or`, `not`	Combine filters

What to Look For

# Catch retransmissions and RSTs — sign of connection problems
tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst|tcp-syn) != 0'

# Zero window advertisements — receiver buffer full, sender blocked
tcpdump -i eth0 -n 'tcp[14:2] == 0'

# SYN flood detection — too many SYNs without SYN-ACK
tcpdump -i eth0 -n 'tcp[tcpflags] == tcp-syn'

ELI5: tcpdump is like a wiretap on a phone line — everything going in or out of a network interface gets recorded. Filters let you focus on the suspicious calls. On a busy server, you’d drown in noise without filters, so you narrow to “only show me calls from this IP on this port.”

Common mistake: Running tcpdump without -n on a busy server. DNS reverse-lookups for every IP address add latency and CPU pressure, and can cause tcpdump to drop packets. Always use -n in production.

4. Wireshark — Deep Packet Analysis

Wireshark is the GUI counterpart to tcpdump. Use it for analysis, not capture — capture on the server with tcpdump, copy the .pcap file, analyze locally in Wireshark.

Key Features

Following TCP streams — Right-click any packet → “Follow → TCP Stream”. Wireshark reassembles the full conversation. You see the raw HTTP request/response, or whatever protocol is in use. Invaluable for debugging “what exactly did the app send?”

Protocol decoders — Wireshark understands 2,000+ protocols. It will automatically decode HTTP, DNS, TLS (if you have keys), gRPC, Redis protocol, etc. The Info column shows you a human-readable summary.

Display filters vs capture filters — Display filters are applied after capture; they don’t reduce what’s captured, just what’s shown. More powerful syntax. Use display filters for analysis:

# Show only retransmissions
tcp.analysis.retransmission

# Show only DNS failures (NXDOMAIN)
dns.flags.rcode == 3

# Show TLS handshake failures
tls.alert_message.level == 2

# Slow requests > 1 second
http.time > 1

Expert Information — Analyze menu → Expert Information. Wireshark auto-flags anomalies: retransmissions, duplicate ACKs, zero windows, TCP resets, protocol violations. Start here when you open an unfamiliar capture.

TLS decryption with SSLKEYLOGFILE — If you set SSLKEYLOGFILE=/tmp/keys.log in the environment before running Chrome/Firefox/curl, TLS session keys get logged. Import this file in Wireshark (Edit → Preferences → TLS → (Pre)-Master-Secret log filename) and you can see decrypted HTTPS traffic.

SSLKEYLOGFILE=/tmp/keys.log curl https://api.example.com/endpoint
tcpdump -i any -w /tmp/capture.pcap port 443
# Then open capture.pcap + keys.log in Wireshark

ELI5: Wireshark is like watching a conversation through a magnifying glass instead of a keyhole. tcpdump captures the raw bytes; Wireshark translates them into “Client said: GET /api/users HTTP/1.1, Server replied: 200 OK, body is 1,234 bytes.” Without TLS keys, HTTPS traffic looks like gibberish. With the keys file, you can read the actual application data.

5. Network Debugging Tools

Quick Reference

Tool	Use case	When to reach for it
`ping`	Connectivity, RTT	First step in any diagnosis
`traceroute` / `mtr`	Path analysis, where packets die	When ping fails or latency is high
`dig`	DNS debugging	When hostname resolution is wrong
`curl -v`	HTTP debugging	When the application layer is the question
`ss` / `netstat`	Socket state	When “connection refused” or “too many connections”
`iperf3`	Bandwidth testing	When you need to prove link capacity
`nmap`	Port scanning	When you don’t know what’s listening
`openssl s_client`	TLS debugging	When TLS handshakes fail

Practical Examples

mtr — continuous traceroute with packet loss per hop:

mtr --report --report-cycles 100 8.8.8.8
# Shows RTT and loss% at each hop. A hop with 10% loss that 
# doesn't affect subsequent hops = ICMP rate-limiting (normal).
# A hop with 10% loss where all subsequent hops also have 10% = real loss.

dig — DNS debugging:

# Query specific DNS server to bypass cache
dig @8.8.8.8 api.example.com A

# Check for CNAME chains
dig +trace api.example.com

# Find who's authoritative
dig api.example.com NS

openssl s_client — TLS debugging:

# Full TLS handshake details
openssl s_client -connect api.example.com:443 -servername api.example.com

# Check certificate expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# Test specific TLS version
openssl s_client -connect api.example.com:443 -tls1_2

ss — socket states:

# All established connections grouped by state
ss -s

# Who's listening on which port
ss -tlnp

# Count TIME_WAIT connections (port exhaustion indicator)
ss -tan state time-wait | wc -l

ELI5: These tools are like a doctor’s toolkit. ping is checking if the patient is alive (pulse). traceroute is tracing where the blood flow stops. dig is testing if the phone directory works. openssl s_client is verifying the ID badge is valid before letting someone in the door. You always start with the simplest check and escalate.

6. Common Network Problems

Diagnostic Patterns

DNS resolution failures:

dig returns NXDOMAIN → hostname doesn’t exist or wrong resolver configured
dig hangs → resolver unreachable (firewall, wrong IP in /etc/resolv.conf)
Works with IP, fails with hostname → DNS-only problem
Works from laptop, fails from server → check /etc/resolv.conf and /etc/hosts on the server

TCP connection timeouts vs refused:

Symptom	Diagnosis
`Connection refused` immediately	Server is reachable but nothing listening on that port
`Connection timed out` after ~75s	Firewall is silently dropping SYN packets (no RST sent)
`Connection timed out` after 3–5s	Custom timeout, or application-level rejection
`No route to host`	Routing problem, or firewall sending ICMP unreachable

TLS handshake failures — read the error message carefully:

certificate verify failed → Cert chain issue, expired cert, wrong hostname, or missing CA
handshake failure (40) → Cipher suite mismatch — server and client share no common cipher
unrecognized name (112) → SNI mismatch — you’re hitting the wrong virtual host
certificate unknown (46) → Client certificate required but not provided

Connection resets (RST):

Mid-connection RST from client: app crashed, or connection pool returning bad connections
RST from direction of firewall: stateful firewall rule, IDS blocking, or asymmetric routing causing firewall to see half a connection

MTU issues — one of the sneakiest problems:

Large packets get fragmented or dropped silently if an intermediate router has a lower MTU and ICMP “fragmentation needed” messages are blocked. Symptom: small requests work, large ones hang (the TCP handshake succeeds, data never arrives). Fix:

# Test MTU with increasing packet sizes
ping -M do -s 1400 gateway_ip   # Linux: don't fragment
ping -D -s 1400 gateway_ip      # macOS

ELI5: MTU problems are like trying to ship a king-size mattress through a doorway that’s only wide enough for a twin. The handshake (small packages) gets through fine. The actual data (big packages) gets stuck. The system should automatically send a “door too small” message (ICMP fragmentation needed) but if that message is blocked by a firewall, you’re stuck with a mystery: “why does my connection establish but never transfer data?”

Common mistake: Blaming the application for slow responses when the network is the culprit. To distinguish:

curl -w "%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}\n" — breaks down DNS, TCP, TLS, TTFB, and total
If time_connect is high → network latency or packet loss
If time_starttransfer is high but time_connect is fine → application/server processing slow

7. Performance Optimization

Connection Management

Keep-alive / connection reuse — opening a new TCP connection costs 1 RTT (handshake) + 1 RTT (TLS) = 2 RTTs before the first byte of data. At 100ms RTT, that’s 200ms of pure overhead on every request if you don’t reuse connections. HTTP/1.1 keep-alive and HTTP/2 multiplexing both avoid this.

TLS session resumption — After a full TLS handshake, the server issues a session ticket (an encrypted blob with the session keys). On reconnect, the client sends the ticket; the server decrypts it and resumes without a full handshake. Saves 1 RTT. Check if your server has it enabled:

# Connect twice and look for "Reused, TLSv1.3, Cipher is ..."
openssl s_client -connect api.example.com:443 -reconnect 2>&1 | grep -E "Reused|New|Session"

TCP Tuning

Nagle’s algorithm — TCP buffers small writes and waits up to 200ms to combine them into a larger segment before sending. This improves throughput on slow links but adds 200ms latency for interactive apps (SSH, game servers, API calls that send small payloads).

Disable for latency-sensitive apps:

int flag = 1;
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

Or in most frameworks: enable “TCP_NODELAY” or “no delay” socket option.

ELI5: Nagle’s algorithm is like a bus that waits 5 minutes for more passengers before departing. Great for filling up the bus (throughput), terrible if you’re alone and in a hurry (latency). Disable it when each “passenger” (packet) needs to leave immediately, even if the bus is mostly empty.

TCP congestion control algorithms:

Algorithm	Best for	Note
CUBIC (default Linux)	High-bandwidth, low-loss LANs	Good default
BBR	Long-distance / lossy links	Google’s algorithm, better on WAN
RENO	Legacy, avoid	Slow recovery from loss

# Check and set congestion algorithm
sysctl net.ipv4.tcp_congestion_control
sysctl -w net.ipv4.tcp_congestion_control=bbr

Application-Level Wins

Optimization	Saves	When to use
gzip/brotli compression	60–90% bandwidth	Text responses (HTML, JSON, CSS)
protobuf vs JSON	50–80% bandwidth + CPU	High-frequency API calls
CDN / edge caching	Propagation delay	Static assets, read-heavy APIs
DNS prefetch (`<link rel="dns-prefetch">`)	One DNS RTT	Known third-party domains
`preconnect`	TCP + TLS RTT	Critical third-party origins
HTTP/2 server push	1 RTT	Critical assets (CSS, fonts)

ELI5: CDNs are like Amazon distribution warehouses. Instead of shipping everything from one giant factory in the middle of the country, they put smaller warehouses near big cities. Your package ships from 50 miles away instead of 2,000 miles. Same content, dramatically less propagation delay.

8. Bufferbloat and Queuing

What Is Bufferbloat?

When a router’s output buffer fills up during congestion, instead of dropping packets (which would signal TCP to slow down), it holds them in a huge queue. Result: packets eventually get delivered, but after waiting in a queue for hundreds of milliseconds or even seconds. Latency skyrockets while throughput stays the same.

The irony: bufferbloat was caused by cheap RAM making huge buffers affordable. “Bigger buffers = better performance” seemed intuitive. It was wrong.

ELI5: Bufferbloat is like a grocery store that handles checkout lines by adding an infinitely long waiting area. Nobody leaves (no dropped packets) but everyone waits forever. The right fix is to occasionally close a checkout lane (drop packets), which signals people to come back at a less busy time (TCP backs off). Holding everyone in line forever doesn’t help — it just makes everyone miserable.

How to Detect Bufferbloat

# Baseline ping with no load
ping -c 20 8.8.8.8

# Start a large download, THEN ping
wget -q http://speedtest.net/largefile.bin &
ping -c 20 8.8.8.8

# Bufferbloat: baseline = 15ms, under load = 800ms
# Healthy: baseline = 15ms, under load = 20ms

Typical symptom: everything feels fine at idle, but when someone on the same network starts a Netflix stream or large download, everyone’s latency jumps by 500ms+.

AQM — The Fix

Active Queue Management (AQM) deliberately drops or marks packets before the queue is full, giving TCP early feedback to slow down. This keeps queues short (and therefore latency low).

AQM Algorithm	Description	Status
CoDel (Controlled Delay)	Drops packets if min latency > 5ms for >100ms	Deployed widely
FQ-CoDel	CoDel + fair queuing (each flow gets equal share)	Best current default
CAKE	Successor to FQ-CoDel, handles shaping too	Linux 5.x+
RED (Random Early Detection)	Old standard, parameter-sensitive	Legacy

# Check your current qdisc
tc qdisc show dev eth0

# Enable fq_codel on an interface
tc qdisc replace dev eth0 root fq_codel

Most modern home routers with OpenWrt support FQ-CoDel via the SQM (Smart Queue Management) package. On a typical cable connection, enabling SQM reduces latency-under-load from 500ms to <20ms.

ELI5: AQM is like a smart toll booth that notices traffic backing up and briefly stops letting cars onto the highway before the merge becomes a disaster. Yes, a few cars wait at the booth (dropped packets get retransmitted). But the highway itself stays flowing freely. Without AQM, all the cars get on the highway and sit bumper-to-bumper for miles.

Debugging Decision Table

When something is “slow” or “broken,” work through this table top-to-bottom:

Question	Tool	Positive result means
Can we reach the host at all?	`ping <host>`	Basic IP connectivity
Where does the path break?	`mtr <host>`	Find the failing hop
Is DNS resolving correctly?	`dig <hostname>`	DNS not the problem
Is the port open?	`nc -zv host port` or `nmap`	Process is listening
Is the TCP handshake completing?	`tcpdump` + SYN/SYN-ACK	No firewall blocking
Is TLS working?	`openssl s_client -connect`	Cert/cipher OK
How long does each HTTP phase take?	`curl -w "%{time_*}"`	Isolate DNS/TCP/TLS/App
Are there retransmissions?	Wireshark Expert Info	Packet loss present
Is the bandwidth what we expect?	`iperf3 -c server -P 8`	Link capacity confirmed
Is latency spiking under load?	`ping` while running `iperf3`	Bufferbloat present

When you’re stuck: capture with tcpdump -w, analyze in Wireshark, start at Expert Information, follow the TCP stream for the failing request. The answer is almost always in the packet capture.

1. Latency — The Silent Killer#

The Four Components#

Bandwidth-Delay Product#

Tail Latency — P50 / P95 / P99#

2. Bandwidth vs Throughput vs Goodput#

3. tcpdump — Packet Capture#

Basic Syntax#

Filter Expression Reference#

What to Look For#

4. Wireshark — Deep Packet Analysis#

Key Features#

5. Network Debugging Tools#

Quick Reference#

Practical Examples#

6. Common Network Problems#

Diagnostic Patterns#

7. Performance Optimization#

Connection Management#

TCP Tuning#

Application-Level Wins#

8. Bufferbloat and Queuing#

What Is Bufferbloat?#

How to Detect Bufferbloat#

AQM — The Fix#

Debugging Decision Table#