Network Performance & Debugging
Performance bugs are the worst: the app “works” but feels broken. Slow response times, intermittent timeouts, mysterious packet loss — these don’t show up in application logs. You need to understand the network layer, know which tool to reach for, and be able to read raw packet traces. This file covers everything from first principles to hands-on debugging.
1. Latency — The Silent Killer
Latency is the time for a message to travel from A to B. It has four components, and confusing them leads to bad diagnoses.
The Four Components
| Component | Cause | Typical magnitude | Variable? |
|---|---|---|---|
| Propagation delay | Speed of light in medium | 1ms–150ms | No |
| Serialization delay | Pushing bits onto the wire | <1ms at high BW | No |
| Processing delay | Router/NIC/server CPU | <1ms (modern hardware) | Slightly |
| Queuing delay | Waiting in buffers | 0ms–seconds | Yes — a lot |
Propagation delay is physics. Light travels ~300,000 km/s in a vacuum, but fiber is about 2/3 of that — roughly 200,000 km/s. New York to London is ~5,500 km, so one-way propagation is ~27ms. RTT is ~54ms — and that’s in a straight line with zero queuing. Real cross-Atlantic RTT is ~85–100ms.
ELI5: Propagation delay is like the speed of sound. You can’t shout faster than ~340 m/s no matter how loud you are. Similarly, you can’t get data from San Francisco to Tokyo in under ~35ms no matter how good your hardware is — physics won’t allow it. Every millisecond of “irreducible” latency is just distance.
Serialization delay is how long it takes to push all the bits of a packet onto the wire. A 1,500 byte packet on a 1 Mbps link takes $\frac{1500 \times 8}{1{,}000{,}000} = 12ms$. On a 1 Gbps link, it’s 12 microseconds. Only matters on slow links (DSL, satellite).
Queuing delay is the dangerous one — it’s variable and causes jitter. When a router’s output buffer is backed up, packets sit and wait. This is where bufferbloat lives (see Section 8). Jitter is the variance in delay; it destroys real-time applications like VoIP and gaming even when average latency is acceptable.
Bandwidth-Delay Product
$$\text{BDP} = \text{Bandwidth} \times \text{RTT}$$
BDP is the amount of data that can be “in flight” at any moment. On a 1 Gbps link with 100ms RTT: $1{,}000{,}000{,}000 \times 0.1 = 100{,}000{,}000$ bytes = 100 MB in flight.
Why does this matter? TCP’s congestion window limits how much unacknowledged data can be in flight. If your window is smaller than BDP, you can never fill the pipe — your throughput is capped at $\frac{\text{window size}}{\text{RTT}}$. This is why a file transfer over a high-latency link (satellite) is agonizingly slow even with “100 Mbps” bandwidth — the small default window gets ACKed slowly and the pipe stays mostly empty.
ELI5: BDP is like a highway. A 10-lane highway (bandwidth) between two cities 500 km apart (RTT) can have thousands of cars on it at once (BDP). If the on-ramp only lets 10 cars in at a time (small TCP window), the highway is mostly empty and you’re wasting capacity. You need to open the on-ramp (increase window size) to actually use the road.
Tail Latency — P50 / P95 / P99
Never report average latency. Use percentiles:
| Metric | Meaning | Who cares |
|---|---|---|
| P50 | Median — half of requests are faster | Marketing |
| P95 | 95th percentile — 1 in 20 requests | Engineering |
| P99 | 99th percentile — 1 in 100 requests | SLAs, real users |
| P999 | 1 in 1,000 | High-scale systems |
At 1,000 RPS, your P99 fires 10 times per second. A microservices call chain with 5 services: the overall P99 latency is roughly $1 - (1 - 0.01)^5 \approx 5%$ of requests see the worst case from at least one service. Tail latency multiplies across service calls.
Why this matters: If your P99 is 2 seconds and average is 50ms, your average looks great but 1% of users are waiting 2 seconds. At a million requests per day, that’s 10,000 users getting a bad experience. Averages hide the outliers that define user satisfaction.
2. Bandwidth vs Throughput vs Goodput
These three words are often used interchangeably. They are not the same thing.
| Term | Definition | Layer |
|---|---|---|
| Bandwidth | Maximum theoretical capacity of the link | Physical |
| Throughput | Actual data transferred per second | Transport |
| Goodput | Application-layer useful data per second | Application |
Why 1 Gbps doesn’t mean 1 Gbps file transfers:
- Protocol overhead: TCP/IP headers add ~40 bytes per ~1,460 bytes of payload. ~2.7% overhead.
- Congestion window growth: TCP starts slow (slow start) and ramps up. Short-lived connections never reach full speed.
- Retransmissions: On a 1% loss link, you lose ~1% of throughput to resends.
- ACK traffic: Every data packet generates ACK traffic in the reverse direction.
- Application behavior: If the app sends data in small batches (Nagle’s algorithm, small writes), the pipe is mostly idle.
ELI5: Bandwidth is the maximum speed on a highway. Throughput is how fast you’re actually driving (you slow for traffic, tolls, and construction). Goodput is how much useful cargo arrives — not counting the truck itself, the fuel, and the packaging. A “1 Gbps” file transfer might only move 700–900 Mbps of actual file data when you account for all the overhead.
Common mistake: Measuring throughput with wget and declaring the link “fine.” wget measures transfer rate for one connection. You need iperf3 with multiple parallel streams to actually saturate a high-bandwidth link:
# Server side
iperf3 -s
# Client side — 8 parallel streams, 30 second test
iperf3 -c 192.168.1.1 -P 8 -t 30
3. tcpdump — Packet Capture
tcpdump is your first responder. It runs on any Linux/macOS system with no GUI, captures on remote servers, and can write pcap files for offline analysis.
Basic Syntax
# Capture all traffic on eth0, no DNS resolution (-n), verbose (-v)
tcpdump -i eth0 -n -v
# Capture only port 443 traffic
tcpdump -i eth0 -n port 443
# Capture HTTP traffic from a specific host
tcpdump -i eth0 -n 'host 10.0.1.5 and port 80'
# Save to file for Wireshark analysis (-s 0 = full packet)
tcpdump -i eth0 -n -s 0 -w /tmp/capture.pcap port 443
# Read back a saved capture
tcpdump -r /tmp/capture.pcap -n
Filter Expression Reference
| Filter | Example | Captures |
|---|---|---|
host | host 10.0.0.1 | Traffic to/from IP |
src / dst | src 10.0.0.1 | One direction only |
port | port 5432 | Any traffic on port |
net | net 10.0.0.0/24 | Entire subnet |
tcp[tcpflags] | tcp[tcpflags] & tcp-rst != 0 | RST packets |
| Logical | and, or, not | Combine filters |
What to Look For
# Catch retransmissions and RSTs — sign of connection problems
tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst|tcp-syn) != 0'
# Zero window advertisements — receiver buffer full, sender blocked
tcpdump -i eth0 -n 'tcp[14:2] == 0'
# SYN flood detection — too many SYNs without SYN-ACK
tcpdump -i eth0 -n 'tcp[tcpflags] == tcp-syn'
ELI5: tcpdump is like a wiretap on a phone line — everything going in or out of a network interface gets recorded. Filters let you focus on the suspicious calls. On a busy server, you’d drown in noise without filters, so you narrow to “only show me calls from this IP on this port.”
Common mistake: Running tcpdump without -n on a busy server. DNS reverse-lookups for every IP address add latency and CPU pressure, and can cause tcpdump to drop packets. Always use -n in production.
4. Wireshark — Deep Packet Analysis
Wireshark is the GUI counterpart to tcpdump. Use it for analysis, not capture — capture on the server with tcpdump, copy the .pcap file, analyze locally in Wireshark.
Key Features
Following TCP streams — Right-click any packet → “Follow → TCP Stream”. Wireshark reassembles the full conversation. You see the raw HTTP request/response, or whatever protocol is in use. Invaluable for debugging “what exactly did the app send?”
Protocol decoders — Wireshark understands 2,000+ protocols. It will automatically decode HTTP, DNS, TLS (if you have keys), gRPC, Redis protocol, etc. The Info column shows you a human-readable summary.
Display filters vs capture filters — Display filters are applied after capture; they don’t reduce what’s captured, just what’s shown. More powerful syntax. Use display filters for analysis:
# Show only retransmissions
tcp.analysis.retransmission
# Show only DNS failures (NXDOMAIN)
dns.flags.rcode == 3
# Show TLS handshake failures
tls.alert_message.level == 2
# Slow requests > 1 second
http.time > 1
Expert Information — Analyze menu → Expert Information. Wireshark auto-flags anomalies: retransmissions, duplicate ACKs, zero windows, TCP resets, protocol violations. Start here when you open an unfamiliar capture.
TLS decryption with SSLKEYLOGFILE — If you set SSLKEYLOGFILE=/tmp/keys.log in the environment before running Chrome/Firefox/curl, TLS session keys get logged. Import this file in Wireshark (Edit → Preferences → TLS → (Pre)-Master-Secret log filename) and you can see decrypted HTTPS traffic.
SSLKEYLOGFILE=/tmp/keys.log curl https://api.example.com/endpoint
tcpdump -i any -w /tmp/capture.pcap port 443
# Then open capture.pcap + keys.log in Wireshark
ELI5: Wireshark is like watching a conversation through a magnifying glass instead of a keyhole. tcpdump captures the raw bytes; Wireshark translates them into “Client said: GET /api/users HTTP/1.1, Server replied: 200 OK, body is 1,234 bytes.” Without TLS keys, HTTPS traffic looks like gibberish. With the keys file, you can read the actual application data.
5. Network Debugging Tools
Quick Reference
| Tool | Use case | When to reach for it |
|---|---|---|
ping | Connectivity, RTT | First step in any diagnosis |
traceroute / mtr | Path analysis, where packets die | When ping fails or latency is high |
dig | DNS debugging | When hostname resolution is wrong |
curl -v | HTTP debugging | When the application layer is the question |
ss / netstat | Socket state | When “connection refused” or “too many connections” |
iperf3 | Bandwidth testing | When you need to prove link capacity |
nmap | Port scanning | When you don’t know what’s listening |
openssl s_client | TLS debugging | When TLS handshakes fail |
Practical Examples
mtr — continuous traceroute with packet loss per hop:
mtr --report --report-cycles 100 8.8.8.8
# Shows RTT and loss% at each hop. A hop with 10% loss that
# doesn't affect subsequent hops = ICMP rate-limiting (normal).
# A hop with 10% loss where all subsequent hops also have 10% = real loss.
dig — DNS debugging:
# Query specific DNS server to bypass cache
dig @8.8.8.8 api.example.com A
# Check for CNAME chains
dig +trace api.example.com
# Find who's authoritative
dig api.example.com NS
openssl s_client — TLS debugging:
# Full TLS handshake details
openssl s_client -connect api.example.com:443 -servername api.example.com
# Check certificate expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null \
| openssl x509 -noout -dates
# Test specific TLS version
openssl s_client -connect api.example.com:443 -tls1_2
ss — socket states:
# All established connections grouped by state
ss -s
# Who's listening on which port
ss -tlnp
# Count TIME_WAIT connections (port exhaustion indicator)
ss -tan state time-wait | wc -l
ELI5: These tools are like a doctor’s toolkit.
pingis checking if the patient is alive (pulse).tracerouteis tracing where the blood flow stops.digis testing if the phone directory works.openssl s_clientis verifying the ID badge is valid before letting someone in the door. You always start with the simplest check and escalate.
6. Common Network Problems
Diagnostic Patterns
DNS resolution failures:
digreturns NXDOMAIN → hostname doesn’t exist or wrong resolver configureddighangs → resolver unreachable (firewall, wrong IP in/etc/resolv.conf)- Works with IP, fails with hostname → DNS-only problem
- Works from laptop, fails from server → check
/etc/resolv.confand/etc/hostson the server
TCP connection timeouts vs refused:
| Symptom | Diagnosis |
|---|---|
Connection refused immediately | Server is reachable but nothing listening on that port |
Connection timed out after ~75s | Firewall is silently dropping SYN packets (no RST sent) |
Connection timed out after 3–5s | Custom timeout, or application-level rejection |
No route to host | Routing problem, or firewall sending ICMP unreachable |
TLS handshake failures — read the error message carefully:
certificate verify failed→ Cert chain issue, expired cert, wrong hostname, or missing CAhandshake failure (40)→ Cipher suite mismatch — server and client share no common cipherunrecognized name (112)→ SNI mismatch — you’re hitting the wrong virtual hostcertificate unknown (46)→ Client certificate required but not provided
Connection resets (RST):
- Mid-connection RST from client: app crashed, or connection pool returning bad connections
- RST from direction of firewall: stateful firewall rule, IDS blocking, or asymmetric routing causing firewall to see half a connection
MTU issues — one of the sneakiest problems:
Large packets get fragmented or dropped silently if an intermediate router has a lower MTU and ICMP “fragmentation needed” messages are blocked. Symptom: small requests work, large ones hang (the TCP handshake succeeds, data never arrives). Fix:
# Test MTU with increasing packet sizes
ping -M do -s 1400 gateway_ip # Linux: don't fragment
ping -D -s 1400 gateway_ip # macOS
ELI5: MTU problems are like trying to ship a king-size mattress through a doorway that’s only wide enough for a twin. The handshake (small packages) gets through fine. The actual data (big packages) gets stuck. The system should automatically send a “door too small” message (ICMP fragmentation needed) but if that message is blocked by a firewall, you’re stuck with a mystery: “why does my connection establish but never transfer data?”
Common mistake: Blaming the application for slow responses when the network is the culprit. To distinguish:
curl -w "%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}\n"— breaks down DNS, TCP, TLS, TTFB, and total- If
time_connectis high → network latency or packet loss - If
time_starttransferis high buttime_connectis fine → application/server processing slow
7. Performance Optimization
Connection Management
Keep-alive / connection reuse — opening a new TCP connection costs 1 RTT (handshake) + 1 RTT (TLS) = 2 RTTs before the first byte of data. At 100ms RTT, that’s 200ms of pure overhead on every request if you don’t reuse connections. HTTP/1.1 keep-alive and HTTP/2 multiplexing both avoid this.
TLS session resumption — After a full TLS handshake, the server issues a session ticket (an encrypted blob with the session keys). On reconnect, the client sends the ticket; the server decrypts it and resumes without a full handshake. Saves 1 RTT. Check if your server has it enabled:
# Connect twice and look for "Reused, TLSv1.3, Cipher is ..."
openssl s_client -connect api.example.com:443 -reconnect 2>&1 | grep -E "Reused|New|Session"
TCP Tuning
Nagle’s algorithm — TCP buffers small writes and waits up to 200ms to combine them into a larger segment before sending. This improves throughput on slow links but adds 200ms latency for interactive apps (SSH, game servers, API calls that send small payloads).
Disable for latency-sensitive apps:
int flag = 1;
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));
Or in most frameworks: enable “TCP_NODELAY” or “no delay” socket option.
ELI5: Nagle’s algorithm is like a bus that waits 5 minutes for more passengers before departing. Great for filling up the bus (throughput), terrible if you’re alone and in a hurry (latency). Disable it when each “passenger” (packet) needs to leave immediately, even if the bus is mostly empty.
TCP congestion control algorithms:
| Algorithm | Best for | Note |
|---|---|---|
| CUBIC (default Linux) | High-bandwidth, low-loss LANs | Good default |
| BBR | Long-distance / lossy links | Google’s algorithm, better on WAN |
| RENO | Legacy, avoid | Slow recovery from loss |
# Check and set congestion algorithm
sysctl net.ipv4.tcp_congestion_control
sysctl -w net.ipv4.tcp_congestion_control=bbr
Application-Level Wins
| Optimization | Saves | When to use |
|---|---|---|
| gzip/brotli compression | 60–90% bandwidth | Text responses (HTML, JSON, CSS) |
| protobuf vs JSON | 50–80% bandwidth + CPU | High-frequency API calls |
| CDN / edge caching | Propagation delay | Static assets, read-heavy APIs |
DNS prefetch (<link rel="dns-prefetch">) | One DNS RTT | Known third-party domains |
preconnect | TCP + TLS RTT | Critical third-party origins |
| HTTP/2 server push | 1 RTT | Critical assets (CSS, fonts) |
ELI5: CDNs are like Amazon distribution warehouses. Instead of shipping everything from one giant factory in the middle of the country, they put smaller warehouses near big cities. Your package ships from 50 miles away instead of 2,000 miles. Same content, dramatically less propagation delay.
8. Bufferbloat and Queuing
What Is Bufferbloat?
When a router’s output buffer fills up during congestion, instead of dropping packets (which would signal TCP to slow down), it holds them in a huge queue. Result: packets eventually get delivered, but after waiting in a queue for hundreds of milliseconds or even seconds. Latency skyrockets while throughput stays the same.
The irony: bufferbloat was caused by cheap RAM making huge buffers affordable. “Bigger buffers = better performance” seemed intuitive. It was wrong.
ELI5: Bufferbloat is like a grocery store that handles checkout lines by adding an infinitely long waiting area. Nobody leaves (no dropped packets) but everyone waits forever. The right fix is to occasionally close a checkout lane (drop packets), which signals people to come back at a less busy time (TCP backs off). Holding everyone in line forever doesn’t help — it just makes everyone miserable.
How to Detect Bufferbloat
# Baseline ping with no load
ping -c 20 8.8.8.8
# Start a large download, THEN ping
wget -q http://speedtest.net/largefile.bin &
ping -c 20 8.8.8.8
# Bufferbloat: baseline = 15ms, under load = 800ms
# Healthy: baseline = 15ms, under load = 20ms
Typical symptom: everything feels fine at idle, but when someone on the same network starts a Netflix stream or large download, everyone’s latency jumps by 500ms+.
AQM — The Fix
Active Queue Management (AQM) deliberately drops or marks packets before the queue is full, giving TCP early feedback to slow down. This keeps queues short (and therefore latency low).
| AQM Algorithm | Description | Status |
|---|---|---|
| CoDel (Controlled Delay) | Drops packets if min latency > 5ms for >100ms | Deployed widely |
| FQ-CoDel | CoDel + fair queuing (each flow gets equal share) | Best current default |
| CAKE | Successor to FQ-CoDel, handles shaping too | Linux 5.x+ |
| RED (Random Early Detection) | Old standard, parameter-sensitive | Legacy |
# Check your current qdisc
tc qdisc show dev eth0
# Enable fq_codel on an interface
tc qdisc replace dev eth0 root fq_codel
Most modern home routers with OpenWrt support FQ-CoDel via the SQM (Smart Queue Management) package. On a typical cable connection, enabling SQM reduces latency-under-load from 500ms to <20ms.
ELI5: AQM is like a smart toll booth that notices traffic backing up and briefly stops letting cars onto the highway before the merge becomes a disaster. Yes, a few cars wait at the booth (dropped packets get retransmitted). But the highway itself stays flowing freely. Without AQM, all the cars get on the highway and sit bumper-to-bumper for miles.
Debugging Decision Table
When something is “slow” or “broken,” work through this table top-to-bottom:
| Question | Tool | Positive result means |
|---|---|---|
| Can we reach the host at all? | ping <host> | Basic IP connectivity |
| Where does the path break? | mtr <host> | Find the failing hop |
| Is DNS resolving correctly? | dig <hostname> | DNS not the problem |
| Is the port open? | nc -zv host port or nmap | Process is listening |
| Is the TCP handshake completing? | tcpdump + SYN/SYN-ACK | No firewall blocking |
| Is TLS working? | openssl s_client -connect | Cert/cipher OK |
| How long does each HTTP phase take? | curl -w "%{time_*}" | Isolate DNS/TCP/TLS/App |
| Are there retransmissions? | Wireshark Expert Info | Packet loss present |
| Is the bandwidth what we expect? | iperf3 -c server -P 8 | Link capacity confirmed |
| Is latency spiking under load? | ping while running iperf3 | Bufferbloat present |
When you’re stuck: capture with tcpdump -w, analyze in Wireshark, start at Expert Information, follow the TCP stream for the failing request. The answer is almost always in the packet capture.