← Networking Mastery — Fundamentals to Principal

Network Performance & Debugging

Performance bugs are the worst: the app “works” but feels broken. Slow response times, intermittent timeouts, mysterious packet loss — these don’t show up in application logs. You need to understand the network layer, know which tool to reach for, and be able to read raw packet traces. This file covers everything from first principles to hands-on debugging.


1. Latency — The Silent Killer

Latency is the time for a message to travel from A to B. It has four components, and confusing them leads to bad diagnoses.

The Four Components

ComponentCauseTypical magnitudeVariable?
Propagation delaySpeed of light in medium1ms–150msNo
Serialization delayPushing bits onto the wire<1ms at high BWNo
Processing delayRouter/NIC/server CPU<1ms (modern hardware)Slightly
Queuing delayWaiting in buffers0ms–secondsYes — a lot

Propagation delay is physics. Light travels ~300,000 km/s in a vacuum, but fiber is about 2/3 of that — roughly 200,000 km/s. New York to London is ~5,500 km, so one-way propagation is ~27ms. RTT is ~54ms — and that’s in a straight line with zero queuing. Real cross-Atlantic RTT is ~85–100ms.

ELI5: Propagation delay is like the speed of sound. You can’t shout faster than ~340 m/s no matter how loud you are. Similarly, you can’t get data from San Francisco to Tokyo in under ~35ms no matter how good your hardware is — physics won’t allow it. Every millisecond of “irreducible” latency is just distance.

Serialization delay is how long it takes to push all the bits of a packet onto the wire. A 1,500 byte packet on a 1 Mbps link takes $\frac{1500 \times 8}{1{,}000{,}000} = 12ms$. On a 1 Gbps link, it’s 12 microseconds. Only matters on slow links (DSL, satellite).

Queuing delay is the dangerous one — it’s variable and causes jitter. When a router’s output buffer is backed up, packets sit and wait. This is where bufferbloat lives (see Section 8). Jitter is the variance in delay; it destroys real-time applications like VoIP and gaming even when average latency is acceptable.

Bandwidth-Delay Product

$$\text{BDP} = \text{Bandwidth} \times \text{RTT}$$

BDP is the amount of data that can be “in flight” at any moment. On a 1 Gbps link with 100ms RTT: $1{,}000{,}000{,}000 \times 0.1 = 100{,}000{,}000$ bytes = 100 MB in flight.

Why does this matter? TCP’s congestion window limits how much unacknowledged data can be in flight. If your window is smaller than BDP, you can never fill the pipe — your throughput is capped at $\frac{\text{window size}}{\text{RTT}}$. This is why a file transfer over a high-latency link (satellite) is agonizingly slow even with “100 Mbps” bandwidth — the small default window gets ACKed slowly and the pipe stays mostly empty.

ELI5: BDP is like a highway. A 10-lane highway (bandwidth) between two cities 500 km apart (RTT) can have thousands of cars on it at once (BDP). If the on-ramp only lets 10 cars in at a time (small TCP window), the highway is mostly empty and you’re wasting capacity. You need to open the on-ramp (increase window size) to actually use the road.

Tail Latency — P50 / P95 / P99

Never report average latency. Use percentiles:

MetricMeaningWho cares
P50Median — half of requests are fasterMarketing
P9595th percentile — 1 in 20 requestsEngineering
P9999th percentile — 1 in 100 requestsSLAs, real users
P9991 in 1,000High-scale systems

At 1,000 RPS, your P99 fires 10 times per second. A microservices call chain with 5 services: the overall P99 latency is roughly $1 - (1 - 0.01)^5 \approx 5%$ of requests see the worst case from at least one service. Tail latency multiplies across service calls.

Why this matters: If your P99 is 2 seconds and average is 50ms, your average looks great but 1% of users are waiting 2 seconds. At a million requests per day, that’s 10,000 users getting a bad experience. Averages hide the outliers that define user satisfaction.


2. Bandwidth vs Throughput vs Goodput

These three words are often used interchangeably. They are not the same thing.

TermDefinitionLayer
BandwidthMaximum theoretical capacity of the linkPhysical
ThroughputActual data transferred per secondTransport
GoodputApplication-layer useful data per secondApplication

Why 1 Gbps doesn’t mean 1 Gbps file transfers:

  1. Protocol overhead: TCP/IP headers add ~40 bytes per ~1,460 bytes of payload. ~2.7% overhead.
  2. Congestion window growth: TCP starts slow (slow start) and ramps up. Short-lived connections never reach full speed.
  3. Retransmissions: On a 1% loss link, you lose ~1% of throughput to resends.
  4. ACK traffic: Every data packet generates ACK traffic in the reverse direction.
  5. Application behavior: If the app sends data in small batches (Nagle’s algorithm, small writes), the pipe is mostly idle.

ELI5: Bandwidth is the maximum speed on a highway. Throughput is how fast you’re actually driving (you slow for traffic, tolls, and construction). Goodput is how much useful cargo arrives — not counting the truck itself, the fuel, and the packaging. A “1 Gbps” file transfer might only move 700–900 Mbps of actual file data when you account for all the overhead.

Common mistake: Measuring throughput with wget and declaring the link “fine.” wget measures transfer rate for one connection. You need iperf3 with multiple parallel streams to actually saturate a high-bandwidth link:

# Server side
iperf3 -s

# Client side — 8 parallel streams, 30 second test
iperf3 -c 192.168.1.1 -P 8 -t 30

3. tcpdump — Packet Capture

tcpdump is your first responder. It runs on any Linux/macOS system with no GUI, captures on remote servers, and can write pcap files for offline analysis.

Basic Syntax

# Capture all traffic on eth0, no DNS resolution (-n), verbose (-v)
tcpdump -i eth0 -n -v

# Capture only port 443 traffic
tcpdump -i eth0 -n port 443

# Capture HTTP traffic from a specific host
tcpdump -i eth0 -n 'host 10.0.1.5 and port 80'

# Save to file for Wireshark analysis (-s 0 = full packet)
tcpdump -i eth0 -n -s 0 -w /tmp/capture.pcap port 443

# Read back a saved capture
tcpdump -r /tmp/capture.pcap -n

Filter Expression Reference

FilterExampleCaptures
hosthost 10.0.0.1Traffic to/from IP
src / dstsrc 10.0.0.1One direction only
portport 5432Any traffic on port
netnet 10.0.0.0/24Entire subnet
tcp[tcpflags]tcp[tcpflags] & tcp-rst != 0RST packets
Logicaland, or, notCombine filters

What to Look For

# Catch retransmissions and RSTs — sign of connection problems
tcpdump -i eth0 -n 'tcp[tcpflags] & (tcp-rst|tcp-syn) != 0'

# Zero window advertisements — receiver buffer full, sender blocked
tcpdump -i eth0 -n 'tcp[14:2] == 0'

# SYN flood detection — too many SYNs without SYN-ACK
tcpdump -i eth0 -n 'tcp[tcpflags] == tcp-syn'

ELI5: tcpdump is like a wiretap on a phone line — everything going in or out of a network interface gets recorded. Filters let you focus on the suspicious calls. On a busy server, you’d drown in noise without filters, so you narrow to “only show me calls from this IP on this port.”

Common mistake: Running tcpdump without -n on a busy server. DNS reverse-lookups for every IP address add latency and CPU pressure, and can cause tcpdump to drop packets. Always use -n in production.


4. Wireshark — Deep Packet Analysis

Wireshark is the GUI counterpart to tcpdump. Use it for analysis, not capture — capture on the server with tcpdump, copy the .pcap file, analyze locally in Wireshark.

Key Features

Following TCP streams — Right-click any packet → “Follow → TCP Stream”. Wireshark reassembles the full conversation. You see the raw HTTP request/response, or whatever protocol is in use. Invaluable for debugging “what exactly did the app send?”

Protocol decoders — Wireshark understands 2,000+ protocols. It will automatically decode HTTP, DNS, TLS (if you have keys), gRPC, Redis protocol, etc. The Info column shows you a human-readable summary.

Display filters vs capture filters — Display filters are applied after capture; they don’t reduce what’s captured, just what’s shown. More powerful syntax. Use display filters for analysis:

# Show only retransmissions
tcp.analysis.retransmission

# Show only DNS failures (NXDOMAIN)
dns.flags.rcode == 3

# Show TLS handshake failures
tls.alert_message.level == 2

# Slow requests > 1 second
http.time > 1

Expert Information — Analyze menu → Expert Information. Wireshark auto-flags anomalies: retransmissions, duplicate ACKs, zero windows, TCP resets, protocol violations. Start here when you open an unfamiliar capture.

TLS decryption with SSLKEYLOGFILE — If you set SSLKEYLOGFILE=/tmp/keys.log in the environment before running Chrome/Firefox/curl, TLS session keys get logged. Import this file in Wireshark (Edit → Preferences → TLS → (Pre)-Master-Secret log filename) and you can see decrypted HTTPS traffic.

SSLKEYLOGFILE=/tmp/keys.log curl https://api.example.com/endpoint
tcpdump -i any -w /tmp/capture.pcap port 443
# Then open capture.pcap + keys.log in Wireshark

ELI5: Wireshark is like watching a conversation through a magnifying glass instead of a keyhole. tcpdump captures the raw bytes; Wireshark translates them into “Client said: GET /api/users HTTP/1.1, Server replied: 200 OK, body is 1,234 bytes.” Without TLS keys, HTTPS traffic looks like gibberish. With the keys file, you can read the actual application data.


5. Network Debugging Tools

Quick Reference

ToolUse caseWhen to reach for it
pingConnectivity, RTTFirst step in any diagnosis
traceroute / mtrPath analysis, where packets dieWhen ping fails or latency is high
digDNS debuggingWhen hostname resolution is wrong
curl -vHTTP debuggingWhen the application layer is the question
ss / netstatSocket stateWhen “connection refused” or “too many connections”
iperf3Bandwidth testingWhen you need to prove link capacity
nmapPort scanningWhen you don’t know what’s listening
openssl s_clientTLS debuggingWhen TLS handshakes fail

Practical Examples

mtr — continuous traceroute with packet loss per hop:

mtr --report --report-cycles 100 8.8.8.8
# Shows RTT and loss% at each hop. A hop with 10% loss that 
# doesn't affect subsequent hops = ICMP rate-limiting (normal).
# A hop with 10% loss where all subsequent hops also have 10% = real loss.

dig — DNS debugging:

# Query specific DNS server to bypass cache
dig @8.8.8.8 api.example.com A

# Check for CNAME chains
dig +trace api.example.com

# Find who's authoritative
dig api.example.com NS

openssl s_client — TLS debugging:

# Full TLS handshake details
openssl s_client -connect api.example.com:443 -servername api.example.com

# Check certificate expiry
echo | openssl s_client -connect api.example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# Test specific TLS version
openssl s_client -connect api.example.com:443 -tls1_2

ss — socket states:

# All established connections grouped by state
ss -s

# Who's listening on which port
ss -tlnp

# Count TIME_WAIT connections (port exhaustion indicator)
ss -tan state time-wait | wc -l

ELI5: These tools are like a doctor’s toolkit. ping is checking if the patient is alive (pulse). traceroute is tracing where the blood flow stops. dig is testing if the phone directory works. openssl s_client is verifying the ID badge is valid before letting someone in the door. You always start with the simplest check and escalate.


6. Common Network Problems

Diagnostic Patterns

DNS resolution failures:

  • dig returns NXDOMAIN → hostname doesn’t exist or wrong resolver configured
  • dig hangs → resolver unreachable (firewall, wrong IP in /etc/resolv.conf)
  • Works with IP, fails with hostname → DNS-only problem
  • Works from laptop, fails from server → check /etc/resolv.conf and /etc/hosts on the server

TCP connection timeouts vs refused:

SymptomDiagnosis
Connection refused immediatelyServer is reachable but nothing listening on that port
Connection timed out after ~75sFirewall is silently dropping SYN packets (no RST sent)
Connection timed out after 3–5sCustom timeout, or application-level rejection
No route to hostRouting problem, or firewall sending ICMP unreachable

TLS handshake failures — read the error message carefully:

  • certificate verify failed → Cert chain issue, expired cert, wrong hostname, or missing CA
  • handshake failure (40) → Cipher suite mismatch — server and client share no common cipher
  • unrecognized name (112) → SNI mismatch — you’re hitting the wrong virtual host
  • certificate unknown (46) → Client certificate required but not provided

Connection resets (RST):

  • Mid-connection RST from client: app crashed, or connection pool returning bad connections
  • RST from direction of firewall: stateful firewall rule, IDS blocking, or asymmetric routing causing firewall to see half a connection

MTU issues — one of the sneakiest problems:

Large packets get fragmented or dropped silently if an intermediate router has a lower MTU and ICMP “fragmentation needed” messages are blocked. Symptom: small requests work, large ones hang (the TCP handshake succeeds, data never arrives). Fix:

# Test MTU with increasing packet sizes
ping -M do -s 1400 gateway_ip   # Linux: don't fragment
ping -D -s 1400 gateway_ip      # macOS

ELI5: MTU problems are like trying to ship a king-size mattress through a doorway that’s only wide enough for a twin. The handshake (small packages) gets through fine. The actual data (big packages) gets stuck. The system should automatically send a “door too small” message (ICMP fragmentation needed) but if that message is blocked by a firewall, you’re stuck with a mystery: “why does my connection establish but never transfer data?”

Common mistake: Blaming the application for slow responses when the network is the culprit. To distinguish:

  1. curl -w "%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}\n" — breaks down DNS, TCP, TLS, TTFB, and total
  2. If time_connect is high → network latency or packet loss
  3. If time_starttransfer is high but time_connect is fine → application/server processing slow

7. Performance Optimization

Connection Management

Keep-alive / connection reuse — opening a new TCP connection costs 1 RTT (handshake) + 1 RTT (TLS) = 2 RTTs before the first byte of data. At 100ms RTT, that’s 200ms of pure overhead on every request if you don’t reuse connections. HTTP/1.1 keep-alive and HTTP/2 multiplexing both avoid this.

TLS session resumption — After a full TLS handshake, the server issues a session ticket (an encrypted blob with the session keys). On reconnect, the client sends the ticket; the server decrypts it and resumes without a full handshake. Saves 1 RTT. Check if your server has it enabled:

# Connect twice and look for "Reused, TLSv1.3, Cipher is ..."
openssl s_client -connect api.example.com:443 -reconnect 2>&1 | grep -E "Reused|New|Session"

TCP Tuning

Nagle’s algorithm — TCP buffers small writes and waits up to 200ms to combine them into a larger segment before sending. This improves throughput on slow links but adds 200ms latency for interactive apps (SSH, game servers, API calls that send small payloads).

Disable for latency-sensitive apps:

int flag = 1;
setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

Or in most frameworks: enable “TCP_NODELAY” or “no delay” socket option.

ELI5: Nagle’s algorithm is like a bus that waits 5 minutes for more passengers before departing. Great for filling up the bus (throughput), terrible if you’re alone and in a hurry (latency). Disable it when each “passenger” (packet) needs to leave immediately, even if the bus is mostly empty.

TCP congestion control algorithms:

AlgorithmBest forNote
CUBIC (default Linux)High-bandwidth, low-loss LANsGood default
BBRLong-distance / lossy linksGoogle’s algorithm, better on WAN
RENOLegacy, avoidSlow recovery from loss
# Check and set congestion algorithm
sysctl net.ipv4.tcp_congestion_control
sysctl -w net.ipv4.tcp_congestion_control=bbr

Application-Level Wins

OptimizationSavesWhen to use
gzip/brotli compression60–90% bandwidthText responses (HTML, JSON, CSS)
protobuf vs JSON50–80% bandwidth + CPUHigh-frequency API calls
CDN / edge cachingPropagation delayStatic assets, read-heavy APIs
DNS prefetch (<link rel="dns-prefetch">)One DNS RTTKnown third-party domains
preconnectTCP + TLS RTTCritical third-party origins
HTTP/2 server push1 RTTCritical assets (CSS, fonts)

ELI5: CDNs are like Amazon distribution warehouses. Instead of shipping everything from one giant factory in the middle of the country, they put smaller warehouses near big cities. Your package ships from 50 miles away instead of 2,000 miles. Same content, dramatically less propagation delay.


8. Bufferbloat and Queuing

What Is Bufferbloat?

When a router’s output buffer fills up during congestion, instead of dropping packets (which would signal TCP to slow down), it holds them in a huge queue. Result: packets eventually get delivered, but after waiting in a queue for hundreds of milliseconds or even seconds. Latency skyrockets while throughput stays the same.

The irony: bufferbloat was caused by cheap RAM making huge buffers affordable. “Bigger buffers = better performance” seemed intuitive. It was wrong.

ELI5: Bufferbloat is like a grocery store that handles checkout lines by adding an infinitely long waiting area. Nobody leaves (no dropped packets) but everyone waits forever. The right fix is to occasionally close a checkout lane (drop packets), which signals people to come back at a less busy time (TCP backs off). Holding everyone in line forever doesn’t help — it just makes everyone miserable.

How to Detect Bufferbloat

# Baseline ping with no load
ping -c 20 8.8.8.8

# Start a large download, THEN ping
wget -q http://speedtest.net/largefile.bin &
ping -c 20 8.8.8.8

# Bufferbloat: baseline = 15ms, under load = 800ms
# Healthy: baseline = 15ms, under load = 20ms

Typical symptom: everything feels fine at idle, but when someone on the same network starts a Netflix stream or large download, everyone’s latency jumps by 500ms+.

AQM — The Fix

Active Queue Management (AQM) deliberately drops or marks packets before the queue is full, giving TCP early feedback to slow down. This keeps queues short (and therefore latency low).

AQM AlgorithmDescriptionStatus
CoDel (Controlled Delay)Drops packets if min latency > 5ms for >100msDeployed widely
FQ-CoDelCoDel + fair queuing (each flow gets equal share)Best current default
CAKESuccessor to FQ-CoDel, handles shaping tooLinux 5.x+
RED (Random Early Detection)Old standard, parameter-sensitiveLegacy
# Check your current qdisc
tc qdisc show dev eth0

# Enable fq_codel on an interface
tc qdisc replace dev eth0 root fq_codel

Most modern home routers with OpenWrt support FQ-CoDel via the SQM (Smart Queue Management) package. On a typical cable connection, enabling SQM reduces latency-under-load from 500ms to <20ms.

ELI5: AQM is like a smart toll booth that notices traffic backing up and briefly stops letting cars onto the highway before the merge becomes a disaster. Yes, a few cars wait at the booth (dropped packets get retransmitted). But the highway itself stays flowing freely. Without AQM, all the cars get on the highway and sit bumper-to-bumper for miles.


Debugging Decision Table

When something is “slow” or “broken,” work through this table top-to-bottom:

QuestionToolPositive result means
Can we reach the host at all?ping <host>Basic IP connectivity
Where does the path break?mtr <host>Find the failing hop
Is DNS resolving correctly?dig <hostname>DNS not the problem
Is the port open?nc -zv host port or nmapProcess is listening
Is the TCP handshake completing?tcpdump + SYN/SYN-ACKNo firewall blocking
Is TLS working?openssl s_client -connectCert/cipher OK
How long does each HTTP phase take?curl -w "%{time_*}"Isolate DNS/TCP/TLS/App
Are there retransmissions?Wireshark Expert InfoPacket loss present
Is the bandwidth what we expect?iperf3 -c server -P 8Link capacity confirmed
Is latency spiking under load?ping while running iperf3Bufferbloat present

When you’re stuck: capture with tcpdump -w, analyze in Wireshark, start at Expert Information, follow the TCP stream for the failing request. The answer is almost always in the packet capture.