← Networking Mastery — Fundamentals to Principal

DNS & Name Resolution

DNS & Name Resolution

DNS is the phone book of the internet — except the phone book has multiple layers of phone books, caches everything for a while, can be wrong, can be lied to, and your app silently breaks when it misbehaves. Understanding DNS deeply means you stop treating “DNS issue” as a mystery and start treating it as a solvable, debuggable system.


1. How DNS Works

The Hierarchy

DNS is a distributed, hierarchical system. There is no single machine that knows all domain-to-IP mappings. Instead:

Root (.) ─── knows where to find .com, .org, .io, etc.
  │
TLD (.com) ─── knows where to find google.com, github.com, etc.
  │
Authoritative (google.com) ─── knows the actual IPs for google.com subdomains

Root servers are operated by 13 organizations (Verisign, ICANN, etc.), spread across thousands of physical machines via anycast. Your laptop never talks to a root server directly — your recursive resolver does.

The Full Query Journey

When you type api.github.com in your browser:

  1. Browser cache — checked first (Chrome has its own DNS cache)
  2. OS stub resolver — checks OS cache (/etc/hosts, then kernel DNS cache)
  3. Recursive resolver — your configured DNS server (ISP’s, or 8.8.8.8, or 1.1.1.1). This is the workhorse.
  4. Root servers — recursive resolver asks: “who handles .com?” → gets NS records for .com TLD servers
  5. TLD servers — recursive resolver asks: “who handles github.com?” → gets NS records pointing to GitHub’s authoritative servers
  6. Authoritative servers — recursive resolver asks: “what’s the IP for api.github.com?” → gets the A record
  7. Response travels back, gets cached at each layer, returns to your browser

The total elapsed time for a cold query is typically 50–150ms. Warm cache hits are sub-millisecond.

ELI5: It’s like asking for a phone number at a library. You ask the librarian (recursive resolver). The librarian doesn’t know but checks with the head librarian (root), who says “go look in the ‘businesses’ section” (TLD). That section says “look under G for GitHub.” The GitHub shelf has the actual phone number (authoritative). The librarian writes it on a Post-it (cache) so they don’t have to run around next time.

Recursive vs Iterative Queries

Query TypeWho Does the WorkUsed By
RecursiveServer does all lookups on your behalf, returns final answerStub resolver → recursive resolver
IterativeServer returns a referral (“ask this server next”), client does the walkingRecursive resolver → root/TLD/auth

Your laptop sends a recursive query to your recursive resolver: “go figure out github.com and tell me the answer.” The recursive resolver then sends iterative queries: “root, who handles .com? … TLD, who handles github.com? … auth, what’s api.github.com?”

Why UDP Port 53 — and When It Falls Back to TCP

DNS uses UDP for most queries because:

  • DNS responses fit in a single packet (historically under 512 bytes, now up to 4096 bytes with EDNS0)
  • UDP has no connection overhead — no handshake, no teardown
  • DNS queries are stateless — if no response, just retry

DNS falls back to TCP port 53 when:

  • Response exceeds the advertised EDNS0 buffer size (truncated flag is set, client retries over TCP)
  • Zone transfers (AXFR/IXFR) — always TCP, can be megabytes of data
  • DNSSEC responses — signatures add significant size

Common mistake: Firewall rules that allow UDP/53 but block TCP/53. Works fine until DNSSEC or large responses trigger TCP fallback. The symptom: some queries silently fail or return truncated results.


2. DNS Record Types

Core Records

RecordPurposeExample
AIPv4 addressapi.example.com → 93.184.216.34
AAAAIPv6 addressapi.example.com → 2606:2800:220:1:248:1893:25c8:1946
CNAMEAlias to another namewww.example.com → example.com
MXMail server (with priority)example.com → 10 mail.example.com
NSAuthoritative nameservers for a zoneexample.com → ns1.example.com
TXTArbitrary text (SPF, DKIM, verification)"v=spf1 include:sendgrid.net ~all"
PTRReverse DNS: IP → hostname34.216.184.93.in-addr.arpa → api.example.com

SOA — Start of Authority

Every DNS zone has exactly one SOA record. It contains:

  • Primary nameserver — canonical NS for the zone
  • Responsible email — zone admin contact (dots replace @)
  • Serial number — incremented every time the zone changes. Secondary nameservers use this to detect updates.
  • Refresh / Retry / Expire / Minimum TTL — zone transfer timing parameters
example.com. SOA ns1.example.com. admin.example.com. (
    2024010501  ; serial
    3600        ; refresh (check for updates every hour)
    900         ; retry (if refresh fails, retry after 15 min)
    604800      ; expire (stop serving after 7 days without contact)
    300         ; minimum TTL (negative caching)
)

The serial number format YYYYMMDDNN is a convention, not a requirement — it just needs to be monotonically increasing.

SRV — Service Discovery

SRV records tell you not just the hostname, but also the port and protocol for a service:

_https._tcp.example.com. SRV 10 5 443 server1.example.com.
                              ↑  ↑  ↑
                           priority weight port

Kubernetes uses SRV records internally. Active Directory relies on them heavily for domain controller discovery. If you’re debugging AD join failures, check SRV records first.

CAA — Certificate Authority Authorization

Tells CAs which ones are allowed to issue certs for your domain:

example.com. CAA 0 issue "letsencrypt.org"
example.com. CAA 0 issuewild ";"  ; nobody can issue wildcards

Most engineers don’t know this exists until they’re surprised by a CA issuing a cert for their domain after a phishing attack.

CNAME vs A Record Trade-offs

ScenarioUse CNAMEUse A Record
Pointing to a service with changing IPs (CDN, SaaS)YesNo
Zone apex (root domain, example.com)No — can’t CNAME apexYes
Want to add other records at same nameNo — CNAME is exclusiveYes
Internal aliasesFineFine

ALIAS / ANAME records are a non-standard extension by some DNS providers (Route 53, Cloudflare) that behave like CNAME at the apex — they resolve the target and return an A record. Useful for pointing example.com to a load balancer hostname.

ELI5: CNAME is “forward my mail to this other address.” A record is “here’s my actual street address.” You can’t have an apartment building whose entire mailing address is “forward to somewhere else” (no CNAME at apex) — it doesn’t make sense. But you can have individual apartments forwarded.


3. DNS Caching & TTL

The Cache Chain

Browser cache (seconds to minutes)
     ↓ miss
OS / stub resolver cache (minutes)
     ↓ miss
Recursive resolver cache (shared across many users)
     ↓ miss
CDN edge resolver (if using CDN)
     ↓ miss
Authoritative server

Each layer caches the response for the duration of the record’s TTL. A TTL of 300 means “cache this for 5 minutes.” When it expires, the next request triggers a fresh lookup.

TTL Trade-offs

TTL ValueProsCons
Low (30–300s)Fast failover, fast rolloutMore DNS queries, more load on auth servers
High (3600–86400s)Fewer queries, cheaperStale data persists for hours, slow disaster recovery

A common strategy: lower TTL to 60s an hour before a planned migration. Do the migration. Raise TTL back to 3600s after confirming stability.

ELI5: TTL is like the “use by” date on food in your fridge. If it says “good for 5 minutes,” you check if it’s still fresh after 5 minutes. Long TTL = food lasts all week (convenient but risky if it goes bad). Short TTL = check the grocery store every minute (fresh but exhausting).

Negative Caching

When a domain doesn’t exist (NXDOMAIN response), resolvers cache that non-existence for the duration of the SOA’s minimum TTL. This means if you typo a hostname, the “doesn’t exist” answer gets cached, and fixing the typo may take minutes to propagate to users who already got the NXDOMAIN.

Cache Poisoning

An attacker tricks a recursive resolver into caching a false record. Classic approach:

  1. Attacker triggers a query for bank.com to a target resolver
  2. Attacker floods the resolver with forged responses, trying to win the race against the real authoritative server
  3. If successful, the resolver caches the attacker’s IP for bank.com
  4. All users behind that resolver get sent to the attacker

The Kaminsky attack (2008) dramatically sped this up by randomizing the subdomain queried, allowing many attempts in parallel. The fix: source port randomization (0–65535) to increase the guessing difficulty from ~65K to ~4 billion combinations.

Common mistake: Running open resolvers (resolvers that answer queries from any IP). These can be used in cache poisoning attacks and DNS amplification DDoS.


4. DNSSEC

The Problem

DNS was designed in 1983 with no authentication. A resolver has no way to verify that a response came from the legitimate authoritative server and wasn’t tampered with in transit. That’s the gap DNSSEC fills.

How It Works

DNSSEC adds digital signatures to DNS records. Each zone has a key pair:

  • ZSK (Zone Signing Key) — signs the actual records
  • KSK (Key Signing Key) — signs the ZSK

New record types:

  • RRSIG — the signature for a record set
  • DNSKEY — the public key for the zone
  • DS — hash of child zone’s KSK, stored in parent zone

Chain of Trust

Root zone (IANA) ─── signs DS records for .com
.com TLD ─────────── signs DS records for example.com
example.com ──────── signs its own A, MX, etc. records

A DNSSEC-validating resolver follows this chain from the root (which it trusts as a “trust anchor”) down to the record. If any signature is invalid or missing, the resolver returns SERVFAIL rather than a potentially forged answer.

ELI5: Normal DNS is like receiving a signed check with no way to verify the signature is real. DNSSEC is like having a notarized certificate chain — the bank (root) vouches for the notary (.com), who vouches for the person signing (your domain). Break any link in the chain and the check bounces.

Why DNSSEC Adoption Is Low

  • Complexity: key rollovers are risky. If you mess up a KSK rollover, your entire domain becomes unresolvable.
  • Broken resolvers: some firewalls strip DNSSEC-related records, breaking validation
  • NSEC walking: DNSSEC’s authenticated denial of existence (NSEC records) accidentally exposes all names in a zone to enumeration. NSEC3 with opt-out mitigates but doesn’t eliminate this.
  • No encryption: DNSSEC authenticates responses but doesn’t encrypt them. An observer can still see what you queried.

DANE

DANE (DNS-Based Authentication of Named Entities) uses DNSSEC to publish TLS certificate fingerprints in DNS (TLSA records). This lets you validate a TLS certificate without relying on any CA — useful for email (SMTP DANE) where the CA model is weak.


5. DNS over HTTPS (DoH) and DNS over TLS (DoT)

The Privacy Problem

Traditional DNS is plaintext UDP. Your ISP, network admin, or anyone on-path can see every domain you resolve. Even if your actual traffic is HTTPS, your DNS queries reveal your browsing patterns.

DoT — DNS over TLS (Port 853)

Wraps DNS queries in a TLS session:

  • Encrypted — no eavesdropping
  • Authenticated — resolver identity verified by certificate
  • Easy to detect and block — it’s a distinct port (853)

DoH — DNS over HTTPS (Port 443)

Sends DNS queries as HTTPS requests to a DoH endpoint:

  • Encrypted — same as DoT
  • Hard to block — looks like regular HTTPS traffic to port 443
  • Bypasses network DNS policies — which is both the feature and the problem
GET /dns-query?dns=<base64-encoded-query> HTTP/2
Host: cloudflare-dns.com
Accept: application/dns-message

ELI5: Regular DNS is like shouting your destination to a taxi dispatcher over a radio — everyone in the room hears it. DoT puts it in a sealed envelope. DoH puts it in a sealed envelope that looks exactly like every other envelope in the office — nobody can even tell you’re sending something sensitive.

The Centralization Problem

DoH shifts DNS from “many ISP resolvers” to “a few big providers” (Cloudflare 1.1.1.1, Google 8.8.8.8). This creates:

  • Single points of failure
  • Concentration of query data with two companies
  • Bypassing of enterprise DNS policies and split-horizon setups

Oblivious DoH (ODoH)

ODoH adds a proxy between the client and the DoH resolver:

  • Client encrypts query for the resolver, sends to proxy
  • Proxy forwards to resolver (can’t see query content)
  • Resolver answers (doesn’t know who asked)
  • Even Cloudflare can’t correlate your identity with your queries
ProtocolPortEncryptedUnblockablePrivacy
DNS53NoEasy to blockNone
DoT853YesBlockableGood
DoH443YesHard to blockGood
ODoH443YesHard to blockExcellent

6. DNS in Practice

Essential Commands

# Basic query — A record for github.com
dig github.com

# Specific record type
dig github.com MX
dig github.com TXT

# Follow the full resolution chain from root
dig +trace github.com

# Query a specific resolver (not your configured one)
dig @8.8.8.8 github.com

# Reverse DNS lookup
dig -x 140.82.114.4

# Short output (IP only)
dig +short github.com

# Check TTL on a cached response
dig +ttlid github.com
# nslookup (interactive)
nslookup github.com
nslookup -type=MX github.com
nslookup github.com 1.1.1.1  # query specific resolver

# host (clean output)
host github.com
host -t MX github.com

dig +trace is the most valuable debugging tool. It shows you exactly which servers were queried, what they returned, and where the resolution chain breaks.

Debugging “It Works for Some Users”

This is almost always a TTL/caching problem. Checklist:

  1. Different TTL stages: User A cached the old record 2 minutes ago (TTL expires in 3 minutes). User B just queried (sees new record). Both are correct — they’re just at different cache states.
  2. ISP resolver lag: Some ISP resolvers ignore TTL and over-cache (especially common with low TTLs).
  3. Browser cache: Chrome caches DNS independently. chrome://net-internals/#dns to clear/inspect.
  4. NXDOMAIN negative cache: If the domain didn’t exist before, the negative cache may persist even after you create the record.

Split-Horizon DNS

Returning different answers based on who’s asking:

  • Internal clients get 10.0.1.45 (private IP)
  • External clients get 93.184.216.34 (public IP)

Common implementations: Bind views, AWS Route 53 private hosted zones, CoreDNS in Kubernetes. The footgun: developers test from outside, see the public IP, assume everything works. Production traffic goes internal and hits a firewall rule nobody expected.

GeoDNS

Authoritative server returns different A records based on the querying resolver’s location. Used by CDNs to route users to the nearest PoP. The gotcha: resolution is based on the recursive resolver’s location, not the end user’s. A user in Singapore using Google’s 8.8.8.8 resolver may get routed to the US because 8.8.8.8 has infrastructure there. ECS (EDNS Client Subnet) partially fixes this by forwarding a truncated version of the user’s IP to the authoritative server.


7. DNS for Service Discovery

Kubernetes Internal DNS

Kubernetes runs CoreDNS as the cluster DNS server. Every Service gets a DNS entry:

<service>.<namespace>.svc.cluster.local
ResourceDNS Name
Service api in default namespaceapi.default.svc.cluster.local
Service api from another namespaceapi.default (short form works within cluster)
Headless service (no ClusterIP)Returns individual pod IPs as A records
StatefulSet pod web-0web-0.web.default.svc.cluster.local

Headless services (ClusterIP: None) use SRV records for stable pod addressing — critical for databases like Cassandra and Elasticsearch that need to know all peers.

ELI5: Inside Kubernetes, every service gets a consistent phone number that never changes, even if the actual pods behind it change every deployment. It’s like dialing the “Pizza department” extension instead of memorizing each pizza chef’s personal number.

SRV Records for Port Discovery

SRV records encode protocol, hostname, and port:

dig _https._tcp.example.com SRV
# Returns: priority weight port target
# 10 5 443 server1.example.com.

Used by: Active Directory (DC discovery), Kubernetes (headless services), Consul, etcd, Kafka.

Health-Check-Aware DNS

Route 53 health checks + DNS failover:

  1. Route 53 pings your endpoint every 30 seconds
  2. If endpoint fails health checks, Route 53 automatically removes that record from responses
  3. Surviving endpoints absorb traffic

This gives you “DNS-level failover” without needing a load balancer. TTL matters here: set it low (60s) so clients don’t hold stale IPs when a host fails.

DNS-Based Load Balancing

Round-robin A records: return multiple IPs for a single name. Clients pick one (usually the first). Problems:

  • No health awareness — dead servers stay in rotation
  • Client-side caching breaks the round-robin
  • “Sticky” clients often ignore the round-robin entirely

Weighted records (Route 53, Cloudflare): assign traffic percentages per record. Useful for canary deployments: send 5% of traffic to new version by putting its IP at weight 5 vs weight 95 for stable.


8. DNS Attacks and Defenses

DNS Amplification (DDoS)

  1. Attacker sends DNS queries with the victim’s IP as the source (IP spoofing)
  2. Queries are for large responses (DNSSEC-signed zones, ANY queries)
  3. DNS servers flood the victim with responses 50–100x larger than the original query

Amplification factor: a 60-byte query can return a 3000-byte response. With thousands of open resolvers, this generates massive traffic toward the victim.

Defense: Don’t run open resolvers. Rate-limit responses per source IP (Response Rate Limiting, RRL). BCP38 ingress filtering to block spoofed source IPs at the network edge.

Kaminsky Attack (Cache Poisoning)

Dan Kaminsky’s 2008 discovery: by racing to answer a query for a non-existent subdomain (a1b2c3.bank.com), an attacker could inject a forged NS record for the parent zone (bank.com), poisoning the entire zone in one shot.

Defense: Source port randomization (makes winning the race require guessing both transaction ID AND source port — 64K × 64K combinations). DNSSEC eliminates the attack entirely (forged records fail signature validation).

DNS Tunneling

DNS is almost never blocked at firewalls. Attackers exploit this:

  • Encode data as DNS query names: aGVsbG8=.exfil.attacker.com
  • The authoritative server for attacker.com receives the queries and extracts the data
  • Response can also carry encoded data back

Used for: data exfiltration from air-gapped networks, C2 communication from malware behind strict firewalls.

Detection: Unusually long subdomain labels, high query rate to a single domain, queries for domains with high entropy names, large TXT responses.

DNS Rebinding

  1. Attacker registers evil.com with a very short TTL (1 second)
  2. Initial response points to a public IP attacker controls — browser allows connection
  3. After TTL expires, DNS response changes to 192.168.1.1 (victim’s internal router)
  4. Browser’s same-origin policy checks only the hostname, not the IP — allows the rebind
  5. Attacker’s JavaScript now makes requests to the victim’s internal network

Defense: DNS rebinding protection in resolvers (refuse to resolve public names to private IPs). Bind to specific interfaces. Use private TLDs for internal services.

ELI5: DNS rebinding is like a thief giving you a business card for a legitimate store, getting you to trust them, then quietly swapping the address on the card to their warehouse after you’ve already decided to trust them. Your rules only check the name, not where it actually leads.

DNS Firewalls (RPZ — Response Policy Zones)

RPZ lets resolvers override DNS responses based on policy:

  • Block known malware domains (return NXDOMAIN or redirect to sinkhole)
  • Block adult content
  • Block data exfiltration domains

Used by enterprise security products (Cisco Umbrella, Palo Alto DNS Security) and some ISPs. Also used by governments for censorship — same mechanism, different intent.


Summary Reference Table

ConceptKey DetailPrincipal-Level Gotcha
Recursive resolverDoes all the legwork for your stubResolver cache is shared — one poisoned response affects all users
TTLCache lifetime in secondsLower before migrations; some resolvers over-cache regardless
CNAMEAlias, not an IPCan’t use at zone apex; can’t share name with other record types
SOA serialMust increment on every zone changeBinary/decimal format mismatch between tools causes missed transfers
DNSSECSigns records, not the wireDoesn’t encrypt queries; NSEC exposes zone enumeration
DoHDNS over HTTPS port 443Bypasses enterprise DNS policies and split-horizon setups
SRVService + port discoveryWeight field is relative, not percentage — 10 10 ≠ 50/50 exactly
GeoDNSRoutes by resolver location, not userECS (EDNS Client Subnet) needed for accurate user geolocation
DNS tunnelingData exfil over DNS queriesAlmost never blocked — monitor for high-entropy subdomains
Cache poisoningInject fake records into resolver cacheSource port randomization + DNSSEC are the real fixes