← Networking Mastery — Fundamentals to Principal

DNS & Name Resolution

16 min read 3241 words

Table of Contents

DNS & Name Resolution

DNS & Name Resolution

DNS is the phone book of the internet — except the phone book has multiple layers of phone books, caches everything for a while, can be wrong, can be lied to, and your app silently breaks when it misbehaves. Understanding DNS deeply means you stop treating “DNS issue” as a mystery and start treating it as a solvable, debuggable system.

1. How DNS Works

The Hierarchy

DNS is a distributed, hierarchical system. There is no single machine that knows all domain-to-IP mappings. Instead:

Root (.) ─── knows where to find .com, .org, .io, etc.
  │
TLD (.com) ─── knows where to find google.com, github.com, etc.
  │
Authoritative (google.com) ─── knows the actual IPs for google.com subdomains

Root servers are operated by 13 organizations (Verisign, ICANN, etc.), spread across thousands of physical machines via anycast. Your laptop never talks to a root server directly — your recursive resolver does.

The Full Query Journey

When you type api.github.com in your browser:

Browser cache — checked first (Chrome has its own DNS cache)
OS stub resolver — checks OS cache (/etc/hosts, then kernel DNS cache)
Recursive resolver — your configured DNS server (ISP’s, or 8.8.8.8, or 1.1.1.1). This is the workhorse.
Root servers — recursive resolver asks: “who handles .com?” → gets NS records for .com TLD servers
TLD servers — recursive resolver asks: “who handles github.com?” → gets NS records pointing to GitHub’s authoritative servers
Authoritative servers — recursive resolver asks: “what’s the IP for api.github.com?” → gets the A record
Response travels back, gets cached at each layer, returns to your browser

The total elapsed time for a cold query is typically 50–150ms. Warm cache hits are sub-millisecond.

ELI5: It’s like asking for a phone number at a library. You ask the librarian (recursive resolver). The librarian doesn’t know but checks with the head librarian (root), who says “go look in the ‘businesses’ section” (TLD). That section says “look under G for GitHub.” The GitHub shelf has the actual phone number (authoritative). The librarian writes it on a Post-it (cache) so they don’t have to run around next time.

Recursive vs Iterative Queries

Query Type	Who Does the Work	Used By
Recursive	Server does all lookups on your behalf, returns final answer	Stub resolver → recursive resolver
Iterative	Server returns a referral (“ask this server next”), client does the walking	Recursive resolver → root/TLD/auth

Your laptop sends a recursive query to your recursive resolver: “go figure out github.com and tell me the answer.” The recursive resolver then sends iterative queries: “root, who handles .com? … TLD, who handles github.com? … auth, what’s api.github.com?”

Why UDP Port 53 — and When It Falls Back to TCP

DNS uses UDP for most queries because:

DNS responses fit in a single packet (historically under 512 bytes, now up to 4096 bytes with EDNS0)
UDP has no connection overhead — no handshake, no teardown
DNS queries are stateless — if no response, just retry

DNS falls back to TCP port 53 when:

Response exceeds the advertised EDNS0 buffer size (truncated flag is set, client retries over TCP)
Zone transfers (AXFR/IXFR) — always TCP, can be megabytes of data
DNSSEC responses — signatures add significant size

Common mistake: Firewall rules that allow UDP/53 but block TCP/53. Works fine until DNSSEC or large responses trigger TCP fallback. The symptom: some queries silently fail or return truncated results.

2. DNS Record Types

Core Records

Record	Purpose	Example
A	IPv4 address	`api.example.com → 93.184.216.34`
AAAA	IPv6 address	`api.example.com → 2606:2800:220:1:248:1893:25c8:1946`
CNAME	Alias to another name	`www.example.com → example.com`
MX	Mail server (with priority)	`example.com → 10 mail.example.com`
NS	Authoritative nameservers for a zone	`example.com → ns1.example.com`
TXT	Arbitrary text (SPF, DKIM, verification)	`"v=spf1 include:sendgrid.net ~all"`
PTR	Reverse DNS: IP → hostname	`34.216.184.93.in-addr.arpa → api.example.com`

SOA — Start of Authority

Every DNS zone has exactly one SOA record. It contains:

Primary nameserver — canonical NS for the zone
Responsible email — zone admin contact (dots replace @)
Serial number — incremented every time the zone changes. Secondary nameservers use this to detect updates.
Refresh / Retry / Expire / Minimum TTL — zone transfer timing parameters

example.com. SOA ns1.example.com. admin.example.com. (
    2024010501  ; serial
    3600        ; refresh (check for updates every hour)
    900         ; retry (if refresh fails, retry after 15 min)
    604800      ; expire (stop serving after 7 days without contact)
    300         ; minimum TTL (negative caching)
)

The serial number format YYYYMMDDNN is a convention, not a requirement — it just needs to be monotonically increasing.

SRV — Service Discovery

SRV records tell you not just the hostname, but also the port and protocol for a service:

_https._tcp.example.com. SRV 10 5 443 server1.example.com.
                              ↑  ↑  ↑
                           priority weight port

Kubernetes uses SRV records internally. Active Directory relies on them heavily for domain controller discovery. If you’re debugging AD join failures, check SRV records first.

CAA — Certificate Authority Authorization

Tells CAs which ones are allowed to issue certs for your domain:

example.com. CAA 0 issue "letsencrypt.org"
example.com. CAA 0 issuewild ";"  ; nobody can issue wildcards

Most engineers don’t know this exists until they’re surprised by a CA issuing a cert for their domain after a phishing attack.

CNAME vs A Record Trade-offs

Scenario	Use CNAME	Use A Record
Pointing to a service with changing IPs (CDN, SaaS)	Yes	No
Zone apex (root domain, `example.com`)	No — can’t CNAME apex	Yes
Want to add other records at same name	No — CNAME is exclusive	Yes
Internal aliases	Fine	Fine

ALIAS / ANAME records are a non-standard extension by some DNS providers (Route 53, Cloudflare) that behave like CNAME at the apex — they resolve the target and return an A record. Useful for pointing example.com to a load balancer hostname.

ELI5: CNAME is “forward my mail to this other address.” A record is “here’s my actual street address.” You can’t have an apartment building whose entire mailing address is “forward to somewhere else” (no CNAME at apex) — it doesn’t make sense. But you can have individual apartments forwarded.

3. DNS Caching & TTL

The Cache Chain

Browser cache (seconds to minutes)
     ↓ miss
OS / stub resolver cache (minutes)
     ↓ miss
Recursive resolver cache (shared across many users)
     ↓ miss
CDN edge resolver (if using CDN)
     ↓ miss
Authoritative server

Each layer caches the response for the duration of the record’s TTL. A TTL of 300 means “cache this for 5 minutes.” When it expires, the next request triggers a fresh lookup.

TTL Trade-offs

TTL Value	Pros	Cons
Low (30–300s)	Fast failover, fast rollout	More DNS queries, more load on auth servers
High (3600–86400s)	Fewer queries, cheaper	Stale data persists for hours, slow disaster recovery

A common strategy: lower TTL to 60s an hour before a planned migration. Do the migration. Raise TTL back to 3600s after confirming stability.

ELI5: TTL is like the “use by” date on food in your fridge. If it says “good for 5 minutes,” you check if it’s still fresh after 5 minutes. Long TTL = food lasts all week (convenient but risky if it goes bad). Short TTL = check the grocery store every minute (fresh but exhausting).

Negative Caching

When a domain doesn’t exist (NXDOMAIN response), resolvers cache that non-existence for the duration of the SOA’s minimum TTL. This means if you typo a hostname, the “doesn’t exist” answer gets cached, and fixing the typo may take minutes to propagate to users who already got the NXDOMAIN.

Cache Poisoning

An attacker tricks a recursive resolver into caching a false record. Classic approach:

Attacker triggers a query for bank.com to a target resolver
Attacker floods the resolver with forged responses, trying to win the race against the real authoritative server
If successful, the resolver caches the attacker’s IP for bank.com
All users behind that resolver get sent to the attacker

The Kaminsky attack (2008) dramatically sped this up by randomizing the subdomain queried, allowing many attempts in parallel. The fix: source port randomization (0–65535) to increase the guessing difficulty from ~65K to ~4 billion combinations.

Common mistake: Running open resolvers (resolvers that answer queries from any IP). These can be used in cache poisoning attacks and DNS amplification DDoS.

4. DNSSEC

The Problem

DNS was designed in 1983 with no authentication. A resolver has no way to verify that a response came from the legitimate authoritative server and wasn’t tampered with in transit. That’s the gap DNSSEC fills.

How It Works

DNSSEC adds digital signatures to DNS records. Each zone has a key pair:

ZSK (Zone Signing Key) — signs the actual records
KSK (Key Signing Key) — signs the ZSK

New record types:

RRSIG — the signature for a record set
DNSKEY — the public key for the zone
DS — hash of child zone’s KSK, stored in parent zone

Chain of Trust

Root zone (IANA) ─── signs DS records for .com
.com TLD ─────────── signs DS records for example.com
example.com ──────── signs its own A, MX, etc. records

A DNSSEC-validating resolver follows this chain from the root (which it trusts as a “trust anchor”) down to the record. If any signature is invalid or missing, the resolver returns SERVFAIL rather than a potentially forged answer.

ELI5: Normal DNS is like receiving a signed check with no way to verify the signature is real. DNSSEC is like having a notarized certificate chain — the bank (root) vouches for the notary (.com), who vouches for the person signing (your domain). Break any link in the chain and the check bounces.

Why DNSSEC Adoption Is Low

Complexity: key rollovers are risky. If you mess up a KSK rollover, your entire domain becomes unresolvable.
Broken resolvers: some firewalls strip DNSSEC-related records, breaking validation
NSEC walking: DNSSEC’s authenticated denial of existence (NSEC records) accidentally exposes all names in a zone to enumeration. NSEC3 with opt-out mitigates but doesn’t eliminate this.
No encryption: DNSSEC authenticates responses but doesn’t encrypt them. An observer can still see what you queried.

DANE

DANE (DNS-Based Authentication of Named Entities) uses DNSSEC to publish TLS certificate fingerprints in DNS (TLSA records). This lets you validate a TLS certificate without relying on any CA — useful for email (SMTP DANE) where the CA model is weak.

5. DNS over HTTPS (DoH) and DNS over TLS (DoT)

The Privacy Problem

Traditional DNS is plaintext UDP. Your ISP, network admin, or anyone on-path can see every domain you resolve. Even if your actual traffic is HTTPS, your DNS queries reveal your browsing patterns.

DoT — DNS over TLS (Port 853)

Wraps DNS queries in a TLS session:

Encrypted — no eavesdropping
Authenticated — resolver identity verified by certificate
Easy to detect and block — it’s a distinct port (853)

DoH — DNS over HTTPS (Port 443)

Sends DNS queries as HTTPS requests to a DoH endpoint:

Encrypted — same as DoT
Hard to block — looks like regular HTTPS traffic to port 443
Bypasses network DNS policies — which is both the feature and the problem

GET /dns-query?dns=<base64-encoded-query> HTTP/2
Host: cloudflare-dns.com
Accept: application/dns-message

ELI5: Regular DNS is like shouting your destination to a taxi dispatcher over a radio — everyone in the room hears it. DoT puts it in a sealed envelope. DoH puts it in a sealed envelope that looks exactly like every other envelope in the office — nobody can even tell you’re sending something sensitive.

The Centralization Problem

DoH shifts DNS from “many ISP resolvers” to “a few big providers” (Cloudflare 1.1.1.1, Google 8.8.8.8). This creates:

Single points of failure
Concentration of query data with two companies
Bypassing of enterprise DNS policies and split-horizon setups

Oblivious DoH (ODoH)

ODoH adds a proxy between the client and the DoH resolver:

Client encrypts query for the resolver, sends to proxy
Proxy forwards to resolver (can’t see query content)
Resolver answers (doesn’t know who asked)
Even Cloudflare can’t correlate your identity with your queries

Protocol	Port	Encrypted	Unblockable	Privacy
DNS	53	No	Easy to block	None
DoT	853	Yes	Blockable	Good
DoH	443	Yes	Hard to block	Good
ODoH	443	Yes	Hard to block	Excellent

6. DNS in Practice

Essential Commands

# Basic query — A record for github.com
dig github.com

# Specific record type
dig github.com MX
dig github.com TXT

# Follow the full resolution chain from root
dig +trace github.com

# Query a specific resolver (not your configured one)
dig @8.8.8.8 github.com

# Reverse DNS lookup
dig -x 140.82.114.4

# Short output (IP only)
dig +short github.com

# Check TTL on a cached response
dig +ttlid github.com

# nslookup (interactive)
nslookup github.com
nslookup -type=MX github.com
nslookup github.com 1.1.1.1  # query specific resolver

# host (clean output)
host github.com
host -t MX github.com

dig +trace is the most valuable debugging tool. It shows you exactly which servers were queried, what they returned, and where the resolution chain breaks.

Debugging “It Works for Some Users”

This is almost always a TTL/caching problem. Checklist:

Different TTL stages: User A cached the old record 2 minutes ago (TTL expires in 3 minutes). User B just queried (sees new record). Both are correct — they’re just at different cache states.
ISP resolver lag: Some ISP resolvers ignore TTL and over-cache (especially common with low TTLs).
Browser cache: Chrome caches DNS independently. chrome://net-internals/#dns to clear/inspect.
NXDOMAIN negative cache: If the domain didn’t exist before, the negative cache may persist even after you create the record.

Split-Horizon DNS

Returning different answers based on who’s asking:

Internal clients get 10.0.1.45 (private IP)
External clients get 93.184.216.34 (public IP)

Common implementations: Bind views, AWS Route 53 private hosted zones, CoreDNS in Kubernetes. The footgun: developers test from outside, see the public IP, assume everything works. Production traffic goes internal and hits a firewall rule nobody expected.

GeoDNS

Authoritative server returns different A records based on the querying resolver’s location. Used by CDNs to route users to the nearest PoP. The gotcha: resolution is based on the recursive resolver’s location, not the end user’s. A user in Singapore using Google’s 8.8.8.8 resolver may get routed to the US because 8.8.8.8 has infrastructure there. ECS (EDNS Client Subnet) partially fixes this by forwarding a truncated version of the user’s IP to the authoritative server.

7. DNS for Service Discovery

Kubernetes Internal DNS

Kubernetes runs CoreDNS as the cluster DNS server. Every Service gets a DNS entry:

<service>.<namespace>.svc.cluster.local

Resource	DNS Name
Service `api` in `default` namespace	`api.default.svc.cluster.local`
Service `api` from another namespace	`api.default` (short form works within cluster)
Headless service (no ClusterIP)	Returns individual pod IPs as A records
StatefulSet pod `web-0`	`web-0.web.default.svc.cluster.local`

Headless services (ClusterIP: None) use SRV records for stable pod addressing — critical for databases like Cassandra and Elasticsearch that need to know all peers.

ELI5: Inside Kubernetes, every service gets a consistent phone number that never changes, even if the actual pods behind it change every deployment. It’s like dialing the “Pizza department” extension instead of memorizing each pizza chef’s personal number.

SRV Records for Port Discovery

SRV records encode protocol, hostname, and port:

dig _https._tcp.example.com SRV
# Returns: priority weight port target
# 10 5 443 server1.example.com.

Used by: Active Directory (DC discovery), Kubernetes (headless services), Consul, etcd, Kafka.

Health-Check-Aware DNS

Route 53 health checks + DNS failover:

Route 53 pings your endpoint every 30 seconds
If endpoint fails health checks, Route 53 automatically removes that record from responses
Surviving endpoints absorb traffic

This gives you “DNS-level failover” without needing a load balancer. TTL matters here: set it low (60s) so clients don’t hold stale IPs when a host fails.

DNS-Based Load Balancing

Round-robin A records: return multiple IPs for a single name. Clients pick one (usually the first). Problems:

No health awareness — dead servers stay in rotation
Client-side caching breaks the round-robin
“Sticky” clients often ignore the round-robin entirely

Weighted records (Route 53, Cloudflare): assign traffic percentages per record. Useful for canary deployments: send 5% of traffic to new version by putting its IP at weight 5 vs weight 95 for stable.

8. DNS Attacks and Defenses

DNS Amplification (DDoS)

Attacker sends DNS queries with the victim’s IP as the source (IP spoofing)
Queries are for large responses (DNSSEC-signed zones, ANY queries)
DNS servers flood the victim with responses 50–100x larger than the original query

Amplification factor: a 60-byte query can return a 3000-byte response. With thousands of open resolvers, this generates massive traffic toward the victim.

Defense: Don’t run open resolvers. Rate-limit responses per source IP (Response Rate Limiting, RRL). BCP38 ingress filtering to block spoofed source IPs at the network edge.

Kaminsky Attack (Cache Poisoning)

Dan Kaminsky’s 2008 discovery: by racing to answer a query for a non-existent subdomain (a1b2c3.bank.com), an attacker could inject a forged NS record for the parent zone (bank.com), poisoning the entire zone in one shot.

Defense: Source port randomization (makes winning the race require guessing both transaction ID AND source port — 64K × 64K combinations). DNSSEC eliminates the attack entirely (forged records fail signature validation).

DNS Tunneling

DNS is almost never blocked at firewalls. Attackers exploit this:

Encode data as DNS query names: aGVsbG8=.exfil.attacker.com
The authoritative server for attacker.com receives the queries and extracts the data
Response can also carry encoded data back

Used for: data exfiltration from air-gapped networks, C2 communication from malware behind strict firewalls.

Detection: Unusually long subdomain labels, high query rate to a single domain, queries for domains with high entropy names, large TXT responses.

DNS Rebinding

Attacker registers evil.com with a very short TTL (1 second)
Initial response points to a public IP attacker controls — browser allows connection
After TTL expires, DNS response changes to 192.168.1.1 (victim’s internal router)
Browser’s same-origin policy checks only the hostname, not the IP — allows the rebind
Attacker’s JavaScript now makes requests to the victim’s internal network

Defense: DNS rebinding protection in resolvers (refuse to resolve public names to private IPs). Bind to specific interfaces. Use private TLDs for internal services.

ELI5: DNS rebinding is like a thief giving you a business card for a legitimate store, getting you to trust them, then quietly swapping the address on the card to their warehouse after you’ve already decided to trust them. Your rules only check the name, not where it actually leads.

DNS Firewalls (RPZ — Response Policy Zones)

RPZ lets resolvers override DNS responses based on policy:

Block known malware domains (return NXDOMAIN or redirect to sinkhole)
Block adult content
Block data exfiltration domains

Used by enterprise security products (Cisco Umbrella, Palo Alto DNS Security) and some ISPs. Also used by governments for censorship — same mechanism, different intent.

Summary Reference Table

Concept	Key Detail	Principal-Level Gotcha
Recursive resolver	Does all the legwork for your stub	Resolver cache is shared — one poisoned response affects all users
TTL	Cache lifetime in seconds	Lower before migrations; some resolvers over-cache regardless
CNAME	Alias, not an IP	Can’t use at zone apex; can’t share name with other record types
SOA serial	Must increment on every zone change	Binary/decimal format mismatch between tools causes missed transfers
DNSSEC	Signs records, not the wire	Doesn’t encrypt queries; NSEC exposes zone enumeration
DoH	DNS over HTTPS port 443	Bypasses enterprise DNS policies and split-horizon setups
SRV	Service + port discovery	Weight field is relative, not percentage — `10 10` ≠ 50/50 exactly
GeoDNS	Routes by resolver location, not user	ECS (EDNS Client Subnet) needed for accurate user geolocation
DNS tunneling	Data exfil over DNS queries	Almost never blocked — monitor for high-entropy subdomains
Cache poisoning	Inject fake records into resolver cache	Source port randomization + DNSSEC are the real fixes

DNS & Name Resolution#

1. How DNS Works#

The Hierarchy#

The Full Query Journey#

Recursive vs Iterative Queries#

Why UDP Port 53 — and When It Falls Back to TCP#

2. DNS Record Types#

Core Records#

SOA — Start of Authority#

SRV — Service Discovery#

CAA — Certificate Authority Authorization#

CNAME vs A Record Trade-offs#

3. DNS Caching & TTL#

The Cache Chain#

TTL Trade-offs#

Negative Caching#

Cache Poisoning#

4. DNSSEC#

The Problem#

How It Works#

Chain of Trust#

Why DNSSEC Adoption Is Low#

DANE#

5. DNS over HTTPS (DoH) and DNS over TLS (DoT)#

The Privacy Problem#

DoT — DNS over TLS (Port 853)#

DoH — DNS over HTTPS (Port 443)#

The Centralization Problem#

Oblivious DoH (ODoH)#

6. DNS in Practice#

Essential Commands#

Debugging “It Works for Some Users”#

Split-Horizon DNS#

GeoDNS#

7. DNS for Service Discovery#

Kubernetes Internal DNS#

SRV Records for Port Discovery#

Health-Check-Aware DNS#

DNS-Based Load Balancing#

8. DNS Attacks and Defenses#

DNS Amplification (DDoS)#

Kaminsky Attack (Cache Poisoning)#

DNS Tunneling#

DNS Rebinding#

DNS Firewalls (RPZ — Response Policy Zones)#

Summary Reference Table#

DNS & Name Resolution

1. How DNS Works

The Hierarchy

The Full Query Journey

Recursive vs Iterative Queries

Why UDP Port 53 — and When It Falls Back to TCP

2. DNS Record Types

Core Records

SOA — Start of Authority

SRV — Service Discovery

CAA — Certificate Authority Authorization

CNAME vs A Record Trade-offs

3. DNS Caching & TTL

The Cache Chain

TTL Trade-offs

Negative Caching

Cache Poisoning

4. DNSSEC

The Problem

How It Works

Chain of Trust

Why DNSSEC Adoption Is Low

DANE

5. DNS over HTTPS (DoH) and DNS over TLS (DoT)

The Privacy Problem

DoT — DNS over TLS (Port 853)

DoH — DNS over HTTPS (Port 443)

The Centralization Problem

Oblivious DoH (ODoH)

6. DNS in Practice

Essential Commands

Debugging “It Works for Some Users”

Split-Horizon DNS

GeoDNS

7. DNS for Service Discovery

Kubernetes Internal DNS

SRV Records for Port Discovery

Health-Check-Aware DNS

DNS-Based Load Balancing

8. DNS Attacks and Defenses

DNS Amplification (DDoS)

Kaminsky Attack (Cache Poisoning)

DNS Tunneling

DNS Rebinding

DNS Firewalls (RPZ — Response Policy Zones)

Summary Reference Table