DNS & Name Resolution
DNS & Name Resolution
DNS is the phone book of the internet — except the phone book has multiple layers of phone books, caches everything for a while, can be wrong, can be lied to, and your app silently breaks when it misbehaves. Understanding DNS deeply means you stop treating “DNS issue” as a mystery and start treating it as a solvable, debuggable system.
1. How DNS Works
The Hierarchy
DNS is a distributed, hierarchical system. There is no single machine that knows all domain-to-IP mappings. Instead:
Root (.) ─── knows where to find .com, .org, .io, etc.
│
TLD (.com) ─── knows where to find google.com, github.com, etc.
│
Authoritative (google.com) ─── knows the actual IPs for google.com subdomains
Root servers are operated by 13 organizations (Verisign, ICANN, etc.), spread across thousands of physical machines via anycast. Your laptop never talks to a root server directly — your recursive resolver does.
The Full Query Journey
When you type api.github.com in your browser:
- Browser cache — checked first (Chrome has its own DNS cache)
- OS stub resolver — checks OS cache (
/etc/hosts, then kernel DNS cache) - Recursive resolver — your configured DNS server (ISP’s, or 8.8.8.8, or 1.1.1.1). This is the workhorse.
- Root servers — recursive resolver asks: “who handles
.com?” → gets NS records for.comTLD servers - TLD servers — recursive resolver asks: “who handles
github.com?” → gets NS records pointing to GitHub’s authoritative servers - Authoritative servers — recursive resolver asks: “what’s the IP for
api.github.com?” → gets the A record - Response travels back, gets cached at each layer, returns to your browser
The total elapsed time for a cold query is typically 50–150ms. Warm cache hits are sub-millisecond.
ELI5: It’s like asking for a phone number at a library. You ask the librarian (recursive resolver). The librarian doesn’t know but checks with the head librarian (root), who says “go look in the ‘businesses’ section” (TLD). That section says “look under G for GitHub.” The GitHub shelf has the actual phone number (authoritative). The librarian writes it on a Post-it (cache) so they don’t have to run around next time.
Recursive vs Iterative Queries
| Query Type | Who Does the Work | Used By |
|---|---|---|
| Recursive | Server does all lookups on your behalf, returns final answer | Stub resolver → recursive resolver |
| Iterative | Server returns a referral (“ask this server next”), client does the walking | Recursive resolver → root/TLD/auth |
Your laptop sends a recursive query to your recursive resolver: “go figure out github.com and tell me the answer.” The recursive resolver then sends iterative queries: “root, who handles .com? … TLD, who handles github.com? … auth, what’s api.github.com?”
Why UDP Port 53 — and When It Falls Back to TCP
DNS uses UDP for most queries because:
- DNS responses fit in a single packet (historically under 512 bytes, now up to 4096 bytes with EDNS0)
- UDP has no connection overhead — no handshake, no teardown
- DNS queries are stateless — if no response, just retry
DNS falls back to TCP port 53 when:
- Response exceeds the advertised EDNS0 buffer size (truncated flag is set, client retries over TCP)
- Zone transfers (AXFR/IXFR) — always TCP, can be megabytes of data
- DNSSEC responses — signatures add significant size
Common mistake: Firewall rules that allow UDP/53 but block TCP/53. Works fine until DNSSEC or large responses trigger TCP fallback. The symptom: some queries silently fail or return truncated results.
2. DNS Record Types
Core Records
| Record | Purpose | Example |
|---|---|---|
| A | IPv4 address | api.example.com → 93.184.216.34 |
| AAAA | IPv6 address | api.example.com → 2606:2800:220:1:248:1893:25c8:1946 |
| CNAME | Alias to another name | www.example.com → example.com |
| MX | Mail server (with priority) | example.com → 10 mail.example.com |
| NS | Authoritative nameservers for a zone | example.com → ns1.example.com |
| TXT | Arbitrary text (SPF, DKIM, verification) | "v=spf1 include:sendgrid.net ~all" |
| PTR | Reverse DNS: IP → hostname | 34.216.184.93.in-addr.arpa → api.example.com |
SOA — Start of Authority
Every DNS zone has exactly one SOA record. It contains:
- Primary nameserver — canonical NS for the zone
- Responsible email — zone admin contact (dots replace
@) - Serial number — incremented every time the zone changes. Secondary nameservers use this to detect updates.
- Refresh / Retry / Expire / Minimum TTL — zone transfer timing parameters
example.com. SOA ns1.example.com. admin.example.com. (
2024010501 ; serial
3600 ; refresh (check for updates every hour)
900 ; retry (if refresh fails, retry after 15 min)
604800 ; expire (stop serving after 7 days without contact)
300 ; minimum TTL (negative caching)
)
The serial number format YYYYMMDDNN is a convention, not a requirement — it just needs to be monotonically increasing.
SRV — Service Discovery
SRV records tell you not just the hostname, but also the port and protocol for a service:
_https._tcp.example.com. SRV 10 5 443 server1.example.com.
↑ ↑ ↑
priority weight port
Kubernetes uses SRV records internally. Active Directory relies on them heavily for domain controller discovery. If you’re debugging AD join failures, check SRV records first.
CAA — Certificate Authority Authorization
Tells CAs which ones are allowed to issue certs for your domain:
example.com. CAA 0 issue "letsencrypt.org"
example.com. CAA 0 issuewild ";" ; nobody can issue wildcards
Most engineers don’t know this exists until they’re surprised by a CA issuing a cert for their domain after a phishing attack.
CNAME vs A Record Trade-offs
| Scenario | Use CNAME | Use A Record |
|---|---|---|
| Pointing to a service with changing IPs (CDN, SaaS) | Yes | No |
Zone apex (root domain, example.com) | No — can’t CNAME apex | Yes |
| Want to add other records at same name | No — CNAME is exclusive | Yes |
| Internal aliases | Fine | Fine |
ALIAS / ANAME records are a non-standard extension by some DNS providers (Route 53, Cloudflare) that behave like CNAME at the apex — they resolve the target and return an A record. Useful for pointing example.com to a load balancer hostname.
ELI5: CNAME is “forward my mail to this other address.” A record is “here’s my actual street address.” You can’t have an apartment building whose entire mailing address is “forward to somewhere else” (no CNAME at apex) — it doesn’t make sense. But you can have individual apartments forwarded.
3. DNS Caching & TTL
The Cache Chain
Browser cache (seconds to minutes)
↓ miss
OS / stub resolver cache (minutes)
↓ miss
Recursive resolver cache (shared across many users)
↓ miss
CDN edge resolver (if using CDN)
↓ miss
Authoritative server
Each layer caches the response for the duration of the record’s TTL. A TTL of 300 means “cache this for 5 minutes.” When it expires, the next request triggers a fresh lookup.
TTL Trade-offs
| TTL Value | Pros | Cons |
|---|---|---|
| Low (30–300s) | Fast failover, fast rollout | More DNS queries, more load on auth servers |
| High (3600–86400s) | Fewer queries, cheaper | Stale data persists for hours, slow disaster recovery |
A common strategy: lower TTL to 60s an hour before a planned migration. Do the migration. Raise TTL back to 3600s after confirming stability.
ELI5: TTL is like the “use by” date on food in your fridge. If it says “good for 5 minutes,” you check if it’s still fresh after 5 minutes. Long TTL = food lasts all week (convenient but risky if it goes bad). Short TTL = check the grocery store every minute (fresh but exhausting).
Negative Caching
When a domain doesn’t exist (NXDOMAIN response), resolvers cache that non-existence for the duration of the SOA’s minimum TTL. This means if you typo a hostname, the “doesn’t exist” answer gets cached, and fixing the typo may take minutes to propagate to users who already got the NXDOMAIN.
Cache Poisoning
An attacker tricks a recursive resolver into caching a false record. Classic approach:
- Attacker triggers a query for
bank.comto a target resolver - Attacker floods the resolver with forged responses, trying to win the race against the real authoritative server
- If successful, the resolver caches the attacker’s IP for
bank.com - All users behind that resolver get sent to the attacker
The Kaminsky attack (2008) dramatically sped this up by randomizing the subdomain queried, allowing many attempts in parallel. The fix: source port randomization (0–65535) to increase the guessing difficulty from ~65K to ~4 billion combinations.
Common mistake: Running open resolvers (resolvers that answer queries from any IP). These can be used in cache poisoning attacks and DNS amplification DDoS.
4. DNSSEC
The Problem
DNS was designed in 1983 with no authentication. A resolver has no way to verify that a response came from the legitimate authoritative server and wasn’t tampered with in transit. That’s the gap DNSSEC fills.
How It Works
DNSSEC adds digital signatures to DNS records. Each zone has a key pair:
- ZSK (Zone Signing Key) — signs the actual records
- KSK (Key Signing Key) — signs the ZSK
New record types:
- RRSIG — the signature for a record set
- DNSKEY — the public key for the zone
- DS — hash of child zone’s KSK, stored in parent zone
Chain of Trust
Root zone (IANA) ─── signs DS records for .com
.com TLD ─────────── signs DS records for example.com
example.com ──────── signs its own A, MX, etc. records
A DNSSEC-validating resolver follows this chain from the root (which it trusts as a “trust anchor”) down to the record. If any signature is invalid or missing, the resolver returns SERVFAIL rather than a potentially forged answer.
ELI5: Normal DNS is like receiving a signed check with no way to verify the signature is real. DNSSEC is like having a notarized certificate chain — the bank (root) vouches for the notary (.com), who vouches for the person signing (your domain). Break any link in the chain and the check bounces.
Why DNSSEC Adoption Is Low
- Complexity: key rollovers are risky. If you mess up a KSK rollover, your entire domain becomes unresolvable.
- Broken resolvers: some firewalls strip DNSSEC-related records, breaking validation
- NSEC walking: DNSSEC’s authenticated denial of existence (NSEC records) accidentally exposes all names in a zone to enumeration. NSEC3 with opt-out mitigates but doesn’t eliminate this.
- No encryption: DNSSEC authenticates responses but doesn’t encrypt them. An observer can still see what you queried.
DANE
DANE (DNS-Based Authentication of Named Entities) uses DNSSEC to publish TLS certificate fingerprints in DNS (TLSA records). This lets you validate a TLS certificate without relying on any CA — useful for email (SMTP DANE) where the CA model is weak.
5. DNS over HTTPS (DoH) and DNS over TLS (DoT)
The Privacy Problem
Traditional DNS is plaintext UDP. Your ISP, network admin, or anyone on-path can see every domain you resolve. Even if your actual traffic is HTTPS, your DNS queries reveal your browsing patterns.
DoT — DNS over TLS (Port 853)
Wraps DNS queries in a TLS session:
- Encrypted — no eavesdropping
- Authenticated — resolver identity verified by certificate
- Easy to detect and block — it’s a distinct port (853)
DoH — DNS over HTTPS (Port 443)
Sends DNS queries as HTTPS requests to a DoH endpoint:
- Encrypted — same as DoT
- Hard to block — looks like regular HTTPS traffic to port 443
- Bypasses network DNS policies — which is both the feature and the problem
GET /dns-query?dns=<base64-encoded-query> HTTP/2
Host: cloudflare-dns.com
Accept: application/dns-message
ELI5: Regular DNS is like shouting your destination to a taxi dispatcher over a radio — everyone in the room hears it. DoT puts it in a sealed envelope. DoH puts it in a sealed envelope that looks exactly like every other envelope in the office — nobody can even tell you’re sending something sensitive.
The Centralization Problem
DoH shifts DNS from “many ISP resolvers” to “a few big providers” (Cloudflare 1.1.1.1, Google 8.8.8.8). This creates:
- Single points of failure
- Concentration of query data with two companies
- Bypassing of enterprise DNS policies and split-horizon setups
Oblivious DoH (ODoH)
ODoH adds a proxy between the client and the DoH resolver:
- Client encrypts query for the resolver, sends to proxy
- Proxy forwards to resolver (can’t see query content)
- Resolver answers (doesn’t know who asked)
- Even Cloudflare can’t correlate your identity with your queries
| Protocol | Port | Encrypted | Unblockable | Privacy |
|---|---|---|---|---|
| DNS | 53 | No | Easy to block | None |
| DoT | 853 | Yes | Blockable | Good |
| DoH | 443 | Yes | Hard to block | Good |
| ODoH | 443 | Yes | Hard to block | Excellent |
6. DNS in Practice
Essential Commands
# Basic query — A record for github.com
dig github.com
# Specific record type
dig github.com MX
dig github.com TXT
# Follow the full resolution chain from root
dig +trace github.com
# Query a specific resolver (not your configured one)
dig @8.8.8.8 github.com
# Reverse DNS lookup
dig -x 140.82.114.4
# Short output (IP only)
dig +short github.com
# Check TTL on a cached response
dig +ttlid github.com
# nslookup (interactive)
nslookup github.com
nslookup -type=MX github.com
nslookup github.com 1.1.1.1 # query specific resolver
# host (clean output)
host github.com
host -t MX github.com
dig +trace is the most valuable debugging tool. It shows you exactly which servers were queried, what they returned, and where the resolution chain breaks.
Debugging “It Works for Some Users”
This is almost always a TTL/caching problem. Checklist:
- Different TTL stages: User A cached the old record 2 minutes ago (TTL expires in 3 minutes). User B just queried (sees new record). Both are correct — they’re just at different cache states.
- ISP resolver lag: Some ISP resolvers ignore TTL and over-cache (especially common with low TTLs).
- Browser cache: Chrome caches DNS independently.
chrome://net-internals/#dnsto clear/inspect. - NXDOMAIN negative cache: If the domain didn’t exist before, the negative cache may persist even after you create the record.
Split-Horizon DNS
Returning different answers based on who’s asking:
- Internal clients get
10.0.1.45(private IP) - External clients get
93.184.216.34(public IP)
Common implementations: Bind views, AWS Route 53 private hosted zones, CoreDNS in Kubernetes. The footgun: developers test from outside, see the public IP, assume everything works. Production traffic goes internal and hits a firewall rule nobody expected.
GeoDNS
Authoritative server returns different A records based on the querying resolver’s location. Used by CDNs to route users to the nearest PoP. The gotcha: resolution is based on the recursive resolver’s location, not the end user’s. A user in Singapore using Google’s 8.8.8.8 resolver may get routed to the US because 8.8.8.8 has infrastructure there. ECS (EDNS Client Subnet) partially fixes this by forwarding a truncated version of the user’s IP to the authoritative server.
7. DNS for Service Discovery
Kubernetes Internal DNS
Kubernetes runs CoreDNS as the cluster DNS server. Every Service gets a DNS entry:
<service>.<namespace>.svc.cluster.local
| Resource | DNS Name |
|---|---|
Service api in default namespace | api.default.svc.cluster.local |
Service api from another namespace | api.default (short form works within cluster) |
| Headless service (no ClusterIP) | Returns individual pod IPs as A records |
StatefulSet pod web-0 | web-0.web.default.svc.cluster.local |
Headless services (ClusterIP: None) use SRV records for stable pod addressing — critical for databases like Cassandra and Elasticsearch that need to know all peers.
ELI5: Inside Kubernetes, every service gets a consistent phone number that never changes, even if the actual pods behind it change every deployment. It’s like dialing the “Pizza department” extension instead of memorizing each pizza chef’s personal number.
SRV Records for Port Discovery
SRV records encode protocol, hostname, and port:
dig _https._tcp.example.com SRV
# Returns: priority weight port target
# 10 5 443 server1.example.com.
Used by: Active Directory (DC discovery), Kubernetes (headless services), Consul, etcd, Kafka.
Health-Check-Aware DNS
Route 53 health checks + DNS failover:
- Route 53 pings your endpoint every 30 seconds
- If endpoint fails health checks, Route 53 automatically removes that record from responses
- Surviving endpoints absorb traffic
This gives you “DNS-level failover” without needing a load balancer. TTL matters here: set it low (60s) so clients don’t hold stale IPs when a host fails.
DNS-Based Load Balancing
Round-robin A records: return multiple IPs for a single name. Clients pick one (usually the first). Problems:
- No health awareness — dead servers stay in rotation
- Client-side caching breaks the round-robin
- “Sticky” clients often ignore the round-robin entirely
Weighted records (Route 53, Cloudflare): assign traffic percentages per record. Useful for canary deployments: send 5% of traffic to new version by putting its IP at weight 5 vs weight 95 for stable.
8. DNS Attacks and Defenses
DNS Amplification (DDoS)
- Attacker sends DNS queries with the victim’s IP as the source (IP spoofing)
- Queries are for large responses (DNSSEC-signed zones, ANY queries)
- DNS servers flood the victim with responses 50–100x larger than the original query
Amplification factor: a 60-byte query can return a 3000-byte response. With thousands of open resolvers, this generates massive traffic toward the victim.
Defense: Don’t run open resolvers. Rate-limit responses per source IP (Response Rate Limiting, RRL). BCP38 ingress filtering to block spoofed source IPs at the network edge.
Kaminsky Attack (Cache Poisoning)
Dan Kaminsky’s 2008 discovery: by racing to answer a query for a non-existent subdomain (a1b2c3.bank.com), an attacker could inject a forged NS record for the parent zone (bank.com), poisoning the entire zone in one shot.
Defense: Source port randomization (makes winning the race require guessing both transaction ID AND source port — 64K × 64K combinations). DNSSEC eliminates the attack entirely (forged records fail signature validation).
DNS Tunneling
DNS is almost never blocked at firewalls. Attackers exploit this:
- Encode data as DNS query names:
aGVsbG8=.exfil.attacker.com - The authoritative server for
attacker.comreceives the queries and extracts the data - Response can also carry encoded data back
Used for: data exfiltration from air-gapped networks, C2 communication from malware behind strict firewalls.
Detection: Unusually long subdomain labels, high query rate to a single domain, queries for domains with high entropy names, large TXT responses.
DNS Rebinding
- Attacker registers
evil.comwith a very short TTL (1 second) - Initial response points to a public IP attacker controls — browser allows connection
- After TTL expires, DNS response changes to
192.168.1.1(victim’s internal router) - Browser’s same-origin policy checks only the hostname, not the IP — allows the rebind
- Attacker’s JavaScript now makes requests to the victim’s internal network
Defense: DNS rebinding protection in resolvers (refuse to resolve public names to private IPs). Bind to specific interfaces. Use private TLDs for internal services.
ELI5: DNS rebinding is like a thief giving you a business card for a legitimate store, getting you to trust them, then quietly swapping the address on the card to their warehouse after you’ve already decided to trust them. Your rules only check the name, not where it actually leads.
DNS Firewalls (RPZ — Response Policy Zones)
RPZ lets resolvers override DNS responses based on policy:
- Block known malware domains (return NXDOMAIN or redirect to sinkhole)
- Block adult content
- Block data exfiltration domains
Used by enterprise security products (Cisco Umbrella, Palo Alto DNS Security) and some ISPs. Also used by governments for censorship — same mechanism, different intent.
Summary Reference Table
| Concept | Key Detail | Principal-Level Gotcha |
|---|---|---|
| Recursive resolver | Does all the legwork for your stub | Resolver cache is shared — one poisoned response affects all users |
| TTL | Cache lifetime in seconds | Lower before migrations; some resolvers over-cache regardless |
| CNAME | Alias, not an IP | Can’t use at zone apex; can’t share name with other record types |
| SOA serial | Must increment on every zone change | Binary/decimal format mismatch between tools causes missed transfers |
| DNSSEC | Signs records, not the wire | Doesn’t encrypt queries; NSEC exposes zone enumeration |
| DoH | DNS over HTTPS port 443 | Bypasses enterprise DNS policies and split-horizon setups |
| SRV | Service + port discovery | Weight field is relative, not percentage — 10 10 ≠ 50/50 exactly |
| GeoDNS | Routes by resolver location, not user | ECS (EDNS Client Subnet) needed for accurate user geolocation |
| DNS tunneling | Data exfil over DNS queries | Almost never blocked — monitor for high-entropy subdomains |
| Cache poisoning | Inject fake records into resolver cache | Source port randomization + DNSSEC are the real fixes |