Domain 3D: Deep Learning & Neural Networks
Table of Contents
- Deep Learning & Neural Networks
Deep Learning & Neural Networks
Exam Domain: 3 — ML Model Development (Deep Learning) Task: Understand deep learning architectures and when to apply them
Neural Network Fundamentals
The Single Neuron
A neuron is the atomic unit of a neural network. It computes:
$$output = activation\left(\sum_{i} w_i x_i + b\right)$$
where $x_i$ are inputs, $w_i$ are learnable weights, $b$ is a bias term, and $activation$ introduces non-linearity.
Single Neuron:
x₁ ──w₁──┐
x₂ ──w₂──┤
x₃ ──w₃──┼──► Σ(wᵢxᵢ + b) ──► activation(z) ──► output
x₄ ──w₄──┤
x₅ ──w₅──┘
Each input is scaled by its weight.
The sum passes through an activation function.
The result is one number: how much this neuron "fires."
ELI5: A neuron is a tiny decision maker. It takes inputs, weights them by importance, sums them up, and passes the result through a “squish” function to decide how much signal to send forward. A high weight on input $x_3$ means “pay a lot of attention to $x_3$.” The activation function prevents the network from just doing endless linear arithmetic — it adds the bends and curves that let it learn complex patterns.
Layers: Input → Hidden → Output
Multi-Layer Neural Network (3-layer):
Input Layer Hidden Layer 1 Hidden Layer 2 Output Layer
(raw features) (simple patterns) (complex patterns) (prediction)
x₁ ──────┐ ┌── h₁₁ ──┐ ┌── h₂₁ ──┐
x₂ ──────┼──► ├── h₁₂ ──┼──► ├── h₂₂ ──┼──► ŷ
x₃ ──────┤ ├── h₁₃ ──┤ ├── h₂₃ ──┤
x₄ ──────┘ └── h₁₄ ──┘ └── h₂₄ ──┘
"Depth" = number of hidden layers
"Width" = number of neurons per layer
What each layer learns (in a CNN processing images):
- Layer 1: edges and gradients
- Layer 2: corners, textures, simple shapes
- Layer 3: object parts (eyes, wheels, doors)
- Layer 4+: full objects and high-level concepts
ELI5: Deep learning is powerful because each layer builds on the previous one. Layer 1 sees edges. Layer 2 sees shapes. Layer 3 sees objects. It’s like building understanding from pixels to meaning — the same way a child learns that dots and lines form letters, letters form words, and words form meaning. The “deep” in deep learning just means many of these stacked layers.
Universal Approximation Theorem: A single hidden layer with enough neurons can approximate any continuous function. This is why neural networks are theoretically universal — but in practice, depth (more layers) is more efficient than width (more neurons in one layer) for learning complex patterns with fewer parameters.
Activation Functions
Activation functions introduce non-linearity. Without them, stacking layers is mathematically equivalent to a single linear layer — useless for complex problems.
Sigmoid
$$\sigma(x) = \frac{1}{1+e^{-x}}$$
Sigmoid shape: Range: (0, 1)
1.0 ─────────────────────────
╱
0.5 ──────────────────╱───
╱
0.0 ─────────╱────────────
──────────────────────→ x
-4 -2 0 2 4
- Use: output layer for binary classification
- Problem 1 — Vanishing gradient: derivatives are near zero at extremes → gradients shrink to nothing through many layers → network stops learning
- Problem 2 — Not zero-centered: outputs always positive → inefficient gradient updates
Tanh
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$
- Range: $(-1, 1)$, zero-centered (better than sigmoid for hidden layers)
- Still suffers from vanishing gradients at extremes
ReLU (Rectified Linear Unit)
$$f(x) = \max(0, x)$$
ReLU shape: Range: [0, ∞)
╱
╱
───────╱───────────→ x
negative positive
= 0 = x
- Why it’s the default: no vanishing gradient for positive values, computationally trivial ($\max$ operation), sparse activations (many neurons output 0 → efficient)
- Dying ReLU problem: if a neuron always receives negative inputs, it always outputs 0 and its gradient is also 0 → it never learns again (“dead neuron”). Caused by large learning rates or poor initialization.
Leaky ReLU
$$f(x) = \max(0.01x,\ x)$$
- Fixes dying ReLU: negative inputs get a small non-zero gradient (0.01 slope)
- The leak coefficient (0.01) can itself be learned → PReLU (Parametric ReLU)
ELU and SELU
- ELU: smooth negative region, negative values push mean activations toward zero → self-normalizing tendencies
- SELU (Scaled ELU): mathematically proven to self-normalize activations (mean ≈ 0, variance ≈ 1) across layers when used with specific initialization — eliminates need for batch normalization in many cases
Softmax
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
- Outputs sum to 1 → interpretable as probabilities
- Use: output layer for multi-class classification
- Key property: amplifies the largest value relative to others (temperature controls “sharpness”)
GELU (Gaussian Error Linear Unit)
$$\text{GELU}(x) = x \cdot \Phi(x)$$
- Smooth probabilistic approximation of ReLU
- Used in Transformers (BERT, GPT) because smooth derivatives help with very deep training
Activation Function Comparison
| Function | Range | Zero-centered | Vanishing Gradient | Use Case |
|---|---|---|---|---|
| Sigmoid | (0,1) | No | Yes (severe) | Binary output layer |
| Tanh | (-1,1) | Yes | Yes (moderate) | RNN hidden states |
| ReLU | [0,∞) | No | No (for x>0) | Default hidden layer |
| Leaky ReLU | (-∞,∞) | Approx | No | When dying ReLU is a problem |
| SELU | (-∞,∞) | Self-normalizes | No | Deep FFN without BatchNorm |
| Softmax | (0,1) sum=1 | No | — | Multi-class output layer |
| GELU | (-∞,∞) | Approx | No | Transformers |
Decision guide:
- Hidden layers → ReLU (default), Leaky ReLU if neurons are dying
- Binary output → Sigmoid
- Multi-class output → Softmax
- RNN hidden states → Tanh (historically) or GELU
- Transformers → GELU
Backpropagation
First Principles
Training a neural network means finding weights $W$ that minimize a loss function $L$. Backpropagation computes $\frac{\partial L}{\partial W}$ for every weight using the chain rule of calculus.
Forward pass: input $\rightarrow$ activations layer by layer $\rightarrow$ prediction $\rightarrow$ compute loss
Backward pass: loss $\rightarrow$ gradient through output layer $\rightarrow$ gradient through each hidden layer (in reverse) $\rightarrow$ update weights
Chain rule applied: if $L$ depends on $z$, and $z$ depends on $W$, then:
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W}$$
For a deep network with layers $1, 2, \ldots, n$:
$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \ldots \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial W_1}$$
ELI5: Imagine a factory assembly line. The product at the end is wrong. Backpropagation traces backward through every station asking “how much did YOU contribute to the error?” Station 5 answers honestly. Station 4 uses that answer to figure out its own contribution. And so on back to station 1. Then each station adjusts its process slightly to reduce the error. Repeat millions of times and the factory learns to make perfect products.
Vanishing Gradient Problem
In deep networks using sigmoid or tanh activations, the derivative at each layer is a number between 0 and 1. Multiplying many such numbers together:
$$0.3 \times 0.3 \times 0.3 \times 0.3 \times 0.3 = 0.00243$$
After 10 layers this is effectively zero. Early layers receive near-zero gradients → they barely learn → deep sigmoid networks fail.
Solutions: ReLU activations, batch normalization, residual connections (skip connections), LSTM/GRU for sequences.
Exploding Gradient Problem
The opposite: gradients grow exponentially through layers. Manifests as NaN loss or wildly oscillating training.
Solutions:
- Gradient clipping: cap gradient norm at a maximum value (e.g., 1.0)
- Weight initialization techniques (Xavier/He initialization)
- Batch normalization
Exam tip: Vanishing gradients → use ReLU, skip connections, or LSTM. Exploding gradients → use gradient clipping. Both are common issues in deep and recurrent networks.
Convolutional Neural Networks (CNNs)
Why Regular Networks Fail on Images
A 224×224 RGB image has $224 \times 224 \times 3 = 150,528$ inputs. A single dense hidden layer with 1,000 neurons would have 150 million parameters — just for layer 1. Problems:
- Too many parameters → overfitting
- No spatial awareness: pixel at position (10,10) has no relationship to pixel at (11,11) in a fully connected layer
- Not translation invariant: a cat in the top-left corner looks completely different from a cat in the bottom-right
Convolution Layer
A filter (kernel) is a small matrix (e.g., 3×3) that slides across the image. At each position, it computes the dot product with the underlying image patch.
3×3 filter sliding across a 5×5 input (stride=1, no padding):
Input (5×5): Filter (3×3): Feature Map (3×3):
┌─────────────┐ ┌───────┐
│ 1 2 3 4 5│ │ 1 0 -1│ ┌─────────┐
│ 6 7 8 9 10│ × │ 1 0 -1│ ──► │ ? ? ?│
│11 12 13 14 15│ │ 1 0 -1│ │ ? ? ?│
│16 17 18 19 20│ └───────┘ │ ? ? ?│
│21 22 23 24 25│ └─────────┘
└─────────────┘
↑ ↑
5×5 = 25 values (5-3+1)×(5-3+1) = 3×3 = 9 values
At top-left position:
(1×1)+(2×0)+(3×-1) + (6×1)+(7×0)+(8×-1) + (11×1)+(12×0)+(13×-1)
= (1+0-3) + (6+0-8) + (11+0-13) = -2 + -2 + -2 = -6
Parameter sharing: The same 3×3 filter (9 parameters) is applied at every position in the image. This is the key insight — instead of learning a separate detector for every position, one filter learns to detect a feature anywhere in the image. Drastically reduces parameters.
Feature maps: Each filter produces one feature map. With 64 filters, you get 64 feature maps, each detecting a different low-level pattern.
Stride: How many pixels the filter moves each step. Stride=1 → dense output. Stride=2 → halves spatial dimensions.
Padding:
- Valid: no padding → output smaller than input
- Same: zero-pad input so output has same spatial dimensions as input
Pooling Layer
Downsamples the spatial dimensions of feature maps.
Max Pooling (2×2, stride=2):
Input (4×4): Output (2×2):
┌──┬──┬──┬──┐ ┌────┬────┐
│ 1│ 3│ 2│ 4│ │ 6 │ 8 │
├──┼──┼──┼──┤ ──► ├────┼────┤
│ 5│ 6│ 1│ 8│ │ 9 │ 7 │
├──┼──┼──┼──┤ └────┴────┘
│ 2│ 9│ 3│ 1│
├──┼──┼──┼──┤
│ 4│ 3│ 7│ 2│
└──┴──┴──┴──┘
Each 2×2 block → its maximum value
- Max pooling (preferred): keeps the strongest activation, provides translation invariance
- Average pooling: takes the mean — preserves more information, sometimes used in final layers
Why pooling helps: reduces spatial resolution → fewer parameters in subsequent layers, and slight translation invariance (the cat can move a few pixels without changing the feature map drastically).
Full CNN Architecture
Full CNN Pipeline:
Input Image
│
▼
┌─────────────────┐
│ Conv Layer │ ── 32 filters of 3×3 → 32 feature maps
│ ReLU │
│ Max Pool 2×2 │ ── halves spatial dims
└────────┬────────┘
│
▼
┌─────────────────┐
│ Conv Layer │ ── 64 filters of 3×3 → 64 feature maps
│ ReLU │
│ Max Pool 2×2 │ ── halves again
└────────┬────────┘
│
▼
┌─────────────────┐
│ Flatten │ ── 3D feature maps → 1D vector
│ Dense Layer │ ── fully connected (128 neurons)
│ ReLU │
│ Dropout │ ── regularization
│ Dense Output │ ── softmax for N classes
└─────────────────┘
Famous CNN Architectures
| Architecture | Year | Key Innovation | Depth |
|---|---|---|---|
| LeNet | 1998 | First successful CNN | 5 layers |
| AlexNet | 2012 | Deep CNN on GPU, ReLU, Dropout | 8 layers |
| VGG | 2014 | All 3×3 convolutions, very deep | 16-19 |
| ResNet | 2015 | Skip connections | 50-152+ |
| Inception | 2014 | Parallel filter sizes in one layer | 22 |
ResNet skip connections — the critical insight:
ResNet Skip Connection (Residual Block):
Input x ──────────────────────────┐
│ │
▼ │
┌────────┐ │
│ Conv │ │
│ BN │ │ (identity shortcut)
│ ReLU │ │
└───┬────┘ │
│ │
┌────────┐ │
│ Conv │ │
│ BN │ │
└───┬────┘ │
│ │
└──────────── + ◄─────────────┘
│
ReLU
│
Output = F(x) + x
Instead of learning $F(x)$, the block learns the residual $F(x) = H(x) - x$, where $H(x)$ is the desired mapping. If the layer is not needed, $F(x)$ simply goes to zero and the identity passes through unchanged. The skip connection also provides a gradient highway directly to early layers — solving vanishing gradients for very deep networks (100+ layers).
Transfer Learning with CNNs
Transfer Learning Strategy:
ImageNet pre-trained model (millions of images, 1000 classes)
┌──────────────────────────────────────┐
│ Conv layers (frozen or fine-tuned) │ ← Universal features
│ Dense layers (frozen) │ (edges, textures, objects)
└──────────────────────────────────────┘
│
▼ Replace final layer
┌──────────────────────────────────────┐
│ New Dense Output Layer (your N classes) │ ← Train from scratch
└──────────────────────────────────────┘
Why it works: early CNN layers learn universal visual features (edges, corners, textures) that transfer across domains. Only the final classification layer needs to learn your specific classes.
When to use transfer learning:
- Small labeled dataset (< 10,000 images) + similar domain to pre-training data → freeze most layers, retrain final layers only
- Medium dataset + different domain → fine-tune more layers
- Large dataset → full training from scratch may beat transfer learning
ELI5: Instead of teaching a baby to see from scratch, start with someone who already knows what edges and shapes are, and just teach them to recognize YOUR specific objects. The pre-trained weights encode years of “visual education” from millions of images — you’re piggybacking on that knowledge.
1D CNNs for sequences: The convolution slides along the time/sequence axis instead of spatial dimensions. Effective for time series and text when local patterns matter more than long-range dependencies. Faster than RNNs, parallelizable.
CNN Use Cases
| Task | Architecture | Output |
|---|---|---|
| Image classification | ResNet, VGG, EfficientNet | Class label + probability |
| Object detection | YOLO, SSD, Faster R-CNN | Bounding boxes + labels |
| Semantic segmentation | FCN, U-Net, DeepLab | Pixel-level label mask |
| Medical imaging | U-Net | Organ/tumor segmentation |
| Time series | 1D CNN | Local pattern features |
Recurrent Neural Networks (RNNs)
Why Feedforward Networks Fail on Sequences
A feedforward network treats each input independently. For the sentence “The cat sat on the mat because it was tired,” understanding that “it” refers to “cat” requires memory of previous words. Feedforward networks have no such memory.
The Recurrence Equation
$$h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b)$$
where $h_t$ is the hidden state at time step $t$ (the “memory”), $x_t$ is the current input, and $h_{t-1}$ is the previous hidden state.
RNN Unrolled Through Time:
x₁ ──► [RNN] ──h₁──► [RNN] ──h₂──► [RNN] ──h₃──► ... ──► output
↑ ↑ ↑
same weights same weights same weights
W_hh, W_xh (shared across all time steps)
The same weight matrices $W_{hh}$ and $W_{xh}$ are applied at every time step — analogous to convolutional parameter sharing, but across time instead of space.
Vanishing Gradient in RNNs
For a sequence of length 100, the gradient of the loss with respect to the hidden state at step 1 involves 99 repeated multiplications of $W_{hh}$. If the largest eigenvalue of $W_{hh}$ is < 1, gradients vanish. If > 1, they explode.
ELI5: Playing “telephone” with 50 people — by the time the message reaches the end, the original meaning is completely lost. The RNN’s gradient is the message traveling backwards. By the time it reaches step 1, it contains almost nothing useful. So early steps don’t learn from events that happened long ago.
LSTM (Long Short-Term Memory)
LSTM introduces a cell state $C_t$ — a separate memory highway that runs alongside the hidden state, protected from the repeated multiplication that causes vanishing gradients.
The four components:
$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$$
$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$$
$$\tilde{C}t = \tanh(W_C [h{t-1}, x_t] + b_C) \quad \text{(candidate values)}$$
$$C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \quad \text{(cell state update)}$$
$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$$
$$h_t = o_t \cdot \tanh(C_t) \quad \text{(hidden state)}$$
LSTM Cell — All Gates:
┌──────────────── C_{t-1} (cell state in) ──────────────┐
│ │
h_{t-1} ──┐ ▼ ▼
├──► [Forget Gate f_t] ──► × ──────────────────────────── + ──► C_t ──► tanh ──► ×──► h_t
x_t ──────┘ │ ↑ ↑ ↑
│ ┌───┘ ┌────┘ ────┘
│ [Input Gate i_t] ──► × ──┘ [Output Gate o_t]
│ [Candidate Ĉ_t] ──┘
│
Gate summary:
Forget gate: "How much of old cell state to keep?" (0=forget all, 1=keep all)
Input gate: "How much of new candidate to write?"
Output gate: "How much of cell state to expose as hidden state?"
Cell state: The long-term memory highway
Why LSTM solves vanishing gradient: the cell state update $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$ is additive, not multiplicative through the network depth. Gradients can flow unchanged through the cell state highway across many time steps.
ELI5: LSTM is a person with a notebook. The forget gate crosses things out. The input gate writes new things. The output gate decides what to read aloud. The notebook itself is the long-term memory that persists. The key insight: information in the notebook can survive for hundreds of time steps without being distorted — unlike the “telephone game” of vanilla RNNs.
GRU (Gated Recurrent Unit)
GRU simplifies LSTM into two gates: reset gate ($r_t$) and update gate ($z_t$).
$$z_t = \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)}$$
$$r_t = \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)}$$
$$\tilde{h}t = \tanh(W [r_t \cdot h{t-1}, x_t]) \quad \text{(candidate hidden state)}$$
$$h_t = (1-z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t$$
No separate cell state — just one hidden state with gating.
RNN vs LSTM vs GRU Comparison
| Aspect | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | None | 3 gates + cell state | 2 gates |
| Parameters | Fewest | Most | Fewer than LSTM |
| Long-term memory | Poor | Excellent | Good |
| Training speed | Fastest | Slowest | Middle |
| Vanishing gradient | Yes | Solved | Solved |
| When to use | Toy examples | Long sequences, complex patterns | Smaller datasets, faster training |
Decision rule: Start with GRU for faster iteration. Switch to LSTM if sequence dependencies are very long (> 100 steps) or accuracy matters more than speed.
Bidirectional RNNs
Bidirectional RNN:
x₁ x₂ x₃ x₄
│ │ │ │
▼ ▼ ▼ ▼
[→]─────►[→]─────►[→]─────►[→] (forward pass)
▲ ▲ ▲ ▲
[←]◄─────[←]◄─────[←]◄─────[←] (backward pass)
│ │ │ │
▼ ▼ ▼ ▼
concat concat concat concat
│ │ │ │
y₁ y₂ y₃ y₄
Each output sees BOTH past AND future context.
The output at each time step is the concatenation of forward and backward hidden states. Essential for tasks that need future context: NER (“Bank” is a company if followed by “Inc.”), speech recognition, machine translation encoder.
Transformers & Attention
The Revolution
Transformers (2017, “Attention Is All You Need”) discarded recurrence entirely. Instead, every token attends directly to every other token via the attention mechanism. Key advantages over RNNs:
- Fully parallelizable — no sequential dependency
- Better long-range dependencies — direct path between any two tokens
- No vanishing gradient — attention creates direct connections
Self-Attention Mechanism
For each token, compute three vectors: Query (Q), Key (K), Value (V) from learned weight matrices.
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Interpretation:
- $QK^T$: dot product of query with all keys → measures relevance of each token to the current token
- $\sqrt{d_k}$: scaling to prevent dot products from growing too large (which would saturate softmax)
- Softmax: turn relevance scores into probabilities (attention weights)
- Multiply by $V$: weighted sum of value vectors — the output is a blend of all tokens, weighted by relevance
ELI5: In the sentence “The cat sat on the mat because it was tired”, self-attention helps the model understand that “it” refers to “cat” by computing a relevance score between “it” and every other word in the sentence. “it” has a high score with “cat” and low scores with “mat.” The output representation of “it” is then a weighted blend of all word representations — heavily blending in “cat.” This is how meaning is captured from context.
Multi-Head Attention
Run self-attention $h$ times in parallel with different learned weight matrices, then concatenate:
$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$
Each head can focus on different types of relationships (syntactic, semantic, coreference) simultaneously.
Positional Encoding
Since there is no recurrence, the Transformer has no inherent sense of position. Positional encodings (sinusoidal functions or learned embeddings) are added to token embeddings to encode order.
Encoder-Decoder Architecture
Transformer Encoder-Decoder:
"Je suis étudiant" "I am a student"
│ │
┌────▼────┐ ┌────▼────┐
│ Encoder │ │ Decoder │
│ │ ─── context ──► │ │
│ Self- │ (K, V from │ Masked │
│ Attention│ encoder) │ Self-Att│
│ │ │ Cross- │
│ Feed- │ │ Attention│
│ Forward │ │ Feed- │
└─────────┘ │ Forward │
└─────────┘
│
Output tokens
(autoregressive)
Encoder: reads the entire input simultaneously, produces rich contextual representations (used in BERT)
Decoder: generates output one token at a time, attending to both previous outputs and encoder context (used in GPT, machine translation)
BERT vs GPT
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Training | Masked language modeling (bidirectional) | Next token prediction (left-to-right) |
| Strength | Understanding tasks (classification, NER) | Generation tasks (text, code, chat) |
| Context | Sees full sequence bidirectionally | Sees only past tokens |
RNN vs Transformer Comparison
RNN Processing (sequential — must wait):
x₁ ──► h₁ ──► h₂ ──► h₃ ──► ... ──► hₙ
Transformer Processing (parallel — all at once):
x₁ ──────────────────────────────►
x₂ ──── Self-Attention (all tokens simultaneously) ────►
x₃ ──────────────────────────────►
xₙ ──────────────────────────────►
RNN processes tokens one at a time → slow for long sequences, hard to parallelize. Transformer processes all tokens simultaneously → much faster training on modern hardware (GPU/TPU).
Autoencoders
Architecture
Autoencoder:
Input x Bottleneck z Reconstructed x̂
(high-dim) (low-dim) (high-dim)
┌───────┐ ┌──────┐ ┌───────┐
│ 784 │ │ 32 │ │ 784 │
│ dims │──► │ dims │──► │ dims │
│ │ │ │ │ │
│Encoder│ │Latent│ │Decoder│
└───────┘ │Space │ └───────┘
└──────┘
Loss = reconstruction error: ||x̂ - x||²
Training signal: how well can you reconstruct x from a tiny z?
The bottleneck forces the encoder to learn a compressed representation. The decoder learns to reconstruct from that compression. Neither part is useful individually — the learned representation in the bottleneck is the output.
ELI5: An autoencoder is like summarizing a book (encoder) then reconstructing it from the summary (decoder). The goal isn’t the reconstruction — it’s the summary itself. That compressed summary is the “embedding” or “latent representation” that captures the essence of the data. A good summary can reconstruct a good approximation of the original.
Use cases:
- Dimensionality reduction: non-linear alternative to PCA
- Anomaly detection: reconstruct normal data well; anomalies have high reconstruction error
- Denoising: train with noisy input → clean output; forces learning of clean structure
- Pre-training: learn representations before fine-tuning on a downstream task
Variational Autoencoders (VAE)
Instead of encoding to a fixed point $z$, VAE encodes to a distribution: mean $\mu$ and variance $\sigma^2$. Samples from this distribution become the decoder input.
Loss = reconstruction error + KL divergence (keeps the latent space smooth and continuous)
Since the latent space is smooth, you can interpolate between points and generate new data by sampling from the latent space. VAEs are generative models.
Generative Adversarial Networks (GANs)
Architecture
Two networks compete in a minimax game:
GAN Training Loop:
Real data ──────────────────────────┐
│
Random noise z ▼
│ ┌───────────────────────────┐
▼ │ Discriminator D │
┌─────────┐ │ "Is this real or fake?" │──► Real/Fake score
│Generator│──fake──►│ │
│ G │ └───────────────────────────┘
└─────────┘ │
↑ │ gradient
└──────── loss ◄────── ┘
(fool D)
Generator ($G$): takes random noise $z$, outputs a fake sample. Trained to maximize $D$’s classification error (make fakes look real).
Discriminator ($D$): takes real or fake samples, outputs probability of being real. Trained to correctly distinguish real from fake.
Training objective:
$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$
Nash equilibrium: when $G$ produces perfect fakes, $D$ can only guess randomly (probability 0.5). At this point, training has converged.
Mode collapse: $G$ finds one type of output that consistently fools $D$ and gets stuck producing only that type. Solutions: Wasserstein GAN, minibatch discrimination, historical averaging.
ELI5: A counterfeiter (generator) tries to make fake money. A detective (discriminator) tries to spot fakes. They both get better through competition — the counterfeiter learns from every time they get caught, the detective learns from every fake that slipped through. Eventually the fakes are indistinguishable from real bills. GANs use this adversarial competition to generate extremely realistic data.
Use cases: image generation, data augmentation (generate more training examples), style transfer, image-to-image translation (pix2pix), super-resolution.
Practical Deep Learning on AWS
GPU Instance Types
| Family | GPU | Best For |
|---|---|---|
| P3 | NVIDIA V100 | Standard DL training |
| P4d | NVIDIA A100 | Large model training |
| P5 | NVIDIA H100 | Transformer / LLM training |
| G4dn | NVIDIA T4 | Cost-effective inference + training |
| G5 | NVIDIA A10G | Inference, vision models |
| Inf1/Inf2 | AWS Inferentia | Inference only (custom chip, very cheap) |
Distributed Training Strategies
Data Parallelism (most common):
GPU 1: full model copy ──► batch 1 ──► gradient 1 ──┐
GPU 2: full model copy ──► batch 2 ──► gradient 2 ──┼──► average gradients ──► update weights
GPU 3: full model copy ──► batch 3 ──► gradient 3 ──┘
GPU 4: full model copy ──► batch 4 ──► gradient 4 ──┘
Model Parallelism (for huge models):
GPU 1: layers 1-10 ──► activations ──► GPU 2: layers 11-20 ──► GPU 3: layers 21-30
(model too big to fit on one GPU → split across GPUs)
- SageMaker Data Parallelism library: optimized AllReduce implementation for multi-GPU/multi-node
- SageMaker Model Parallelism library: automatic pipeline parallelism for models > GPU memory
Cost Optimization
- Managed Spot Training: use EC2 Spot instances (70-90% cheaper), SageMaker handles interruptions and auto-resumes from checkpoints
- Checkpointing: save model state to S3 periodically; critical for spot training since instances can be reclaimed
- SageMaker Training Compiler: optimizes DL training code (up to 50% speedup, up to 50% cost reduction) by optimizing GPU utilization — works with TensorFlow and PyTorch
Training Observability
- SageMaker Debugger: captures tensors during training, monitors for issues (vanishing/exploding gradients, overfit, poor weight initialization) and can auto-stop unhealthy training jobs
- CloudWatch: training metrics (loss, accuracy) per epoch
- SageMaker Experiments: track hyperparameters, metrics, artifacts across training runs
Architecture Selection Guide
Why this matters for the exam: Every scenario question about model design requires mapping problem type → architecture. Memorize this mapping.
| Problem Type | Data | Architecture | Key Reason |
|---|---|---|---|
| Image classification | 2D images | CNN (ResNet/EfficientNet) | Spatial feature hierarchy |
| Object detection | 2D images | CNN (YOLO/SSD) | Bounding box regression |
| Pixel segmentation | 2D images | FCN/U-Net | Dense prediction |
| Sequence classification | Text, time series | LSTM/GRU or 1D CNN | Temporal patterns |
| Text understanding | Documents | BERT (Transformer encoder) | Bidirectional context |
| Text generation | Language | GPT (Transformer decoder) | Autoregressive |
| Translation | Sequence pairs | Transformer (enc-dec) | Long-range alignment |
| Anomaly detection | Any | Autoencoder | Reconstruction error |
| Data generation | Any modality | GAN or VAE | Generative modeling |
| Tabular data | Structured | XGBoost (not deep learning) | Gradient boosting dominates |
| Multi-series forecast | Time series | DeepAR (LSTM) | Cross-series patterns |
Exam tip: For tabular/structured data, XGBoost typically outperforms deep learning and is cheaper to train. Deep learning advantages kick in for unstructured data (images, text, audio) where feature engineering is impractical.
Quick Reference
Deep Learning Architecture Decision Tree:
What is your data type?
│
├─ Images / Video ──────────────────► CNN
│ ├─ Classify whole image? → ResNet + transfer learning
│ ├─ Find objects with boxes? → SSD / Faster R-CNN
│ └─ Label every pixel? → U-Net / DeepLab
│
├─ Sequential / Time series ────────► RNN family
│ ├─ Short sequences? → GRU (faster)
│ ├─ Long sequences? → LSTM (better memory)
│ ├─ Bidirectional context? → Bidirectional LSTM
│ └─ Multiple related series? → DeepAR
│
├─ Text / Language ─────────────────► Transformers
│ ├─ Understand / classify text? → BERT (encoder)
│ ├─ Generate text? → GPT (decoder)
│ └─ Translate / summarize? → Encoder-Decoder Transformer
│
├─ Generate new data ───────────────► Generative models
│ ├─ Realistic samples? → GAN
│ └─ Smooth latent space? → VAE
│
└─ Compress / detect anomalies ─────► Autoencoder