Domain 3D: Deep Learning & Neural Networks

21 min read 4436 words

Table of Contents

Deep Learning & Neural Networks

Deep Learning & Neural Networks

Exam Domain: 3 — ML Model Development (Deep Learning) Task: Understand deep learning architectures and when to apply them

Neural Network Fundamentals

The Single Neuron

A neuron is the atomic unit of a neural network. It computes:

$$output = activation\left(\sum_{i} w_i x_i + b\right)$$

where $x_i$ are inputs, $w_i$ are learnable weights, $b$ is a bias term, and $activation$ introduces non-linearity.

Single Neuron:

  x₁ ──w₁──┐
  x₂ ──w₂──┤
  x₃ ──w₃──┼──► Σ(wᵢxᵢ + b) ──► activation(z) ──► output
  x₄ ──w₄──┤
  x₅ ──w₅──┘

  Each input is scaled by its weight.
  The sum passes through an activation function.
  The result is one number: how much this neuron "fires."

ELI5: A neuron is a tiny decision maker. It takes inputs, weights them by importance, sums them up, and passes the result through a “squish” function to decide how much signal to send forward. A high weight on input $x_3$ means “pay a lot of attention to $x_3$.” The activation function prevents the network from just doing endless linear arithmetic — it adds the bends and curves that let it learn complex patterns.

Layers: Input → Hidden → Output

Multi-Layer Neural Network (3-layer):

  Input Layer       Hidden Layer 1     Hidden Layer 2    Output Layer
  (raw features)    (simple patterns)  (complex patterns) (prediction)

     x₁ ──────┐    ┌── h₁₁ ──┐    ┌── h₂₁ ──┐
     x₂ ──────┼──► ├── h₁₂ ──┼──► ├── h₂₂ ──┼──► ŷ
     x₃ ──────┤    ├── h₁₃ ──┤    ├── h₂₃ ──┤
     x₄ ──────┘    └── h₁₄ ──┘    └── h₂₄ ──┘

  "Depth" = number of hidden layers
  "Width" = number of neurons per layer

What each layer learns (in a CNN processing images):

Layer 1: edges and gradients
Layer 2: corners, textures, simple shapes
Layer 3: object parts (eyes, wheels, doors)
Layer 4+: full objects and high-level concepts

ELI5: Deep learning is powerful because each layer builds on the previous one. Layer 1 sees edges. Layer 2 sees shapes. Layer 3 sees objects. It’s like building understanding from pixels to meaning — the same way a child learns that dots and lines form letters, letters form words, and words form meaning. The “deep” in deep learning just means many of these stacked layers.

Universal Approximation Theorem: A single hidden layer with enough neurons can approximate any continuous function. This is why neural networks are theoretically universal — but in practice, depth (more layers) is more efficient than width (more neurons in one layer) for learning complex patterns with fewer parameters.

Activation Functions

Activation functions introduce non-linearity. Without them, stacking layers is mathematically equivalent to a single linear layer — useless for complex problems.

Sigmoid

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

Sigmoid shape:     Range: (0, 1)

    1.0 ─────────────────────────
                              ╱
    0.5 ──────────────────╱───
                      ╱
    0.0 ─────────╱────────────
         ──────────────────────→ x
         -4   -2    0    2    4

Use: output layer for binary classification
Problem 1 — Vanishing gradient: derivatives are near zero at extremes → gradients shrink to nothing through many layers → network stops learning
Problem 2 — Not zero-centered: outputs always positive → inefficient gradient updates

Tanh

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

Range: $(-1, 1)$, zero-centered (better than sigmoid for hidden layers)
Still suffers from vanishing gradients at extremes

ReLU (Rectified Linear Unit)

$$f(x) = \max(0, x)$$

ReLU shape:     Range: [0, ∞)

         ╱
        ╱
───────╱───────────→ x
  negative  positive
  = 0       = x

Why it’s the default: no vanishing gradient for positive values, computationally trivial ($\max$ operation), sparse activations (many neurons output 0 → efficient)
Dying ReLU problem: if a neuron always receives negative inputs, it always outputs 0 and its gradient is also 0 → it never learns again (“dead neuron”). Caused by large learning rates or poor initialization.

Leaky ReLU

$$f(x) = \max(0.01x,\ x)$$

Fixes dying ReLU: negative inputs get a small non-zero gradient (0.01 slope)
The leak coefficient (0.01) can itself be learned → PReLU (Parametric ReLU)

ELU and SELU

ELU: smooth negative region, negative values push mean activations toward zero → self-normalizing tendencies
SELU (Scaled ELU): mathematically proven to self-normalize activations (mean ≈ 0, variance ≈ 1) across layers when used with specific initialization — eliminates need for batch normalization in many cases

Softmax

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Outputs sum to 1 → interpretable as probabilities
Use: output layer for multi-class classification
Key property: amplifies the largest value relative to others (temperature controls “sharpness”)

GELU (Gaussian Error Linear Unit)

$$\text{GELU}(x) = x \cdot \Phi(x)$$

Smooth probabilistic approximation of ReLU
Used in Transformers (BERT, GPT) because smooth derivatives help with very deep training

Activation Function Comparison

Function	Range	Zero-centered	Vanishing Gradient	Use Case
Sigmoid	(0,1)	No	Yes (severe)	Binary output layer
Tanh	(-1,1)	Yes	Yes (moderate)	RNN hidden states
ReLU	[0,∞)	No	No (for x>0)	Default hidden layer
Leaky ReLU	(-∞,∞)	Approx	No	When dying ReLU is a problem
SELU	(-∞,∞)	Self-normalizes	No	Deep FFN without BatchNorm
Softmax	(0,1) sum=1	No	—	Multi-class output layer
GELU	(-∞,∞)	Approx	No	Transformers

Decision guide:

Hidden layers → ReLU (default), Leaky ReLU if neurons are dying
Binary output → Sigmoid
Multi-class output → Softmax
RNN hidden states → Tanh (historically) or GELU
Transformers → GELU

Backpropagation

First Principles

Training a neural network means finding weights $W$ that minimize a loss function $L$. Backpropagation computes $\frac{\partial L}{\partial W}$ for every weight using the chain rule of calculus.

Forward pass: input $\rightarrow$ activations layer by layer $\rightarrow$ prediction $\rightarrow$ compute loss

Backward pass: loss $\rightarrow$ gradient through output layer $\rightarrow$ gradient through each hidden layer (in reverse) $\rightarrow$ update weights

Chain rule applied: if $L$ depends on $z$, and $z$ depends on $W$, then:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W}$$

For a deep network with layers $1, 2, \ldots, n$:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \ldots \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial W_1}$$

ELI5: Imagine a factory assembly line. The product at the end is wrong. Backpropagation traces backward through every station asking “how much did YOU contribute to the error?” Station 5 answers honestly. Station 4 uses that answer to figure out its own contribution. And so on back to station 1. Then each station adjusts its process slightly to reduce the error. Repeat millions of times and the factory learns to make perfect products.

Vanishing Gradient Problem

In deep networks using sigmoid or tanh activations, the derivative at each layer is a number between 0 and 1. Multiplying many such numbers together:

$$0.3 \times 0.3 \times 0.3 \times 0.3 \times 0.3 = 0.00243$$

After 10 layers this is effectively zero. Early layers receive near-zero gradients → they barely learn → deep sigmoid networks fail.

Solutions: ReLU activations, batch normalization, residual connections (skip connections), LSTM/GRU for sequences.

Exploding Gradient Problem

The opposite: gradients grow exponentially through layers. Manifests as NaN loss or wildly oscillating training.

Solutions:

Gradient clipping: cap gradient norm at a maximum value (e.g., 1.0)
Weight initialization techniques (Xavier/He initialization)
Batch normalization

Exam tip: Vanishing gradients → use ReLU, skip connections, or LSTM. Exploding gradients → use gradient clipping. Both are common issues in deep and recurrent networks.

Convolutional Neural Networks (CNNs)

Why Regular Networks Fail on Images

A 224×224 RGB image has $224 \times 224 \times 3 = 150,528$ inputs. A single dense hidden layer with 1,000 neurons would have 150 million parameters — just for layer 1. Problems:

Too many parameters → overfitting
No spatial awareness: pixel at position (10,10) has no relationship to pixel at (11,11) in a fully connected layer
Not translation invariant: a cat in the top-left corner looks completely different from a cat in the bottom-right

Convolution Layer

A filter (kernel) is a small matrix (e.g., 3×3) that slides across the image. At each position, it computes the dot product with the underlying image patch.

3×3 filter sliding across a 5×5 input (stride=1, no padding):

Input (5×5):          Filter (3×3):       Feature Map (3×3):
┌─────────────┐       ┌───────┐
│ 1  2  3  4  5│      │ 1 0 -1│            ┌─────────┐
│ 6  7  8  9 10│  ×   │ 1 0 -1│   ──►     │  ?  ?  ?│
│11 12 13 14 15│      │ 1 0 -1│            │  ?  ?  ?│
│16 17 18 19 20│      └───────┘            │  ?  ?  ?│
│21 22 23 24 25│                           └─────────┘
└─────────────┘
     ↑                                      ↑
  5×5 = 25 values               (5-3+1)×(5-3+1) = 3×3 = 9 values

At top-left position:
  (1×1)+(2×0)+(3×-1) + (6×1)+(7×0)+(8×-1) + (11×1)+(12×0)+(13×-1)
= (1+0-3) + (6+0-8) + (11+0-13) = -2 + -2 + -2 = -6

Parameter sharing: The same 3×3 filter (9 parameters) is applied at every position in the image. This is the key insight — instead of learning a separate detector for every position, one filter learns to detect a feature anywhere in the image. Drastically reduces parameters.

Feature maps: Each filter produces one feature map. With 64 filters, you get 64 feature maps, each detecting a different low-level pattern.

Stride: How many pixels the filter moves each step. Stride=1 → dense output. Stride=2 → halves spatial dimensions.

Padding:

Valid: no padding → output smaller than input
Same: zero-pad input so output has same spatial dimensions as input

Pooling Layer

Downsamples the spatial dimensions of feature maps.

Max Pooling (2×2, stride=2):

  Input (4×4):         Output (2×2):
  ┌──┬──┬──┬──┐       ┌────┬────┐
  │ 1│ 3│ 2│ 4│       │ 6  │ 8  │
  ├──┼──┼──┼──┤  ──►  ├────┼────┤
  │ 5│ 6│ 1│ 8│       │ 9  │ 7  │
  ├──┼──┼──┼──┤       └────┴────┘
  │ 2│ 9│ 3│ 1│
  ├──┼──┼──┼──┤
  │ 4│ 3│ 7│ 2│
  └──┴──┴──┴──┘

  Each 2×2 block → its maximum value

Max pooling (preferred): keeps the strongest activation, provides translation invariance
Average pooling: takes the mean — preserves more information, sometimes used in final layers

Why pooling helps: reduces spatial resolution → fewer parameters in subsequent layers, and slight translation invariance (the cat can move a few pixels without changing the feature map drastically).

Full CNN Architecture

Full CNN Pipeline:

  Input Image
      │
      ▼
  ┌─────────────────┐
  │  Conv Layer     │ ── 32 filters of 3×3 → 32 feature maps
  │  ReLU           │
  │  Max Pool 2×2   │ ── halves spatial dims
  └────────┬────────┘
           │
      ▼
  ┌─────────────────┐
  │  Conv Layer     │ ── 64 filters of 3×3 → 64 feature maps
  │  ReLU           │
  │  Max Pool 2×2   │ ── halves again
  └────────┬────────┘
           │
      ▼
  ┌─────────────────┐
  │  Flatten        │ ── 3D feature maps → 1D vector
  │  Dense Layer    │ ── fully connected (128 neurons)
  │  ReLU           │
  │  Dropout        │ ── regularization
  │  Dense Output   │ ── softmax for N classes
  └─────────────────┘

Famous CNN Architectures

Architecture	Year	Key Innovation	Depth
LeNet	1998	First successful CNN	5 layers
AlexNet	2012	Deep CNN on GPU, ReLU, Dropout	8 layers
VGG	2014	All 3×3 convolutions, very deep	16-19
ResNet	2015	Skip connections	50-152+
Inception	2014	Parallel filter sizes in one layer	22

ResNet skip connections — the critical insight:

ResNet Skip Connection (Residual Block):

  Input x ──────────────────────────┐
      │                             │
      ▼                             │
  ┌────────┐                        │
  │ Conv   │                        │
  │ BN     │                        │ (identity shortcut)
  │ ReLU   │                        │
  └───┬────┘                        │
      │                             │
  ┌────────┐                        │
  │ Conv   │                        │
  │ BN     │                        │
  └───┬────┘                        │
      │                             │
      └──────────── + ◄─────────────┘
                    │
                  ReLU
                    │
                 Output = F(x) + x

Instead of learning $F(x)$, the block learns the residual $F(x) = H(x) - x$, where $H(x)$ is the desired mapping. If the layer is not needed, $F(x)$ simply goes to zero and the identity passes through unchanged. The skip connection also provides a gradient highway directly to early layers — solving vanishing gradients for very deep networks (100+ layers).

Transfer Learning with CNNs

Transfer Learning Strategy:

  ImageNet pre-trained model (millions of images, 1000 classes)
  ┌──────────────────────────────────────┐
  │  Conv layers (frozen or fine-tuned)  │ ← Universal features
  │  Dense layers (frozen)               │   (edges, textures, objects)
  └──────────────────────────────────────┘
                      │
                      ▼ Replace final layer
  ┌──────────────────────────────────────┐
  │  New Dense Output Layer (your N classes) │ ← Train from scratch
  └──────────────────────────────────────┘

Why it works: early CNN layers learn universal visual features (edges, corners, textures) that transfer across domains. Only the final classification layer needs to learn your specific classes.

When to use transfer learning:

Small labeled dataset (< 10,000 images) + similar domain to pre-training data → freeze most layers, retrain final layers only
Medium dataset + different domain → fine-tune more layers
Large dataset → full training from scratch may beat transfer learning

ELI5: Instead of teaching a baby to see from scratch, start with someone who already knows what edges and shapes are, and just teach them to recognize YOUR specific objects. The pre-trained weights encode years of “visual education” from millions of images — you’re piggybacking on that knowledge.

1D CNNs for sequences: The convolution slides along the time/sequence axis instead of spatial dimensions. Effective for time series and text when local patterns matter more than long-range dependencies. Faster than RNNs, parallelizable.

CNN Use Cases

Task	Architecture	Output
Image classification	ResNet, VGG, EfficientNet	Class label + probability
Object detection	YOLO, SSD, Faster R-CNN	Bounding boxes + labels
Semantic segmentation	FCN, U-Net, DeepLab	Pixel-level label mask
Medical imaging	U-Net	Organ/tumor segmentation
Time series	1D CNN	Local pattern features

Recurrent Neural Networks (RNNs)

Why Feedforward Networks Fail on Sequences

A feedforward network treats each input independently. For the sentence “The cat sat on the mat because it was tired,” understanding that “it” refers to “cat” requires memory of previous words. Feedforward networks have no such memory.

The Recurrence Equation

$$h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b)$$

where $h_t$ is the hidden state at time step $t$ (the “memory”), $x_t$ is the current input, and $h_{t-1}$ is the previous hidden state.

RNN Unrolled Through Time:

  x₁ ──► [RNN] ──h₁──► [RNN] ──h₂──► [RNN] ──h₃──► ... ──► output
              ↑             ↑             ↑
         same weights  same weights  same weights
         W_hh, W_xh   (shared across all time steps)

The same weight matrices $W_{hh}$ and $W_{xh}$ are applied at every time step — analogous to convolutional parameter sharing, but across time instead of space.

Vanishing Gradient in RNNs

For a sequence of length 100, the gradient of the loss with respect to the hidden state at step 1 involves 99 repeated multiplications of $W_{hh}$. If the largest eigenvalue of $W_{hh}$ is < 1, gradients vanish. If > 1, they explode.

ELI5: Playing “telephone” with 50 people — by the time the message reaches the end, the original meaning is completely lost. The RNN’s gradient is the message traveling backwards. By the time it reaches step 1, it contains almost nothing useful. So early steps don’t learn from events that happened long ago.

LSTM (Long Short-Term Memory)

LSTM introduces a cell state $C_t$ — a separate memory highway that runs alongside the hidden state, protected from the repeated multiplication that causes vanishing gradients.

The four components:

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$$

$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$$

$$\tilde{C}t = \tanh(W_C [h{t-1}, x_t] + b_C) \quad \text{(candidate values)}$$

$$C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \quad \text{(cell state update)}$$

$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$$

$$h_t = o_t \cdot \tanh(C_t) \quad \text{(hidden state)}$$

LSTM Cell — All Gates:

                            ┌──────────────── C_{t-1} (cell state in) ──────────────┐
                            │                                                         │
  h_{t-1} ──┐              ▼                                                         ▼
             ├──► [Forget Gate f_t] ──►  × ──────────────────────────── + ──► C_t ──► tanh ──► ×──► h_t
  x_t ──────┘              │            ↑                                ↑            ↑
                            │       ┌───┘                          ┌────┘         ────┘
                            │  [Input Gate i_t] ──► × ──┘     [Output Gate o_t]
                            │  [Candidate Ĉ_t]  ──┘
                            │
  Gate summary:
    Forget gate:  "How much of old cell state to keep?" (0=forget all, 1=keep all)
    Input gate:   "How much of new candidate to write?"
    Output gate:  "How much of cell state to expose as hidden state?"
    Cell state:   The long-term memory highway

Why LSTM solves vanishing gradient: the cell state update $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$ is additive, not multiplicative through the network depth. Gradients can flow unchanged through the cell state highway across many time steps.

ELI5: LSTM is a person with a notebook. The forget gate crosses things out. The input gate writes new things. The output gate decides what to read aloud. The notebook itself is the long-term memory that persists. The key insight: information in the notebook can survive for hundreds of time steps without being distorted — unlike the “telephone game” of vanilla RNNs.

GRU (Gated Recurrent Unit)

GRU simplifies LSTM into two gates: reset gate ($r_t$) and update gate ($z_t$).

$$z_t = \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)}$$

$$r_t = \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)}$$

$$\tilde{h}t = \tanh(W [r_t \cdot h{t-1}, x_t]) \quad \text{(candidate hidden state)}$$

$$h_t = (1-z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t$$

No separate cell state — just one hidden state with gating.

RNN vs LSTM vs GRU Comparison

Aspect	Vanilla RNN	LSTM	GRU
Gates	None	3 gates + cell state	2 gates
Parameters	Fewest	Most	Fewer than LSTM
Long-term memory	Poor	Excellent	Good
Training speed	Fastest	Slowest	Middle
Vanishing gradient	Yes	Solved	Solved
When to use	Toy examples	Long sequences, complex patterns	Smaller datasets, faster training

Decision rule: Start with GRU for faster iteration. Switch to LSTM if sequence dependencies are very long (> 100 steps) or accuracy matters more than speed.

Bidirectional RNNs

Bidirectional RNN:

  x₁       x₂       x₃       x₄
   │        │        │        │
   ▼        ▼        ▼        ▼
  [→]─────►[→]─────►[→]─────►[→]   (forward pass)
   ▲        ▲        ▲        ▲
  [←]◄─────[←]◄─────[←]◄─────[←]   (backward pass)
   │        │        │        │
   ▼        ▼        ▼        ▼
 concat   concat   concat   concat
   │        │        │        │
  y₁       y₂       y₃       y₄

Each output sees BOTH past AND future context.

The output at each time step is the concatenation of forward and backward hidden states. Essential for tasks that need future context: NER (“Bank” is a company if followed by “Inc.”), speech recognition, machine translation encoder.

Transformers & Attention

The Revolution

Transformers (2017, “Attention Is All You Need”) discarded recurrence entirely. Instead, every token attends directly to every other token via the attention mechanism. Key advantages over RNNs:

Fully parallelizable — no sequential dependency
Better long-range dependencies — direct path between any two tokens
No vanishing gradient — attention creates direct connections

Self-Attention Mechanism

For each token, compute three vectors: Query (Q), Key (K), Value (V) from learned weight matrices.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Interpretation:

$QK^T$: dot product of query with all keys → measures relevance of each token to the current token
$\sqrt{d_k}$: scaling to prevent dot products from growing too large (which would saturate softmax)
Softmax: turn relevance scores into probabilities (attention weights)
Multiply by $V$: weighted sum of value vectors — the output is a blend of all tokens, weighted by relevance

ELI5: In the sentence “The cat sat on the mat because it was tired”, self-attention helps the model understand that “it” refers to “cat” by computing a relevance score between “it” and every other word in the sentence. “it” has a high score with “cat” and low scores with “mat.” The output representation of “it” is then a weighted blend of all word representations — heavily blending in “cat.” This is how meaning is captured from context.

Multi-Head Attention

Run self-attention $h$ times in parallel with different learned weight matrices, then concatenate:

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

Each head can focus on different types of relationships (syntactic, semantic, coreference) simultaneously.

Positional Encoding

Since there is no recurrence, the Transformer has no inherent sense of position. Positional encodings (sinusoidal functions or learned embeddings) are added to token embeddings to encode order.

Encoder-Decoder Architecture

Transformer Encoder-Decoder:

  "Je suis étudiant"         "I am a student"
       │                           │
  ┌────▼────┐                 ┌────▼────┐
  │ Encoder │                 │ Decoder │
  │         │ ─── context ──► │         │
  │ Self-   │    (K, V from   │ Masked  │
  │ Attention│   encoder)     │ Self-Att│
  │         │                 │ Cross-  │
  │ Feed-   │                 │ Attention│
  │ Forward │                 │ Feed-   │
  └─────────┘                 │ Forward │
                              └─────────┘
                                   │
                              Output tokens
                              (autoregressive)

Encoder: reads the entire input simultaneously, produces rich contextual representations (used in BERT)

Decoder: generates output one token at a time, attending to both previous outputs and encoder context (used in GPT, machine translation)

BERT vs GPT

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Training	Masked language modeling (bidirectional)	Next token prediction (left-to-right)
Strength	Understanding tasks (classification, NER)	Generation tasks (text, code, chat)
Context	Sees full sequence bidirectionally	Sees only past tokens

RNN vs Transformer Comparison

RNN Processing (sequential — must wait):
  x₁ ──► h₁ ──► h₂ ──► h₃ ──► ... ──► hₙ

Transformer Processing (parallel — all at once):
  x₁ ──────────────────────────────►
  x₂ ──── Self-Attention (all tokens simultaneously) ────►
  x₃ ──────────────────────────────►
  xₙ ──────────────────────────────►

RNN processes tokens one at a time → slow for long sequences, hard to parallelize. Transformer processes all tokens simultaneously → much faster training on modern hardware (GPU/TPU).

Autoencoders

Architecture

Autoencoder:

  Input x        Bottleneck z         Reconstructed x̂
  (high-dim)     (low-dim)            (high-dim)

  ┌───────┐    ┌──────┐    ┌───────┐
  │ 784   │    │  32  │    │ 784   │
  │ dims  │──► │ dims │──► │ dims  │
  │       │    │      │    │       │
  │Encoder│    │Latent│    │Decoder│
  └───────┘    │Space │    └───────┘
               └──────┘

  Loss = reconstruction error: ||x̂ - x||²
  Training signal: how well can you reconstruct x from a tiny z?

The bottleneck forces the encoder to learn a compressed representation. The decoder learns to reconstruct from that compression. Neither part is useful individually — the learned representation in the bottleneck is the output.

ELI5: An autoencoder is like summarizing a book (encoder) then reconstructing it from the summary (decoder). The goal isn’t the reconstruction — it’s the summary itself. That compressed summary is the “embedding” or “latent representation” that captures the essence of the data. A good summary can reconstruct a good approximation of the original.

Use cases:

Dimensionality reduction: non-linear alternative to PCA
Anomaly detection: reconstruct normal data well; anomalies have high reconstruction error
Denoising: train with noisy input → clean output; forces learning of clean structure
Pre-training: learn representations before fine-tuning on a downstream task

Variational Autoencoders (VAE)

Instead of encoding to a fixed point $z$, VAE encodes to a distribution: mean $\mu$ and variance $\sigma^2$. Samples from this distribution become the decoder input.

Loss = reconstruction error + KL divergence (keeps the latent space smooth and continuous)

Since the latent space is smooth, you can interpolate between points and generate new data by sampling from the latent space. VAEs are generative models.

Generative Adversarial Networks (GANs)

Architecture

Two networks compete in a minimax game:

GAN Training Loop:

  Real data ──────────────────────────┐
                                       │
  Random noise z                       ▼
       │              ┌───────────────────────────┐
       ▼              │     Discriminator D        │
  ┌─────────┐         │  "Is this real or fake?"  │──► Real/Fake score
  │Generator│──fake──►│                           │
  │    G    │         └───────────────────────────┘
  └─────────┘                 │
       ↑                      │ gradient
       └──────── loss ◄────── ┘
         (fool D)

Generator ($G$): takes random noise $z$, outputs a fake sample. Trained to maximize $D$’s classification error (make fakes look real).

Discriminator ($D$): takes real or fake samples, outputs probability of being real. Trained to correctly distinguish real from fake.

Training objective:

$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$

Nash equilibrium: when $G$ produces perfect fakes, $D$ can only guess randomly (probability 0.5). At this point, training has converged.

Mode collapse: $G$ finds one type of output that consistently fools $D$ and gets stuck producing only that type. Solutions: Wasserstein GAN, minibatch discrimination, historical averaging.

ELI5: A counterfeiter (generator) tries to make fake money. A detective (discriminator) tries to spot fakes. They both get better through competition — the counterfeiter learns from every time they get caught, the detective learns from every fake that slipped through. Eventually the fakes are indistinguishable from real bills. GANs use this adversarial competition to generate extremely realistic data.

Use cases: image generation, data augmentation (generate more training examples), style transfer, image-to-image translation (pix2pix), super-resolution.

Practical Deep Learning on AWS

GPU Instance Types

Family	GPU	Best For
P3	NVIDIA V100	Standard DL training
P4d	NVIDIA A100	Large model training
P5	NVIDIA H100	Transformer / LLM training
G4dn	NVIDIA T4	Cost-effective inference + training
G5	NVIDIA A10G	Inference, vision models
Inf1/Inf2	AWS Inferentia	Inference only (custom chip, very cheap)

Distributed Training Strategies

Data Parallelism (most common):

  GPU 1: full model copy ──► batch 1 ──► gradient 1 ──┐
  GPU 2: full model copy ──► batch 2 ──► gradient 2 ──┼──► average gradients ──► update weights
  GPU 3: full model copy ──► batch 3 ──► gradient 3 ──┘
  GPU 4: full model copy ──► batch 4 ──► gradient 4 ──┘

Model Parallelism (for huge models):

  GPU 1: layers 1-10 ──► activations ──► GPU 2: layers 11-20 ──► GPU 3: layers 21-30
  (model too big to fit on one GPU → split across GPUs)

SageMaker Data Parallelism library: optimized AllReduce implementation for multi-GPU/multi-node
SageMaker Model Parallelism library: automatic pipeline parallelism for models > GPU memory

Cost Optimization

Managed Spot Training: use EC2 Spot instances (70-90% cheaper), SageMaker handles interruptions and auto-resumes from checkpoints
Checkpointing: save model state to S3 periodically; critical for spot training since instances can be reclaimed
SageMaker Training Compiler: optimizes DL training code (up to 50% speedup, up to 50% cost reduction) by optimizing GPU utilization — works with TensorFlow and PyTorch

Training Observability

SageMaker Debugger: captures tensors during training, monitors for issues (vanishing/exploding gradients, overfit, poor weight initialization) and can auto-stop unhealthy training jobs
CloudWatch: training metrics (loss, accuracy) per epoch
SageMaker Experiments: track hyperparameters, metrics, artifacts across training runs

Architecture Selection Guide

Why this matters for the exam: Every scenario question about model design requires mapping problem type → architecture. Memorize this mapping.

Problem Type	Data	Architecture	Key Reason
Image classification	2D images	CNN (ResNet/EfficientNet)	Spatial feature hierarchy
Object detection	2D images	CNN (YOLO/SSD)	Bounding box regression
Pixel segmentation	2D images	FCN/U-Net	Dense prediction
Sequence classification	Text, time series	LSTM/GRU or 1D CNN	Temporal patterns
Text understanding	Documents	BERT (Transformer encoder)	Bidirectional context
Text generation	Language	GPT (Transformer decoder)	Autoregressive
Translation	Sequence pairs	Transformer (enc-dec)	Long-range alignment
Anomaly detection	Any	Autoencoder	Reconstruction error
Data generation	Any modality	GAN or VAE	Generative modeling
Tabular data	Structured	XGBoost (not deep learning)	Gradient boosting dominates
Multi-series forecast	Time series	DeepAR (LSTM)	Cross-series patterns

Exam tip: For tabular/structured data, XGBoost typically outperforms deep learning and is cheaper to train. Deep learning advantages kick in for unstructured data (images, text, audio) where feature engineering is impractical.

Quick Reference

Deep Learning Architecture Decision Tree:

  What is your data type?
  │
  ├─ Images / Video ──────────────────► CNN
  │   ├─ Classify whole image?         → ResNet + transfer learning
  │   ├─ Find objects with boxes?      → SSD / Faster R-CNN
  │   └─ Label every pixel?            → U-Net / DeepLab
  │
  ├─ Sequential / Time series ────────► RNN family
  │   ├─ Short sequences?              → GRU (faster)
  │   ├─ Long sequences?               → LSTM (better memory)
  │   ├─ Bidirectional context?        → Bidirectional LSTM
  │   └─ Multiple related series?      → DeepAR
  │
  ├─ Text / Language ─────────────────► Transformers
  │   ├─ Understand / classify text?   → BERT (encoder)
  │   ├─ Generate text?                → GPT (decoder)
  │   └─ Translate / summarize?        → Encoder-Decoder Transformer
  │
  ├─ Generate new data ───────────────► Generative models
  │   ├─ Realistic samples?            → GAN
  │   └─ Smooth latent space?          → VAE
  │
  └─ Compress / detect anomalies ─────► Autoencoder

Deep Learning & Neural Networks#

Neural Network Fundamentals#

The Single Neuron#

Layers: Input → Hidden → Output#

Activation Functions#

Sigmoid#

Tanh#

ReLU (Rectified Linear Unit)#

Leaky ReLU#

ELU and SELU#

Softmax#

GELU (Gaussian Error Linear Unit)#

Activation Function Comparison#

Backpropagation#

First Principles#

Vanishing Gradient Problem#

Exploding Gradient Problem#

Convolutional Neural Networks (CNNs)#

Why Regular Networks Fail on Images#

Convolution Layer#

Pooling Layer#

Full CNN Architecture#

Famous CNN Architectures#

Transfer Learning with CNNs#

CNN Use Cases#

Recurrent Neural Networks (RNNs)#

Why Feedforward Networks Fail on Sequences#

The Recurrence Equation#

Vanishing Gradient in RNNs#

LSTM (Long Short-Term Memory)#

GRU (Gated Recurrent Unit)#

RNN vs LSTM vs GRU Comparison#

Bidirectional RNNs#

Transformers & Attention#

The Revolution#

Self-Attention Mechanism#

Multi-Head Attention#

Positional Encoding#

Encoder-Decoder Architecture#

BERT vs GPT#

RNN vs Transformer Comparison#

Autoencoders#

Architecture#

Variational Autoencoders (VAE)#

Generative Adversarial Networks (GANs)#

Architecture#

Practical Deep Learning on AWS#

GPU Instance Types#

Distributed Training Strategies#

Cost Optimization#

Training Observability#

Architecture Selection Guide#

Quick Reference#

Deep Learning & Neural Networks

Neural Network Fundamentals

The Single Neuron

Layers: Input → Hidden → Output

Activation Functions

Sigmoid

Tanh

ReLU (Rectified Linear Unit)

Leaky ReLU

ELU and SELU

Softmax

GELU (Gaussian Error Linear Unit)

Activation Function Comparison

Backpropagation

First Principles

Vanishing Gradient Problem

Exploding Gradient Problem

Convolutional Neural Networks (CNNs)

Why Regular Networks Fail on Images

Convolution Layer

Pooling Layer

Full CNN Architecture

Famous CNN Architectures

Transfer Learning with CNNs

CNN Use Cases

Recurrent Neural Networks (RNNs)

Why Feedforward Networks Fail on Sequences

The Recurrence Equation

Vanishing Gradient in RNNs

LSTM (Long Short-Term Memory)

GRU (Gated Recurrent Unit)

RNN vs LSTM vs GRU Comparison

Bidirectional RNNs

Transformers & Attention

The Revolution

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Encoder-Decoder Architecture

BERT vs GPT

RNN vs Transformer Comparison

Autoencoders

Architecture

Variational Autoencoders (VAE)

Generative Adversarial Networks (GANs)

Architecture

Practical Deep Learning on AWS

GPU Instance Types

Distributed Training Strategies

Cost Optimization

Training Observability

Architecture Selection Guide

Quick Reference