← AWS MLS-C01 — ML Specialty

Domain 3D: Deep Learning & Neural Networks

Deep Learning & Neural Networks

Exam Domain: 3 — ML Model Development (Deep Learning) Task: Understand deep learning architectures and when to apply them


Neural Network Fundamentals

The Single Neuron

A neuron is the atomic unit of a neural network. It computes:

$$output = activation\left(\sum_{i} w_i x_i + b\right)$$

where $x_i$ are inputs, $w_i$ are learnable weights, $b$ is a bias term, and $activation$ introduces non-linearity.

Single Neuron:

  x₁ ──w₁──┐
  x₂ ──w₂──┤
  x₃ ──w₃──┼──► Σ(wᵢxᵢ + b) ──► activation(z) ──► output
  x₄ ──w₄──┤
  x₅ ──w₅──┘

  Each input is scaled by its weight.
  The sum passes through an activation function.
  The result is one number: how much this neuron "fires."

ELI5: A neuron is a tiny decision maker. It takes inputs, weights them by importance, sums them up, and passes the result through a “squish” function to decide how much signal to send forward. A high weight on input $x_3$ means “pay a lot of attention to $x_3$.” The activation function prevents the network from just doing endless linear arithmetic — it adds the bends and curves that let it learn complex patterns.

Layers: Input → Hidden → Output

Multi-Layer Neural Network (3-layer):

  Input Layer       Hidden Layer 1     Hidden Layer 2    Output Layer
  (raw features)    (simple patterns)  (complex patterns) (prediction)

     x₁ ──────┐    ┌── h₁₁ ──┐    ┌── h₂₁ ──┐
     x₂ ──────┼──► ├── h₁₂ ──┼──► ├── h₂₂ ──┼──► ŷ
     x₃ ──────┤    ├── h₁₃ ──┤    ├── h₂₃ ──┤
     x₄ ──────┘    └── h₁₄ ──┘    └── h₂₄ ──┘

  "Depth" = number of hidden layers
  "Width" = number of neurons per layer

What each layer learns (in a CNN processing images):

  • Layer 1: edges and gradients
  • Layer 2: corners, textures, simple shapes
  • Layer 3: object parts (eyes, wheels, doors)
  • Layer 4+: full objects and high-level concepts

ELI5: Deep learning is powerful because each layer builds on the previous one. Layer 1 sees edges. Layer 2 sees shapes. Layer 3 sees objects. It’s like building understanding from pixels to meaning — the same way a child learns that dots and lines form letters, letters form words, and words form meaning. The “deep” in deep learning just means many of these stacked layers.

Universal Approximation Theorem: A single hidden layer with enough neurons can approximate any continuous function. This is why neural networks are theoretically universal — but in practice, depth (more layers) is more efficient than width (more neurons in one layer) for learning complex patterns with fewer parameters.


Activation Functions

Activation functions introduce non-linearity. Without them, stacking layers is mathematically equivalent to a single linear layer — useless for complex problems.

Sigmoid

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

Sigmoid shape:     Range: (0, 1)

    1.0 ─────────────────────────
                              ╱
    0.5 ──────────────────╱───
                      ╱
    0.0 ─────────╱────────────
         ──────────────────────→ x
         -4   -2    0    2    4
  • Use: output layer for binary classification
  • Problem 1 — Vanishing gradient: derivatives are near zero at extremes → gradients shrink to nothing through many layers → network stops learning
  • Problem 2 — Not zero-centered: outputs always positive → inefficient gradient updates

Tanh

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

  • Range: $(-1, 1)$, zero-centered (better than sigmoid for hidden layers)
  • Still suffers from vanishing gradients at extremes

ReLU (Rectified Linear Unit)

$$f(x) = \max(0, x)$$

ReLU shape:     Range: [0, ∞)

         ╱
        ╱
───────╱───────────→ x
  negative  positive
  = 0       = x
  • Why it’s the default: no vanishing gradient for positive values, computationally trivial ($\max$ operation), sparse activations (many neurons output 0 → efficient)
  • Dying ReLU problem: if a neuron always receives negative inputs, it always outputs 0 and its gradient is also 0 → it never learns again (“dead neuron”). Caused by large learning rates or poor initialization.

Leaky ReLU

$$f(x) = \max(0.01x,\ x)$$

  • Fixes dying ReLU: negative inputs get a small non-zero gradient (0.01 slope)
  • The leak coefficient (0.01) can itself be learned → PReLU (Parametric ReLU)

ELU and SELU

  • ELU: smooth negative region, negative values push mean activations toward zero → self-normalizing tendencies
  • SELU (Scaled ELU): mathematically proven to self-normalize activations (mean ≈ 0, variance ≈ 1) across layers when used with specific initialization — eliminates need for batch normalization in many cases

Softmax

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

  • Outputs sum to 1 → interpretable as probabilities
  • Use: output layer for multi-class classification
  • Key property: amplifies the largest value relative to others (temperature controls “sharpness”)

GELU (Gaussian Error Linear Unit)

$$\text{GELU}(x) = x \cdot \Phi(x)$$

  • Smooth probabilistic approximation of ReLU
  • Used in Transformers (BERT, GPT) because smooth derivatives help with very deep training

Activation Function Comparison

FunctionRangeZero-centeredVanishing GradientUse Case
Sigmoid(0,1)NoYes (severe)Binary output layer
Tanh(-1,1)YesYes (moderate)RNN hidden states
ReLU[0,∞)NoNo (for x>0)Default hidden layer
Leaky ReLU(-∞,∞)ApproxNoWhen dying ReLU is a problem
SELU(-∞,∞)Self-normalizesNoDeep FFN without BatchNorm
Softmax(0,1) sum=1NoMulti-class output layer
GELU(-∞,∞)ApproxNoTransformers

Decision guide:

  • Hidden layers → ReLU (default), Leaky ReLU if neurons are dying
  • Binary output → Sigmoid
  • Multi-class output → Softmax
  • RNN hidden states → Tanh (historically) or GELU
  • Transformers → GELU

Backpropagation

First Principles

Training a neural network means finding weights $W$ that minimize a loss function $L$. Backpropagation computes $\frac{\partial L}{\partial W}$ for every weight using the chain rule of calculus.

Forward pass: input $\rightarrow$ activations layer by layer $\rightarrow$ prediction $\rightarrow$ compute loss

Backward pass: loss $\rightarrow$ gradient through output layer $\rightarrow$ gradient through each hidden layer (in reverse) $\rightarrow$ update weights

Chain rule applied: if $L$ depends on $z$, and $z$ depends on $W$, then:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial W}$$

For a deep network with layers $1, 2, \ldots, n$:

$$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdot \ldots \cdot \frac{\partial a_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial W_1}$$

ELI5: Imagine a factory assembly line. The product at the end is wrong. Backpropagation traces backward through every station asking “how much did YOU contribute to the error?” Station 5 answers honestly. Station 4 uses that answer to figure out its own contribution. And so on back to station 1. Then each station adjusts its process slightly to reduce the error. Repeat millions of times and the factory learns to make perfect products.

Vanishing Gradient Problem

In deep networks using sigmoid or tanh activations, the derivative at each layer is a number between 0 and 1. Multiplying many such numbers together:

$$0.3 \times 0.3 \times 0.3 \times 0.3 \times 0.3 = 0.00243$$

After 10 layers this is effectively zero. Early layers receive near-zero gradients → they barely learn → deep sigmoid networks fail.

Solutions: ReLU activations, batch normalization, residual connections (skip connections), LSTM/GRU for sequences.

Exploding Gradient Problem

The opposite: gradients grow exponentially through layers. Manifests as NaN loss or wildly oscillating training.

Solutions:

  • Gradient clipping: cap gradient norm at a maximum value (e.g., 1.0)
  • Weight initialization techniques (Xavier/He initialization)
  • Batch normalization

Exam tip: Vanishing gradients → use ReLU, skip connections, or LSTM. Exploding gradients → use gradient clipping. Both are common issues in deep and recurrent networks.


Convolutional Neural Networks (CNNs)

Why Regular Networks Fail on Images

A 224×224 RGB image has $224 \times 224 \times 3 = 150,528$ inputs. A single dense hidden layer with 1,000 neurons would have 150 million parameters — just for layer 1. Problems:

  1. Too many parameters → overfitting
  2. No spatial awareness: pixel at position (10,10) has no relationship to pixel at (11,11) in a fully connected layer
  3. Not translation invariant: a cat in the top-left corner looks completely different from a cat in the bottom-right

Convolution Layer

A filter (kernel) is a small matrix (e.g., 3×3) that slides across the image. At each position, it computes the dot product with the underlying image patch.

3×3 filter sliding across a 5×5 input (stride=1, no padding):

Input (5×5):          Filter (3×3):       Feature Map (3×3):
┌─────────────┐       ┌───────┐
│ 1  2  3  4  5│      │ 1 0 -1│            ┌─────────┐
│ 6  7  8  9 10│  ×   │ 1 0 -1│   ──►     │  ?  ?  ?│
│11 12 13 14 15│      │ 1 0 -1│            │  ?  ?  ?│
│16 17 18 19 20│      └───────┘            │  ?  ?  ?│
│21 22 23 24 25│                           └─────────┘
└─────────────┘
     ↑                                      ↑
  5×5 = 25 values               (5-3+1)×(5-3+1) = 3×3 = 9 values

At top-left position:
  (1×1)+(2×0)+(3×-1) + (6×1)+(7×0)+(8×-1) + (11×1)+(12×0)+(13×-1)
= (1+0-3) + (6+0-8) + (11+0-13) = -2 + -2 + -2 = -6

Parameter sharing: The same 3×3 filter (9 parameters) is applied at every position in the image. This is the key insight — instead of learning a separate detector for every position, one filter learns to detect a feature anywhere in the image. Drastically reduces parameters.

Feature maps: Each filter produces one feature map. With 64 filters, you get 64 feature maps, each detecting a different low-level pattern.

Stride: How many pixels the filter moves each step. Stride=1 → dense output. Stride=2 → halves spatial dimensions.

Padding:

  • Valid: no padding → output smaller than input
  • Same: zero-pad input so output has same spatial dimensions as input

Pooling Layer

Downsamples the spatial dimensions of feature maps.

Max Pooling (2×2, stride=2):

  Input (4×4):         Output (2×2):
  ┌──┬──┬──┬──┐       ┌────┬────┐
  │ 1│ 3│ 2│ 4│       │ 6  │ 8  │
  ├──┼──┼──┼──┤  ──►  ├────┼────┤
  │ 5│ 6│ 1│ 8│       │ 9  │ 7  │
  ├──┼──┼──┼──┤       └────┴────┘
  │ 2│ 9│ 3│ 1│
  ├──┼──┼──┼──┤
  │ 4│ 3│ 7│ 2│
  └──┴──┴──┴──┘

  Each 2×2 block → its maximum value
  • Max pooling (preferred): keeps the strongest activation, provides translation invariance
  • Average pooling: takes the mean — preserves more information, sometimes used in final layers

Why pooling helps: reduces spatial resolution → fewer parameters in subsequent layers, and slight translation invariance (the cat can move a few pixels without changing the feature map drastically).

Full CNN Architecture

Full CNN Pipeline:

  Input Image
      │
      ▼
  ┌─────────────────┐
  │  Conv Layer     │ ── 32 filters of 3×3 → 32 feature maps
  │  ReLU           │
  │  Max Pool 2×2   │ ── halves spatial dims
  └────────┬────────┘
           │
      ▼
  ┌─────────────────┐
  │  Conv Layer     │ ── 64 filters of 3×3 → 64 feature maps
  │  ReLU           │
  │  Max Pool 2×2   │ ── halves again
  └────────┬────────┘
           │
      ▼
  ┌─────────────────┐
  │  Flatten        │ ── 3D feature maps → 1D vector
  │  Dense Layer    │ ── fully connected (128 neurons)
  │  ReLU           │
  │  Dropout        │ ── regularization
  │  Dense Output   │ ── softmax for N classes
  └─────────────────┘

Famous CNN Architectures

ArchitectureYearKey InnovationDepth
LeNet1998First successful CNN5 layers
AlexNet2012Deep CNN on GPU, ReLU, Dropout8 layers
VGG2014All 3×3 convolutions, very deep16-19
ResNet2015Skip connections50-152+
Inception2014Parallel filter sizes in one layer22

ResNet skip connections — the critical insight:

ResNet Skip Connection (Residual Block):

  Input x ──────────────────────────┐
      │                             │
      ▼                             │
  ┌────────┐                        │
  │ Conv   │                        │
  │ BN     │                        │ (identity shortcut)
  │ ReLU   │                        │
  └───┬────┘                        │
      │                             │
  ┌────────┐                        │
  │ Conv   │                        │
  │ BN     │                        │
  └───┬────┘                        │
      │                             │
      └──────────── + ◄─────────────┘
                    │
                  ReLU
                    │
                 Output = F(x) + x

Instead of learning $F(x)$, the block learns the residual $F(x) = H(x) - x$, where $H(x)$ is the desired mapping. If the layer is not needed, $F(x)$ simply goes to zero and the identity passes through unchanged. The skip connection also provides a gradient highway directly to early layers — solving vanishing gradients for very deep networks (100+ layers).

Transfer Learning with CNNs

Transfer Learning Strategy:

  ImageNet pre-trained model (millions of images, 1000 classes)
  ┌──────────────────────────────────────┐
  │  Conv layers (frozen or fine-tuned)  │ ← Universal features
  │  Dense layers (frozen)               │   (edges, textures, objects)
  └──────────────────────────────────────┘
                      │
                      ▼ Replace final layer
  ┌──────────────────────────────────────┐
  │  New Dense Output Layer (your N classes) │ ← Train from scratch
  └──────────────────────────────────────┘

Why it works: early CNN layers learn universal visual features (edges, corners, textures) that transfer across domains. Only the final classification layer needs to learn your specific classes.

When to use transfer learning:

  • Small labeled dataset (< 10,000 images) + similar domain to pre-training data → freeze most layers, retrain final layers only
  • Medium dataset + different domain → fine-tune more layers
  • Large dataset → full training from scratch may beat transfer learning

ELI5: Instead of teaching a baby to see from scratch, start with someone who already knows what edges and shapes are, and just teach them to recognize YOUR specific objects. The pre-trained weights encode years of “visual education” from millions of images — you’re piggybacking on that knowledge.

1D CNNs for sequences: The convolution slides along the time/sequence axis instead of spatial dimensions. Effective for time series and text when local patterns matter more than long-range dependencies. Faster than RNNs, parallelizable.

CNN Use Cases

TaskArchitectureOutput
Image classificationResNet, VGG, EfficientNetClass label + probability
Object detectionYOLO, SSD, Faster R-CNNBounding boxes + labels
Semantic segmentationFCN, U-Net, DeepLabPixel-level label mask
Medical imagingU-NetOrgan/tumor segmentation
Time series1D CNNLocal pattern features

Recurrent Neural Networks (RNNs)

Why Feedforward Networks Fail on Sequences

A feedforward network treats each input independently. For the sentence “The cat sat on the mat because it was tired,” understanding that “it” refers to “cat” requires memory of previous words. Feedforward networks have no such memory.

The Recurrence Equation

$$h_t = f(W_{hh} h_{t-1} + W_{xh} x_t + b)$$

where $h_t$ is the hidden state at time step $t$ (the “memory”), $x_t$ is the current input, and $h_{t-1}$ is the previous hidden state.

RNN Unrolled Through Time:

  x₁ ──► [RNN] ──h₁──► [RNN] ──h₂──► [RNN] ──h₃──► ... ──► output
              ↑             ↑             ↑
         same weights  same weights  same weights
         W_hh, W_xh   (shared across all time steps)

The same weight matrices $W_{hh}$ and $W_{xh}$ are applied at every time step — analogous to convolutional parameter sharing, but across time instead of space.

Vanishing Gradient in RNNs

For a sequence of length 100, the gradient of the loss with respect to the hidden state at step 1 involves 99 repeated multiplications of $W_{hh}$. If the largest eigenvalue of $W_{hh}$ is < 1, gradients vanish. If > 1, they explode.

ELI5: Playing “telephone” with 50 people — by the time the message reaches the end, the original meaning is completely lost. The RNN’s gradient is the message traveling backwards. By the time it reaches step 1, it contains almost nothing useful. So early steps don’t learn from events that happened long ago.

LSTM (Long Short-Term Memory)

LSTM introduces a cell state $C_t$ — a separate memory highway that runs alongside the hidden state, protected from the repeated multiplication that causes vanishing gradients.

The four components:

$$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) \quad \text{(forget gate)}$$

$$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) \quad \text{(input gate)}$$

$$\tilde{C}t = \tanh(W_C [h{t-1}, x_t] + b_C) \quad \text{(candidate values)}$$

$$C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t \quad \text{(cell state update)}$$

$$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) \quad \text{(output gate)}$$

$$h_t = o_t \cdot \tanh(C_t) \quad \text{(hidden state)}$$

LSTM Cell — All Gates:

                            ┌──────────────── C_{t-1} (cell state in) ──────────────┐
                            │                                                         │
  h_{t-1} ──┐              ▼                                                         ▼
             ├──► [Forget Gate f_t] ──►  × ──────────────────────────── + ──► C_t ──► tanh ──► ×──► h_t
  x_t ──────┘              │            ↑                                ↑            ↑
                            │       ┌───┘                          ┌────┘         ────┘
                            │  [Input Gate i_t] ──► × ──┘     [Output Gate o_t]
                            │  [Candidate Ĉ_t]  ──┘
                            │
  Gate summary:
    Forget gate:  "How much of old cell state to keep?" (0=forget all, 1=keep all)
    Input gate:   "How much of new candidate to write?"
    Output gate:  "How much of cell state to expose as hidden state?"
    Cell state:   The long-term memory highway

Why LSTM solves vanishing gradient: the cell state update $C_t = f_t \cdot C_{t-1} + i_t \cdot \tilde{C}_t$ is additive, not multiplicative through the network depth. Gradients can flow unchanged through the cell state highway across many time steps.

ELI5: LSTM is a person with a notebook. The forget gate crosses things out. The input gate writes new things. The output gate decides what to read aloud. The notebook itself is the long-term memory that persists. The key insight: information in the notebook can survive for hundreds of time steps without being distorted — unlike the “telephone game” of vanilla RNNs.

GRU (Gated Recurrent Unit)

GRU simplifies LSTM into two gates: reset gate ($r_t$) and update gate ($z_t$).

$$z_t = \sigma(W_z [h_{t-1}, x_t]) \quad \text{(update gate)}$$

$$r_t = \sigma(W_r [h_{t-1}, x_t]) \quad \text{(reset gate)}$$

$$\tilde{h}t = \tanh(W [r_t \cdot h{t-1}, x_t]) \quad \text{(candidate hidden state)}$$

$$h_t = (1-z_t) \cdot h_{t-1} + z_t \cdot \tilde{h}_t$$

No separate cell state — just one hidden state with gating.

RNN vs LSTM vs GRU Comparison

AspectVanilla RNNLSTMGRU
GatesNone3 gates + cell state2 gates
ParametersFewestMostFewer than LSTM
Long-term memoryPoorExcellentGood
Training speedFastestSlowestMiddle
Vanishing gradientYesSolvedSolved
When to useToy examplesLong sequences, complex patternsSmaller datasets, faster training

Decision rule: Start with GRU for faster iteration. Switch to LSTM if sequence dependencies are very long (> 100 steps) or accuracy matters more than speed.

Bidirectional RNNs

Bidirectional RNN:

  x₁       x₂       x₃       x₄
   │        │        │        │
   ▼        ▼        ▼        ▼
  [→]─────►[→]─────►[→]─────►[→]   (forward pass)
   ▲        ▲        ▲        ▲
  [←]◄─────[←]◄─────[←]◄─────[←]   (backward pass)
   │        │        │        │
   ▼        ▼        ▼        ▼
 concat   concat   concat   concat
   │        │        │        │
  y₁       y₂       y₃       y₄

Each output sees BOTH past AND future context.

The output at each time step is the concatenation of forward and backward hidden states. Essential for tasks that need future context: NER (“Bank” is a company if followed by “Inc.”), speech recognition, machine translation encoder.


Transformers & Attention

The Revolution

Transformers (2017, “Attention Is All You Need”) discarded recurrence entirely. Instead, every token attends directly to every other token via the attention mechanism. Key advantages over RNNs:

  1. Fully parallelizable — no sequential dependency
  2. Better long-range dependencies — direct path between any two tokens
  3. No vanishing gradient — attention creates direct connections

Self-Attention Mechanism

For each token, compute three vectors: Query (Q), Key (K), Value (V) from learned weight matrices.

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Interpretation:

  • $QK^T$: dot product of query with all keys → measures relevance of each token to the current token
  • $\sqrt{d_k}$: scaling to prevent dot products from growing too large (which would saturate softmax)
  • Softmax: turn relevance scores into probabilities (attention weights)
  • Multiply by $V$: weighted sum of value vectors — the output is a blend of all tokens, weighted by relevance

ELI5: In the sentence “The cat sat on the mat because it was tired”, self-attention helps the model understand that “it” refers to “cat” by computing a relevance score between “it” and every other word in the sentence. “it” has a high score with “cat” and low scores with “mat.” The output representation of “it” is then a weighted blend of all word representations — heavily blending in “cat.” This is how meaning is captured from context.

Multi-Head Attention

Run self-attention $h$ times in parallel with different learned weight matrices, then concatenate:

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

Each head can focus on different types of relationships (syntactic, semantic, coreference) simultaneously.

Positional Encoding

Since there is no recurrence, the Transformer has no inherent sense of position. Positional encodings (sinusoidal functions or learned embeddings) are added to token embeddings to encode order.

Encoder-Decoder Architecture

Transformer Encoder-Decoder:

  "Je suis étudiant"         "I am a student"
       │                           │
  ┌────▼────┐                 ┌────▼────┐
  │ Encoder │                 │ Decoder │
  │         │ ─── context ──► │         │
  │ Self-   │    (K, V from   │ Masked  │
  │ Attention│   encoder)     │ Self-Att│
  │         │                 │ Cross-  │
  │ Feed-   │                 │ Attention│
  │ Forward │                 │ Feed-   │
  └─────────┘                 │ Forward │
                              └─────────┘
                                   │
                              Output tokens
                              (autoregressive)

Encoder: reads the entire input simultaneously, produces rich contextual representations (used in BERT)

Decoder: generates output one token at a time, attending to both previous outputs and encoder context (used in GPT, machine translation)

BERT vs GPT

AspectBERTGPT
ArchitectureEncoder-onlyDecoder-only
TrainingMasked language modeling (bidirectional)Next token prediction (left-to-right)
StrengthUnderstanding tasks (classification, NER)Generation tasks (text, code, chat)
ContextSees full sequence bidirectionallySees only past tokens

RNN vs Transformer Comparison

RNN Processing (sequential — must wait):
  x₁ ──► h₁ ──► h₂ ──► h₃ ──► ... ──► hₙ

Transformer Processing (parallel — all at once):
  x₁ ──────────────────────────────►
  x₂ ──── Self-Attention (all tokens simultaneously) ────►
  x₃ ──────────────────────────────►
  xₙ ──────────────────────────────►

RNN processes tokens one at a time → slow for long sequences, hard to parallelize. Transformer processes all tokens simultaneously → much faster training on modern hardware (GPU/TPU).


Autoencoders

Architecture

Autoencoder:

  Input x        Bottleneck z         Reconstructed x̂
  (high-dim)     (low-dim)            (high-dim)

  ┌───────┐    ┌──────┐    ┌───────┐
  │ 784   │    │  32  │    │ 784   │
  │ dims  │──► │ dims │──► │ dims  │
  │       │    │      │    │       │
  │Encoder│    │Latent│    │Decoder│
  └───────┘    │Space │    └───────┘
               └──────┘

  Loss = reconstruction error: ||x̂ - x||²
  Training signal: how well can you reconstruct x from a tiny z?

The bottleneck forces the encoder to learn a compressed representation. The decoder learns to reconstruct from that compression. Neither part is useful individually — the learned representation in the bottleneck is the output.

ELI5: An autoencoder is like summarizing a book (encoder) then reconstructing it from the summary (decoder). The goal isn’t the reconstruction — it’s the summary itself. That compressed summary is the “embedding” or “latent representation” that captures the essence of the data. A good summary can reconstruct a good approximation of the original.

Use cases:

  • Dimensionality reduction: non-linear alternative to PCA
  • Anomaly detection: reconstruct normal data well; anomalies have high reconstruction error
  • Denoising: train with noisy input → clean output; forces learning of clean structure
  • Pre-training: learn representations before fine-tuning on a downstream task

Variational Autoencoders (VAE)

Instead of encoding to a fixed point $z$, VAE encodes to a distribution: mean $\mu$ and variance $\sigma^2$. Samples from this distribution become the decoder input.

Loss = reconstruction error + KL divergence (keeps the latent space smooth and continuous)

Since the latent space is smooth, you can interpolate between points and generate new data by sampling from the latent space. VAEs are generative models.


Generative Adversarial Networks (GANs)

Architecture

Two networks compete in a minimax game:

GAN Training Loop:

  Real data ──────────────────────────┐
                                       │
  Random noise z                       ▼
       │              ┌───────────────────────────┐
       ▼              │     Discriminator D        │
  ┌─────────┐         │  "Is this real or fake?"  │──► Real/Fake score
  │Generator│──fake──►│                           │
  │    G    │         └───────────────────────────┘
  └─────────┘                 │
       ↑                      │ gradient
       └──────── loss ◄────── ┘
         (fool D)

Generator ($G$): takes random noise $z$, outputs a fake sample. Trained to maximize $D$’s classification error (make fakes look real).

Discriminator ($D$): takes real or fake samples, outputs probability of being real. Trained to correctly distinguish real from fake.

Training objective:

$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$

Nash equilibrium: when $G$ produces perfect fakes, $D$ can only guess randomly (probability 0.5). At this point, training has converged.

Mode collapse: $G$ finds one type of output that consistently fools $D$ and gets stuck producing only that type. Solutions: Wasserstein GAN, minibatch discrimination, historical averaging.

ELI5: A counterfeiter (generator) tries to make fake money. A detective (discriminator) tries to spot fakes. They both get better through competition — the counterfeiter learns from every time they get caught, the detective learns from every fake that slipped through. Eventually the fakes are indistinguishable from real bills. GANs use this adversarial competition to generate extremely realistic data.

Use cases: image generation, data augmentation (generate more training examples), style transfer, image-to-image translation (pix2pix), super-resolution.


Practical Deep Learning on AWS

GPU Instance Types

FamilyGPUBest For
P3NVIDIA V100Standard DL training
P4dNVIDIA A100Large model training
P5NVIDIA H100Transformer / LLM training
G4dnNVIDIA T4Cost-effective inference + training
G5NVIDIA A10GInference, vision models
Inf1/Inf2AWS InferentiaInference only (custom chip, very cheap)

Distributed Training Strategies

Data Parallelism (most common):

  GPU 1: full model copy ──► batch 1 ──► gradient 1 ──┐
  GPU 2: full model copy ──► batch 2 ──► gradient 2 ──┼──► average gradients ──► update weights
  GPU 3: full model copy ──► batch 3 ──► gradient 3 ──┘
  GPU 4: full model copy ──► batch 4 ──► gradient 4 ──┘

Model Parallelism (for huge models):

  GPU 1: layers 1-10 ──► activations ──► GPU 2: layers 11-20 ──► GPU 3: layers 21-30
  (model too big to fit on one GPU → split across GPUs)
  • SageMaker Data Parallelism library: optimized AllReduce implementation for multi-GPU/multi-node
  • SageMaker Model Parallelism library: automatic pipeline parallelism for models > GPU memory

Cost Optimization

  • Managed Spot Training: use EC2 Spot instances (70-90% cheaper), SageMaker handles interruptions and auto-resumes from checkpoints
  • Checkpointing: save model state to S3 periodically; critical for spot training since instances can be reclaimed
  • SageMaker Training Compiler: optimizes DL training code (up to 50% speedup, up to 50% cost reduction) by optimizing GPU utilization — works with TensorFlow and PyTorch

Training Observability

  • SageMaker Debugger: captures tensors during training, monitors for issues (vanishing/exploding gradients, overfit, poor weight initialization) and can auto-stop unhealthy training jobs
  • CloudWatch: training metrics (loss, accuracy) per epoch
  • SageMaker Experiments: track hyperparameters, metrics, artifacts across training runs

Architecture Selection Guide

Why this matters for the exam: Every scenario question about model design requires mapping problem type → architecture. Memorize this mapping.

Problem TypeDataArchitectureKey Reason
Image classification2D imagesCNN (ResNet/EfficientNet)Spatial feature hierarchy
Object detection2D imagesCNN (YOLO/SSD)Bounding box regression
Pixel segmentation2D imagesFCN/U-NetDense prediction
Sequence classificationText, time seriesLSTM/GRU or 1D CNNTemporal patterns
Text understandingDocumentsBERT (Transformer encoder)Bidirectional context
Text generationLanguageGPT (Transformer decoder)Autoregressive
TranslationSequence pairsTransformer (enc-dec)Long-range alignment
Anomaly detectionAnyAutoencoderReconstruction error
Data generationAny modalityGAN or VAEGenerative modeling
Tabular dataStructuredXGBoost (not deep learning)Gradient boosting dominates
Multi-series forecastTime seriesDeepAR (LSTM)Cross-series patterns

Exam tip: For tabular/structured data, XGBoost typically outperforms deep learning and is cheaper to train. Deep learning advantages kick in for unstructured data (images, text, audio) where feature engineering is impractical.


Quick Reference

Deep Learning Architecture Decision Tree:

  What is your data type?
  │
  ├─ Images / Video ──────────────────► CNN
  │   ├─ Classify whole image?         → ResNet + transfer learning
  │   ├─ Find objects with boxes?      → SSD / Faster R-CNN
  │   └─ Label every pixel?            → U-Net / DeepLab
  │
  ├─ Sequential / Time series ────────► RNN family
  │   ├─ Short sequences?              → GRU (faster)
  │   ├─ Long sequences?               → LSTM (better memory)
  │   ├─ Bidirectional context?        → Bidirectional LSTM
  │   └─ Multiple related series?      → DeepAR
  │
  ├─ Text / Language ─────────────────► Transformers
  │   ├─ Understand / classify text?   → BERT (encoder)
  │   ├─ Generate text?                → GPT (decoder)
  │   └─ Translate / summarize?        → Encoder-Decoder Transformer
  │
  ├─ Generate new data ───────────────► Generative models
  │   ├─ Realistic samples?            → GAN
  │   └─ Smooth latent space?          → VAE
  │
  └─ Compress / detect anomalies ─────► Autoencoder