Domain 2D: Generative AI & Amazon Bedrock
Generative AI & Amazon Bedrock
Exam Domain: 2 — ML Model Development (26%) Task: Understand foundation models, transformers, and Bedrock services
Transformer Architecture
The architecture behind all modern LLMs (GPT, Claude, Llama, etc.)
┌─────────────────────────────────────────────────────────┐
│ TRANSFORMER │
│ │
│ Input: "The cat sat on the" │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Tokenization │ "The" "cat" "sat" "on" "the" │
│ └────────┬────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Embeddings │ Each token → dense vector │
│ └────────┬────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Positional │ Add position info to embeddings │
│ │ Encoding │ (sin/cos or learned) │
│ └────────┬────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Self-Attention │ Each token attends to all others │
│ │ (Multi-Head) │ "How relevant is each word to │
│ │ │ every other word?" │
│ └────────┬────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Feed-Forward │ Process each position │
│ │ Network │ │
│ └────────┬────────┘ │
│ ↓ │
│ (repeat N layers) │
│ ↓ │
│ Output: "mat" (next token prediction) │
└─────────────────────────────────────────────────────────┘
ELI5: The transformer revolutionized AI with one key insight: “self-attention.” Instead of reading a sentence word by word (like older RNNs), the transformer looks at all words simultaneously and figures out which words relate to each other. In “The animal didn’t cross the street because it was too tired,” a transformer immediately knows “it” refers to “animal” — not “street” — because it can see both words at once. It’s like reading a whole page at a time instead of one word.
Self-Attention Mechanism
For each token, compute:
Q (Query) = What am I looking for?
K (Key) = What do I contain?
V (Value) = What information do I provide?
Attention(Q,K,V) = softmax(QK^T / √d_k) × V
Multi-Head = run multiple attention computations in parallel
(each head learns different relationships)
ELI5: Self-attention is like being at a noisy party. For every word (token) in the sentence, you check how relevant every other word is to understanding it. “Bank” in “river bank” gets high attention scores toward “river” and low scores toward “money.” The Q/K/V math is just a way of computing those relevance scores efficiently — Q asks the question, K holds the answers, V holds the actual information to pass forward.
Transformer Variants
| Type | Architecture | Examples | Use Case |
|---|---|---|---|
| Encoder-only | Processes full input bidirectionally | BERT | Classification, NER, sentiment |
| Decoder-only | Generates text auto-regressively | GPT, Claude, Llama | Text generation, chat, code |
| Encoder-Decoder | Encodes input, decodes output | T5, BART | Translation, summarization |
ELI5: BERT reads a sentence in both directions simultaneously (great for understanding). GPT/Claude only read left-to-right, one token at a time (great for generating). T5/BART do both — read the input fully, then generate the output. For the exam: decoder-only = text generation, encoder-only = understanding tasks, encoder-decoder = transformation tasks (translate, summarize).
LLM Key Concepts
Tokens & Tokenization
"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]
6 tokens
Rule of thumb: ~1 token ≈ 4 characters ≈ 0.75 words
Key LLM Parameters
| Parameter | Controls | Range |
|---|---|---|
| Temperature | Randomness/creativity | 0 (deterministic) → 1+ (creative) |
| Top-p (nucleus) | Cumulative probability cutoff | 0.0 → 1.0 |
| Top-k | Number of top tokens to consider | 1 → vocabulary size |
| Max tokens | Maximum output length | Depends on model |
| Stop sequences | When to stop generating | Custom strings |
Temperature effect:
temp=0.0: "The capital of France is Paris." (always same answer)
temp=0.5: "The capital of France is Paris, a beautiful city."
temp=1.0: "The capital of France is Paris, known for its vibrant culture..."
temp=1.5: "The glorious heart of France beats in Paris, where..."
ELI5: Temperature is the creativity knob. Turn it to 0 for factual, deterministic answers (“What is 2+2?” should always be 4). Turn it up for creative writing or brainstorming. Top-p and Top-k limit which words the model even considers at each step — like giving it a smaller vocabulary to choose from. Low top-k means the model sticks to the most likely words; high top-k lets it explore weirder options.
Embeddings
- Dense vector representations of tokens/text
- Capture semantic meaning — similar meanings → similar vectors
- Used for: semantic search, RAG, clustering, similarity
- Bedrock provides embedding models (Titan Embeddings, Cohere Embed)
Fine-Tuning & Transfer Learning
Transfer Learning
Pre-trained Model (ImageNet, GPT, etc.)
↓
Freeze early layers (keep general knowledge)
↓
Replace/add final layers for your task
↓
Train on YOUR data with small learning rate
↓
Fine-tuned Model (your domain)
Fine-Tuning Approaches
| Approach | What Changes | Data Needed | Cost |
|---|---|---|---|
| Full fine-tuning | All weights | Large dataset | Expensive |
| Feature extraction | Only final layers | Small dataset | Cheap |
| LoRA (Low-Rank Adaptation) | Small adapter matrices | Small-medium | Medium |
| Continuous pre-training | Extend base knowledge | Domain corpus | Expensive |
| Prompt tuning | Learned prompt embeddings | Small dataset | Cheap |
ELI5: Full fine-tuning is like repainting an entire house — effective but expensive. LoRA is like just touching up the trim — you add small adapter matrices alongside the original weights and only train those, so it’s much cheaper while still teaching the model new behavior. Prompt tuning is like putting a welcome mat at the door — minimal change (just a few learned tokens prepended to every prompt), yet surprisingly effective for steering model behavior.
SageMaker JumpStart
- Model hub with 400+ pre-trained foundation models
- One-click deploy and fine-tune
- Models from: Hugging Face, Meta (Llama), Stability AI, AI21, Cohere
- Provides: notebooks, training scripts, inference code
- Use case: quick start with foundation models without Bedrock
Amazon Bedrock
Fully managed service for accessing foundation models via API.
Architecture

┌───────────────────────────────────────────────────────┐
│ Amazon Bedrock │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Foundation Models (FMs) │ │
│ │ ├─ Amazon Nova (Micro/Lite/Pro/Premier) │ │
│ │ ├─ Amazon Titan (Text, Embeddings V2, Image) │ │
│ │ ├─ Anthropic Claude (Haiku/Sonnet/Opus) │ │
│ │ ├─ Meta Llama (open weights, edge + cloud) │ │
│ │ ├─ Cohere (Command R+, Embed v4 — RAG focus) │ │
│ │ ├─ AI21 Labs Jamba (256K context, JSON native) │ │
│ │ ├─ Stability AI (SDXL — image generation) │ │
│ │ └─ Mistral AI (MoE — cost-efficient inference) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │
│ │Knowledge │ │ Agents │ │Guardrails│ │ Model │ │
│ │ Bases │ │ │ │ │ │ Eval │ │
│ │ (RAG) │ │ (actions)│ │ (safety) │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └─────────┘ │
│ │
│ Data Privacy: Your data is NEVER used to train │
│ base models. Data stays in your AWS account. │
└───────────────────────────────────────────────────────┘
Bedrock Model Families (Exam Focus)
| Family | Key Models | Differentiator | Exam Focus |
|---|---|---|---|
| Amazon Nova | Micro, Lite, Pro, Premier | AWS-native, fine-tunable, watermarking | Model selection, distillation |
| Amazon Titan | Embeddings V2, Image G1 | 100+ language embeddings | RAG embedding selection |
| Anthropic Claude | Haiku, Sonnet, Opus | Best reasoning, 200K-1M context | Agent orchestration, tool use |
| Meta Llama | 3.1-3.3 (1B-405B) | Open weights, edge deploy | Fine-tuning, on-prem options |
| Mistral | 7B, Mixtral, Large | MoE (sparse activation = lower cost) | Cost optimization |
| Cohere | Command R+, Embed v4 | RAG specialist, multimodal embed | RAG pipeline, embedding choice |
| AI21 Jamba | 1.5, 2 | 256K context (longest), native JSON | Long-context, structured output |
| Stability AI | SDXL, SD3.5 | Image generation | Image generation params |
Not all models support fine-tuning on Bedrock — Nova (Micro/Lite/Pro), Llama, Mistral do. Open-weight models (Llama, Mistral) can also deploy via SageMaker for more control.
Bedrock Pricing Models
| Model | How It Works | Best For | Exam Signal |
|---|---|---|---|
| On-Demand | Pay per input/output token | Variable, unpredictable workloads | “bursty”, “testing” |
| Provisioned Throughput | Reserve model capacity | Consistent, high-volume workloads | “steady”, “SLA” |
| Batch Inference | Process large batches (50% discount) | Offline processing | “batch”, “non-real-time” |
| Intelligent Prompt Routing | Auto-select model size per query | Cost optimization | “optimize cost”, “30% savings” |
Additional Bedrock Features
| Feature | Purpose |
|---|---|
| Intelligent Prompt Routing | Auto-route between model sizes per query (up to 30% savings) |
| Model Distillation | Use large teacher model to train smaller student model |
| Watermarking | Detect AI-generated content (Nova models) |
Bedrock Custom Models
| Approach | What It Does | Data Needed |
|---|---|---|
| Continued Pre-Training | Teach new domain knowledge | Large unlabeled corpus |
| Fine-Tuning | Teach specific task format | Labeled prompt-completion pairs |
RAG (Retrieval-Augmented Generation)
Why RAG?
Problem: LLMs have knowledge cutoff, hallucinate, lack your private data
Solution: Retrieve relevant docs → inject into prompt → generate grounded answer
Without RAG: "What is our refund policy?" → Hallucinated answer
With RAG: "What is our refund policy?" → Retrieves policy doc → Accurate answer
ELI5: RAG solves the biggest LLM problem: hallucination. Instead of asking the model “What’s our refund policy?” and getting a confident but made-up answer, RAG first searches your actual documents, finds the relevant section, then hands it to the model along with the question. The model generates an answer grounded in real text — with citations. It’s like giving a student an open-book test instead of asking them to guess from memory.
Bedrock Knowledge Bases (Managed RAG)
Setup:
1. Data Sources (S3)
↓
2. Chunking (split documents into pieces)
↓
3. Embedding (convert chunks to vectors)
↓
4. Vector Store (index for similarity search)
↓
Ready for queries!
Query Flow:
User question
↓
Embed question → Search vector store → Top-K relevant chunks
↓
Augmented prompt = System prompt + Retrieved chunks + User question
↓
Foundation model generates grounded answer
↓
Response with source citations
Supported Vector Stores
| Store | Type | Best For |
|---|---|---|
| OpenSearch Serverless | AWS managed | Default, easy setup |
| Aurora PostgreSQL (pgvector) | AWS managed | Already using Aurora |
| Pinecone | Third-party | Dedicated vector DB |
| Redis Enterprise | Third-party | Low-latency requirements |
| MongoDB Atlas | Third-party | Already using MongoDB |
Chunking Strategies
| Strategy | How It Works |
|---|---|
| Fixed-size | Split every N tokens (default) |
| Semantic | Split by meaning/topic boundaries |
| Hierarchical | Parent-child chunks (context + detail) |
| No chunking | Treat each file as one chunk |
Bedrock Agents
Automate multi-step tasks by combining LLM reasoning with actions.

User: "Book me a flight from NYC to London next Friday under $500"
Agent reasoning:
1. Search flights (call flight API via Action Group)
2. Filter by price < $500
3. Check availability for next Friday
4. Book the cheapest option (call booking API)
5. Return confirmation to user
┌───────────────────────────────────────────────┐
│ Bedrock Agent │
│ │
│ ┌────────────┐ ┌────────────────────────┐ │
│ │ FM │ │ Action Groups │ │
│ │ (reasoning │←→ │ ├─ Lambda functions │ │
│ │ & planning│ │ ├─ API schemas │ │
│ │ ) │ │ └─ Return results │ │
│ └────────────┘ └────────────────────────┘ │
│ ↕ │
│ ┌────────────┐ ┌────────────────────────┐ │
│ │ Knowledge │ │ Guardrails │ │
│ │ Bases │ │ (safety filters) │ │
│ └────────────┘ └────────────────────────┘ │
└───────────────────────────────────────────────┘
- Action Groups: Lambda functions the agent can invoke
- Knowledge Bases: RAG for the agent’s domain knowledge
- Session memory: Retains context across turns
- Guardrails: Apply content safety to agent responses
Bedrock Guardrails
Content safety framework applied to FM inputs and outputs.
| Filter Type | What It Blocks |
|---|---|
| Content filters | Hate, insults, sexual, violence, misconduct, prompt attacks (6 categories, adjustable severity) |
| Denied topics | Custom topics to avoid (e.g., “competitor pricing”) |
| Word filters | Block specific words/phrases |
| Sensitive info (PII) | Detect and redact PII (SSN, email, phone, etc.) |
| Contextual grounding | Block hallucinated or irrelevant responses |
Guardrails Flow:
User Input → [Input Guardrail] → FM → [Output Guardrail] → Response
↓ ↓
Block/modify Block/modify
if policy if policy
violated violated
ELI5: Guardrails are like parental controls for your AI — they filter what goes in and what comes out. You can block topics (“never discuss competitors”), automatically redact PII before it reaches the model or leaves the response, and prevent toxic content. Unlike prompt engineering (asking the model nicely to behave), Guardrails enforce rules at the infrastructure level — the model doesn’t even see the blocked content.
Bedrock Model Evaluation
| Evaluation Type | How It Works |
|---|---|
| Automatic | Built-in metrics (accuracy, robustness, toxicity) on benchmark datasets |
| Human | Subject matter experts rate model outputs |
| Custom | Your own evaluation criteria and datasets |
Quick Reference: Bedrock vs SageMaker JumpStart
| Feature | Bedrock | SageMaker JumpStart |
|---|---|---|
| Hosting | Fully managed API | You manage endpoint |
| Models | Curated FMs (Titan, Claude, Llama) | 400+ open-source models |
| Customization | Fine-tune, continued pre-training | Full training control |
| RAG | Knowledge Bases (managed) | Build your own |
| Agents | Managed agents | Build your own |
| Cost model | Per token | Per instance hour |
| Best for | GenAI applications | Custom ML + GenAI experiments |
Exam tip: If the question is about GenAI applications (chatbots, RAG, content generation) → Bedrock. If it’s about custom model training/deployment → SageMaker.
ELI5: Bedrock is a vending machine — pick a model, call the API, pay per token, done. SageMaker JumpStart is a workshop — pick a model, customize everything, manage your own deployment, pay per instance hour. Bedrock requires zero infrastructure knowledge; JumpStart gives you full control. For the exam, “build a GenAI app quickly” = Bedrock; “fine-tune and self-host” = JumpStart.