Domain 2D: Generative AI & Amazon Bedrock

10 min read 2113 words

Generative AI & Amazon Bedrock

Exam Domain: 2 — ML Model Development (26%) Task: Understand foundation models, transformers, and Bedrock services

Transformer Architecture

The architecture behind all modern LLMs (GPT, Claude, Llama, etc.)

┌─────────────────────────────────────────────────────────┐
│                  TRANSFORMER                             │
│                                                         │
│  Input: "The cat sat on the"                            │
│      ↓                                                  │
│  ┌─────────────────┐                                    │
│  │  Tokenization   │  "The" "cat" "sat" "on" "the"     │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Embeddings     │  Each token → dense vector         │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Positional     │  Add position info to embeddings   │
│  │  Encoding       │  (sin/cos or learned)              │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Self-Attention  │  Each token attends to all others │
│  │  (Multi-Head)    │  "How relevant is each word to    │
│  │                  │   every other word?"               │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Feed-Forward   │  Process each position             │
│  │  Network        │                                    │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  (repeat N layers)                                      │
│           ↓                                             │
│  Output: "mat"  (next token prediction)                 │
└─────────────────────────────────────────────────────────┘

ELI5: The transformer revolutionized AI with one key insight: “self-attention.” Instead of reading a sentence word by word (like older RNNs), the transformer looks at all words simultaneously and figures out which words relate to each other. In “The animal didn’t cross the street because it was too tired,” a transformer immediately knows “it” refers to “animal” — not “street” — because it can see both words at once. It’s like reading a whole page at a time instead of one word.

Self-Attention Mechanism

For each token, compute:
  Q (Query)  = What am I looking for?
  K (Key)    = What do I contain?
  V (Value)  = What information do I provide?

Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Multi-Head = run multiple attention computations in parallel
             (each head learns different relationships)

ELI5: Self-attention is like being at a noisy party. For every word (token) in the sentence, you check how relevant every other word is to understanding it. “Bank” in “river bank” gets high attention scores toward “river” and low scores toward “money.” The Q/K/V math is just a way of computing those relevance scores efficiently — Q asks the question, K holds the answers, V holds the actual information to pass forward.

Transformer Variants

Type	Architecture	Examples	Use Case
Encoder-only	Processes full input bidirectionally	BERT	Classification, NER, sentiment
Decoder-only	Generates text auto-regressively	GPT, Claude, Llama	Text generation, chat, code
Encoder-Decoder	Encodes input, decodes output	T5, BART	Translation, summarization

ELI5: BERT reads a sentence in both directions simultaneously (great for understanding). GPT/Claude only read left-to-right, one token at a time (great for generating). T5/BART do both — read the input fully, then generate the output. For the exam: decoder-only = text generation, encoder-only = understanding tasks, encoder-decoder = transformation tasks (translate, summarize).

LLM Key Concepts

Tokens & Tokenization

"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]
                         6 tokens

Rule of thumb: ~1 token ≈ 4 characters ≈ 0.75 words

Key LLM Parameters

Parameter	Controls	Range
Temperature	Randomness/creativity	0 (deterministic) → 1+ (creative)
Top-p (nucleus)	Cumulative probability cutoff	0.0 → 1.0
Top-k	Number of top tokens to consider	1 → vocabulary size
Max tokens	Maximum output length	Depends on model
Stop sequences	When to stop generating	Custom strings

Temperature effect:
  temp=0.0: "The capital of France is Paris."  (always same answer)
  temp=0.5: "The capital of France is Paris, a beautiful city."
  temp=1.0: "The capital of France is Paris, known for its vibrant culture..."
  temp=1.5: "The glorious heart of France beats in Paris, where..."

ELI5: Temperature is the creativity knob. Turn it to 0 for factual, deterministic answers (“What is 2+2?” should always be 4). Turn it up for creative writing or brainstorming. Top-p and Top-k limit which words the model even considers at each step — like giving it a smaller vocabulary to choose from. Low top-k means the model sticks to the most likely words; high top-k lets it explore weirder options.

Embeddings

Dense vector representations of tokens/text
Capture semantic meaning — similar meanings → similar vectors
Used for: semantic search, RAG, clustering, similarity
Bedrock provides embedding models (Titan Embeddings, Cohere Embed)

Fine-Tuning & Transfer Learning

Transfer Learning

Pre-trained Model (ImageNet, GPT, etc.)
    ↓
  Freeze early layers (keep general knowledge)
    ↓
  Replace/add final layers for your task
    ↓
  Train on YOUR data with small learning rate
    ↓
  Fine-tuned Model (your domain)

Fine-Tuning Approaches

Approach	What Changes	Data Needed	Cost
Full fine-tuning	All weights	Large dataset	Expensive
Feature extraction	Only final layers	Small dataset	Cheap
LoRA (Low-Rank Adaptation)	Small adapter matrices	Small-medium	Medium
Continuous pre-training	Extend base knowledge	Domain corpus	Expensive
Prompt tuning	Learned prompt embeddings	Small dataset	Cheap

ELI5: Full fine-tuning is like repainting an entire house — effective but expensive. LoRA is like just touching up the trim — you add small adapter matrices alongside the original weights and only train those, so it’s much cheaper while still teaching the model new behavior. Prompt tuning is like putting a welcome mat at the door — minimal change (just a few learned tokens prepended to every prompt), yet surprisingly effective for steering model behavior.

SageMaker JumpStart

Model hub with 400+ pre-trained foundation models
One-click deploy and fine-tune
Models from: Hugging Face, Meta (Llama), Stability AI, AI21, Cohere
Provides: notebooks, training scripts, inference code
Use case: quick start with foundation models without Bedrock

Amazon Bedrock

Fully managed service for accessing foundation models via API.

Architecture

Bedrock Architecture

┌───────────────────────────────────────────────────────┐
│                   Amazon Bedrock                       │
│                                                       │
│  ┌─────────────────────────────────────────────────┐  │
│  │  Foundation Models (FMs)                        │  │
│  │  ├─ Amazon Nova (Micro/Lite/Pro/Premier)        │  │
│  │  ├─ Amazon Titan (Text, Embeddings V2, Image)   │  │
│  │  ├─ Anthropic Claude (Haiku/Sonnet/Opus)        │  │
│  │  ├─ Meta Llama (open weights, edge + cloud)     │  │
│  │  ├─ Cohere (Command R+, Embed v4 — RAG focus)   │  │
│  │  ├─ AI21 Labs Jamba (256K context, JSON native) │  │
│  │  ├─ Stability AI (SDXL — image generation)      │  │
│  │  └─ Mistral AI (MoE — cost-efficient inference) │  │
│  └─────────────────────────────────────────────────┘  │
│                                                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐  │
│  │Knowledge │ │  Agents  │ │Guardrails│ │  Model  │  │
│  │  Bases   │ │          │ │          │ │  Eval   │  │
│  │  (RAG)   │ │ (actions)│ │ (safety) │ │         │  │
│  └──────────┘ └──────────┘ └──────────┘ └─────────┘  │
│                                                       │
│  Data Privacy: Your data is NEVER used to train       │
│  base models. Data stays in your AWS account.         │
└───────────────────────────────────────────────────────┘

Bedrock Model Families (Exam Focus)

Family	Key Models	Differentiator	Exam Focus
Amazon Nova	Micro, Lite, Pro, Premier	AWS-native, fine-tunable, watermarking	Model selection, distillation
Amazon Titan	Embeddings V2, Image G1	100+ language embeddings	RAG embedding selection
Anthropic Claude	Haiku, Sonnet, Opus	Best reasoning, 200K-1M context	Agent orchestration, tool use
Meta Llama	3.1-3.3 (1B-405B)	Open weights, edge deploy	Fine-tuning, on-prem options
Mistral	7B, Mixtral, Large	MoE (sparse activation = lower cost)	Cost optimization
Cohere	Command R+, Embed v4	RAG specialist, multimodal embed	RAG pipeline, embedding choice
AI21 Jamba	1.5, 2	256K context (longest), native JSON	Long-context, structured output
Stability AI	SDXL, SD3.5	Image generation	Image generation params

Not all models support fine-tuning on Bedrock — Nova (Micro/Lite/Pro), Llama, Mistral do. Open-weight models (Llama, Mistral) can also deploy via SageMaker for more control.

Bedrock Pricing Models

Model	How It Works	Best For	Exam Signal
On-Demand	Pay per input/output token	Variable, unpredictable workloads	“bursty”, “testing”
Provisioned Throughput	Reserve model capacity	Consistent, high-volume workloads	“steady”, “SLA”
Batch Inference	Process large batches (50% discount)	Offline processing	“batch”, “non-real-time”
Intelligent Prompt Routing	Auto-select model size per query	Cost optimization	“optimize cost”, “30% savings”

Additional Bedrock Features

Feature	Purpose
Intelligent Prompt Routing	Auto-route between model sizes per query (up to 30% savings)
Model Distillation	Use large teacher model to train smaller student model
Watermarking	Detect AI-generated content (Nova models)

Bedrock Custom Models

Approach	What It Does	Data Needed
Continued Pre-Training	Teach new domain knowledge	Large unlabeled corpus
Fine-Tuning	Teach specific task format	Labeled prompt-completion pairs

RAG (Retrieval-Augmented Generation)

Why RAG?

Problem: LLMs have knowledge cutoff, hallucinate, lack your private data

Solution: Retrieve relevant docs → inject into prompt → generate grounded answer

Without RAG:  "What is our refund policy?" → Hallucinated answer
With RAG:     "What is our refund policy?" → Retrieves policy doc → Accurate answer

ELI5: RAG solves the biggest LLM problem: hallucination. Instead of asking the model “What’s our refund policy?” and getting a confident but made-up answer, RAG first searches your actual documents, finds the relevant section, then hands it to the model along with the question. The model generates an answer grounded in real text — with citations. It’s like giving a student an open-book test instead of asking them to guess from memory.

Bedrock Knowledge Bases (Managed RAG)

Setup:
  1. Data Sources (S3)
      ↓
  2. Chunking (split documents into pieces)
      ↓
  3. Embedding (convert chunks to vectors)
      ↓
  4. Vector Store (index for similarity search)
      ↓
  Ready for queries!

Query Flow:
  User question
      ↓
  Embed question → Search vector store → Top-K relevant chunks
      ↓
  Augmented prompt = System prompt + Retrieved chunks + User question
      ↓
  Foundation model generates grounded answer
      ↓
  Response with source citations

Supported Vector Stores

Store	Type	Best For
OpenSearch Serverless	AWS managed	Default, easy setup
Aurora PostgreSQL (pgvector)	AWS managed	Already using Aurora
Pinecone	Third-party	Dedicated vector DB
Redis Enterprise	Third-party	Low-latency requirements
MongoDB Atlas	Third-party	Already using MongoDB

Chunking Strategies

Strategy	How It Works
Fixed-size	Split every N tokens (default)
Semantic	Split by meaning/topic boundaries
Hierarchical	Parent-child chunks (context + detail)
No chunking	Treat each file as one chunk

Bedrock Agents

Automate multi-step tasks by combining LLM reasoning with actions.

Bedrock Agents Architecture

User: "Book me a flight from NYC to London next Friday under $500"

Agent reasoning:
  1. Search flights (call flight API via Action Group)
  2. Filter by price < $500
  3. Check availability for next Friday
  4. Book the cheapest option (call booking API)
  5. Return confirmation to user

┌───────────────────────────────────────────────┐
│  Bedrock Agent                                │
│                                               │
│  ┌────────────┐   ┌────────────────────────┐  │
│  │   FM       │   │  Action Groups         │  │
│  │ (reasoning │←→ │  ├─ Lambda functions    │  │
│  │  & planning│   │  ├─ API schemas         │  │
│  │  )         │   │  └─ Return results      │  │
│  └────────────┘   └────────────────────────┘  │
│        ↕                                      │
│  ┌────────────┐   ┌────────────────────────┐  │
│  │ Knowledge  │   │  Guardrails            │  │
│  │ Bases      │   │  (safety filters)      │  │
│  └────────────┘   └────────────────────────┘  │
└───────────────────────────────────────────────┘

Action Groups: Lambda functions the agent can invoke
Knowledge Bases: RAG for the agent’s domain knowledge
Session memory: Retains context across turns
Guardrails: Apply content safety to agent responses

Bedrock Guardrails

Content safety framework applied to FM inputs and outputs.

Filter Type	What It Blocks
Content filters	Hate, insults, sexual, violence, misconduct, prompt attacks (6 categories, adjustable severity)
Denied topics	Custom topics to avoid (e.g., “competitor pricing”)
Word filters	Block specific words/phrases
Sensitive info (PII)	Detect and redact PII (SSN, email, phone, etc.)
Contextual grounding	Block hallucinated or irrelevant responses

Guardrails Flow:
  User Input → [Input Guardrail] → FM → [Output Guardrail] → Response
                    ↓                          ↓
               Block/modify               Block/modify
               if policy                  if policy
               violated                   violated

ELI5: Guardrails are like parental controls for your AI — they filter what goes in and what comes out. You can block topics (“never discuss competitors”), automatically redact PII before it reaches the model or leaves the response, and prevent toxic content. Unlike prompt engineering (asking the model nicely to behave), Guardrails enforce rules at the infrastructure level — the model doesn’t even see the blocked content.

Bedrock Model Evaluation

Evaluation Type	How It Works
Automatic	Built-in metrics (accuracy, robustness, toxicity) on benchmark datasets
Human	Subject matter experts rate model outputs
Custom	Your own evaluation criteria and datasets

Quick Reference: Bedrock vs SageMaker JumpStart

Feature	Bedrock	SageMaker JumpStart
Hosting	Fully managed API	You manage endpoint
Models	Curated FMs (Titan, Claude, Llama)	400+ open-source models
Customization	Fine-tune, continued pre-training	Full training control
RAG	Knowledge Bases (managed)	Build your own
Agents	Managed agents	Build your own
Cost model	Per token	Per instance hour
Best for	GenAI applications	Custom ML + GenAI experiments

Exam tip: If the question is about GenAI applications (chatbots, RAG, content generation) → Bedrock. If it’s about custom model training/deployment → SageMaker.

ELI5: Bedrock is a vending machine — pick a model, call the API, pay per token, done. SageMaker JumpStart is a workshop — pick a model, customize everything, manage your own deployment, pay per instance hour. Bedrock requires zero infrastructure knowledge; JumpStart gives you full control. For the exam, “build a GenAI app quickly” = Bedrock; “fine-tune and self-host” = JumpStart.

Generative AI & Amazon Bedrock#

Transformer Architecture#

Self-Attention Mechanism#

Transformer Variants#

LLM Key Concepts#

Tokens & Tokenization#

Key LLM Parameters#

Embeddings#

Fine-Tuning & Transfer Learning#

Transfer Learning#

Fine-Tuning Approaches#

SageMaker JumpStart#

Amazon Bedrock#

Architecture#

Bedrock Model Families (Exam Focus)#

Bedrock Pricing Models#

Additional Bedrock Features#

Bedrock Custom Models#

RAG (Retrieval-Augmented Generation)#

Why RAG?#

Bedrock Knowledge Bases (Managed RAG)#

Supported Vector Stores#

Chunking Strategies#

Bedrock Agents#

Bedrock Guardrails#

Bedrock Model Evaluation#

Quick Reference: Bedrock vs SageMaker JumpStart#