← AWS MLA-C01 — ML Engineer Associate

Domain 2D: Generative AI & Amazon Bedrock

Generative AI & Amazon Bedrock

Exam Domain: 2 — ML Model Development (26%) Task: Understand foundation models, transformers, and Bedrock services


Transformer Architecture

The architecture behind all modern LLMs (GPT, Claude, Llama, etc.)

┌─────────────────────────────────────────────────────────┐
│                  TRANSFORMER                             │
│                                                         │
│  Input: "The cat sat on the"                            │
│      ↓                                                  │
│  ┌─────────────────┐                                    │
│  │  Tokenization   │  "The" "cat" "sat" "on" "the"     │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Embeddings     │  Each token → dense vector         │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Positional     │  Add position info to embeddings   │
│  │  Encoding       │  (sin/cos or learned)              │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Self-Attention  │  Each token attends to all others │
│  │  (Multi-Head)    │  "How relevant is each word to    │
│  │                  │   every other word?"               │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  ┌─────────────────┐                                    │
│  │  Feed-Forward   │  Process each position             │
│  │  Network        │                                    │
│  └────────┬────────┘                                    │
│           ↓                                             │
│  (repeat N layers)                                      │
│           ↓                                             │
│  Output: "mat"  (next token prediction)                 │
└─────────────────────────────────────────────────────────┘

ELI5: The transformer revolutionized AI with one key insight: “self-attention.” Instead of reading a sentence word by word (like older RNNs), the transformer looks at all words simultaneously and figures out which words relate to each other. In “The animal didn’t cross the street because it was too tired,” a transformer immediately knows “it” refers to “animal” — not “street” — because it can see both words at once. It’s like reading a whole page at a time instead of one word.

Self-Attention Mechanism

For each token, compute:
  Q (Query)  = What am I looking for?
  K (Key)    = What do I contain?
  V (Value)  = What information do I provide?

Attention(Q,K,V) = softmax(QK^T / √d_k) × V

Multi-Head = run multiple attention computations in parallel
             (each head learns different relationships)

ELI5: Self-attention is like being at a noisy party. For every word (token) in the sentence, you check how relevant every other word is to understanding it. “Bank” in “river bank” gets high attention scores toward “river” and low scores toward “money.” The Q/K/V math is just a way of computing those relevance scores efficiently — Q asks the question, K holds the answers, V holds the actual information to pass forward.

Transformer Variants

TypeArchitectureExamplesUse Case
Encoder-onlyProcesses full input bidirectionallyBERTClassification, NER, sentiment
Decoder-onlyGenerates text auto-regressivelyGPT, Claude, LlamaText generation, chat, code
Encoder-DecoderEncodes input, decodes outputT5, BARTTranslation, summarization

ELI5: BERT reads a sentence in both directions simultaneously (great for understanding). GPT/Claude only read left-to-right, one token at a time (great for generating). T5/BART do both — read the input fully, then generate the output. For the exam: decoder-only = text generation, encoder-only = understanding tasks, encoder-decoder = transformation tasks (translate, summarize).


LLM Key Concepts

Tokens & Tokenization

"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]
                         6 tokens

Rule of thumb: ~1 token ≈ 4 characters ≈ 0.75 words

Key LLM Parameters

ParameterControlsRange
TemperatureRandomness/creativity0 (deterministic) → 1+ (creative)
Top-p (nucleus)Cumulative probability cutoff0.0 → 1.0
Top-kNumber of top tokens to consider1 → vocabulary size
Max tokensMaximum output lengthDepends on model
Stop sequencesWhen to stop generatingCustom strings
Temperature effect:
  temp=0.0: "The capital of France is Paris."  (always same answer)
  temp=0.5: "The capital of France is Paris, a beautiful city."
  temp=1.0: "The capital of France is Paris, known for its vibrant culture..."
  temp=1.5: "The glorious heart of France beats in Paris, where..."

ELI5: Temperature is the creativity knob. Turn it to 0 for factual, deterministic answers (“What is 2+2?” should always be 4). Turn it up for creative writing or brainstorming. Top-p and Top-k limit which words the model even considers at each step — like giving it a smaller vocabulary to choose from. Low top-k means the model sticks to the most likely words; high top-k lets it explore weirder options.

Embeddings

  • Dense vector representations of tokens/text
  • Capture semantic meaning — similar meanings → similar vectors
  • Used for: semantic search, RAG, clustering, similarity
  • Bedrock provides embedding models (Titan Embeddings, Cohere Embed)

Fine-Tuning & Transfer Learning

Transfer Learning

Pre-trained Model (ImageNet, GPT, etc.)
    ↓
  Freeze early layers (keep general knowledge)
    ↓
  Replace/add final layers for your task
    ↓
  Train on YOUR data with small learning rate
    ↓
  Fine-tuned Model (your domain)

Fine-Tuning Approaches

ApproachWhat ChangesData NeededCost
Full fine-tuningAll weightsLarge datasetExpensive
Feature extractionOnly final layersSmall datasetCheap
LoRA (Low-Rank Adaptation)Small adapter matricesSmall-mediumMedium
Continuous pre-trainingExtend base knowledgeDomain corpusExpensive
Prompt tuningLearned prompt embeddingsSmall datasetCheap

ELI5: Full fine-tuning is like repainting an entire house — effective but expensive. LoRA is like just touching up the trim — you add small adapter matrices alongside the original weights and only train those, so it’s much cheaper while still teaching the model new behavior. Prompt tuning is like putting a welcome mat at the door — minimal change (just a few learned tokens prepended to every prompt), yet surprisingly effective for steering model behavior.

SageMaker JumpStart

  • Model hub with 400+ pre-trained foundation models
  • One-click deploy and fine-tune
  • Models from: Hugging Face, Meta (Llama), Stability AI, AI21, Cohere
  • Provides: notebooks, training scripts, inference code
  • Use case: quick start with foundation models without Bedrock

Amazon Bedrock

Fully managed service for accessing foundation models via API.

Architecture

Bedrock Architecture

┌───────────────────────────────────────────────────────┐
│                   Amazon Bedrock                       │
│                                                       │
│  ┌─────────────────────────────────────────────────┐  │
│  │  Foundation Models (FMs)                        │  │
│  │  ├─ Amazon Nova (Micro/Lite/Pro/Premier)        │  │
│  │  ├─ Amazon Titan (Text, Embeddings V2, Image)   │  │
│  │  ├─ Anthropic Claude (Haiku/Sonnet/Opus)        │  │
│  │  ├─ Meta Llama (open weights, edge + cloud)     │  │
│  │  ├─ Cohere (Command R+, Embed v4 — RAG focus)   │  │
│  │  ├─ AI21 Labs Jamba (256K context, JSON native) │  │
│  │  ├─ Stability AI (SDXL — image generation)      │  │
│  │  └─ Mistral AI (MoE — cost-efficient inference) │  │
│  └─────────────────────────────────────────────────┘  │
│                                                       │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐  │
│  │Knowledge │ │  Agents  │ │Guardrails│ │  Model  │  │
│  │  Bases   │ │          │ │          │ │  Eval   │  │
│  │  (RAG)   │ │ (actions)│ │ (safety) │ │         │  │
│  └──────────┘ └──────────┘ └──────────┘ └─────────┘  │
│                                                       │
│  Data Privacy: Your data is NEVER used to train       │
│  base models. Data stays in your AWS account.         │
└───────────────────────────────────────────────────────┘

Bedrock Model Families (Exam Focus)

FamilyKey ModelsDifferentiatorExam Focus
Amazon NovaMicro, Lite, Pro, PremierAWS-native, fine-tunable, watermarkingModel selection, distillation
Amazon TitanEmbeddings V2, Image G1100+ language embeddingsRAG embedding selection
Anthropic ClaudeHaiku, Sonnet, OpusBest reasoning, 200K-1M contextAgent orchestration, tool use
Meta Llama3.1-3.3 (1B-405B)Open weights, edge deployFine-tuning, on-prem options
Mistral7B, Mixtral, LargeMoE (sparse activation = lower cost)Cost optimization
CohereCommand R+, Embed v4RAG specialist, multimodal embedRAG pipeline, embedding choice
AI21 Jamba1.5, 2256K context (longest), native JSONLong-context, structured output
Stability AISDXL, SD3.5Image generationImage generation params

Not all models support fine-tuning on Bedrock — Nova (Micro/Lite/Pro), Llama, Mistral do. Open-weight models (Llama, Mistral) can also deploy via SageMaker for more control.

Bedrock Pricing Models

ModelHow It WorksBest ForExam Signal
On-DemandPay per input/output tokenVariable, unpredictable workloads“bursty”, “testing”
Provisioned ThroughputReserve model capacityConsistent, high-volume workloads“steady”, “SLA”
Batch InferenceProcess large batches (50% discount)Offline processing“batch”, “non-real-time”
Intelligent Prompt RoutingAuto-select model size per queryCost optimization“optimize cost”, “30% savings”

Additional Bedrock Features

FeaturePurpose
Intelligent Prompt RoutingAuto-route between model sizes per query (up to 30% savings)
Model DistillationUse large teacher model to train smaller student model
WatermarkingDetect AI-generated content (Nova models)

Bedrock Custom Models

ApproachWhat It DoesData Needed
Continued Pre-TrainingTeach new domain knowledgeLarge unlabeled corpus
Fine-TuningTeach specific task formatLabeled prompt-completion pairs

RAG (Retrieval-Augmented Generation)

Why RAG?

Problem: LLMs have knowledge cutoff, hallucinate, lack your private data

Solution: Retrieve relevant docs → inject into prompt → generate grounded answer

Without RAG:  "What is our refund policy?" → Hallucinated answer
With RAG:     "What is our refund policy?" → Retrieves policy doc → Accurate answer

ELI5: RAG solves the biggest LLM problem: hallucination. Instead of asking the model “What’s our refund policy?” and getting a confident but made-up answer, RAG first searches your actual documents, finds the relevant section, then hands it to the model along with the question. The model generates an answer grounded in real text — with citations. It’s like giving a student an open-book test instead of asking them to guess from memory.

Bedrock Knowledge Bases (Managed RAG)

Setup:
  1. Data Sources (S3)
      ↓
  2. Chunking (split documents into pieces)
      ↓
  3. Embedding (convert chunks to vectors)
      ↓
  4. Vector Store (index for similarity search)
      ↓
  Ready for queries!

Query Flow:
  User question
      ↓
  Embed question → Search vector store → Top-K relevant chunks
      ↓
  Augmented prompt = System prompt + Retrieved chunks + User question
      ↓
  Foundation model generates grounded answer
      ↓
  Response with source citations

Supported Vector Stores

StoreTypeBest For
OpenSearch ServerlessAWS managedDefault, easy setup
Aurora PostgreSQL (pgvector)AWS managedAlready using Aurora
PineconeThird-partyDedicated vector DB
Redis EnterpriseThird-partyLow-latency requirements
MongoDB AtlasThird-partyAlready using MongoDB

Chunking Strategies

StrategyHow It Works
Fixed-sizeSplit every N tokens (default)
SemanticSplit by meaning/topic boundaries
HierarchicalParent-child chunks (context + detail)
No chunkingTreat each file as one chunk

Bedrock Agents

Automate multi-step tasks by combining LLM reasoning with actions.

Bedrock Agents Architecture

User: "Book me a flight from NYC to London next Friday under $500"

Agent reasoning:
  1. Search flights (call flight API via Action Group)
  2. Filter by price < $500
  3. Check availability for next Friday
  4. Book the cheapest option (call booking API)
  5. Return confirmation to user

┌───────────────────────────────────────────────┐
│  Bedrock Agent                                │
│                                               │
│  ┌────────────┐   ┌────────────────────────┐  │
│  │   FM       │   │  Action Groups         │  │
│  │ (reasoning │←→ │  ├─ Lambda functions    │  │
│  │  & planning│   │  ├─ API schemas         │  │
│  │  )         │   │  └─ Return results      │  │
│  └────────────┘   └────────────────────────┘  │
│        ↕                                      │
│  ┌────────────┐   ┌────────────────────────┐  │
│  │ Knowledge  │   │  Guardrails            │  │
│  │ Bases      │   │  (safety filters)      │  │
│  └────────────┘   └────────────────────────┘  │
└───────────────────────────────────────────────┘
  • Action Groups: Lambda functions the agent can invoke
  • Knowledge Bases: RAG for the agent’s domain knowledge
  • Session memory: Retains context across turns
  • Guardrails: Apply content safety to agent responses

Bedrock Guardrails

Content safety framework applied to FM inputs and outputs.

Filter TypeWhat It Blocks
Content filtersHate, insults, sexual, violence, misconduct, prompt attacks (6 categories, adjustable severity)
Denied topicsCustom topics to avoid (e.g., “competitor pricing”)
Word filtersBlock specific words/phrases
Sensitive info (PII)Detect and redact PII (SSN, email, phone, etc.)
Contextual groundingBlock hallucinated or irrelevant responses
Guardrails Flow:
  User Input → [Input Guardrail] → FM → [Output Guardrail] → Response
                    ↓                          ↓
               Block/modify               Block/modify
               if policy                  if policy
               violated                   violated

ELI5: Guardrails are like parental controls for your AI — they filter what goes in and what comes out. You can block topics (“never discuss competitors”), automatically redact PII before it reaches the model or leaves the response, and prevent toxic content. Unlike prompt engineering (asking the model nicely to behave), Guardrails enforce rules at the infrastructure level — the model doesn’t even see the blocked content.


Bedrock Model Evaluation

Evaluation TypeHow It Works
AutomaticBuilt-in metrics (accuracy, robustness, toxicity) on benchmark datasets
HumanSubject matter experts rate model outputs
CustomYour own evaluation criteria and datasets

Quick Reference: Bedrock vs SageMaker JumpStart

FeatureBedrockSageMaker JumpStart
HostingFully managed APIYou manage endpoint
ModelsCurated FMs (Titan, Claude, Llama)400+ open-source models
CustomizationFine-tune, continued pre-trainingFull training control
RAGKnowledge Bases (managed)Build your own
AgentsManaged agentsBuild your own
Cost modelPer tokenPer instance hour
Best forGenAI applicationsCustom ML + GenAI experiments

Exam tip: If the question is about GenAI applications (chatbots, RAG, content generation) → Bedrock. If it’s about custom model training/deploymentSageMaker.

ELI5: Bedrock is a vending machine — pick a model, call the API, pay per token, done. SageMaker JumpStart is a workshop — pick a model, customize everything, manage your own deployment, pay per instance hour. Bedrock requires zero infrastructure knowledge; JumpStart gives you full control. For the exam, “build a GenAI app quickly” = Bedrock; “fine-tune and self-host” = JumpStart.