← Claude Code & Certification

Claude with Google Cloud's Vertex AI - Certification Study Guide

Table of Contents

Claude with Google Cloud’s Vertex AI - Certification Study Guide

Course: Anthropic - Claude with Google Cloud’s Vertex AI
Modules: 9 (Introduction through Anthropic Apps & Agents)
Target: Certification Prep


MODULE 1: Introduction

Key Notes

  • Course scope: Using Claude via Google Cloud Vertex AI — same models, GCP-native auth and infrastructure
  • Claude models on Vertex AI:
    • claude-3-5-sonnet@20241022 — latest Sonnet, best balance of speed and capability
    • claude-3-5-haiku@20241022 — fast and cost-efficient, great for classification and routing
    • claude-3-opus@20240229 — most capable, highest cost, use for complex reasoning
    • claude-3-haiku@20240307 — older Haiku, lower cost baseline
    • claude-3-sonnet@20240229 — mid-tier option, older generation
  • Model ID format: claude-{family}@{date} — note @ not -v suffix (unlike Bedrock)
  • Authentication: Google Cloud Application Default Credentials (ADC)
    • Local dev: gcloud auth application-default login
    • Production: Service Account attached to compute resource with Vertex AI permissions
    • CI/CD: Workload Identity Federation preferred over long-lived keys
  • SDK: from anthropic import AnthropicVertex — uses the anthropic Python library, not a GCP-specific client
  • Prerequisites:
    • GCP project with billing enabled
    • Vertex AI API enabled (gcloud services enable aiplatform.googleapis.com)
    • Claude model access requested in Vertex AI Model Garden
    • IAM role: roles/aiplatform.user or roles/vertexai.user
  • Enable model access: Vertex AI Model Garden → Claude → Enable (one-time per project)

Vertex AI vs Direct API — Key Differences:

AspectDirect APIVertex AI
AuthANTHROPIC_API_KEY env varGoogle ADC or Service Account
Client classAnthropic()AnthropicVertex(region, project_id)
Model IDclaude-3-5-sonnet-20241022claude-3-5-sonnet@20241022
BillingThrough AnthropicThrough Google Cloud
IntegrationStandaloneGCP-native (BigQuery, GCS, etc.)
Messages APIIdenticalIdentical
StreamingIdenticalIdentical
Tool useIdenticalIdentical

Everything else — Messages API, tools, streaming, extended thinking — is identical across both.

GCP Setup Checklist

# 1. Install Google Cloud SDK
curl https://sdk.cloud.google.com | bash

# 2. Initialize and authenticate
gcloud init
gcloud auth application-default login

# 3. Enable Vertex AI API
gcloud services enable aiplatform.googleapis.com

# 4. Install the Anthropic Python library
pip install anthropic

# 5. Verify access
python -c "from anthropic import AnthropicVertex; print('OK')"

Available Regions

  • us-east5 (Ohio) — primary recommended region
  • us-central1 (Iowa) — general US
  • europe-west4 (Netherlands) — EU
  • asia-southeast1 (Singapore) — APAC
  • Check Vertex AI Model Garden for the latest region availability per model version

Best Practices

  • Enable Vertex AI API and request Claude model access in Model Garden before running any code
  • Use service accounts with roles/aiplatform.user or roles/vertexai.user in production
  • Pin model IDs with the @date suffix to avoid unexpected version changes
  • Choose region closest to your users for lowest latency
  • Set ANTHROPIC_VERTEX_PROJECT_ID and CLOUD_ML_REGION env vars for environment-portable code

MODULE 2: Accessing Claude with the API

Key Notes

2.1 — AnthropicVertex Client Setup

from anthropic import AnthropicVertex

# Explicit configuration — recommended for production
client = AnthropicVertex(
    region="us-east5",           # GCP region with Claude access
    project_id="my-gcp-project"  # Your GCP project ID
)

From environment variables:

import os
# Set env vars: ANTHROPIC_VERTEX_PROJECT_ID, CLOUD_ML_REGION
os.environ["ANTHROPIC_VERTEX_PROJECT_ID"] = "my-gcp-project"
os.environ["CLOUD_ML_REGION"] = "us-east5"
client = AnthropicVertex()  # reads from env

ADC Authentication flow:

Code → AnthropicVertex → google-auth library → ADC
                                                 ├── GOOGLE_APPLICATION_CREDENTIALS env var (JSON key file)
                                                 ├── gcloud auth application-default login (local dev)
                                                 ├── GCE/GKE metadata server (compute resources)
                                                 └── Cloud Run / App Engine built-in SA

2.2 — Basic Request

from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

message = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ]
)

# Accessing response content
print(message.content[0].text)           # "Paris is the capital of France."
print(f"Stop reason: {message.stop_reason}")  # "end_turn"
print(f"Input tokens:  {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
print(f"Model: {message.model}")

Response object fields:

message.id              — unique message ID
message.type            — "message"
message.role            — "assistant"
message.content         — list of content blocks
message.model           — model version used
message.stop_reason     — "end_turn" | "max_tokens" | "stop_sequence" | "tool_use"
message.stop_sequence   — which stop sequence triggered (if applicable)
message.usage           — input_tokens, output_tokens

2.3 — Multi-Turn Conversations

The API is stateless. You maintain conversation history client-side and send the entire array each request.

from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

conversation_history = []
system_prompt = "You are a helpful Python tutor. Be concise and give working code examples."

def chat(user_input: str) -> str:
    # Add user turn
    conversation_history.append({"role": "user", "content": user_input})

    response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=2048,
        system=system_prompt,
        messages=conversation_history
    )

    # Extract and store assistant turn
    assistant_msg = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_msg})
    return assistant_msg

# Example multi-turn session
print(chat("What is a list comprehension?"))
# Returns explanation

print(chat("Show me three examples."))
# References the previous exchange (list comprehensions)

print(chat("Now convert the last example to a generator expression."))
# Refers to the third example from the prior turn

Conversation history management — trimming to stay within context:

def trim_history(history: list, max_pairs: int = 10) -> list:
    """Keep only the most recent N user/assistant pairs."""
    # Always keep pairs (user + assistant = 2 items)
    max_items = max_pairs * 2
    if len(history) > max_items:
        return history[-max_items:]
    return history

Message structure rules:

  • Must alternate: user, assistant, user, assistant…
  • First message must be role user
  • Last message before API call must be role user
  • Do NOT put system prompt in the messages array — use the system parameter

2.4 — System Prompts

# Simple string system prompt
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    system="You are a senior software engineer. Always provide code examples in Python.",
    messages=[{"role": "user", "content": "How do I read a file?"}]
)

# Structured system prompt with XML tags
system = """
<role>You are a customer support agent for Acme Corp.</role>
<tone>Professional, empathetic, solution-focused</tone>
<constraints>
- Never promise refunds without authorization code
- Escalate billing issues to finance team
- Always end with: "Is there anything else I can help you with?"
</constraints>
<knowledge_cutoff>You have product information current as of 2024-Q4.</knowledge_cutoff>
"""

# System prompt as list (for prompt caching — covered in Module 7)
system_as_list = [
    {
        "type": "text",
        "text": "You are a helpful assistant with a large corpus of knowledge.",
        "cache_control": {"type": "ephemeral"}  # Cache this block
    }
]

2.5 — Temperature and Sampling Controls

# Deterministic output — best for extraction, classification, structured data
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=256,
    temperature=0.0,   # Fully deterministic
    messages=[{"role": "user", "content": "Is 'The weather is nice' positive or negative?"}]
)

# Creative output — stories, brainstorming, variation generation
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    temperature=1.0,   # Maximum creativity
    messages=[{"role": "user", "content": "Write an unexpected plot twist for a mystery novel."}]
)

# Top-p (nucleus) sampling — alternative to temperature
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=256,
    top_p=0.9,    # Consider tokens comprising top 90% probability mass
    messages=[{"role": "user", "content": "Write a tagline for a coffee brand."}]
)

# Top-k — limit to top k tokens
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=256,
    top_k=50,     # Only sample from the 50 most likely next tokens
    messages=[{"role": "user", "content": "Continue this sentence: The rocket launched and"}]
)

Temperature guide:

0.0       → Classification, extraction, structured output, Q&A
0.1–0.3   → Factual summarization, translation, code generation
0.5–0.7   → Conversational assistants, explanations, paraphrasing
0.8–1.0   → Creative writing, brainstorming, ideation, roleplay

Important: temperature and top_p cannot both be set in the same request on Vertex AI (same constraint as direct API). Use one or the other.

2.6 — Stop Sequences

# Stop generation when a specific string is produced
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    stop_sequences=["</answer>", "###", "STOP"],
    messages=[{
        "role": "user",
        "content": "Solve this: What is 15% of 200? Wrap your answer in <answer></answer> tags."
    }]
)
# Claude generates until it produces </answer> — then stops
# Useful to extract content between tags without trailing text
print(response.stop_reason)       # "stop_sequence"
print(response.stop_sequence)     # "</answer>"

2.7 — Streaming Responses

from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

# Simple streaming — iterate over text chunks
with client.messages.stream(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about Google Cloud Platform."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()

# Access final message after streaming completes
with client.messages.stream(
    model="claude-3-5-sonnet@20241022",
    max_tokens=2048,
    messages=[{"role": "user", "content": "Explain Vertex AI in detail."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

    final_message = stream.get_final_message()
    print(f"\nTotal output tokens: {final_message.usage.output_tokens}")

# Raw event streaming — for fine-grained control
with client.messages.stream(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Count to 5."}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            print(f"[delta]: {event.delta.text}", end="")
        elif event.type == "message_stop":
            print("\n[stream complete]")

SSE event types during streaming:

message_start          → metadata about the message
content_block_start    → new content block beginning
content_block_delta    → incremental text or thinking delta
content_block_stop     → content block ended
message_delta          → stop_reason, stop_sequence, usage update
message_stop           → stream complete

2.8 — Structured Output

import json
from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

# Method 1: System prompt instruction + JSON parsing
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    system="Always respond with valid JSON only. No explanation, no markdown code fences.",
    messages=[{
        "role": "user",
        "content": "Extract: name, age, city from: 'John is 30 years old from London.'"
    }]
)
data = json.loads(response.content[0].text)
print(data)  # {"name": "John", "age": 30, "city": "London"}

# Method 2: Prefill the assistant turn to force JSON
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    messages=[
        {"role": "user", "content": "List three GCP regions as JSON array of strings."},
        {"role": "assistant", "content": "["}  # Prefill forces Claude to continue JSON
    ]
)
# Response will complete the JSON array
raw = "[" + response.content[0].text
regions = json.loads(raw)

# Method 3: XML-delimited extraction
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": """Extract the sentiment and confidence score.
Respond ONLY with XML:
<result>
  <sentiment>POSITIVE|NEGATIVE|NEUTRAL</sentiment>
  <confidence>0.0-1.0</confidence>
</result>

Text: "This product completely exceeded my expectations!"
"""
    }]
)

import xml.etree.ElementTree as ET
root = ET.fromstring(response.content[0].text.strip())
sentiment = root.find("sentiment").text
confidence = float(root.find("confidence").text)

2.9 — Async Usage

import asyncio
from anthropic import AsyncAnthropicVertex

async def main():
    client = AsyncAnthropicVertex(region="us-east5", project_id="my-gcp-project")

    # Single async request
    response = await client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Explain async/await in Python."}]
    )
    print(response.content[0].text)

    # Parallel async requests
    tasks = [
        client.messages.create(
            model="claude-3-5-haiku@20241022",
            max_tokens=256,
            messages=[{"role": "user", "content": f"Summarize: Topic {i}"}]
        )
        for i in range(5)
    ]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(r.content[0].text)

asyncio.run(main())

2.10 — Error Handling

from anthropic import AnthropicVertex
from anthropic import APIStatusError, APIConnectionError, APITimeoutError, RateLimitError
import time

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

def create_with_retry(messages: list, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-3-5-sonnet@20241022",
                max_tokens=1024,
                messages=messages
            )
            return response.content[0].text

        except RateLimitError as e:
            wait = 2 ** attempt  # exponential backoff: 1s, 2s, 4s
            print(f"Rate limited. Waiting {wait}s before retry {attempt+1}/{max_retries}")
            time.sleep(wait)

        except APIConnectionError as e:
            print(f"Connection error: {e}. Retrying...")
            time.sleep(1)

        except APITimeoutError as e:
            print(f"Request timed out on attempt {attempt+1}")
            time.sleep(1)

        except APIStatusError as e:
            print(f"API error {e.status_code}: {e.message}")
            if e.status_code in (400, 401, 403):
                raise  # Don't retry auth/validation errors
            time.sleep(2 ** attempt)

    raise RuntimeError(f"Failed after {max_retries} retries")

Best Practices

  • Use AnthropicVertex — do not try to use Anthropic client with a custom base URL for Vertex
  • ADC resolves automatically on GCE/Cloud Run/GKE — no extra config needed in those environments
  • Specify project_id explicitly in code for portability across environments
  • Stream for interactive applications; collect full response for batch processing
  • Set temperature=0 for classification, extraction, and structured output tasks
  • Always handle RateLimitError with exponential backoff in production
  • Use AsyncAnthropicVertex when running many concurrent requests to avoid blocking
  • Monitor token usage (message.usage) for cost tracking and quota management

MODULE 3: Prompt Evaluation

Key Notes

3.1 — Evaluation Philosophy

Prompt evaluation is the discipline of measuring how well a prompt + model combination performs on a defined task. Without evaluation, you cannot know if a prompt change helps or hurts.

Core principle: You need a fixed test set to compare prompt versions. If you change the test set between experiments, the comparison is invalid.

3.2 — Evaluation Workflow

Step 1: Define Task
  └── Clear input format
  └── Clear expected output format
  └── Success criterion (what "correct" means)

Step 2: Build Test Dataset
  └── 20-50 examples minimum for initial evaluation
  └── 100+ for production-grade signal
  └── Cover: normal cases, edge cases, adversarial inputs
  └── Label expected outputs precisely

Step 3: Run Evaluation
  └── Run all test cases through Claude
  └── Collect (input, expected, actual) triples
  └── Log: timestamp, model, prompt version, all outputs

Step 4: Grade Outputs
  └── Apply grading function to each output
  └── Compute aggregate metric (accuracy, F1, etc.)

Step 5: Analyze Failures
  └── Group failures by type/category
  └── Identify patterns (what inputs fail?)
  └── Formulate hypothesis for improvement

Step 6: Iterate
  └── Modify prompt based on failure analysis
  └── Re-run same test set
  └── Compare metrics across versions

3.3 — Test Dataset Generation

from anthropic import AnthropicVertex
import json

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

def generate_test_dataset(task_description: str, n_examples: int = 30) -> list:
    """Use Claude to generate labeled test cases for a classification task."""
    response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": f"""Generate {n_examples} diverse test examples for this task:
{task_description}

Return as JSON array where each item has:
- "input": the test input text
- "expected_label": the correct answer
- "category": one of [normal, edge_case, adversarial]

Ensure balanced distribution across categories and classes.
Return ONLY the JSON array."""
        }]
    )
    return json.loads(response.content[0].text)

# Example usage
dataset = generate_test_dataset(
    "Sentiment classification: classify customer reviews as POSITIVE, NEGATIVE, or NEUTRAL"
)

3.4 — Grading Methods

MethodWhen to UseProsCons
Exact matchSingle-word/fixed answersFast, cheap, 100% reliableFails on paraphrase
RegexFormat validation (dates, IDs)Flexible, fastRequires regex writing
JSON schemaStructured data tasksValidates shape + typesDoesn’t check semantic correctness
Substring/keywordKeyword presence checkSimpleCan false-positive
Model-basedOpen-ended, subjectiveCan handle nuanceCosts money, can be wrong
Human reviewGold standard, calibrationMost accurateSlow, expensive

Exact match grading:

def grade_exact(output: str, expected: str) -> bool:
    return output.strip().upper() == expected.strip().upper()

Regex grading:

import re

def grade_regex(output: str, pattern: str) -> bool:
    return bool(re.search(pattern, output, re.IGNORECASE))

# Example: check if output contains a valid date format
grade_regex(output, r'\d{4}-\d{2}-\d{2}')

JSON schema grading:

import jsonschema
import json

schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "city": {"type": "string"}
    },
    "required": ["name", "age", "city"]
}

def grade_json_schema(output: str, schema: dict) -> bool:
    try:
        data = json.loads(output)
        jsonschema.validate(data, schema)
        return True
    except (json.JSONDecodeError, jsonschema.ValidationError):
        return False

Model-based grading (LLM-as-judge):

def grade_with_claude(output: str, expected: str, criterion: str) -> dict:
    """Use Haiku for cost-efficient grading."""
    grading_response = client.messages.create(
        model="claude-3-5-haiku@20241022",   # Use cheaper model for grading
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Evaluate if the response meets the criterion.

Criterion: {criterion}
Expected (reference): {expected}
Actual response: {output}

Respond with JSON only:
{{"pass": true/false, "score": 0.0-1.0, "reason": "one sentence explanation"}}"""
        }]
    )
    return json.loads(grading_response.content[0].text)

# Usage
result = grade_with_claude(
    output="Paris is the capital city of France, located in northern France.",
    expected="Paris",
    criterion="The response correctly identifies the capital of France."
)
# {"pass": true, "score": 1.0, "reason": "Response correctly names Paris as France's capital."}

3.5 — Running a Full Evaluation

import json
import time
from datetime import datetime

def run_evaluation(
    test_dataset: list,
    prompt_template: str,
    model: str = "claude-3-5-sonnet@20241022"
) -> dict:
    """Run full evaluation and return metrics."""
    results = []

    for i, example in enumerate(test_dataset):
        # Format prompt with input
        user_message = prompt_template.format(input=example["input"])

        response = client.messages.create(
            model=model,
            max_tokens=256,
            temperature=0,
            messages=[{"role": "user", "content": user_message}]
        )

        actual = response.content[0].text.strip()
        correct = grade_exact(actual, example["expected_label"])

        results.append({
            "input": example["input"],
            "expected": example["expected_label"],
            "actual": actual,
            "correct": correct,
            "category": example.get("category", "normal"),
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens
        })

        # Rate limit respect
        if i > 0 and i % 10 == 0:
            time.sleep(1)

    # Compute metrics
    total = len(results)
    correct = sum(1 for r in results if r["correct"])
    accuracy = correct / total

    # Break down by category
    by_category = {}
    for r in results:
        cat = r["category"]
        if cat not in by_category:
            by_category[cat] = {"correct": 0, "total": 0}
        by_category[cat]["total"] += 1
        if r["correct"]:
            by_category[cat]["correct"] += 1

    return {
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "accuracy": accuracy,
        "correct": correct,
        "total": total,
        "by_category": by_category,
        "avg_input_tokens": sum(r["input_tokens"] for r in results) / total,
        "avg_output_tokens": sum(r["output_tokens"] for r in results) / total,
        "results": results
    }

3.6 — Key Metrics to Track

MetricFormulaUse Case
Accuracycorrect / totalSingle-class classification
PrecisionTP / (TP + FP)When false positives are costly
RecallTP / (TP + FN)When false negatives are costly
F1 Score2 * P * R / (P + R)Balanced precision/recall
Token efficiencymeaningful tokens / total output tokensVerbosity control
Cost per correcttotal cost / correct answersBudget optimization
p50/p95 latencymedian / 95th percentile response timeSLA monitoring

Best Practices

  • Use a cheaper model (Haiku) for grading to reduce evaluation cost by 5-10x
  • Log all evaluation runs: timestamp, model version, prompt version, per-example results
  • Compare prompt versions on the exact same test set for valid comparisons
  • Track cost per correct answer, not just accuracy — a prompt with 95% accuracy at $0.10/answer may beat 98% at $0.50/answer
  • Run evaluations after every significant prompt change — small wording changes can cause large accuracy shifts
  • Categorize test cases: normal, edge case, adversarial — measure accuracy per category, not just overall
  • Hold out 20% of examples for final validation; use 80% for prompt development

MODULE 4: Prompt Engineering Techniques

Key Notes

4.1 — Core Principles

PrincipleDescriptionExample
ClarityExplicit about format, length, tone, constraints“Respond in exactly 2 sentences”
SpecificityPrecise about what you want“List 5 bullet points, each under 15 words”
ContextProvide all necessary backgroundInclude domain glossary for specialized tasks
RoleAssign a persona or expertise level“You are a senior security engineer…”
ExamplesShow don’t just tellInclude 2-5 input→output demonstrations

4.2 — XML Tag Structuring

XML tags organize complex prompts into labeled, scannable sections. Claude parses them naturally.

<system_context>
You are a content moderation assistant for a B2B SaaS platform.
Your task is to classify support tickets by urgency.
</system_context>

<urgency_definitions>
<level name="P0">Production down, all users affected — respond in 15 minutes</level>
<level name="P1">Partial outage, major feature broken — respond in 1 hour</level>
<level name="P2">Non-blocking bug or feature question — respond in 24 hours</level>
<level name="P3">Feature request or cosmetic — respond in 1 week</level>
</urgency_definitions>

<examples>
<example>
<ticket>Our login system is completely broken, nobody can access the app!</ticket>
<priority>P0</priority>
</example>
<example>
<ticket>The export CSV button shows wrong column headers</ticket>
<priority>P2</priority>
</example>
</examples>

<task>
Classify this ticket. Respond with only: P0, P1, P2, or P3.
<ticket>{ticket_text}</ticket>
</task>

4.3 — Few-Shot Examples

few_shot_system = """
You extract structured data from natural language. Return JSON only.

Examples:
Input: "Sarah Johnson, 42, lives in Austin Texas"
Output: {"name": "Sarah Johnson", "age": 42, "city": "Austin", "state": "Texas"}

Input: "Bob, 28, from New York"
Output: {"name": "Bob", "age": 28, "city": "New York", "state": "NY"}

Input: "Dr. Emily Chen is 55 and based in Chicago"
Output: {"name": "Emily Chen", "age": 55, "city": "Chicago", "state": "IL"}
"""

response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=256,
    system=few_shot_system,
    messages=[{
        "role": "user",
        "content": "Maria Rodriguez, 31, from Los Angeles California"
    }]
)

Few-shot placement strategies:

  • System prompt — use when examples apply to all turns in a conversation; cache them for efficiency
  • User message — use when examples are task-specific and vary per request
  • Optimal count — 2-5 examples; more is often not better unless task is highly ambiguous

4.4 — Chain of Thought

# Standard CoT — ask Claude to think step by step
cot_response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": """Solve this problem step by step, then give your final answer.

A store sells apples for $0.50 each and oranges for $0.75 each.
If Alex bought 8 apples and 5 oranges, and paid with a $10 bill,
how much change did Alex receive?

Show your work clearly, then state the final answer."""
    }]
)

# Structured CoT with XML — extract reasoning and answer separately
structured_cot = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": """Analyze whether this investment is worthwhile.
Put your reasoning in <thinking> tags and your recommendation in <recommendation> tags.

Investment: $50,000 upfront, returns $8,000/year for 8 years"""
    }]
)

import re
text = structured_cot.content[0].text
thinking = re.search(r'<thinking>(.*?)</thinking>', text, re.DOTALL)
recommendation = re.search(r'<recommendation>(.*?)</recommendation>', text, re.DOTALL)

4.5 — Role Prompting

# Technical expert role
technical_system = """
You are a principal Google Cloud architect with 10 years of experience designing
enterprise-scale distributed systems on GCP. You specialize in Vertex AI, BigQuery,
and Cloud Spanner. When asked about architecture, you consider: scalability, cost,
reliability, security, and operational complexity. You cite specific GCP services
and their limitations.
"""

# Pedagogical role
teacher_system = """
You are an expert at explaining complex technical concepts to beginners.
You use analogies from everyday life. You check understanding by asking
a simple follow-up question at the end of each explanation.
You never use jargon without defining it first.
"""

# Reviewer role with specific constraints
reviewer_system = """
You are a strict code reviewer who follows Google's Python Style Guide.
Flag issues in these categories: correctness, security, performance, readability.
For each issue: state category, line reference, problem, and recommended fix.
Do not praise — only flag issues. If no issues found, respond with "LGTM".
"""

4.6 — Output Format Control

# Force specific structure with format specification
format_prompt = """
Analyze the following customer review. Respond using EXACTLY this format:

SUMMARY: [1-2 sentence summary]
SENTIMENT: [POSITIVE / NEGATIVE / NEUTRAL / MIXED]
SCORE: [1-10]
KEY_ISSUES: [comma-separated list, or "none"]
RECOMMENDED_ACTION: [one action for the team]

Review: {review}
"""

# Markdown output
markdown_prompt = """
Create a technical comparison. Format as markdown with:
- H2 header for each technology
- Bullet points for pros and cons
- A comparison table at the end with columns: Feature | Tech A | Tech B

Keep the table under 8 rows. Use checkmarks (✓) and X marks for binary features.
"""

4.7 — Prefilling and Output Forcing

# Prefill the assistant turn to start output in a specific format
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    messages=[
        {"role": "user", "content": "List 3 GCP regions as a JSON array."},
        {"role": "assistant", "content": '["'}   # Force JSON array start
    ]
)
# Response continues from the prefill — guaranteed JSON array format
raw_response = '["' + response.content[0].text
regions = json.loads(raw_response)

# Prefill for SQL — force SELECT keyword
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    messages=[
        {"role": "user", "content": "Write a BigQuery SQL to count rows in dataset.table"},
        {"role": "assistant", "content": "SELECT"}  # Force SELECT start
    ]
)

Best Practices

  • Start with zero-shot; add few-shot only if zero-shot quality is insufficient
  • Test prompts on edge cases before production deployment
  • Use temperature=0 for classification/extraction; higher for generation tasks
  • Separate static instructions (system prompt) from dynamic content (user message)
  • Format few-shot examples identically to real inputs — format mismatch degrades quality
  • Use XML tags for long, multi-part prompts to aid Claude’s parsing
  • When using CoT, ask for reasoning before the final answer — this produces better answers
  • Prefill with [ or { to force JSON; prefill with to prevent unwanted preamble

MODULE 5: Tool Use with Claude

Key Notes

5.1 — Tool Schema Definition

Tools are defined as JSON schemas. Claude uses the name and description to decide when to call a tool, and the input_schema to know what arguments to provide.

from anthropic import AnthropicVertex
import json

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather conditions for a specified city. Returns temperature, humidity, and conditions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "City name (e.g., 'San Francisco', 'London')"
                },
                "units": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "Temperature units. Default: celsius",
                    "default": "celsius"
                }
            },
            "required": ["city"]
        }
    },
    {
        "name": "search_database",
        "description": "Search the internal product database for matching items. Use for product lookups, inventory checks, and pricing.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query string"
                },
                "category": {
                    "type": "string",
                    "enum": ["electronics", "clothing", "food", "all"],
                    "description": "Product category to search within"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results (1-20)",
                    "minimum": 1,
                    "maximum": 20,
                    "default": 5
                }
            },
            "required": ["query"]
        }
    }
]

5.2 — Tool Use Message Flow

┌─────────────────────────────────────────────────────────────┐
│                   Tool Use Message Flow                      │
│                                                             │
│  Client                              Claude                 │
│    │                                   │                    │
│    │── messages + tools ──────────────▶│                    │
│    │                                   │ (decides to        │
│    │                                   │  use a tool)       │
│    │◀─ stop_reason="tool_use" ─────────│                    │
│    │   content=[tool_use block]        │                    │
│    │                                   │                    │
│    │  [Client executes tool]           │                    │
│    │                                   │                    │
│    │── messages + tool_result ────────▶│                    │
│    │                                   │ (formulates        │
│    │                                   │  final answer)     │
│    │◀─ stop_reason="end_turn" ─────────│                    │
│    │   content=[text block]            │                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Content block types in tool use:

  • tool_use — Claude is requesting a tool call (has id, name, input)
  • tool_result — Client’s response to a tool call (has tool_use_id, content)
  • text — Claude’s text response (final answer or intermediate thinking)

5.3 — Complete Tool Use Loop

import json
from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

# Tool definitions
tools = [
    {
        "name": "get_stock_price",
        "description": "Get the current stock price for a given ticker symbol. Returns price in USD.",
        "input_schema": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "Stock ticker symbol (e.g. GOOG, AAPL, MSFT)"
                }
            },
            "required": ["ticker"]
        }
    },
    {
        "name": "calculate_portfolio_value",
        "description": "Calculate total value of a stock portfolio given tickers and share counts.",
        "input_schema": {
            "type": "object",
            "properties": {
                "holdings": {
                    "type": "array",
                    "description": "List of holdings with ticker and shares",
                    "items": {
                        "type": "object",
                        "properties": {
                            "ticker": {"type": "string"},
                            "shares": {"type": "number"}
                        },
                        "required": ["ticker", "shares"]
                    }
                }
            },
            "required": ["holdings"]
        }
    }
]

# Tool implementation functions
def get_stock_price(ticker: str) -> dict:
    """Simulated stock price lookup."""
    prices = {"GOOG": 175.20, "AAPL": 189.50, "MSFT": 415.30, "AMZN": 195.00}
    if ticker.upper() in prices:
        return {"ticker": ticker.upper(), "price": prices[ticker.upper()], "currency": "USD"}
    return {"error": f"Unknown ticker: {ticker}"}

def calculate_portfolio_value(holdings: list) -> dict:
    """Calculate total portfolio value."""
    total = 0.0
    breakdown = []
    for holding in holdings:
        price_data = get_stock_price(holding["ticker"])
        if "error" in price_data:
            return price_data
        value = price_data["price"] * holding["shares"]
        total += value
        breakdown.append({
            "ticker": holding["ticker"],
            "shares": holding["shares"],
            "price": price_data["price"],
            "value": value
        })
    return {"total_value": total, "breakdown": breakdown, "currency": "USD"}

# Dispatch table
TOOL_FUNCTIONS = {
    "get_stock_price": get_stock_price,
    "calculate_portfolio_value": calculate_portfolio_value
}

def execute_tool(name: str, inputs: dict) -> str:
    """Execute a tool and return result as string."""
    if name not in TOOL_FUNCTIONS:
        return json.dumps({"error": f"Unknown tool: {name}"})
    try:
        result = TOOL_FUNCTIONS[name](**inputs)
        return json.dumps(result)
    except Exception as e:
        return json.dumps({"error": str(e)})

def run_agent(user_message: str) -> str:
    """Run the full tool-use agent loop."""
    messages = [{"role": "user", "content": user_message}]
    max_iterations = 10

    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-3-5-sonnet@20241022",
            max_tokens=2048,
            tools=tools,
            messages=messages
        )

        print(f"[Iteration {iteration+1}] Stop reason: {response.stop_reason}")

        # Task complete — return final text response
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "[No text response]"

        # Handle tool use
        if response.stop_reason == "tool_use":
            # Append Claude's turn (includes tool_use blocks)
            messages.append({"role": "assistant", "content": response.content})

            # Execute all tool calls (Claude may request multiple in parallel)
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    print(f"  Calling tool: {block.name}({block.input})")
                    result = execute_tool(block.name, block.input)
                    print(f"  Result: {result}")
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result
                    })

            # Append tool results and continue
            messages.append({"role": "user", "content": tool_results})

        else:
            # Unexpected stop reason
            break

    return "Max iterations reached without completing task."

# Test the agent
result = run_agent("What is the total value of a portfolio with 100 shares of Google and 50 shares of Apple?")
print(f"\nFinal answer: {result}")

5.4 — tool_choice Parameter

# Default — Claude decides whether and which tool to use
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "auto"},  # Default behavior
    messages=messages
)

# Force Claude to use at least one tool
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "any"},  # Must call a tool
    messages=messages
)

# Force a specific tool
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "get_stock_price"},
    messages=messages
)

# Disable all tool use
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "none"},  # No tools, text only
    messages=messages
)

5.5 — Parallel Tool Calls

Claude may request multiple tool calls in a single response. Each has a unique id.

# Claude's response content when calling multiple tools:
# [
#   ToolUseBlock(id="toolu_01", name="get_weather", input={"city": "London"}),
#   ToolUseBlock(id="toolu_02", name="get_weather", input={"city": "Tokyo"}),
#   ToolUseBlock(id="toolu_03", name="get_weather", input={"city": "Sydney"})
# ]

# You must provide ALL tool results before Claude can continue:
tool_results = [
    {"type": "tool_result", "tool_use_id": "toolu_01", "content": '{"city": "London", "temp": 15}'},
    {"type": "tool_result", "tool_use_id": "toolu_02", "content": '{"city": "Tokyo", "temp": 22}'},
    {"type": "tool_result", "tool_use_id": "toolu_03", "content": '{"city": "Sydney", "temp": 28}'}
]
messages.append({"role": "user", "content": tool_results})

5.6 — Specialized Tools

Bash tool (for Claude Code / computer use contexts):

bash_tool = {
    "name": "bash",
    "description": "Execute shell commands and return stdout/stderr",
    "input_schema": {
        "type": "object",
        "properties": {
            "command": {
                "type": "string",
                "description": "Shell command to execute"
            }
        },
        "required": ["command"]
    }
}

Text editor tool:

text_editor_tool = {
    "name": "str_replace_editor",
    "description": "View, create, and edit files",
    "input_schema": {
        "type": "object",
        "properties": {
            "command": {
                "type": "string",
                "enum": ["view", "create", "str_replace", "insert", "undo_edit"]
            },
            "path": {"type": "string"},
            "file_text": {"type": "string"},
            "old_str": {"type": "string"},
            "new_str": {"type": "string"},
            "insert_line": {"type": "integer"},
            "new_str_after_insert": {"type": "string"}
        },
        "required": ["command", "path"]
    }
}

Web search tool:

web_search_tool = {
    "name": "web_search",
    "description": "Search the web for current information. Use when you need recent data, news, or facts not in your training data.",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Search query"
            }
        },
        "required": ["query"]
    }
}

5.7 — Error Handling in Tool Use

def execute_tool_safe(name: str, inputs: dict) -> str:
    """Execute tool with error handling — return error as JSON string, not exception."""
    try:
        if name == "get_stock_price":
            result = get_stock_price(**inputs)
        elif name == "calculate_portfolio_value":
            result = calculate_portfolio_value(**inputs)
        else:
            result = {"error": f"Unknown tool: {name}"}
        return json.dumps(result)
    except KeyError as e:
        return json.dumps({"error": f"Missing required parameter: {e}"})
    except ValueError as e:
        return json.dumps({"error": f"Invalid parameter value: {e}"})
    except Exception as e:
        return json.dumps({"error": f"Tool execution failed: {str(e)}"})

# Tool error response pattern — Claude will acknowledge the error and may retry or explain
tool_error_result = {
    "type": "tool_result",
    "tool_use_id": "toolu_01",
    "content": '{"error": "Stock price service unavailable. Please try again."}',
    "is_error": True  # Optional flag to mark as error result
}

Best Practices

  • Write descriptive tool names and descriptions — Claude’s tool routing depends entirely on them
  • Handle tool execution failures gracefully — return error message in tool_result.content, never raise exceptions that crash the loop
  • Always loop on stop_reason == "tool_use" until end_turn — never assume a single cycle is sufficient
  • Validate tool inputs server-side before executing — do not trust model-provided values for security-sensitive operations
  • Use tool_choice={"type": "any"} only when a tool call is required for correctness
  • Return all parallel tool results before sending the next message — partial results will cause API errors
  • Cap agent loops with max_iterations to prevent runaway execution
  • Log every tool call and result for debugging and audit trails

MODULE 6: Retrieval Augmented Generation (RAG)

Key Notes

6.1 — RAG Architecture on GCP

Documents (PDFs, web pages, docs)
         │
         ▼
  ┌─────────────┐
  │   Chunking  │  ← Split into 512-1000 token segments
  └──────┬──────┘
         │
         ▼
  ┌──────────────────────┐
  │ Vertex AI Embeddings │  ← text-embedding-004 (768 dims)
  └──────────┬───────────┘
             │
             ▼
  ┌─────────────────────┐
  │ Vertex AI Vector    │  ← ANN index for similarity search
  │ Search (Index)      │
  └─────────────────────┘

Query Time:
  User Query
      │
      ▼
  Embed Query (Vertex AI Embeddings)
      │
      ▼
  ANN Search → Top-K chunk IDs + scores
      │
      ▼
  Fetch chunk text from storage (GCS / Firestore)
      │
      ▼
  (Optional) Rerank with Vertex AI Ranking API
      │
      ▼
  Assemble Context (top-5 chunks + metadata)
      │
      ▼
  Claude generates grounded answer

6.2 — Text Chunking Strategies

from typing import List

def chunk_fixed_size(text: str, chunk_size: int = 800, overlap: int = 100) -> List[str]:
    """Fixed-size chunking with overlap to avoid losing context at boundaries."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start = end - overlap  # overlap to preserve context
    return chunks

def chunk_by_paragraph(text: str, max_words: int = 1000) -> List[str]:
    """Split at paragraph boundaries — preserves natural structure."""
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = []
    current_count = 0

    for para in paragraphs:
        words = para.split()
        if current_count + len(words) > max_words and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = [para]
            current_count = len(words)
        else:
            current_chunk.append(para)
            current_count += len(words)

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))
    return chunks

def chunk_by_sections(text: str) -> List[dict]:
    """Chunk by markdown headers — preserves document structure."""
    import re
    sections = re.split(r'\n(?=#{1,3} )', text)
    chunks = []
    for section in sections:
        if section.strip():
            title_match = re.match(r'(#{1,3})\s+(.+)', section)
            title = title_match.group(2) if title_match else "Section"
            chunks.append({"title": title, "content": section.strip()})
    return chunks

Chunking strategy guide:

Fixed-size    → Simple implementation, consistent sizes, use as baseline
Paragraph     → Preserves natural reading flow, variable sizes
Section-based → Best for structured documents (docs, reports)
Semantic      → Best quality, requires additional NLP model
Sliding window → Use when information spans paragraph boundaries

6.3 — Vertex AI Embeddings API

import vertexai
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

# Initialize Vertex AI
vertexai.init(project="my-gcp-project", location="us-central1")

# Load embedding model
embedding_model = TextEmbeddingModel.from_pretrained("text-embedding-004")

def embed_text(text: str) -> List[float]:
    """Embed a single text string."""
    embeddings = embedding_model.get_embeddings([text])
    return embeddings[0].values  # 768-dimensional float vector

def embed_batch(texts: List[str], batch_size: int = 250) -> List[List[float]]:
    """Embed multiple texts with batching."""
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings = embedding_model.get_embeddings(batch)
        all_embeddings.extend([e.values for e in embeddings])
    return all_embeddings

# Task-specific embeddings (text-embedding-004 supports task types)
def embed_for_retrieval(query: str) -> List[float]:
    """Embed a query optimized for retrieval tasks."""
    inputs = [TextEmbeddingInput(query, task_type="RETRIEVAL_QUERY")]
    embeddings = embedding_model.get_embeddings(inputs)
    return embeddings[0].values

def embed_document_for_indexing(doc: str) -> List[float]:
    """Embed a document optimized for storage in vector index."""
    inputs = [TextEmbeddingInput(doc, task_type="RETRIEVAL_DOCUMENT")]
    embeddings = embedding_model.get_embeddings(inputs)
    return embeddings[0].values

Vertex AI Embedding Models:

ModelDimensionsNotes
text-embedding-004768Latest, best quality, supports task types
textembedding-gecko@003768Previous generation
textembedding-gecko-multilingual@001768100+ languages
text-multilingual-embedding-002768Latest multilingual
from google.cloud import aiplatform

# Initialize
aiplatform.init(project="my-gcp-project", location="us-central1")

# Create an index (one-time setup — can take 30-60 minutes)
my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="rag-document-index",
    dimensions=768,
    approximate_neighbors_count=100,
    distance_measure_type="COSINE_DISTANCE",
    index_update_method="STREAM_UPDATE"  # or BATCH_UPDATE
)

# Deploy index to endpoint
my_index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name="rag-index-endpoint",
    public_endpoint_enabled=True
)
my_index_endpoint.deploy_index(
    index=my_index,
    deployed_index_id="rag_deployed_index"
)

# Upsert vectors (add documents to index)
my_index.upsert_datapoints(
    datapoints=[
        aiplatform.MatchingEngineIndex.Datapoint(
            datapoint_id=str(chunk_id),
            feature_vector=embedding_vector,  # list of 768 floats
            restricts=[],                      # optional filtering metadata
            crowding_tag=None
        )
        for chunk_id, embedding_vector in chunk_embeddings.items()
    ]
)

# Query the index at runtime
def vector_search(query: str, top_k: int = 20) -> List[dict]:
    query_embedding = embed_for_retrieval(query)
    response = my_index_endpoint.find_neighbors(
        deployed_index_id="rag_deployed_index",
        queries=[query_embedding],
        num_neighbors=top_k
    )
    neighbors = response[0]  # First query's results
    return [
        {"id": n.id, "distance": n.distance}
        for n in neighbors
    ]

BM25 excels at exact keyword matches where vector search may miss them (product IDs, proper nouns, technical terms).

from rank_bm25 import BM25Okapi
import nltk

class BM25Index:
    def __init__(self):
        self.corpus = []
        self.chunk_ids = []
        self.bm25 = None

    def add_documents(self, chunks: List[dict]):
        """Index a list of chunks with {id, text}."""
        for chunk in chunks:
            tokens = self._tokenize(chunk["text"])
            self.corpus.append(tokens)
            self.chunk_ids.append(chunk["id"])
        self.bm25 = BM25Okapi(self.corpus)

    def _tokenize(self, text: str) -> List[str]:
        return text.lower().split()

    def search(self, query: str, top_k: int = 20) -> List[dict]:
        tokens = self._tokenize(query)
        scores = self.bm25.get_scores(tokens)
        top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
        return [
            {"id": self.chunk_ids[i], "score": scores[i]}
            for i in top_indices if scores[i] > 0
        ]

6.6 — Hybrid Search with Reciprocal Rank Fusion

def reciprocal_rank_fusion(
    *result_lists: List[dict],
    k: int = 60,
    id_field: str = "id"
) -> List[dict]:
    """Combine multiple ranked result lists using RRF."""
    scores = {}
    for results in result_lists:
        for rank, result in enumerate(results):
            doc_id = result[id_field]
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1 / (k + rank + 1)

    sorted_ids = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"id": doc_id, "rrf_score": score} for doc_id, score in sorted_ids]

def hybrid_search(query: str, top_k: int = 10) -> List[dict]:
    """Combine vector and BM25 search with RRF."""
    vector_results = vector_search(query, top_k=50)   # More candidates
    bm25_results = bm25_index.search(query, top_k=50)

    combined = reciprocal_rank_fusion(vector_results, bm25_results)
    return combined[:top_k]

6.7 — Contextual Retrieval (Anthropic’s Technique)

Generate context summaries for each chunk before indexing — significantly improves retrieval precision.

def generate_chunk_context(document_summary: str, chunk: str) -> str:
    """Generate a contextual description of where this chunk fits in the document."""
    response = client.messages.create(
        model="claude-3-5-haiku@20241022",  # Use fast model for indexing
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""<document_summary>
{document_summary}
</document_summary>

<chunk>
{chunk}
</chunk>

Write a 2-3 sentence context that explains what this chunk is about, which section it belongs to, and what broader topic it covers. This context will be prepended to the chunk for retrieval purposes."""
        }]
    )
    context = response.content[0].text
    return f"{context}\n\n{chunk}"  # Prepend context to chunk

# Index with contextual enrichment
def index_document_with_context(document: dict, chunks: List[str]):
    doc_summary = summarize_document(document["full_text"])
    enriched_chunks = []
    for chunk in chunks:
        enriched = generate_chunk_context(doc_summary, chunk)
        enriched_chunks.append({
            "id": generate_id(),
            "text": enriched,
            "original_chunk": chunk,
            "source": document["title"]
        })
    # Embed enriched text, store originals for retrieval
    return enriched_chunks

6.8 — Reranking

from vertexai.preview.language_models import RankingModel

ranking_model = RankingModel.from_pretrained("semantic-ranker-512@latest")

def rerank_results(query: str, candidates: List[dict], top_n: int = 5) -> List[dict]:
    """Rerank candidates using Vertex AI Ranking API."""
    records = [
        ranking_model.RankingRecord(
            id=str(c["id"]),
            content=c["text"][:512]  # Ranker has token limit
        )
        for c in candidates
    ]
    response = ranking_model.rank(
        query=query,
        records=records,
        top_n=top_n
    )
    # Map back to original candidates, preserving rank order
    id_to_candidate = {str(c["id"]): c for c in candidates}
    return [id_to_candidate[r.id] for r in response.records]

6.9 — Assembling Context and Generating Answers

def rag_answer(user_query: str, chunks_store: dict) -> str:
    """Full RAG pipeline: retrieve → rerank → generate."""

    # Step 1: Retrieve candidates
    candidates = hybrid_search(user_query, top_k=20)

    # Step 2: Fetch chunk text from store
    candidate_chunks = [
        {"id": c["id"], "text": chunks_store[c["id"]]["text"],
         "source": chunks_store[c["id"]]["source"]}
        for c in candidates
        if c["id"] in chunks_store
    ]

    # Step 3: Rerank to top-5
    top_chunks = rerank_results(user_query, candidate_chunks, top_n=5)

    # Step 4: Build context string with citations
    context_parts = []
    for i, chunk in enumerate(top_chunks):
        context_parts.append(f"[Source {i+1}: {chunk['source']}]\n{chunk['text']}")
    context = "\n\n---\n\n".join(context_parts)

    # Step 5: Generate grounded answer
    response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=1024,
        system="""You are a helpful assistant. Answer questions based solely on the provided context.
If the context doesn't contain enough information, say so clearly.
Always cite which source(s) you used (e.g., [Source 1], [Source 2]).""",
        messages=[{
            "role": "user",
            "content": f"<context>\n{context}\n</context>\n\nQuestion: {user_query}"
        }]
    )
    return response.content[0].text

#### 6.10 — Vertex AI Search (Managed RAG)

Fully managed alternative  skip manual vector search infrastructure.

```python
from google.cloud import discoveryengine_v1 as discoveryengine

def vertex_ai_search(project_id: str, location: str, data_store_id: str, query: str) -> list:
    """Query Vertex AI Search (Discovery Engine) managed search."""
    client = discoveryengine.SearchServiceClient()

    serving_config = f"projects/{project_id}/locations/{location}/collections/default_collection/dataStores/{data_store_id}/servingConfigs/default_config"

    request = discoveryengine.SearchRequest(
        serving_config=serving_config,
        query=query,
        page_size=5,
        content_search_spec=discoveryengine.SearchRequest.ContentSearchSpec(
            snippet_spec=discoveryengine.SearchRequest.ContentSearchSpec.SnippetSpec(
                return_snippet=True
            ),
            extractive_content_spec=discoveryengine.SearchRequest.ContentSearchSpec.ExtractiveContentSpec(
                max_extractive_answer_count=3,
                max_extractive_segment_count=5
            )
        )
    )
    response = client.search(request)
    results = []
    for result in response.results:
        doc = result.document
        results.append({
            "id": doc.id,
            "snippets": [s.snippet for s in doc.derived_struct_data.get("snippets", [])]
        })
    return results

Best Practices

  • Retrieve more than needed (top-20), rerank, then send top-5 to Claude — avoid overwhelming the context
  • Keep retrieved context under 10K tokens — attention quality degrades beyond that
  • Include document metadata (title, source URL, section) for citation support
  • Use hybrid search by default — pure vector search misses exact keyword matches
  • Use Vertex AI Search when you want managed ingestion + search with minimal operational overhead
  • Use Contextual Retrieval for ambiguous queries in large document corpora — adds 20-40% retrieval precision
  • Store original chunk text separately from enriched indexing text — serve the original to Claude

MODULE 7: Features of Claude

Key Notes

7.1 — Extended Thinking

Extended thinking exposes Claude’s chain-of-thought reasoning before the final answer. Enables better accuracy on complex reasoning tasks.

from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=8000,                          # Must be > budget_tokens
    thinking={
        "type": "enabled",
        "budget_tokens": 5000                 # Max tokens Claude can "think" with
    },
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
)

# Response contains thinking blocks AND text blocks
for block in response.content:
    if block.type == "thinking":
        print(f"[Thinking - {len(block.thinking)} chars]")
        print(block.thinking[:300] + "...")  # Show preview of reasoning
    elif block.type == "text":
        print(f"[Answer]")
        print(block.text)

# Check token usage for thinking
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")  # Includes thinking tokens

Extended thinking rules:

  • max_tokens must be greater than budget_tokens
  • Minimum budget_tokens: 1024
  • Maximum budget_tokens: up to model context limit
  • thinking blocks appear before text blocks in content
  • Cannot use temperature parameter when thinking is enabled
  • Streaming works — thinking block content is streamed as thinking deltas

When to use extended thinking:

Use extended thinking for:          Skip for:
✓ Mathematical proofs               ✗ Simple Q&A
✓ Multi-step logical reasoning      ✗ Classification tasks
✓ Complex planning / strategy       ✗ Text generation
✓ Code debugging                    ✗ Extraction tasks
✓ Scientific analysis               ✗ Summarization

7.2 — Vision: Image Input

import base64
from pathlib import Path

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

# Base64 image input
def encode_image(path: str) -> tuple[str, str]:
    """Returns (base64_data, media_type)."""
    suffix_to_type = {
        ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
        ".png": "image/png", ".gif": "image/gif",
        ".webp": "image/webp"
    }
    path_obj = Path(path)
    media_type = suffix_to_type.get(path_obj.suffix.lower(), "image/jpeg")
    with open(path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return data, media_type

img_data, media_type = encode_image("architecture_diagram.png")

response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": img_data
                }
            },
            {"type": "text", "text": "Describe this architecture diagram and identify any potential bottlenecks."}
        ]
    }]
)
print(response.content[0].text)

# URL-based image input (for publicly accessible images)
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "url",
                    "url": "https://storage.googleapis.com/my-bucket/chart.png"
                }
            },
            {"type": "text", "text": "What trend does this chart show?"}
        ]
    }]
)

# Multiple images in one message
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two UI designs:"},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img1_data}},
            {"type": "text", "text": "vs"},
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img2_data}},
            {"type": "text", "text": "Which design is more intuitive and why?"}
        ]
    }]
)

Vision token costs: Images are tokenized based on pixel dimensions. Resize large images before sending to reduce costs:

from PIL import Image
import io

def resize_image_for_claude(image_path: str, max_dimension: int = 1568) -> bytes:
    """Resize image to stay within Claude's recommended dimensions."""
    with Image.open(image_path) as img:
        # Maintain aspect ratio
        img.thumbnail((max_dimension, max_dimension), Image.Resampling.LANCZOS)
        buffer = io.BytesIO()
        img.save(buffer, format="PNG")
        return buffer.getvalue()

7.3 — PDF Support

import base64

def encode_pdf(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

pdf_data = encode_pdf("annual_report_2024.pdf")

response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "base64",
                    "media_type": "application/pdf",
                    "data": pdf_data
                },
                "title": "Annual Report 2024",       # Optional metadata
                "context": "This is our company's financial report"  # Optional context
            },
            {"type": "text", "text": "What was the total revenue for Q3 2024? Provide the exact figure."}
        ]
    }]
)
print(response.content[0].text)

7.4 — Citations

# Enable citations for document-grounded answers
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "text",
                    "media_type": "text/plain",
                    "data": """Section 3.2 - Revenue
Q1 revenue was $4.2M. Q2 increased to $5.1M.
Q3 revenue reached $6.8M, a 33% quarter-over-quarter increase.
Q4 is projected at $7.5M based on current pipeline."""
                },
                "title": "Quarterly Financial Summary"
            },
            {
                "type": "text",
                "text": "What was the Q3 revenue and how did it compare to Q2?"
            }
        ]
    }],
    # Note: citations are enabled by including documents with source type "text"
    # Claude will naturally cite sources in this format
)

# Parse citation blocks in response
for block in response.content:
    if block.type == "text":
        print(block.text)
    elif block.type == "citation":
        print(f"[Citation: {block.cited_text} from {block.document_title}]")

7.5 — Prompt Caching

Prompt caching dramatically reduces cost and latency for prompts that include large, repeated content.

# Cache a large system prompt or document that repeats across many requests
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_OR_DOCUMENT,  # 10,000+ tokens
            "cache_control": {"type": "ephemeral"}     # Cache this block
        }
    ],
    messages=[{"role": "user", "content": "Question 1 about the document..."}]
)

# First request: cache MISS — 25% premium on cached tokens (cache write)
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")   # 0

# Second request (within 5 min TTL): cache HIT — 90% discount on cached tokens
response2 = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT_OR_DOCUMENT,  # Same text = cache hit
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Question 2 about the document..."}]
)
print(f"Cache write tokens: {response2.usage.cache_creation_input_tokens}")  # 0
print(f"Cache read tokens: {response2.usage.cache_read_input_tokens}")       # Large number

# Cache in conversation turns (for few-shot examples)
messages_with_cache = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": STATIC_FEW_SHOT_EXAMPLES,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": f"Now classify this: {dynamic_input}"
            }
        ]
    }
]

Prompt caching economics:

Cache write:  1.25x normal input token price  (25% premium)
Cache read:   0.10x normal input token price  (90% discount)
Cache TTL:    5 minutes (ephemeral)

Break-even:   ~1.39 identical requests for cache write cost to be recouped
Ideal for:    Static documents, system prompts, few-shot examples > 1024 tokens

Best Practices

  • Use extended thinking for math, logic, multi-step planning — not for simple Q&A
  • Cache system prompts and static documents that repeat across many requests — set cache_control once and forget
  • Resize images before sending — smaller images = fewer tokens; max recommended: 1568px on longest side
  • Enable citations when source attribution matters for trust and auditability (legal, medical, compliance)
  • Monitor cache_read_input_tokens vs cache_creation_input_tokens to measure cache effectiveness
  • Place cacheable content at the start of the prompt (prefix caching — everything before cache marker is cached)
  • Extended thinking increases output tokens significantly — adjust max_tokens accordingly

MODULE 8: Model Context Protocol (MCP)

Key Notes

8.1 — MCP Overview

MCP (Model Context Protocol) is an open standard for connecting LLMs to external tools, data sources, and capabilities. It separates tool implementation (server) from tool usage (client/LLM).

Three MCP primitives:

  • Tools — functions Claude can call (like API tool use, but standardized)
  • Resources — data sources Claude can read (files, database rows, configs)
  • Prompts — reusable prompt templates that can be invoked by name
┌─────────────────────────────────────────────────────────────────┐
│                        MCP Architecture                         │
│                                                                 │
│  Your App               MCP Client          MCP Server         │
│  ─────────              ──────────          ──────────         │
│  AnthropicVertex   ──▶  Tool schemas   ──▶  Tool impl          │
│  .messages.create()     (auto-loaded)       (BigQuery,          │
│  with tools=            from server         GCS, APIs, etc.)   │
│                                                                 │
│  Claude decides         Routes tool_use     Executes + returns  │
│  which tools to use     block to server     results to client   │
└─────────────────────────────────────────────────────────────────┘

8.2 — Building an MCP Server

from mcp.server.fastmcp import FastMCP
from google.cloud import bigquery, storage
import json

mcp = FastMCP("gcp-data-server")

# --- TOOLS ---

@mcp.tool()
def query_bigquery(sql: str, project_id: str = "my-project") -> str:
    """
    Execute a SQL query against BigQuery and return results as JSON.
    Use this to analyze data, run aggregations, and explore datasets.
    """
    client = bigquery.Client(project=project_id)
    try:
        results = client.query(sql).result()
        rows = [dict(row) for row in results]
        return json.dumps(rows[:100])  # Limit to 100 rows
    except Exception as e:
        return json.dumps({"error": str(e)})

@mcp.tool()
def list_gcs_files(bucket_name: str, prefix: str = "") -> str:
    """
    List files in a Google Cloud Storage bucket with optional prefix filter.
    Returns file names, sizes, and last modified dates.
    """
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blobs = bucket.list_blobs(prefix=prefix)
    files = [
        {"name": b.name, "size": b.size, "updated": str(b.updated)}
        for b in blobs
    ]
    return json.dumps(files)

@mcp.tool()
def read_gcs_file(bucket_name: str, file_path: str) -> str:
    """
    Read text content from a Google Cloud Storage file.
    Returns file content as a string. Use for reading configs, logs, and documents.
    """
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(file_path)
    return blob.download_as_text()

# --- RESOURCES ---

@mcp.resource("gcp://project/{project_id}/config")
def get_project_config(project_id: str) -> str:
    """Current GCP project configuration and enabled services."""
    return json.dumps({
        "project_id": project_id,
        "region": "us-east5",
        "enabled_apis": ["aiplatform.googleapis.com", "bigquery.googleapis.com"]
    })

# --- PROMPTS ---

@mcp.prompt()
def sql_analyzer(dataset: str, question: str) -> str:
    """Generate a prompt for analyzing a BigQuery dataset."""
    return f"""You are a BigQuery SQL expert. The user wants to analyze the dataset: {dataset}

Their question is: {question}

First use the query_bigquery tool to explore the schema by running:
SELECT column_name, data_type FROM `{dataset}.INFORMATION_SCHEMA.COLUMNS` LIMIT 50

Then write and execute a query to answer the question."""

if __name__ == "__main__":
    mcp.run()

8.3 — MCP Transport Options

# stdio transport — local process (most common for development)
python my_mcp_server.py

# HTTP/SSE transport — remote server (for production deployments)
# Server runs on port 8000, client connects over network
mcp serve --transport sse --port 8000 my_mcp_server.py
# Connecting to an MCP server programmatically
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import asyncio

async def run_with_mcp():
    server_params = StdioServerParameters(
        command="python",
        args=["my_mcp_server.py"]
    )
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()

            # List available tools
            tools_list = await session.list_tools()
            # Convert to Anthropic tool format
            tools = [
                {
                    "name": tool.name,
                    "description": tool.description,
                    "input_schema": tool.inputSchema
                }
                for tool in tools_list.tools
            ]

            # Now use with AnthropicVertex
            client = AnthropicVertex(region="us-east5", project_id="my-project")
            response = client.messages.create(
                model="claude-3-5-sonnet@20241022",
                max_tokens=2048,
                tools=tools,
                messages=[{"role": "user", "content": "List files in the my-data-bucket bucket"}]
            )

            # Route tool_use back through MCP session
            if response.stop_reason == "tool_use":
                for block in response.content:
                    if block.type == "tool_use":
                        result = await session.call_tool(block.name, block.input)
                        # Continue conversation with result...

asyncio.run(run_with_mcp())

8.4 — MCP Inspector for Debugging

# Install and run MCP Inspector
npx @modelcontextprotocol/inspector python my_mcp_server.py

# Opens web UI at http://localhost:5173
# - Lists all tools with their schemas
# - Lists all resources
# - Lists all prompts
# - Test tool calls interactively
# - See raw request/response payloads

8.5 — MCP with GCP Services Pattern

# Pattern: MCP server wraps GCP services with proper ADC auth
# Authentication is handled inside the MCP server — Claude never sees credentials

@mcp.tool()
def run_vertex_prediction(
    endpoint_id: str,
    instances: list,
    project_id: str = "my-project",
    location: str = "us-central1"
) -> str:
    """
    Run a prediction against a deployed Vertex AI model endpoint.
    Returns predictions as JSON.
    """
    from google.cloud import aiplatform
    aiplatform.init(project=project_id, location=location)
    endpoint = aiplatform.Endpoint(endpoint_id)
    prediction = endpoint.predict(instances=instances)
    return json.dumps({"predictions": prediction.predictions})

Best Practices

  • Write descriptive tool names and descriptions — Claude picks tools by name + description alone
  • Validate all inputs inside tool functions — never trust model-provided values for security-sensitive ops
  • Return structured error strings (not exceptions) so Claude can handle failures gracefully and retry
  • Use MCP for reusable tool servers shared across multiple Claude applications (build once, use everywhere)
  • Test with MCP Inspector before wiring into Claude to verify schemas and responses
  • Group related tools in the same MCP server (all GCS tools together, all BigQuery tools together)
  • Use MCP Resources for read-only data access; use Tools for actions that change state
  • Authenticate inside the MCP server using ADC — never pass credentials through tool inputs

MODULE 9: Anthropic Apps & Agents

Key Notes

9.1 — Claude Code

Claude Code is an agentic coding assistant that operates via tool use in a loop: plan → tool call → observe result → continue until task is complete.

# Install Claude Code globally
npm install -g @anthropic-ai/claude-code

# Launch in a project directory
cd my-gcp-project/
claude

# Or run a specific task non-interactively
claude -p "Add error handling to the main.py file" --output-format json

# Configure to use Vertex AI backend
claude config set --global api.backend vertex
claude config set --global vertex.region us-east5
claude config set --global vertex.projectId my-gcp-project

# Enable extended thinking for complex tasks
claude --extended-thinking "Refactor the authentication module to use OAuth 2.0"

Claude Code built-in tools:

ToolPurpose
read_fileRead file contents
write_fileCreate or overwrite files
str_replace_editorEdit existing files
bashExecute shell commands
globFind files matching pattern
grepSearch file contents
list_dirList directory contents
web_fetchFetch web page content
mcp_*MCP server tools (if configured)

Claude Code agent loop:

User: "Fix the authentication bug in login.py"
  │
  ▼
Claude: reads login.py (read_file tool)
  │
  ▼
Claude: identifies the bug, reads related files
  │
  ▼
Claude: writes the fix (str_replace_editor tool)
  │
  ▼
Claude: runs tests to verify (bash tool: "python -m pytest tests/")
  │
  ▼
Claude: reports results and explains what was changed

Custom Claude Code setup (CLAUDE.md):

# Project: My GCP Application

## Build & Test
- Install: `uv sync`
- Test: `uv run pytest`
- Lint: `uv run ruff check .`
- Type check: `uv run mypy .`

## Architecture
- FastAPI backend in `backend/`
- Vertex AI integration in `backend/ai/`
- Tests in `tests/`

## Conventions
- All Vertex AI calls use `AnthropicVertex(region="us-east5", project_id="my-project")`
- Model IDs use `@` separator: `claude-3-5-sonnet@20241022`
- Environment vars: `GCP_PROJECT_ID`, `GCP_REGION`

9.2 — Computer Use

Computer Use enables Claude to control a desktop GUI — take screenshots, move the mouse, click, type, and scroll.

Architecture:

┌─────────────────────────────────────────────────────┐
│              Computer Use Architecture               │
│                                                     │
│  Claude                    Execution Environment    │
│  ──────                    ──────────────────────   │
│  Sends screenshot request  ─▶ Takes screenshot      │
│  Analyzes screenshot        ◀─ Returns PNG          │
│  Decides: click (100, 200)  ─▶ Executes mouse click │
│  Analyzes result screenshot ◀─ Returns new PNG      │
│  Continues until done                               │
└─────────────────────────────────────────────────────┘

Computer Use tools:

computer_use_tools = [
    {
        "type": "computer_20241022",
        "name": "computer",
        "display_width_px": 1024,
        "display_height_px": 768,
        "display_number": 1
    },
    {
        "type": "text_editor_20241022",
        "name": "str_replace_editor"
    },
    {
        "type": "bash_20241022",
        "name": "bash"
    }
]

response = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=4096,
    tools=computer_use_tools,
    messages=[{
        "role": "user",
        "content": "Open a browser, go to cloud.google.com, and take a screenshot of the Vertex AI page."
    }]
)

# Handle computer action blocks
for block in response.content:
    if block.type == "tool_use" and block.name == "computer":
        action = block.input["action"]
        if action == "screenshot":
            screenshot = take_screenshot()  # Your implementation
            # Return screenshot as base64 image in tool_result
        elif action == "left_click":
            x, y = block.input["coordinate"]
            click(x, y)  # Your implementation
        elif action == "type":
            text = block.input["text"]
            type_text(text)  # Your implementation

Computer Use safety rules:

  • ALWAYS run in an isolated, sandboxed VM or container — never on production systems
  • Never expose Computer Use to external users without strict sandboxing
  • Never run on machines with access to sensitive credentials, databases, or production services
  • Implement a human-in-the-loop checkpoint for destructive actions
  • Set max_iterations — never run Computer Use loops unbounded

9.3 — Agent Workflow Patterns

Pattern 1: Parallelization

Run independent subtasks simultaneously to reduce total latency.

import concurrent.futures
from anthropic import AnthropicVertex

client = AnthropicVertex(region="us-east5", project_id="my-gcp-project")

def analyze_document(document: dict) -> dict:
    """Analyze a single document — can run in parallel."""
    response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=1024,
        system="You are a document analyst. Extract: summary, topics, sentiment, action items.",
        messages=[{
            "role": "user",
            "content": f"Analyze this document:\n\n{document['text']}"
        }]
    )
    return {
        "document_id": document["id"],
        "analysis": response.content[0].text
    }

# Process 10 documents in parallel
documents = load_documents()  # list of {id, text} dicts
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    analyses = list(executor.map(analyze_document, documents))

# Merge results
merged_analysis = "\n\n".join(
    f"Document {a['document_id']}:\n{a['analysis']}"
    for a in analyses
)

# Final synthesis from all analyses
synthesis = client.messages.create(
    model="claude-3-5-sonnet@20241022",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": f"Synthesize these document analyses into an executive summary:\n\n{merged_analysis}"
    }]
)
print(synthesis.content[0].text)

Pattern 2: Chaining (Pipeline)

Pass output of each step as input to the next step.

def pipeline_step(instruction: str, input_text: str, model: str = "claude-3-5-sonnet@20241022") -> str:
    """Run a single pipeline step."""
    response = client.messages.create(
        model=model,
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"{instruction}\n\n---\n\n{input_text}"
        }]
    )
    return response.content[0].text

def content_pipeline(raw_transcript: str) -> dict:
    """Multi-step content processing pipeline."""

    # Step 1: Extract key points
    key_points = pipeline_step(
        "Extract the 5-7 most important points from this meeting transcript. Be concise.",
        raw_transcript,
        model="claude-3-5-haiku@20241022"  # Fast model for extraction
    )

    # Step 2: Draft blog post from key points
    draft = pipeline_step(
        "Write a 500-word blog post based on these key points. Use a professional but engaging tone.",
        key_points
    )

    # Step 3: Polish and improve
    polished = pipeline_step(
        "Improve this blog post: fix any grammatical issues, improve flow, and make the intro more compelling.",
        draft
    )

    # Step 4: Generate SEO metadata
    metadata = pipeline_step(
        "Generate SEO metadata for this blog post as JSON: title (max 60 chars), meta_description (max 160 chars), keywords (5 items)",
        polished,
        model="claude-3-5-haiku@20241022"
    )

    return {
        "key_points": key_points,
        "draft": draft,
        "final_post": polished,
        "seo_metadata": metadata
    }

Pattern 3: Routing

Classify the incoming request and route to the appropriate specialized handler.

def classify_intent(user_message: str) -> str:
    """Classify user intent using a fast, cheap model."""
    response = client.messages.create(
        model="claude-3-5-haiku@20241022",  # Use fast model for routing
        max_tokens=20,
        temperature=0,
        system="Classify the user's intent. Respond with exactly one word: billing, technical, sales, or general",
        messages=[{"role": "user", "content": user_message}]
    )
    return response.content[0].text.strip().lower()

SPECIALIST_PROMPTS = {
    "billing": "You are a billing specialist. You help with invoices, payments, subscriptions, and refunds. You have access to billing records.",
    "technical": "You are a senior technical support engineer. You help debug issues, explain errors, and provide code solutions.",
    "sales": "You are a sales consultant. You explain product features, pricing tiers, and enterprise options. Focus on value proposition.",
    "general": "You are a helpful general assistant. Answer questions concisely and route to specialists when needed."
}

def route_and_respond(user_message: str) -> str:
    """Route to specialist and generate response."""
    intent = classify_intent(user_message)
    specialist_prompt = SPECIALIST_PROMPTS.get(intent, SPECIALIST_PROMPTS["general"])

    response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=1024,
        system=specialist_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    return f"[Routed to {intent} specialist]\n{response.content[0].text}"

Pattern 4: Agent Loop with Tool Use

import json
from typing import Callable

def run_tool_agent(
    task: str,
    tools: list,
    tool_functions: dict[str, Callable],
    max_iterations: int = 15,
    verbose: bool = False
) -> str:
    """Generic tool-use agent loop."""
    messages = [{"role": "user", "content": task}]

    for iteration in range(max_iterations):
        response = client.messages.create(
            model="claude-3-5-sonnet@20241022",
            max_tokens=4096,
            tools=tools,
            messages=messages
        )

        if verbose:
            print(f"[Iter {iteration+1}] stop_reason={response.stop_reason}")

        if response.stop_reason == "end_turn":
            # Extract final text
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text
            return "[Task complete — no text output]"

        if response.stop_reason == "max_tokens":
            return "[Stopped: max_tokens reached — increase max_tokens]"

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    if verbose:
                        print(f"  Tool call: {block.name}({block.input})")

                    func = tool_functions.get(block.name)
                    if func is None:
                        result_str = json.dumps({"error": f"No implementation for tool: {block.name}"})
                    else:
                        try:
                            result = func(**block.input)
                            result_str = json.dumps(result) if not isinstance(result, str) else result
                        except Exception as e:
                            result_str = json.dumps({"error": str(e)})

                    if verbose:
                        print(f"  Result: {result_str[:200]}")

                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result_str
                    })

            messages.append({"role": "user", "content": tool_results})

    return f"[Agent loop ended after {max_iterations} iterations without completing]"

Pattern 5: Orchestrator-Worker

def orchestrator_worker_pipeline(complex_task: str) -> str:
    """
    Orchestrator breaks down task → workers execute subtasks → orchestrator synthesizes.
    """

    # Step 1: Orchestrator decomposes the task
    decomposition = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Break this task into 3-5 independent subtasks that can be executed in parallel.
Return as JSON array of strings.

Task: {complex_task}"""
        }]
    )
    subtasks = json.loads(decomposition.content[0].text)

    # Step 2: Workers execute subtasks in parallel
    def worker(subtask: str) -> str:
        response = client.messages.create(
            model="claude-3-5-haiku@20241022",  # Workers use cheaper model
            max_tokens=1024,
            messages=[{"role": "user", "content": subtask}]
        )
        return response.content[0].text

    with concurrent.futures.ThreadPoolExecutor(max_workers=len(subtasks)) as executor:
        worker_results = list(executor.map(worker, subtasks))

    # Step 3: Orchestrator synthesizes results
    synthesis_input = "\n\n".join(
        f"Subtask {i+1}: {subtasks[i]}\nResult: {result}"
        for i, result in enumerate(worker_results)
    )
    final_response = client.messages.create(
        model="claude-3-5-sonnet@20241022",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Synthesize these subtask results into a comprehensive final answer:\n\n{synthesis_input}"
        }]
    )
    return final_response.content[0].text

9.4 — Vertex AI Agent Builder

Fully managed agent orchestration on GCP — use when you want managed RAG + flows without custom code.

from google.cloud import dialogflow_cx_v3 as dialogflow

# Vertex AI Agent Builder is based on Dialogflow CX
# For LLM-powered agents: use Vertex AI Generative AI Agent (Agent Builder)
# Console: console.cloud.google.com/vertex-ai/agents

# Query a deployed Vertex AI Agent via REST
import requests

def query_vertex_agent(
    agent_id: str,
    user_message: str,
    session_id: str,
    project_id: str = "my-project"
) -> str:
    """Send message to a Vertex AI Agent Builder agent."""
    url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{project_id}/locations/us-central1/reasoningEngines/{agent_id}:query"

    import google.auth
    import google.auth.transport.requests
    creds, _ = google.auth.default()
    creds.refresh(google.auth.transport.requests.Request())

    headers = {
        "Authorization": f"Bearer {creds.token}",
        "Content-Type": "application/json"
    }
    body = {
        "input": {"messages": [{"role": "user", "content": user_message}]},
        "sessionId": session_id
    }
    response = requests.post(url, headers=headers, json=body)
    return response.json()

9.5 — When to Use What

┌────────────────────────────────────────────────────────────────┐
│              Agent Pattern Decision Guide                      │
│                                                                │
│  Task Type                    Recommended Pattern             │
│  ─────────                    ──────────────────             │
│  Many independent items       Parallelization                 │
│  Sequential processing        Chaining (pipeline)            │
│  Variable input types         Routing                         │
│  Tool-dependent tasks         Agent loop                      │
│  Complex decomposable task    Orchestrator-worker             │
│  Desktop GUI automation       Computer Use                    │
│  Coding / file editing        Claude Code                     │
│  Managed RAG + flows          Vertex AI Agent Builder         │
└────────────────────────────────────────────────────────────────┘

Best Practices

  • Always set max_iterations on agent loops — never let them run unbounded
  • Log every tool call and its result for debugging and audit purposes
  • Use parallelization for independent subtasks to reduce total latency by N-fold
  • Restrict Computer Use to sandboxed, isolated environments only — never production
  • Design chaining pipelines with clear handoff points between agents
  • Use cheaper models (Haiku) for routing, classification, and worker agents; Sonnet/Opus for orchestration
  • Prefer Vertex AI Agent Builder for managed orchestration with built-in RAG; custom loops for full control
  • Add human-in-the-loop checkpoints for high-stakes agent actions (file deletion, payment, deployment)
  • Consider prompt caching for agent systems with large, repeated system prompts

Comparison: Direct API vs Bedrock vs Vertex AI

FeatureAnthropic Direct APIAmazon BedrockGoogle Vertex AI
AuthANTHROPIC_API_KEY env varAWS IAM roles / access keysGoogle ADC / Service Account
SDKanthropic Python libraryboto3 (AWS SDK)anthropic (AnthropicVertex class)
Client classAnthropic()boto3.client('bedrock-runtime')AnthropicVertex(region, project_id)
Model ID formatclaude-3-5-sonnet-20241022anthropic.claude-3-5-sonnet-20241022-v2:0claude-3-5-sonnet@20241022
Extra required fieldNone"anthropic_version": "bedrock-2023-05-31" in JSON bodyNone
Messages API callclient.messages.create()client.invoke_model() with JSON bodyclient.messages.create()
Streaming.stream() context managerinvoke_model_with_response_stream().stream() context manager
Async supportAsyncAnthropic()aioboto3AsyncAnthropicVertex()
BillingAnthropic direct billingAWS billGoogle Cloud bill
Managed RAGNone built-inBedrock Knowledge BasesVertex AI Search
Managed AgentsNone built-inBedrock AgentsVertex AI Agent Builder
EmbeddingsNone nativeAmazon Titan EmbeddingsVertex AI Embeddings API
Vector DBAny third-partyOpenSearch Serverless, Aurora pgvectorVertex AI Vector Search
RerankingNone nativeNone nativeVertex AI Ranking API
RegionsGlobal (no region choice)us-east-1, us-west-2, eu-west-3, ap-northeast-1, etc.us-east5, us-central1, europe-west4, asia-southeast1
Batch inferenceBatch API (async)Model Invocation Jobs (S3 I/O)Batch prediction jobs
Safety/GuardrailsNone built-inBedrock GuardrailsVertex AI safety filters
Prompt cachingYes (cache_control)Yes (cache_control)Yes (cache_control)
Extended thinkingYesYesYes
Computer UseYesYesYes
Vision (images)YesYesYes
PDF supportYesYesYes
CitationsYesYesYes
MCP supportYesYes (via anthropic SDK)Yes (via AnthropicVertex)
Network/VPCPublic API onlyVPC endpoints availableVPC Service Controls available
ComplianceSOC 2, ISO 27001AWS compliance portfolioGCP compliance portfolio (FedRAMP, HIPAA)
MonitoringNone built-inCloudWatchCloud Monitoring + Cloud Logging
IAM integrationNone (API keys only)AWS IAM (roles, policies, SCPs)Google IAM (roles, org policies)
Audit loggingNone built-inAWS CloudTrailCloud Audit Logs

Code comparison — same task across all three:

# === DIRECT API ===
from anthropic import Anthropic
client = Anthropic()  # reads ANTHROPIC_API_KEY
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)

# === BEDROCK ===
import boto3, json
client = boto3.client('bedrock-runtime', region_name='us-east-1')
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",   # Required!
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello!"}]
})
response = client.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",  # Bedrock model ID
    body=body
)
result = json.loads(response['body'].read())
print(result['content'][0]['text'])

# === VERTEX AI ===
from anthropic import AnthropicVertex
client = AnthropicVertex(region="us-east5", project_id="my-project")  # ADC auth
response = client.messages.create(
    model="claude-3-5-sonnet@20241022",           # @ separator
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

Decision Guide:

  • AWS-first shop → Bedrock: IAM, CloudWatch, S3 batch, Bedrock Agents, Bedrock Guardrails all integrate natively
  • GCP-first shop → Vertex AI: BigQuery, Cloud Storage, Vertex AI Search, Vertex Monitoring all integrate natively
  • Cloud-agnostic / startup → Direct API: simplest setup, billed through Anthropic, no cloud lock-in, fastest model releases
  • Multi-cloud / cost optimization → Direct API primary + Bedrock/Vertex for fallback or data-locality requirements
  • Compliance/government → Vertex AI for FedRAMP; Bedrock for AWS GovCloud

Comprehensive Study Checklist

MODULE 1 — Introduction

  • Know all available Claude models on Vertex AI and their @date model ID format
  • Understand ADC auth flow: local dev vs production vs GKE/Cloud Run
  • Know which IAM roles are needed: roles/aiplatform.user or roles/vertexai.user
  • Know how to enable Claude model access in Model Garden
  • Available regions: us-east5, us-central1, europe-west4, asia-southeast1

MODULE 2 — Accessing Claude with the API

  • Initialize AnthropicVertex(region, project_id) client
  • Set env vars ANTHROPIC_VERTEX_PROJECT_ID and CLOUD_ML_REGION
  • Build and maintain multi-turn conversation history
  • Use system parameter (not in messages array) for system prompts
  • Set temperature=0 for deterministic tasks
  • Stream responses with .stream() context manager
  • Extract structured output with JSON + prefilling
  • Handle RateLimitError with exponential backoff
  • Use AsyncAnthropicVertex for concurrent requests

MODULE 3 — Prompt Evaluation

  • 6-step evaluation workflow: define → dataset → run → grade → analyze → iterate
  • Know all grading methods: exact, regex, JSON schema, substring, model-based
  • Use Haiku for cost-efficient model-based grading
  • Log evaluation runs: timestamp, model version, prompt version, per-example results
  • Track cost per correct answer, not just accuracy
  • Use fixed test set for comparing prompt versions

MODULE 4 — Prompt Engineering

  • Five core techniques: clarity, specificity, XML tags, few-shot, chain-of-thought
  • XML tag structure for complex multi-part prompts
  • 2-5 few-shot examples, formatted exactly like real inputs
  • CoT: ask for reasoning before final answer
  • Prefill assistant turn to force output format
  • Stop sequences to prevent over-generation

MODULE 5 — Tool Use

  • Tool schema structure: name, description, input_schema (JSON Schema)
  • Complete tool use message flow: tool_use block → execute → tool_result → continue
  • Loop until stop_reason == "end_turn"
  • tool_choice: auto, any, tool (specific), none
  • Handle parallel tool calls (multiple tool_use blocks, one message with all results)
  • Return errors as JSON strings in tool_result.content, not exceptions

MODULE 6 — RAG

  • Full RAG pipeline on GCP: chunk → embed → index → search → rerank → generate
  • Vertex AI Embeddings API: text-embedding-004 (768 dims), task types
  • Vertex AI Vector Search: index creation, deployment, find_neighbors() query
  • BM25 for keyword search; RRF for hybrid search
  • Contextual Retrieval: generate chunk context summaries before indexing
  • Reranking with Vertex AI Ranking API: retrieve top-20, rerank, send top-5
  • Vertex AI Search as managed alternative

MODULE 7 — Claude Features

  • Extended thinking: thinking={"type": "enabled", "budget_tokens": N}, min 1024
  • Vision: base64 image blocks, URL image blocks, multiple images per message
  • PDF: document content block with application/pdf media type
  • Citations: include document blocks, Claude cites specific text
  • Prompt caching: cache_control: ephemeral, 5-min TTL, 90% read discount, 25% write premium
  • Monitor cache_read_input_tokens and cache_creation_input_tokens

MODULE 8 — MCP

  • Three primitives: Tools (functions), Resources (data), Prompts (templates)
  • Build MCP server with FastMCP and @mcp.tool() decorator
  • Transport options: stdio (local) and HTTP/SSE (remote)
  • Debug with npx @modelcontextprotocol/inspector
  • Connect MCP server to AnthropicVertex via tool schema extraction
  • GCP auth inside MCP server via ADC

MODULE 9 — Anthropic Apps & Agents

  • Claude Code: install, launch, CLAUDE.md configuration, built-in tools
  • Computer Use: three tools (computer, text_editor, bash), isolation requirement
  • Agent patterns: parallelization, chaining, routing, agent loop, orchestrator-worker
  • Use cheaper models (Haiku) for workers and routing; Sonnet for orchestration
  • max_iterations guard on all agent loops
  • Vertex AI Agent Builder for managed orchestration

Comparison Table

  • Know all three auth methods: API key / IAM / ADC
  • Know all three model ID formats and the extra Bedrock field (anthropic_version)
  • Know platform-native services: Bedrock KB / Vertex AI Search for RAG
  • Know when to pick each: AWS shop / GCP shop / cloud-agnostic

Study tip: Focus on the AnthropicVertex client initialization (region + project_id), the @date model ID format, and the fact that the Messages API methods are identical to the direct API — unlike Bedrock which uses a completely different boto3.invoke_model() call structure with a required anthropic_version field in the JSON body.