← Claude Code & Certification

Building with the Claude API - Certification Study Guide

Building with the Claude API - Certification Study Guide

Course: Anthropic - Building with the Claude API
Modules: 11 (Introduction through Agents & Workflows)
Target: Certification Prep


MODULE 1: Introduction - Claude Models Overview

Key Notes

  • Claude model family: Opus (most capable), Sonnet (balanced), Haiku (fastest/cheapest)
  • Model context windows: Haiku 200K, Sonnet 200K, Opus 200K tokens
  • Latest stable model: claude-3-5-sonnet-20241022 (or current latest version)
  • API version: Messages API is the standard (avoid legacy text completions)
  • Rate limits: Vary by plan (free, pro, enterprise); track via response headers anthropic-ratelimit-*
  • Pricing model: Input tokens cheaper than output tokens; cached tokens are 90% discount
  • Token counting: Use count_tokens endpoint before production calls
  • Supported formats: JSON, text, images (PNG, GIF, JPEG, WebP), PDFs, documents via Files API
  • Authentication: API key via ANTHROPIC_API_KEY env var or header
  • Base URL: https://api.anthropic.com/v1
  • Timeout defaults: 10s for SDK (configurable)

Best Practices

  • Choose model based on task complexity: Haiku for simple tasks, Sonnet for balanced, Opus for reasoning
  • Always set explicit max_tokens to avoid surprise token usage
  • Use system prompt for role/behavior, not in message history
  • Include version-specific features only when supported by chosen model
  • Cache large context (docs, few-shots) to reduce cost
  • Stream responses for better UX and token efficiency feedback
  • Test with cheaper models (Haiku) before scaling to Opus

Example (Python)

import anthropic
import os

client = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

# Basic request
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello, Claude!"}
    ]
)

print(f"Response: {message.content[0].text}")
print(f"Stop reason: {message.stop_reason}")
print(f"Usage: input={message.usage.input_tokens}, output={message.usage.output_tokens}")

MODULE 2: Accessing Claude with the API

Key Notes

API Authentication:

  • API key location: https://console.anthropic.com/
  • Header format: x-api-key: <key>
  • Or: ANTHROPIC_API_KEY=<key> environment variable
  • Keys are secret; rotate if exposed
  • Per-project keys available in org settings

Request Structure:

  • Method: POST /v1/messages
  • Content-Type: application/json
  • Required fields: model, max_tokens, messages
  • Optional: system, temperature, top_p, tools, tool_choice, etc.

Message Format:

  • messages array with role (user/assistant) and content (string or array of blocks)
  • role alternates: user → assistant → user
  • Each turn is stateless; send full conversation history for multi-turn

Multi-Turn Conversation:

  • Track message history client-side
  • System prompt applies to entire conversation
  • Assistant’s previous responses become assistant messages in next request
  • Tool results are inserted as user messages with tool_use_id reference

System Prompt:

  • Placed in separate system parameter, not messages array
  • Applies to all turns in conversation
  • Can be updated per request (creates new conversation context)
  • Best for role definition, rules, output format instructions
  • Costs same as message tokens but cached efficiently

Temperature & Top-P:

  • temperature (0.0-1.0): 0=deterministic, 1=random (default 1.0)
  • top_p (0.0-1.0): nucleus sampling, used with temperature
  • For structured output: temperature=0
  • For creative: temperature=0.8-1.0
  • Rarely combine with top_k (deprecated in favor of top_p)

Streaming:

  • Set stream=True in request
  • Returns server-sent events (SSE) with delta updates
  • Stream event types: content_block_start, content_block_delta, message_delta, message_stop
  • Rebuild message by accumulating deltas
  • Always consume stream fully before closing connection

Structured Output:

  • Use response_format parameter with type: "json_schema" (if model supports)
  • Define JSON schema in json_schema.schema property
  • Model will output valid JSON matching schema
  • Useful for function calls, data extraction, parsing

Best Practices

  • Always include explicit max_tokens (don’t rely on defaults)
  • Send full conversation history for multi-turn (Claude has no built-in memory)
  • Use system prompt for all conversations with consistent instructions
  • Set temperature=0 for deterministic tasks (classification, extraction)
  • Stream for interactive applications; collect full message for logging
  • Validate JSON schema compliance in client if structured output requested
  • Handle rate limiting with exponential backoff
  • Set appropriate timeouts for long-running requests

Example (Python)

Basic Request:

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[
        {"role": "user", "content": "What is 2+2?"}
    ]
)
print(message.content[0].text)

Multi-Turn Conversation:

system_prompt = "You are a helpful coding assistant."
conversation_history = []

def chat(user_input):
    conversation_history.append({"role": "user", "content": user_input})
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        system=system_prompt,
        messages=conversation_history
    )
    
    assistant_message = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# Multiple turns
print(chat("What is Python?"))
print(chat("How do I write a class?"))
print(chat("Can you show me an example?"))

Streaming:

print("Streaming response:")
with client.messages.stream(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system="You are concise.",
    messages=[{"role": "user", "content": "Write a haiku about AI."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
print()

Structured Output (JSON):

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Extract the name and age from: John is 30 years old."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person_extractor",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"]
            },
            "strict": True
        }
    }
)

import json
data = json.loads(response.content[0].text)
print(f"Name: {data['name']}, Age: {data['age']}")

MODULE 3: Prompt Evaluation

Key Notes

Evaluation Workflow:

  1. Define evaluation task (classification, generation, reasoning)
  2. Create test dataset (labeled examples)
  3. Run model on test cases
  4. Grade outputs (automated or manual)
  5. Compute metrics (accuracy, F1, similarity, custom)
  6. Analyze failures and iterate

Test Dataset Design:

  • Minimum 20-50 examples for reliable signal
  • Include edge cases, ambiguous inputs, common errors
  • Label expected outputs for supervised evaluation
  • Stratify by category if multi-class
  • Version control datasets alongside prompts

Grading Methods:

Code-based grading:

  • Exact match (string equality)
  • Regex matching (pattern validation)
  • JSON schema validation
  • Custom Python function (flexible)
  • Numeric thresholds
  • Function-based scoring (0-1 range)

Model-based grading:

  • Use Claude to grade Claude’s outputs (consistent rubric)
  • Judges model: Opus 3.5 or Sonnet 3.5
  • Rubric: clear criteria, examples, scoring scale
  • Less code, captures semantic quality
  • Slower/more expensive than code-based but more reliable

Metrics to track:

  • Accuracy (% correct)
  • F1 score (precision × recall)
  • Token efficiency (tokens/task)
  • Latency (response time)
  • Cost (input + output tokens × pricing)
  • User satisfaction (if collecting feedback)

Best Practices

  • Start with code-based grading (fast iteration)
  • Use model-based grading for subjective tasks (quality, tone, correctness)
  • Separate test set from training/validation set
  • Run evals on multiple model versions before deployment
  • Log all eval runs with timestamps, model, prompt version
  • Aim for 95%+ accuracy before production
  • Document failure cases and plan improvements
  • A/B test prompt changes on held-out test set

Example (Python)

import anthropic
import json
from typing import Literal

client = anthropic.Anthropic()

# Test dataset
test_cases = [
    {"input": "Extract the color: The car is red.", "expected_output": "red"},
    {"input": "Extract the color: She wore a blue dress.", "expected_output": "blue"},
    {"input": "Extract the color: The sky is clear.", "expected_output": "no color mentioned"},
]

# Code-based grading (exact match)
def grade_exact_match(output: str, expected: str) -> bool:
    return output.strip().lower() == expected.strip().lower()

# Model-based grading
def grade_with_claude(input_text: str, output: str, expected: str) -> float:
    """Use Claude as a judge: returns score 0-1."""
    rubric = f"""
    Task: Evaluate if the output correctly answers the input query.
    Expected answer: {expected}
    Actual output: {output}
    
    Score 1.0 if correct, 0.5 if partially correct, 0.0 if incorrect.
    Respond with ONLY the number (e.g., 1.0).
    """
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=10,
        system="You are an expert evaluator. Score the response accurately.",
        messages=[{"role": "user", "content": rubric}]
    )
    
    try:
        return float(response.content[0].text.strip())
    except:
        return 0.0

# Run evaluation
def evaluate_prompt():
    results = []
    
    for test in test_cases:
        # Get model output
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=100,
            system="Extract the color mentioned. If no color, respond 'no color mentioned'.",
            messages=[{"role": "user", "content": test["input"]}]
        )
        
        output = response.content[0].text
        
        # Grade with both methods
        exact_match = grade_exact_match(output, test["expected_output"])
        model_score = grade_with_claude(test["input"], output, test["expected_output"])
        
        results.append({
            "input": test["input"],
            "expected": test["expected_output"],
            "actual": output,
            "exact_match": exact_match,
            "model_score": model_score
        })
    
    # Compute metrics
    accuracy = sum(1 for r in results if r["exact_match"]) / len(results)
    avg_model_score = sum(r["model_score"] for r in results) / len(results)
    
    print(f"Accuracy: {accuracy:.2%}")
    print(f"Average Model Score: {avg_model_score:.2f}")
    print("\nDetailed Results:")
    for r in results:
        print(f"  Input: {r['input']}")
        print(f"  Expected: {r['expected']}, Actual: {r['actual']}")
        print(f"  Match: {r['exact_match']}, Model Score: {r['model_score']}\n")

evaluate_prompt()

MODULE 4: Prompt Engineering Techniques

Key Notes

Core Principles:

  1. Clarity: Be specific about task, not vague

    • Bad: “Summarize this”
    • Good: “Summarize in 3 bullet points, focusing on methodology”
  2. Specificity: Include constraints, format, examples

    • Output format (JSON, bullet points, code blocks)
    • Length (words, paragraphs, tokens)
    • Tone (formal, casual, technical)
    • Edge cases (“If N/A, respond ’not provided’”)
  3. XML Tags: Structure complex prompts

    • <task>, <context>, <rules>, <output_format>
    • Makes parsing easier, prevents confusion
    • Claude particularly responsive to well-structured XML
  4. Examples (Few-Shot): Dramatically improve performance

    • 2-5 examples usually sufficient
    • Show input-output pairs for task
    • Include edge cases in examples
    • More effective than long descriptions
  5. Chain of Thought: Encourage step-by-step reasoning

    • “Think step-by-step before answering”
    • Improves accuracy on reasoning tasks
    • Increases token usage but better results
  6. Iterative Refinement: Test, measure, improve

    • Evaluate on test set
    • Identify failure patterns
    • Adjust prompt, re-evaluate
    • Version prompts alongside evals

Best Practices

  • Separate instructions (system) from data (user messages)
  • Use XML tags for multi-part instructions
  • Include 2-5 diverse examples for complex tasks
  • Ask for step-by-step reasoning on logic/math tasks
  • Specify output format explicitly
  • For creative tasks, use higher temperature; for accuracy, use temperature=0
  • Test variations on same test set to measure impact
  • Document what changed and why in prompt versions

Example (Python)

Clarity & Specificity:

# Bad prompt
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=200,
    messages=[{"role": "user", "content": "Analyze this text."}]
)

# Good prompt with clarity and specificity
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=200,
    messages=[{
        "role": "user",
        "content": """Analyze the following customer review for sentiment.
        
Output format: JSON with fields: sentiment (positive/negative/neutral), confidence (0-1), key_phrases (list of strings).

Text: "The product arrived late but works great once I set it up. Would recommend despite shipping issues."
"""
    }]
)

XML Structure:

prompt = """
<task>
Extract structured information from a product review.
</task>

<rules>
- Output MUST be valid JSON
- If information is missing, use null
- Sentiment must be one of: positive, negative, neutral
- Confidence is a number 0-1
</rules>

<output_format>
{
  "product_name": string,
  "sentiment": string,
  "confidence": number,
  "pros": [string],
  "cons": [string],
  "rating": number or null
}
</output_format>

<examples>
Input: "Excellent phone! Fast processor, great camera. Battery life is average though."
Output: {
  "product_name": null,
  "sentiment": "positive",
  "confidence": 0.9,
  "pros": ["fast processor", "great camera"],
  "cons": ["average battery life"],
  "rating": null
}
</examples>

<input>
"These shoes are uncomfortable and overpriced. Not worth the hype."
</input>
"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=500,
    messages=[{"role": "user", "content": prompt}]
)

Few-Shot Examples:

examples = [
    {
        "input": "The meeting is at 3 PM tomorrow.",
        "output": "DATETIME: 3:00 PM tomorrow"
    },
    {
        "input": "I have 5 apples.",
        "output": "QUANTITY: 5"
    },
    {
        "input": "The sky is blue.",
        "output": "ATTRIBUTE: sky, blue"
    }
]

prompt = "Extract the main entity from this sentence:\n\nThe project deadline is next Friday.\n\nUse examples:\n"
for ex in examples:
    prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
prompt += "Now extract: 'The project deadline is next Friday.'"

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=100,
    messages=[{"role": "user", "content": prompt}]
)

Chain of Thought:

prompt = """Solve this math problem step-by-step.

Problem: If a train travels at 60 mph for 2.5 hours, how far does it go?

Before answering, think through:
1. What formula do I need?
2. What values do I have?
3. What's the calculation?
4. What's the final answer?
"""

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=300,
    temperature=0,  # Deterministic for math
    messages=[{"role": "user", "content": prompt}]
)

MODULE 5: Tool Use with Claude

Key Notes

Tool Use Overview:

  • Claude can request function calls; you execute and return results
  • Supports up to 10,000 tools per request
  • Tool definitions are JSON schemas describing function signature
  • Tool results fed back as user messages with special role

Tool Schema Structure:

{
  "name": "function_name",
  "description": "What this function does",
  "input_schema": {
    "type": "object",
    "properties": {
      "param1": {"type": "string", "description": "..."},
      "param2": {"type": "number", "description": "..."}
    },
    "required": ["param1"]
  }
}

Message Flow:

  1. Send tool definitions in tools array
  2. Model responds with stop_reason: "tool_use"
  3. Extract tool_use blocks from response
  4. Execute tool, collect result
  5. Send result back as {"role": "user", "content": [{"type": "tool_result", "tool_use_id": "...", "content": "..."}]}
  6. Model continues with updated information

Tool Use Block Structure:

{
  "type": "tool_use",
  "id": "unique_id",
  "name": "tool_name",
  "input": {...}
}

Tool Result Block:

{
  "type": "tool_result",
  "tool_use_id": "id_from_tool_use_block",
  "content": "result string or error"
}

Multiple Tools:

  • Can use multiple tools in single turn
  • Model decides which tools to call
  • All tools called in parallel, results returned together
  • Specify tool order in tools array to hint preference

Tool Choice Parameter:

  • "auto" (default): Model decides when to use tools
  • "required": Model must use a tool in response
  • {"type": "tool", "name": "specific_tool"}: Force specific tool
  • "none": Model won’t use any tools

Fine-Grained Tool Calling:

  • Set tool_choice={"type": "tool", "name": "exact_tool_name"} to force specific tool
  • Use "required" when tool use is essential for task
  • Useful for forcing function calling in agentic workflows

Text Edit Tool:

  • Built-in tool for editing text/code
  • Useful in agentic scenarios where Claude modifies documents
  • Not directly exposed; mention if needed for advanced workflows

Web Search Tool:

  • Built-in capability; can search web within tool use
  • Returns snippet results with citations
  • Used within tool_use blocks similar to custom tools

Best Practices

  • Keep tool descriptions concise but clear
  • Use descriptive parameter names and descriptions
  • Set required fields only for essential parameters
  • Include default values for optional parameters
  • Error handling: return error message as tool result, let Claude retry
  • For multi-step workflows, use tools to gather info, then summarize
  • Cache tool definitions (especially for long lists) using prompt caching
  • Test tool schemas with tools parameter before deployment
  • Use tool_choice="required" to enforce function calling for APIs

Example (Python)

Define Tool Schema & Basic Flow:

import anthropic
import json

client = anthropic.Anthropic()

# Define tools
tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City or coordinates"},
                "units": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature units"}
            },
            "required": ["location"]
        }
    },
    {
        "name": "get_time",
        "description": "Get current time for a timezone",
        "input_schema": {
            "type": "object",
            "properties": {
                "timezone": {"type": "string", "description": "Timezone (e.g., 'America/New_York')"}
            },
            "required": ["timezone"]
        }
    }
]

# Simulate tool execution
def execute_tool(name: str, input_data: dict) -> str:
    if name == "get_weather":
        return json.dumps({
            "location": input_data["location"],
            "temperature": 22,
            "condition": "Sunny"
        })
    elif name == "get_time":
        return json.dumps({
            "timezone": input_data["timezone"],
            "time": "14:30:00"
        })
    return "Tool not found"

# Multi-turn tool use loop
def chat_with_tools(user_message: str):
    messages = [{"role": "user", "content": user_message}]
    
    print(f"User: {user_message}\n")
    
    while True:
        # Send request with tools
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )
        
        # Check if tool use is needed
        if response.stop_reason == "tool_use":
            # Extract all tool use blocks
            tool_uses = [block for block in response.content if block.type == "tool_use"]
            
            # Add assistant response to messages
            messages.append({"role": "assistant", "content": response.content})
            
            # Execute tools and collect results
            tool_results = []
            for tool_use in tool_uses:
                print(f"Tool: {tool_use.name}")
                print(f"Input: {json.dumps(tool_use.input, indent=2)}")
                
                result = execute_tool(tool_use.name, tool_use.input)
                print(f"Result: {result}\n")
                
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": result
                })
            
            # Send tool results back
            messages.append({"role": "user", "content": tool_results})
        
        elif response.stop_reason == "end_turn":
            # Model finished, no more tools
            final_response = next(
                (block.text for block in response.content if hasattr(block, "text")),
                "No response"
            )
            print(f"Assistant: {final_response}")
            break
        
        else:
            print(f"Unexpected stop reason: {response.stop_reason}")
            break

# Test
chat_with_tools("What's the weather in Paris and the time in London?")

Force Tool Use & Handle Multiple Tools:

# Force tool use
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice="required",  # Model must use a tool
    messages=[{"role": "user", "content": "Tell me about Paris."}]
)

# Or force specific tool
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "get_weather"},  # Force this tool
    messages=[{"role": "user", "content": "What's the weather?"}]
)

# Parallel tool execution (both tools called in same turn)
def handle_parallel_tools(user_message: str):
    messages = [{"role": "user", "content": user_message}]
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )
    
    if response.stop_reason == "tool_use":
        tool_uses = [block for block in response.content if block.type == "tool_use"]
        
        # Execute all tools in parallel
        results = []
        for tool_use in tool_uses:
            result = execute_tool(tool_use.name, tool_use.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": tool_use.id,
                "content": result
            })
        
        # Send all results back at once
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": results})
        
        # Continue conversation
        final_response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )
        return final_response

handle_parallel_tools("Get weather in Paris AND time in London")

Key Notes

RAG (Retrieval-Augmented Generation):

  • Retrieves relevant documents, passes to LLM for synthesis
  • Better than fine-tuning for up-to-date, dynamic content
  • Solves hallucination by grounding responses in real data

RAG Pipeline Steps:

  1. Chunking: Split documents into small pieces

    • Chunk size: 300-1000 tokens typical
    • Overlap: 100-200 tokens to preserve context
    • Split on semantic boundaries (paragraphs, sections)
  2. Embeddings: Convert chunks to vectors

    • Use embedding model (e.g., OpenAI, Anthropic, Cohere)
    • Dimension: 768-3072 typical
    • Store vectors in vector DB (Pinecone, Weaviate, Milvus)
  3. Indexing: Build search index

    • Vector index for semantic search
    • BM25 index for keyword search
    • Hybrid search combines both
  4. Retrieval: Find relevant chunks

    • Query embedding vs document embeddings (cosine similarity)
    • Top-K results (usually 3-10)
    • Filtering by metadata (date, category)
  5. Generation: Pass retrieved context to LLM

    • Include original query + retrieved chunks
    • Use system prompt to define task
    • Claude synthesizes answer with citations

Chunking Strategy:

  • Fixed size: Simple, consistent (e.g., 512 tokens)
  • Semantic: Split on headers, paragraphs (preserves meaning)
  • Overlapping: Maintain context across chunks
  • Hierarchical: Chunks with parent/child relationships

BM25 Search:

  • Keyword-based ranking algorithm
  • Good for exact matches, specific terms
  • Complement vector search for hybrid retrieval
  • Fast, no embeddings needed

Multi-Index Search:

  • Vector index: semantic similarity
  • BM25 index: keyword matching
  • Metadata index: filtering (date, source, category)
  • Combine results with reciprocal rank fusion or learned weights

Vector DB Selection:

  • Pinecone: Managed, serverless, easy to use
  • Weaviate: Open-source, flexible, local/cloud
  • Milvus: Open-source, high performance
  • Qdrant: Rust-based, performant, similar to Milvus

Best Practices

  • Chunk at semantic boundaries (paragraphs, sections), not randomly
  • Use 2-3 sources of retrieval (vector + BM25 + metadata)
  • Retrieve 5-10 top results; let model use most relevant
  • Include source/citation metadata with chunks
  • Test retrieval quality independently (check if relevant docs retrieved)
  • Combine with reranking (use LLM to rerank retrieved results)
  • Cache retrieved context if same query appears multiple times
  • Monitor retrieval performance: measure precision@k, recall, MRR

Example (Python)

Basic RAG Flow:

import anthropic
from typing import List

client = anthropic.Anthropic()

# Simulated document store (in production: vector DB)
documents = [
    {
        "id": "doc1",
        "text": "Python is a high-level programming language. It emphasizes readability.",
        "source": "Python Basics"
    },
    {
        "id": "doc2",
        "text": "JavaScript runs in browsers and enables interactive web pages.",
        "source": "Web Development"
    },
    {
        "id": "doc3",
        "text": "Python has a rich ecosystem of libraries like NumPy, Pandas, TensorFlow.",
        "source": "Python Libraries"
    }
]

def retrieve_documents(query: str, top_k: int = 3) -> List[str]:
    """Simple keyword-based retrieval (BM25-like)."""
    query_terms = query.lower().split()
    scored_docs = []
    
    for doc in documents:
        score = sum(1 for term in query_terms if term in doc["text"].lower())
        if score > 0:
            scored_docs.append((doc, score))
    
    # Sort by score and return top_k
    ranked = sorted(scored_docs, key=lambda x: x[1], reverse=True)
    return [doc["text"] for doc, _ in ranked[:top_k]]

def rag_query(user_query: str) -> str:
    """RAG pipeline: retrieve → pass to Claude → synthesize."""
    
    # Step 1: Retrieve relevant documents
    retrieved_docs = retrieve_documents(user_query, top_k=3)
    context = "\n\n".join([f"[Chunk {i+1}]\n{doc}" for i, doc in enumerate(retrieved_docs)])
    
    # Step 2: Build prompt with context
    system_prompt = """You are a helpful assistant. Answer based on the provided context.
    If context doesn't contain relevant information, say so clearly."""
    
    user_message = f"""Context:
{context}

Question: {user_query}

Answer based on the context above."""
    
    # Step 3: Get response from Claude
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    return response.content[0].text

# Test RAG
print(rag_query("What can I do with Python?"))
print("\n---\n")
print(rag_query("How does JavaScript work?"))

Multi-Index Hybrid Search:

from collections import Counter
import math

def bm25_score(doc_text: str, query_terms: List[str]) -> float:
    """Simple BM25-like scoring."""
    doc_terms = doc_text.lower().split()
    score = 0
    for term in query_terms:
        count = doc_terms.count(term)
        score += math.log(1 + count)
    return score

def retrieve_hybrid(query: str, top_k: int = 5):
    """Hybrid retrieval: BM25 + semantic (simulated vector)."""
    query_terms = query.lower().split()
    
    results = []
    for doc in documents:
        # BM25 score
        bm25 = bm25_score(doc["text"], query_terms)
        
        # Simulated vector similarity (0-1)
        # In production: actual embedding cosine similarity
        vector_sim = 0.8 if any(term in doc["text"].lower() for term in query_terms) else 0.2
        
        # Combine scores (weighted average)
        combined = 0.4 * bm25 / 10 + 0.6 * vector_sim  # Normalize BM25
        results.append((doc, combined))
    
    # Rank and return
    ranked = sorted(results, key=lambda x: x[1], reverse=True)
    return [doc["text"] for doc, _ in ranked[:top_k]]

docs = retrieve_hybrid("Python programming language", top_k=2)
print("Retrieved documents (hybrid):")
for i, doc in enumerate(docs, 1):
    print(f"{i}. {doc[:80]}...")

MODULE 7: Features of Claude

Key Notes

Extended Thinking:

  • Enables Claude to reason in “thinking” tokens (not shown to user)
  • Improves accuracy on complex reasoning, math, coding
  • Costs: thinking tokens = output tokens (not discounted)
  • Parameter: thinking with type="enabled" or type="disabled"
  • budget_tokens: max thinking tokens (default 10,000)
  • Response contains thinking block (shown to client) + text block

Image/PDF Support:

  • Support formats: PNG, GIF, JPEG, WebP, PDF
  • Images sent in content array as {"type": "image", "source": {...}}
  • Image source: base64, url, or media_type
  • PDFs: use Files API or base64 encode (max 20MB per file, 5 files per message)
  • Vision capability included in all models

Citations:

  • Claude can cite document snippets with precise locations
  • Requires extracting citation data from response
  • Citation format: document indices + character ranges
  • Use bblock_citations in response headers if enabled
  • Useful for Q&A, document analysis, compliance

Prompt Caching:

  • Cache frequently-used context (system prompt, documents, examples)
  • Cached tokens charged 90% less than new tokens (10% of input token cost)
  • Cache hits: reuse cached tokens without reprocessing
  • Cache key: hash of request up to cache control point
  • Minimum cache size: 1024 tokens to create cache

Cache Control Placement:

System prompt (usually cached)
    ↓
Optional long context (docs, examples, few-shot)
    ↓ [CACHE_CONTROL HERE]
    ↓
User query (not cached)

Code Execution:

  • Claude can write and reason about code
  • Not directly executable in API; client must run code
  • Use tool use to return code for execution
  • Results fed back for Claude to analyze

Files API:

  • Upload documents for analysis
  • Supports: PDF, DOCX, XLSX, PPTX, TXT, CSV, JSON
  • File size: max 20MB
  • Solves: avoid base64 encoding, handle large docs efficiently
  • Reference in message: {"type": "document", "source": {"type": "file", "file_id": "..."}}

Best Practices

  • Use extended thinking for complex reasoning; measure cost vs accuracy improvement
  • Cache system prompts + few-shot examples for consistent savings
  • For images: include relevant metadata (describe what to look for)
  • Enable citations only if compliance/audit needed (adds overhead)
  • Use Files API for documents > 1MB or > 10k tokens
  • Test cache hit rate; measure savings before production
  • Combine caching + streaming for optimal token efficiency
  • PDF handling: extract text if possible, use Files API as fallback

Example (Python)

Extended Thinking:

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=8000,
    thinking={
        "type": "enabled",
        "budget_tokens": 5000  # Max thinking tokens
    },
    messages=[{
        "role": "user",
        "content": "Solve: If a train leaves NYC at 9 AM going 60 mph, and another leaves Boston at 10 AM going 50 mph, when do they meet?"
    }]
)

# Extract thinking and response
for block in response.content:
    if block.type == "thinking":
        print(f"Thinking: {block.thinking}")
    elif block.type == "text":
        print(f"Answer: {block.text}")

Image Analysis:

import base64

# Load image as base64
with open("chart.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_data
                }
            },
            {
                "type": "text",
                "text": "What trends do you see in this chart?"
            }
        ]
    }]
)

print(response.content[0].text)

Prompt Caching:

system_prompt = """You are an expert analyst. Answer questions about the provided documents accurately."""

# Long context to cache (e.g., a large document)
cached_document = """
[Large document with thousands of tokens...]
Company History: Founded in 1990, specialized in cloud infrastructure.
Product Features: Load balancing, auto-scaling, monitoring, security.
Pricing: $99/month basic, $299/month pro, custom enterprise.
[... continues for many tokens ...]
"""

# Request with caching
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt
        },
        {
            "type": "text",
            "text": cached_document,
            "cache_control": {"type": "ephemeral"}  # Cache this content
        }
    ],
    messages=[{
        "role": "user",
        "content": "What is the pricing for the basic plan?"
    }]
)

# Check cache usage
print(f"Input tokens (new): {response.usage.input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")

# Second request reuses cache
response2 = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt
        },
        {
            "type": "text",
            "text": cached_document,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{
        "role": "user",
        "content": "What are the main product features?"
    }]
)

print(f"\nSecond request:")
print(f"Cache read tokens (reused): {response2.usage.cache_read_input_tokens}")
# Expected: much lower cost due to cache hit

Files API:

# Upload a document
import os

with open("report.pdf", "rb") as f:
    file_response = client.beta.files.upload(
        file=(os.path.basename("report.pdf"), f, "application/pdf"),
        betas=["files-api-2025-04-14"]
    )

file_id = file_response.id
print(f"Uploaded file: {file_id}")

# Use file in message
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "file",
                    "file_id": file_id
                },
                "title": "Q1 Report"
            },
            {
                "type": "text",
                "text": "Summarize the key financial metrics."
            }
        ]
    }],
    betas=["files-api-2025-04-14"]
)

print(response.content[0].text)

# Cleanup
client.beta.files.delete(file_id, betas=["files-api-2025-04-14"])

MODULE 8: Model Context Protocol (MCP)

Key Notes

MCP Overview:

  • Protocol for LLMs to interact with external tools, data, APIs
  • Client (Claude) requests resources/tools from MCP server
  • Server provides tools, resources (files, databases), and prompts
  • Bidirectional communication over stdio, SSE, or HTTP

MCP Architecture:

Claude Client (LLM)
    ↓
MCP Client (proxy)
    ↓ [MCP Protocol]
    ↓
MCP Server (e.g., database, API)

Key Components:

  1. Tools: Functions MCP server exposes

    • Schema: name, description, input parameters
    • Claude calls tools, server executes, returns result
  2. Resources: Data/files MCP server provides

    • Can be read, written, updated
    • Examples: database records, files, API responses
    • Use URI scheme (e.g., file://, db://)
  3. Prompts: Contextual instructions from server

    • Customize Claude’s behavior per server
    • Include guidelines, examples, constraints

Server Inspector:

  • Tool to discover MCP server capabilities
  • List all tools, resources, prompts available
  • Test tool execution
  • Useful for debugging and documentation

MCP Protocol Details:

  • Request/response pattern (RPC-like)
  • Transport: stdio (local), SSE (HTTP), or WebSocket
  • Initialization: client sends list of MCP versions, server responds with capabilities
  • Error handling: structured error responses with codes

Best Practices

  • Version MCP servers; indicate breaking changes
  • Keep tool/resource schemas concise and clear
  • Use descriptive names for tools and resources
  • Implement proper error handling and logging
  • Test server with server inspector before deployment
  • Document resource URIs and tool parameters thoroughly
  • Cache MCP server responses if repeated calls expected

Example (Python)

Simple MCP Server Definition (conceptual):

# MCP servers are typically implemented in languages with good stdlib support
# Here's a conceptual example of what an MCP server provides

mcp_server_tools = [
    {
        "name": "get_user_by_id",
        "description": "Retrieve user information by ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "user_id": {"type": "integer", "description": "User ID"}
            },
            "required": ["user_id"]
        }
    },
    {
        "name": "create_task",
        "description": "Create a new task for a user",
        "input_schema": {
            "type": "object",
            "properties": {
                "user_id": {"type": "integer"},
                "title": {"type": "string"},
                "description": {"type": "string"},
                "due_date": {"type": "string", "format": "date"}
            },
            "required": ["user_id", "title"]
        }
    }
]

mcp_resources = [
    {
        "uri": "database://users",
        "name": "Users Table",
        "description": "All user records",
        "mimeType": "application/json"
    },
    {
        "uri": "database://tasks",
        "name": "Tasks Table",
        "description": "All tasks",
        "mimeType": "application/json"
    }
]

# Claude would interact with this MCP server
# to call tools and access resources

Using MCP in Claude API Calls (with MCP client):

# In production, configure MCP servers in Claude Code settings
# The MCP client proxy handles the protocol

# Within Claude conversation, tools from MCP server become available
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[
        # MCP tools are injected here by Claude Code harness
        # Example MCP tool from server:
        {
            "name": "get_user_by_id",
            "description": "Retrieve user information by ID",
            "input_schema": {
                "type": "object",
                "properties": {
                    "user_id": {"type": "integer"}
                },
                "required": ["user_id"]
            }
        }
    ],
    messages=[{
        "role": "user",
        "content": "Tell me about user 123."
    }]
)

# Claude calls tool, server executes, result returned

MODULE 9: Anthropic Apps (Claude Code)

Key Notes

Claude Code Overview:

  • VS Code extension enabling AI-assisted development
  • Integrates Claude API with code editor
  • Features: code generation, completion, refactoring, debugging, testing
  • Connected to MCP servers for external tool access

Setup:

  • Install Claude Code extension from VS Code marketplace
  • Authenticate with Anthropic API key
  • Configure in settings.json (project-specific) and settings.local.json (machine-specific)
  • Hooks auto-load from .claude/hooks/ for automation

MCP Server Integration:

  • Configure MCP servers in .claude/settings.json
  • Each server: name, path/URL, environment variables
  • Claude Code automatically instantiates and connects to servers
  • Tools/resources from MCP servers available in chat

Hooks (Automation):

  • SessionStart: Run before each session starts
  • PreToolUse: Run before Claude uses a tool
  • PostToolUse: Run after tool execution
  • UserPromptSubmit: Run before user input processed
  • Enable loop detection, build monitoring, token efficiency tracking

Rules & Docs:

  • .claude/rules/ — auto-loaded guidelines for every session
  • .claude/docs/ — reference material (architecture, standards, patterns)
  • .claude/skills/ — domain-specific capabilities (research, planning, code review)

Best Practices

  • Keep hooks lightweight (no heavy computation)
  • Use rules for enforcing standards, not for task-specific instructions
  • Organize MCP servers by domain (database, API, file system)
  • Version control .claude/ configuration across team
  • Use hooks for CI/CD integration, token monitoring, privacy enforcement
  • Document custom hooks and MCP servers in .claude/docs/

Example (Settings)

.claude/settings.json (sample):

{
  "mcp_servers": {
    "filesystem": {
      "command": "node",
      "args": ["./mcp-servers/filesystem.js"],
      "env": {
        "ALLOWED_PATHS": "/project/src,/project/docs"
      }
    },
    "database": {
      "command": "python",
      "args": ["./mcp-servers/database.py"],
      "env": {
        "DB_URL": "postgresql://localhost:5432/mydb"
      }
    }
  },
  "hooks": {
    "SessionStart": "node .claude/hooks/token-monitor.cjs",
    "PreToolUse": "node .claude/hooks/tool-validator.cjs",
    "PostToolUse": "node .claude/hooks/result-logger.cjs"
  }
}

MODULE 10: Agents and Workflows

Key Notes

Agents vs Workflows:

Agents:

  • Autonomous, goal-driven systems
  • Use tools to achieve objectives
  • Self-directed task planning and execution
  • Error recovery and retry logic
  • Examples: research agent, coding assistant, data analyst

Workflows:

  • Orchestrated sequences of steps
  • Fixed flow, deterministic routing
  • Human-in-the-loop decision points
  • Examples: approval pipelines, data pipelines, CI/CD chains

Parallelization:

  • Execute independent tasks simultaneously
  • No dependencies between tasks
  • Speed up overall execution
  • Example: retrieve 3 data sources in parallel, then synthesize

Chaining:

  • Sequential task execution with dependencies
  • Output of task N feeds into task N+1
  • Used for multi-step workflows
  • Example: research → design → implement → test

Routing:

  • Branch logic based on conditions
  • Route to different tasks based on input/output
  • Used for decision-making agents
  • Example: IF complex_issue THEN escalate ELSE resolve

Agentic Patterns:

  1. Loop Agent:

    • Perceive → Plan → Act → Repeat
    • Check goal achievement, loop until done
    • Tool use at each iteration
  2. Router Agent:

    • Classify input
    • Route to specialized agent/tool
    • Collect and synthesize results
  3. Delegator:

    • Break task into subtasks
    • Delegate to sub-agents
    • Aggregate results

Implementation Approaches:

  1. Scheduled Agents (with schedule tool):

    • Run on cron schedule or at specific time
    • Good for monitoring, cleanup, reports
    • Managed by harness
  2. Task Agents (ad-hoc):

    • Spawn when triggered by user/event
    • Run to completion then exit
    • Used for one-off work
  3. Loop Agents (persistent):

    • Long-running, check conditions periodically
    • Monitor/polling patterns
    • Use Monitor tool for streaming events

Best Practices

  • Parallelization: Identify independent tasks; spawn simultaneously; collect results
  • Chaining: Use task outputs as inputs to next; handle failures gracefully
  • Routing: Define clear decision criteria; ensure all routes have handlers
  • Agent Communication: Use file-based or API-based messaging between agents
  • Error Handling: Implement retry logic, fallback options, error logging
  • Monitoring: Log agent execution, measure latency, success rates
  • Testing: Test agent in isolation, then in composition
  • Scaling: Use queue systems (Bull, RabbitMQ) for high-volume agent execution

Example (Python & Pseudocode)

Parallel Execution:

import anthropic
import concurrent.futures

client = anthropic.Anthropic()

def research_topic(topic: str) -> str:
    """Research subtask: get info on topic."""
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Research and summarize: {topic}"
        }]
    )
    return response.content[0].text

def parallel_research(main_topic: str, subtopics: list) -> dict:
    """Execute research in parallel, synthesize results."""
    
    # Parallelize
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        futures = {
            executor.submit(research_topic, subtopic): subtopic
            for subtopic in subtopics
        }
        
        results = {}
        for future in concurrent.futures.as_completed(futures):
            subtopic = futures[future]
            results[subtopic] = future.result()
    
    # Synthesize
    synthesis_prompt = f"""
Given these research summaries on {main_topic}:
{chr(10).join(f'{topic}: {result}' for topic, result in results.items())}

Create a comprehensive summary combining all insights.
"""
    
    synthesis = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": synthesis_prompt}]
    )
    
    return {
        "subtopic_results": results,
        "synthesis": synthesis.content[0].text
    }

# Execute
result = parallel_research("AI Safety", ["Alignment", "Robustness", "Interpretability"])
print(result["synthesis"])

Chaining with Dependencies:

def chain_workflow(initial_data: str) -> str:
    """Execute tasks sequentially with data flow."""
    
    # Step 1: Analyze
    response1 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Analyze this data and identify patterns:\n{initial_data}"
        }]
    )
    analysis = response1.content[0].text
    
    # Step 2: Plan (uses analysis output)
    response2 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Based on this analysis:\n{analysis}\n\nCreate an action plan."
        }]
    )
    plan = response2.content[0].text
    
    # Step 3: Implement (uses plan output)
    response3 = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"Implement this plan with code:\n{plan}"
        }]
    )
    implementation = response3.content[0].text
    
    return implementation

result = chain_workflow("Sales data: 10% growth, high churn in Q2")
print(result)

Router Agent:

def router_agent(issue: str) -> str:
    """Route issue to appropriate handler."""
    
    # Step 1: Classify
    classifier = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        system="Classify issue as: TECHNICAL, BUSINESS, URGENT, OTHER. Respond with only the category.",
        messages=[{"role": "user", "content": issue}]
    )
    category = classifier.content[0].text.strip()
    
    # Step 2: Route and handle
    if category == "TECHNICAL":
        handler_prompt = "You are a technical expert. Solve this technical issue:\n"
    elif category == "BUSINESS":
        handler_prompt = "You are a business analyst. Address this business concern:\n"
    elif category == "URGENT":
        handler_prompt = "This is urgent. Provide immediate action items:\n"
    else:
        handler_prompt = "Address this general inquiry:\n"
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": handler_prompt + issue}]
    )
    
    return response.content[0].text

result = router_agent("Our database is down and customers can't access their accounts!")
print(result)

MODULE 11: Conclusion & Course Summary

Key Takeaways

  1. API Fundamentals: Authentication, requests, multi-turn conversations, system prompts, streaming
  2. Prompt Engineering: Clarity, specificity, XML structure, examples, chain-of-thought
  3. Tool Integration: Schemas, message flow, multi-tool, tool results, fine-grained calling
  4. Retrieval & Search: Chunking, embeddings, BM25, hybrid search, RAG pipelines
  5. Advanced Features: Extended thinking, images/PDFs, citations, caching, code execution, Files API
  6. Architecture: MCP protocol, Claude Code setup, MCP servers, hooks
  7. Systems Design: Agents vs workflows, parallelization, chaining, routing patterns

Certification Prep Checklist

  • Build basic request flow (model selection, tokens, streaming)
  • Implement multi-turn conversation with state management
  • Design and test a prompt with examples and XML structure
  • Create tool schema and implement tool-use loop
  • Build simple RAG pipeline with retrieval and synthesis
  • Implement prompt caching and measure token savings
  • Use extended thinking for reasoning task; measure accuracy improvement
  • Analyze PDF or image using Files API or base64 encoding
  • Design an agent with tool use and loop control
  • Implement parallel task execution with results aggregation
  • Chain multiple model calls with data flow
  • Set up MCP server integration in Claude Code
  • Configure hooks for automation (SessionStart, PreToolUse, PostToolUse)
  • Measure evaluation metrics (accuracy, F1, BLEU) on test dataset
  • Optimize for cost: cache, use Haiku where possible, batch calls

Resources


Claude Model Comparison Table

FeatureHaiku 3.5Sonnet 3.5Opus 3.5
Context Window200K tokens200K tokens200K tokens
Input Pricing$0.80/1M$3.00/1M$15.00/1M
Output Pricing$4.00/1M$15.00/1M$60.00/1M
SpeedFastestBalancedSlow
ReasoningGoodVery GoodExcellent
CodingGoodExcellentExcellent
Best ForSimple tasks, high volume, cost-sensitiveBalanced use, production APIsComplex reasoning, multi-step tasks
Extended ThinkingSupportedSupportedSupported
Vision (Images)YesYesYes
Tool UseYesYesYes
Max TokensRecommend <2048Recommend <4096Recommend <8192
StreamingYesYesYes
Cached Tokens$0.08/1M (90% discount)$0.30/1M (90% discount)$1.50/1M (90% discount)

When to Use Each Model

  • Haiku: Classification, simple Q&A, high-throughput systems, RAG retrieval ranking
  • Sonnet: Production APIs, chatbots, code generation, RAG synthesis, balanced latency/quality
  • Opus: Research tasks, complex reasoning, math/physics, novel problem-solving, cost-insensitive

Implementation Checklist for Certification

Basic API Usage

  • Create client with API key
  • Make basic request with model, max_tokens, messages
  • Handle response and extract text/stop_reason/usage
  • Implement error handling (rate limits, timeouts, auth)

Multi-Turn & Advanced Parameters

  • Build conversation history and multi-turn loop
  • Set system prompt and understand scope
  • Configure temperature and top_p
  • Implement streaming with event loop
  • Use structured output (JSON schema) and validate

Prompt Engineering

  • Write clear, specific prompts with examples
  • Use XML tags for complex instructions
  • Implement few-shot learning with examples
  • Add chain-of-thought prompts for reasoning
  • Test variations on evaluation set

Tool Use

  • Define tool schemas with input parameters
  • Implement tool-use loop: send tools → handle response → execute → return result
  • Handle multiple tools in single turn
  • Use tool_choice parameter (auto, required, specific)
  • Add error handling for tool failures

RAG

  • Chunk documents (semantic or fixed-size)
  • Build retrieval function (BM25 or vector-based)
  • Integrate retrieval with Claude request
  • Measure retrieval quality and relevance
  • Implement hybrid search if needed

Advanced Features

  • Use extended thinking for reasoning task; measure improvement
  • Upload and analyze image (base64 or URL)
  • Upload document via Files API
  • Implement prompt caching; measure cache hit rate
  • Add citations to responses where relevant

Agents & Workflows

  • Implement parallel task execution
  • Chain multiple model calls with data flow
  • Create router agent with classification + routing
  • Use tools in agentic loop with goal checking
  • Implement error recovery and retries

Evaluation & Optimization

  • Create test dataset (20+ examples)
  • Implement code-based grading (exact match, regex, schema)
  • Use model-based grading for subjective tasks
  • Measure accuracy, F1, or custom metrics
  • Track token usage and costs per task
  • Identify failure patterns and iterate

Deep Dives & Advanced Topics

Prompt Caching Strategy

When to cache:

  • System prompt (>1K tokens) used in all requests
  • Few-shot examples (>1K tokens) stable across queries
  • Large reference documents included in system
  • Long conversation history (>5 turns)

Cost calculation:

Without cache:
  5 requests × 5000 tokens = 25,000 input tokens
  Cost: 25,000 × $3.00 / 1M = $0.075

With cache (1x creation, 4x reads):
  Creation: 5000 tokens × $3.00 / 1M = $0.015
  Reads: 4 × 5000 × $0.30 / 1M = $0.006
  Total: $0.021 (72% savings)

Extended Thinking Budget

Recommended budgets by task:

  • Simple reasoning: 2,000-5,000 tokens
  • Medium complexity (coding, math): 5,000-10,000 tokens
  • Complex multi-step: 10,000-20,000 tokens (max)

Token cost:

  • Thinking tokens = output tokens (NOT discounted)
  • If thinking=5000 + response=1000 = 6000 output tokens charged

Tool Use Best Practices

Schema design:

{
  "name": "calculate_revenue",
  "description": "Calculate total revenue for a given period and product category",
  "input_schema": {
    "type": "object",
    "properties": {
      "start_date": {"type": "string", "format": "date", "description": "ISO 8601 date"},
      "end_date": {"type": "string", "format": "date"},
      "category": {"type": "string", "enum": ["electronics", "clothing", "food"], "description": "Product category"}
    },
    "required": ["start_date", "end_date", "category"]
  }
}

Error handling in tool results:

if error_occurred:
    tool_result = {
        "type": "tool_result",
        "tool_use_id": use_id,
        "content": f"Error: Database connection failed. Retrying...",
        "is_error": True
    }
else:
    tool_result = {
        "type": "tool_result",
        "tool_use_id": use_id,
        "content": json.dumps({"revenue": 12345, "units": 500})
    }

RAG Optimization

Chunking strategies:

  1. Semantic: Split on headers/paragraphs

    • Pros: Preserves context, reduces redundancy
    • Cons: Variable chunk sizes, harder to implement
  2. Fixed sliding window: 512 tokens with 256-token overlap

    • Pros: Consistent, predictable
    • Cons: May split important concepts
  3. Hierarchical: Section → subsection → paragraph

    • Pros: Enables different retrieval granularities
    • Cons: More complex indexing

Retrieval quality metrics:

  • Precision@k: % of top-k results relevant to query
  • Recall: % of all relevant docs retrieved
  • MRR (Mean Reciprocal Rank): Average position of first relevant result
  • NDCG (Normalized Discounted Cumulative Gain): Relevance ranking quality

Hybrid search weights:

  • BM25 weight: 0.3-0.5 (keyword precision)
  • Vector weight: 0.5-0.7 (semantic similarity)
  • Metadata weight: 0.0-0.2 (date/category filters)

Agent Design Patterns

Loop Agent with Tool Use:

def loop_agent(goal: str, max_iterations: int = 10) -> str:
    state = {"goal": goal, "steps": [], "current_info": ""}
    
    for i in range(max_iterations):
        # Perceive & Plan
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=1024,
            tools=tools,
            system=f"Goal: {goal}\nProgress so far: {state['current_info']}",
            messages=[{"role": "user", "content": "What's the next step?"}]
        )
        
        # Check if done
        if response.stop_reason == "end_turn":
            return response.content[0].text
        
        # Act (execute tool)
        if response.stop_reason == "tool_use":
            tool_use = next(b for b in response.content if b.type == "tool_use")
            result = execute_tool(tool_use.name, tool_use.input)
            state["steps"].append((tool_use.name, result))
            state["current_info"] += f"\nStep {i+1}: {tool_use.name}{result}"
    
    return "Max iterations reached"

Unresolved Questions / Topics for Further Study

  • How to handle very large documents (100K+ tokens) in RAG? (Hierarchical chunking strategies)
  • Fine-tuning vs RAG: When to use fine-tuning for domain-specific tasks?
  • Cost optimization for high-volume production: Batching, caching, model selection trade-offs?
  • Guardrails and content filtering: Implementing safety layers on top of Claude API?
  • Multi-language support: How well does Claude handle non-English prompts in tool use?
  • Real-time streaming UI patterns: Best practices for streaming multiple concurrent requests?
  • Agent memory persistence: Effective strategies for long-running agents with context limits?
  • MCP server scaling: Production deployment patterns for MCP servers with multiple clients?