Tools
Table of Contents
Tools
Tools are the verbs of an agentic system — what the agent can DO. Tool design decides whether the agent feels capable or incompetent, fast or sluggish, trustworthy or risky.
Tool design is different from permissions (factor: permissions.md). Permissions control which tools the agent may use; tool design controls how the tools work and what they expose.
The core problem
Two extremes fail:
- One general tool (
Bash) — the agent has full power but the system can’t reason about what’s happening. Every shell command needs review. Permission rules are brittle. Concurrency is impossible. - A million specialized tools — the agent can’t choose between them. Selection latency dominates. Models pick wrong tools when 60+ are visible.
The right design is a small core set of single-purpose tools with deterministic semantics, augmented by general tools for fallback.
Patterns observed in Claude Code
Single-purpose tool design
Claude Code replaces general shell routing with purpose-built tools:
FileReadTool— read a file (typed input, line-range support)FileEditTool— edit a file (precise, atomic, reviewable)GrepTool— search file content (built on ripgrep)GlobTool— find files by patternWebFetchTool— fetch a URL (with URL validation)- …and so on, ~43 specialized tools
Each has:
- Typed inputs — no string parsing on the receiving end
- Constrained scope — does one thing
- Individual permission rules —
Read(...)vsEdit(...)vsBash(...)are different rule namespaces - Predictable output format — structured, parseable
Why it matters: general shell commands are harder to review, permission, and execute correctly. sed -i 's/foo/bar/' file.txt looks identical to sed -i 's/.*//' file.txt (which empties the file). Purpose-built tools eliminate the ambiguity.
Trade-off: purpose-built tools cannot cover every edge case. General shells (Bash) remain necessary as a fallback. The discipline is “specialized first, general only when needed.”
Progressive tool expansion
Start with a small default set (Claude Code uses fewer than 20 default tools) and activate additional tools on demand. MCP tools, remote tools, and custom skills load only when:
- A skill is invoked that needs them
- A file path matches a skill’s path-based activation rule
- The user explicitly enables a tool
Why it matters: visible tools create selection problems. Studies of LLM tool use show that with 60 tools visible, models spend more time deciding and pick wrong ones more often than with 20. The cognitive load on the model is real.
Trade-off: expansion logic adds complexity. Activating too late wastes turns (“I would do X, but I don’t have the tool”). Need to balance laziness with prediction.
Self-declared concurrency safety
Each tool implements isConcurrencySafe(input):
- Default:
false(conservative — exclusive) - Per-invocation, not per-tool-type
- The decision uses the actual arguments, not just the tool name
Examples:
Bash("cat file.txt")→ safe (read-only)Bash("npm install")→ exclusive (mutatesnode_modules/)Bash("git status")→ safeBash("git commit")→ exclusive
The system partitions a batch of tool calls at runtime:
Input: [Read, Read, Grep, Write, Read, Read]
Output: [Read+Read+Grep] [Write] [Read+Read]
↑ concurrent ↑ exclusive ↑ concurrent
3 batches instead of 6 sequential calls
Greedy batching: consecutive safe tools are grouped into a concurrent batch; an exclusive tool breaks the batch.
Max concurrency: 10 (configurable via CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY).
Lesson: safety is per-invocation, not per-tool. Default exclusive (safe-by-default). The system partitions automatically. This applies to any system with mixed read/write operations: database calls, file ops, API calls, CI/CD steps.
Streaming tool execution
Tools execute immediately as tool_use blocks arrive from the stream, not after the full LLM response. A StreamingToolExecutor (530 LOC in the leak) manages a state machine per tool: queued → executing → completed → yielded.
Why it matters: tool execution can begin before the LLM finishes generating its response. This shaves seconds off every turn that uses tools.
Sibling abort rules:
- Bash errors cancel sibling tools (because they often share state — current directory, env vars, file locks)
- Read/Grep/WebFetch errors do NOT cancel siblings (they’re independent)
Comment from the leak:
Read/WebFetch/etc are independent — one failure shouldn’t nuke the rest.
Lesson: classify which failures cascade and which are independent. Don’t blindly cancel everything on a single failure.
Context modifier chain
Tools can modify shared context for subsequent tools via contextModifier callbacks. Example: cd /some/path modifies the working directory for all subsequent Bash calls.
Constraint: only exclusive (non-concurrent) tools can return modifiers. Prevents non-deterministic state from parallel application — if two concurrent tools each try to cd, the result is undefined.
Pattern: immutable context snapshots with explicit transitions. Functional programming applied at the architecture level.
Lesson: never mutate shared state from concurrent operations. Use modifier functions applied sequentially, like Redux reducers but for agent state.
Tool input design
Single-purpose tools need well-designed inputs. Principles from Claude Code:
- Typed inputs — use schemas, not free-form strings
- Required + optional split — minimal required surface, rich optional surface
- Sensible defaults —
head_limit: 250by default; user can override - Boundaries explicit — file paths must be absolute, line ranges 1-indexed, glob patterns documented
- No ambiguous overloads —
EditandMultiEditare separate tools, not one tool with a flag
Example: the Grep tool input schema:
pattern(required) — regexpath(optional) — search rootglob(optional) — file filtertype(optional) — language type filteroutput_mode—content|files_with_matches|count-n,-A,-B,-C,-i— context controlshead_limit(default 250) — output capmultiline(default false) — cross-line patterns
This is a small schema with rich semantics. The agent learns it once, uses it everywhere.
Tool output design
Output design is as important as input design. Principles:
- Structured, not freeform. JSON, tables, key-value pairs. Parseable by the agent and the next tool.
- Truncated by default. Default
head_limit(e.g., 250 lines). Caller can override for known-large operations. - Self-describing. Include schema/structure clues so the agent doesn’t have to guess.
- Errors are first-class. Failed tool calls return structured errors, not stack traces. Include suggested next actions where possible.
- Pointer-friendly. Large outputs go to disk, returning a pointer. Saves context window. (See
context.mdLayer 1.)
Anti-patterns
- One general tool to rule them all. Bash for everything. Permission becomes impossible. Concurrency impossible. Audit impossible.
- Tools with overlapping responsibilities.
EditandPatchandWriteall do similar things. Agent picks randomly. - Tools without typed inputs.
runCommand({command: "..."})— every input is a free-form string. The agent has to construct the string; the tool has to parse it back. - Tools that mutate hidden state.
cdwithout a context modifier. Subsequent tools have inconsistent state. - Concurrency safety declared per-tool-type, not per-invocation.
Bashis either always safe or never safe. Loses the partitioning win. - Default concurrent. Tools assume safety unless told otherwise. First parallel
npm installcorrupts the lockfile. - No streaming execution. Tools wait for the entire LLM response before starting. Latency wasted.
- Cascading aborts. One read fails → all parallel tools cancelled. Independent operations shouldn’t cascade.
- Tool output > 50 KB by default. Burns context. Force the user to grep instead.
- Tool errors as exceptions. The agent sees a stack trace, not a structured “what went wrong, try X next” message.
Takeaways for harness engineering
- Single-purpose tools are the default. General tools (Bash) are the fallback.
- Type the inputs. Schemas, not strings. The schema IS the documentation.
- Default concurrent-unsafe. Tools opt INTO concurrency, not out of it.
- Concurrency safety is per-invocation, not per-type.
Bash("cat")andBash("npm install")are different. - Partition batches greedily. Consecutive safe tools → concurrent batch. Exclusive tool → breaks the batch.
- Stream execution from the LLM stream, not after it. Free latency win.
- Classify failure cascades. Stateful failures cancel siblings; stateless failures don’t.
- Context modifiers are exclusive-only. Concurrent state mutation is undefined.
- Output is structured, truncated, pointer-friendly. Everything that goes back to the agent has a token cost — minimize it.
- Errors are structured. Not exceptions, not stack traces. Tell the agent what went wrong and what to do next.
- Tool count matters cognitively. Keep the visible set small; lazy-load the rest.
What this repo does
- Claude Code’s built-in tools — this distribution doesn’t ship tools (those are part of the Claude Code binary), but the rules and skills are designed around them:
Read,Edit,Write,Grep,Glob,Bash,WebFetch,Agent, etc. hooks/build-sensor.cjs— PostToolUse hook that runs the project’s build command afterEdit/Write/MultiEdit. Implements the streaming feedback principle: tool result feedback is automatic, not manual. Context-efficient: success is silent; only failures surface to the agent.hooks/loop-detection.cjs— PreToolUse hook that detects when the agent is editing the same file repeatedly. Implements a soft circuit breaker for tool invocations.hooks/scout-block.cjs— restricts tool inputs (file paths, bash arguments) at the harness layer. The tool API is unchanged; the inputs are filtered before they reach the tool.hooks/descriptive-name.cjs— injects naming guidance into theWritetool’s call site. A pre-tool advisory, not a hard restriction.- Skills as tool composers —
qa,ship,review,decomposeetc. don’t add new tools; they orchestrate existing tools into higher-level workflows. The “skill” is a structured prompt that uses the existing tool surface.
Open problems
- No tool concurrency partitioning at the harness level. This is inside the Claude Code binary; the harness can’t tune it. If we ran our own loop, we’d partition.
- No tool-level error taxonomy in this repo. Errors come back as strings; the agent has to parse intent from prose. A structured error type would be cleaner.
- No tool usage telemetry. We don’t know which tools are used most, which fail most, which return the largest outputs. Instrumenting tools would surface optimization opportunities.
- No custom tool definitions. The repo extends Claude Code via hooks and skills, not custom tools. Tools come from Claude Code itself or MCP servers. If a true custom tool became necessary, it would need an MCP wrapper.