← Harness Engineering

Tools

Tools

Tools are the verbs of an agentic system — what the agent can DO. Tool design decides whether the agent feels capable or incompetent, fast or sluggish, trustworthy or risky.

Tool design is different from permissions (factor: permissions.md). Permissions control which tools the agent may use; tool design controls how the tools work and what they expose.


The core problem

Two extremes fail:

  1. One general tool (Bash) — the agent has full power but the system can’t reason about what’s happening. Every shell command needs review. Permission rules are brittle. Concurrency is impossible.
  2. A million specialized tools — the agent can’t choose between them. Selection latency dominates. Models pick wrong tools when 60+ are visible.

The right design is a small core set of single-purpose tools with deterministic semantics, augmented by general tools for fallback.


Patterns observed in Claude Code

Single-purpose tool design

Claude Code replaces general shell routing with purpose-built tools:

  • FileReadTool — read a file (typed input, line-range support)
  • FileEditTool — edit a file (precise, atomic, reviewable)
  • GrepTool — search file content (built on ripgrep)
  • GlobTool — find files by pattern
  • WebFetchTool — fetch a URL (with URL validation)
  • …and so on, ~43 specialized tools

Each has:

  • Typed inputs — no string parsing on the receiving end
  • Constrained scope — does one thing
  • Individual permission rulesRead(...) vs Edit(...) vs Bash(...) are different rule namespaces
  • Predictable output format — structured, parseable

Why it matters: general shell commands are harder to review, permission, and execute correctly. sed -i 's/foo/bar/' file.txt looks identical to sed -i 's/.*//' file.txt (which empties the file). Purpose-built tools eliminate the ambiguity.

Trade-off: purpose-built tools cannot cover every edge case. General shells (Bash) remain necessary as a fallback. The discipline is “specialized first, general only when needed.”

Progressive tool expansion

Start with a small default set (Claude Code uses fewer than 20 default tools) and activate additional tools on demand. MCP tools, remote tools, and custom skills load only when:

  • A skill is invoked that needs them
  • A file path matches a skill’s path-based activation rule
  • The user explicitly enables a tool

Why it matters: visible tools create selection problems. Studies of LLM tool use show that with 60 tools visible, models spend more time deciding and pick wrong ones more often than with 20. The cognitive load on the model is real.

Trade-off: expansion logic adds complexity. Activating too late wastes turns (“I would do X, but I don’t have the tool”). Need to balance laziness with prediction.

Self-declared concurrency safety

Each tool implements isConcurrencySafe(input):

  • Default: false (conservative — exclusive)
  • Per-invocation, not per-tool-type
  • The decision uses the actual arguments, not just the tool name

Examples:

  • Bash("cat file.txt") → safe (read-only)
  • Bash("npm install") → exclusive (mutates node_modules/)
  • Bash("git status") → safe
  • Bash("git commit") → exclusive

The system partitions a batch of tool calls at runtime:

Input:  [Read, Read, Grep, Write, Read, Read]
Output: [Read+Read+Grep] [Write] [Read+Read]
        ↑ concurrent     ↑ exclusive ↑ concurrent
        3 batches instead of 6 sequential calls

Greedy batching: consecutive safe tools are grouped into a concurrent batch; an exclusive tool breaks the batch.

Max concurrency: 10 (configurable via CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY).

Lesson: safety is per-invocation, not per-tool. Default exclusive (safe-by-default). The system partitions automatically. This applies to any system with mixed read/write operations: database calls, file ops, API calls, CI/CD steps.

Streaming tool execution

Tools execute immediately as tool_use blocks arrive from the stream, not after the full LLM response. A StreamingToolExecutor (530 LOC in the leak) manages a state machine per tool: queued → executing → completed → yielded.

Why it matters: tool execution can begin before the LLM finishes generating its response. This shaves seconds off every turn that uses tools.

Sibling abort rules:

  • Bash errors cancel sibling tools (because they often share state — current directory, env vars, file locks)
  • Read/Grep/WebFetch errors do NOT cancel siblings (they’re independent)

Comment from the leak:

Read/WebFetch/etc are independent — one failure shouldn’t nuke the rest.

Lesson: classify which failures cascade and which are independent. Don’t blindly cancel everything on a single failure.

Context modifier chain

Tools can modify shared context for subsequent tools via contextModifier callbacks. Example: cd /some/path modifies the working directory for all subsequent Bash calls.

Constraint: only exclusive (non-concurrent) tools can return modifiers. Prevents non-deterministic state from parallel application — if two concurrent tools each try to cd, the result is undefined.

Pattern: immutable context snapshots with explicit transitions. Functional programming applied at the architecture level.

Lesson: never mutate shared state from concurrent operations. Use modifier functions applied sequentially, like Redux reducers but for agent state.


Tool input design

Single-purpose tools need well-designed inputs. Principles from Claude Code:

  1. Typed inputs — use schemas, not free-form strings
  2. Required + optional split — minimal required surface, rich optional surface
  3. Sensible defaultshead_limit: 250 by default; user can override
  4. Boundaries explicit — file paths must be absolute, line ranges 1-indexed, glob patterns documented
  5. No ambiguous overloadsEdit and MultiEdit are separate tools, not one tool with a flag

Example: the Grep tool input schema:

  • pattern (required) — regex
  • path (optional) — search root
  • glob (optional) — file filter
  • type (optional) — language type filter
  • output_modecontent | files_with_matches | count
  • -n, -A, -B, -C, -i — context controls
  • head_limit (default 250) — output cap
  • multiline (default false) — cross-line patterns

This is a small schema with rich semantics. The agent learns it once, uses it everywhere.


Tool output design

Output design is as important as input design. Principles:

  1. Structured, not freeform. JSON, tables, key-value pairs. Parseable by the agent and the next tool.
  2. Truncated by default. Default head_limit (e.g., 250 lines). Caller can override for known-large operations.
  3. Self-describing. Include schema/structure clues so the agent doesn’t have to guess.
  4. Errors are first-class. Failed tool calls return structured errors, not stack traces. Include suggested next actions where possible.
  5. Pointer-friendly. Large outputs go to disk, returning a pointer. Saves context window. (See context.md Layer 1.)

Anti-patterns

  1. One general tool to rule them all. Bash for everything. Permission becomes impossible. Concurrency impossible. Audit impossible.
  2. Tools with overlapping responsibilities. Edit and Patch and Write all do similar things. Agent picks randomly.
  3. Tools without typed inputs. runCommand({command: "..."}) — every input is a free-form string. The agent has to construct the string; the tool has to parse it back.
  4. Tools that mutate hidden state. cd without a context modifier. Subsequent tools have inconsistent state.
  5. Concurrency safety declared per-tool-type, not per-invocation. Bash is either always safe or never safe. Loses the partitioning win.
  6. Default concurrent. Tools assume safety unless told otherwise. First parallel npm install corrupts the lockfile.
  7. No streaming execution. Tools wait for the entire LLM response before starting. Latency wasted.
  8. Cascading aborts. One read fails → all parallel tools cancelled. Independent operations shouldn’t cascade.
  9. Tool output > 50 KB by default. Burns context. Force the user to grep instead.
  10. Tool errors as exceptions. The agent sees a stack trace, not a structured “what went wrong, try X next” message.

Takeaways for harness engineering

  1. Single-purpose tools are the default. General tools (Bash) are the fallback.
  2. Type the inputs. Schemas, not strings. The schema IS the documentation.
  3. Default concurrent-unsafe. Tools opt INTO concurrency, not out of it.
  4. Concurrency safety is per-invocation, not per-type. Bash("cat") and Bash("npm install") are different.
  5. Partition batches greedily. Consecutive safe tools → concurrent batch. Exclusive tool → breaks the batch.
  6. Stream execution from the LLM stream, not after it. Free latency win.
  7. Classify failure cascades. Stateful failures cancel siblings; stateless failures don’t.
  8. Context modifiers are exclusive-only. Concurrent state mutation is undefined.
  9. Output is structured, truncated, pointer-friendly. Everything that goes back to the agent has a token cost — minimize it.
  10. Errors are structured. Not exceptions, not stack traces. Tell the agent what went wrong and what to do next.
  11. Tool count matters cognitively. Keep the visible set small; lazy-load the rest.

What this repo does

  • Claude Code’s built-in tools — this distribution doesn’t ship tools (those are part of the Claude Code binary), but the rules and skills are designed around them: Read, Edit, Write, Grep, Glob, Bash, WebFetch, Agent, etc.
  • hooks/build-sensor.cjs — PostToolUse hook that runs the project’s build command after Edit/Write/MultiEdit. Implements the streaming feedback principle: tool result feedback is automatic, not manual. Context-efficient: success is silent; only failures surface to the agent.
  • hooks/loop-detection.cjs — PreToolUse hook that detects when the agent is editing the same file repeatedly. Implements a soft circuit breaker for tool invocations.
  • hooks/scout-block.cjs — restricts tool inputs (file paths, bash arguments) at the harness layer. The tool API is unchanged; the inputs are filtered before they reach the tool.
  • hooks/descriptive-name.cjs — injects naming guidance into the Write tool’s call site. A pre-tool advisory, not a hard restriction.
  • Skills as tool composersqa, ship, review, decompose etc. don’t add new tools; they orchestrate existing tools into higher-level workflows. The “skill” is a structured prompt that uses the existing tool surface.

Open problems

  • No tool concurrency partitioning at the harness level. This is inside the Claude Code binary; the harness can’t tune it. If we ran our own loop, we’d partition.
  • No tool-level error taxonomy in this repo. Errors come back as strings; the agent has to parse intent from prose. A structured error type would be cleaner.
  • No tool usage telemetry. We don’t know which tools are used most, which fail most, which return the largest outputs. Instrumenting tools would surface optimization opportunities.
  • No custom tool definitions. The repo extends Claude Code via hooks and skills, not custom tools. Tools come from Claude Code itself or MCP servers. If a true custom tool became necessary, it would need an MCP wrapper.