← Harness Engineering

Context

Context

Context is the live window the LLM sees on each turn: system prompt, prior conversation, tool results, injected memories, injected rules. The context window is finite; competition for space is the single most important resource constraint in harness engineering.

Context is different from memory (factor: memory.md). Memory is the durable store; context is the per-turn working set.


The core problem

LLM context windows are finite. Agentic sessions can run for hours across hundreds of tool calls. Every token spent on boilerplate is a token unavailable for reasoning. Context engineering is the discipline of maximizing signal per token across the entire session, not just one turn.

The anti-pattern is treating context as unlimited: dumping everything into the prompt, loading all rules, reading whole files when a grep suffices. It works until it doesn’t — then the session becomes useless.


Patterns observed in Claude Code

Persistent instruction file pattern

A durable project-level configuration file (CLAUDE.md, CLAW.md, etc.) automatically loaded at session start. Defines build commands, test procedures, architecture rules, naming conventions, coding standards. Ships with the repository.

Without this, agents re-learn project conventions each session, and users re-explain them each session. The file IS the agent’s onboarding document.

Key trade-off: the instruction file must stay current as the project evolves. Stale rules teach outdated patterns — worse than no rules.

Scoped context assembly

Dynamic instruction loading from multiple files at different hierarchy levels:

  • Organization level
  • User level (~/.claude/CLAUDE.md)
  • Project root (./CLAUDE.md)
  • Parent directories (walk upward)
  • Child directories (walk downward into active file’s tree)

Rules vary based on the agent’s working location. When editing packages/api/, load packages/api/CLAUDE.md; when editing packages/web/, load packages/web/CLAUDE.md. Monorepos stop being a single-context problem.

Trade-off: discoverability suffers when instructions span multiple files. Conflicting rules across scopes produce surprising behavior. Needs a deduplication layer (content-hash based in Claude Code) so identical instructions from different scopes load once.

Progressive context compaction (5-layer defense)

Multi-stage compression tuned for conversation age. The Claude Code leak showed 5 escalating layers, cheapest first:

#LayerMechanismCost
1Tool result truncationLarge outputs persisted to disk, only a pointer kept in context~Zero
2MicrocompactRemove stale tool results at compact boundaries, no LLM neededLow
3Auto-compactLLM summarizes the entire conversation, creates a compact boundary1 API call
4Reactive compactEmergency: triggered by prompt_too_long error mid-turn1 API call + retry
5Context collapseRead-time projection across old turns (stubbed in leak)Medium

Design principles:

  • Cheapest layer first; 99% of requests fast-path at layer 1.
  • Each layer runs once per iteration only — prevents infinite loops.
  • Circuit breaker (hasAttemptedReactiveCompact) at layer 4.
  • Compact boundary = controlled forgetting. The summary IS the new history; details before it are unrecoverable.
  • Memory extraction runs alongside compact — “forget conversation, remember lessons” (see memory.md).

Empirical calibration: 20,000 tokens reserved for compact summary output, based on measured p99.99 = 17,387 tokens. Not guessed — measured.

Tiered memory (context side)

Memory index (200-line cap) is always in context; topic files load on demand. See memory.md for the storage side. The context side is: the index is a first-class context citizen, the topic files are lazy.

Dynamic boundary marker for prompt caching

From Claude Code’s system prompt structure:

Intro
System rules
Doing tasks
Actions care
__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__   ← marker
Environment
Project context
Instruction files
Runtime config

Everything before the boundary is static across turns → cacheable by the API. Everything after the boundary changes per turn (env vars, current dir, active file) → not cacheable.

Why it matters: API prompt caching gives you free token cost for cached prefixes. Every static section you move above the boundary saves money on every turn of every session. The boundary is a performance lever, not a cosmetic separator.

Instruction file budgets

  • MAX_INSTRUCTION_FILE_CHARS = 4,000 per file
  • MAX_TOTAL_INSTRUCTION_CHARS = 12,000 total across all loaded instruction files
  • Truncation marker: [truncated]
  • Deduplication: content-hash based

Hard budgets prevent a single file or a recursive directory walk from blowing out the window.


Compaction mechanics in depth

Compaction is not one technique — it’s a pipeline. In order:

  1. Truncate long tool results inline. grep ... | head -100 logic in the tool layer itself.
  2. Drop stale results — tool results that are no longer referenced by subsequent turns get microcompacted (deleted from context, not summarized).
  3. Summarize old turns — once the window crosses a threshold, an LLM summarization pass collapses old turns into a single “so far” block.
  4. Extract memories in the same summarization call — see memory.md.
  5. Retry on prompt_too_long error — if layer 3 didn’t compact aggressively enough, the next API call returns an error; the reactive compact layer 4 kicks in, summarizes harder, retries the failed turn.
  6. Context collapse — long-term turn projection that reduces N old turns into a compressed schematic. Mentioned in the leak, stubbed in source.

Each layer is one-shot per iteration with a circuit breaker. The agent never loops on “compact → fail → compact → fail”.


Streaming tool execution as context management

Tools execute immediately as tool_use blocks arrive from the API stream, not waiting for the full response. This is a context optimization: tool results can be interleaved into the next turn’s context as soon as they exist, rather than held in a buffer.

Sibling abort rules: only Bash errors cancel sibling tools. Read/Grep/WebFetch errors are independent. The stream keeps flowing.

See patterns.md for the async-generator machinery that makes this possible.


Anti-patterns

  1. Fixed-size history trimming (drop oldest N turns). Loses context surgically — the key fact you need is always in the dropped turn.
  2. Single-pass summarization. One LLM call compresses everything. Loses nuance, hallucinates mid-compression. Use the 5-layer defense instead.
  3. No tool result truncation. cat large-file.json dumps 200 KB into context. Compactors can’t save you from this — fix it at the tool layer.
  4. Static system prompt with no dynamic boundary. Every turn recomputes the cache prefix. Burns money on every request.
  5. Rules in the system prompt as free-form prose. Unparseable by the compactor, unlinkable by the deduplication layer. Use structured instruction files.
  6. Loading all instruction files unconditionally. A recursive CLAUDE.md walk through a monorepo loads 40 files = 160 KB. Use path-based scoping.
  7. No compact boundary marker. Agent doesn’t know which “old turns” are safe to summarize vs which are load-bearing.

Takeaways for harness engineering

  1. Treat the window as a budget. Every hook, every rule, every injected line has a cost. Measure it.
  2. Layer the defense. Truncate → microcompact → summarize → reactive compact → collapse. Each layer has a specific cost band.
  3. Measure, don’t guess. Claude Code’s 20K reserved summary budget came from p99.99 = 17,387 measured tokens. Instrument your compactor before tuning it.
  4. Move static sections above the dynamic boundary. Every static token is a cached token is a free token.
  5. Dedupe by content hash. Scoped assembly will load the same rule from 3 places — dedupe before injection.
  6. Hard-budget instruction files. Per-file cap and total cap. Truncate with a marker, don’t silently drop.
  7. Extract memories in the summarization pass. One API call, two artifacts. Never summarize without extracting.
  8. Circuit-break every compaction layer. One attempt per iteration. No loops.
  9. Fix tool output truncation at the tool layer, not the compactor. Compactors are emergency, not routine.
  10. Separate static rules (cacheable, auto-loaded) from detailed standards (read on-demand). In this repo: rules/ vs docs/.

What this repo does

  • hooks/inject-rules.cjs — loads every .md from rules/ into context at session start. Auto-scoped: session-level rules are visible globally.
  • hooks/dev-rules-reminder.cjs — re-injects rules on every user prompt, plus active Plan Context. Defends against mid-session drift.
  • hooks/token-efficiency-reminder.cjs — injects token discipline on every prompt. “Match effort to request complexity. No preambles.”
  • hooks/usage-context-awareness.cjs — tracks remaining rate-limit budget, writes to cache, read by statusline and context builder. Lets the agent adapt behavior to remaining quota.
  • rules/ vs docs/ splitrules/ is auto-loaded (cost always); docs/ is read on-demand (cost per request). The split is itself a context discipline.
  • hooks/enforce-doc-rules.cjs — enforces doc conventions on docs/*.md edits. Prevents the rules from drifting into the docs or vice versa.
  • hooks/scout-block.cjs — blocks reads from .ckignore directories (node_modules/, dist/, etc.). Stops context pollution before it enters the window.
  • Claude Code’s built-in auto memory system handles the “extract before compact” principle via its own compaction-time memory extraction.

Open problems

  • No per-turn token budget telemetry in this repo. usage-context-awareness.cjs tracks API-level rate limits but not per-turn context use. A dashboard would help.
  • No dynamic boundary marker in this repo’s system prompt. Every turn’s prompt is rebuilt; no explicit split between cacheable and dynamic. Lost cache wins.
  • No empirical measurement of compact summary budgets. We don’t know what our p99 summary size is. Guessing.
  • Context collapse is still stubbed in Claude Code’s leak. Nobody has shipped a production context-collapse layer that works well. Open research problem.