Context
Context
Context is the live window the LLM sees on each turn: system prompt, prior conversation, tool results, injected memories, injected rules. The context window is finite; competition for space is the single most important resource constraint in harness engineering.
Context is different from memory (factor: memory.md). Memory is the durable store; context is the per-turn working set.
The core problem
LLM context windows are finite. Agentic sessions can run for hours across hundreds of tool calls. Every token spent on boilerplate is a token unavailable for reasoning. Context engineering is the discipline of maximizing signal per token across the entire session, not just one turn.
The anti-pattern is treating context as unlimited: dumping everything into the prompt, loading all rules, reading whole files when a grep suffices. It works until it doesn’t — then the session becomes useless.
Patterns observed in Claude Code
Persistent instruction file pattern
A durable project-level configuration file (CLAUDE.md, CLAW.md, etc.) automatically loaded at session start. Defines build commands, test procedures, architecture rules, naming conventions, coding standards. Ships with the repository.
Without this, agents re-learn project conventions each session, and users re-explain them each session. The file IS the agent’s onboarding document.
Key trade-off: the instruction file must stay current as the project evolves. Stale rules teach outdated patterns — worse than no rules.
Scoped context assembly
Dynamic instruction loading from multiple files at different hierarchy levels:
- Organization level
- User level (
~/.claude/CLAUDE.md) - Project root (
./CLAUDE.md) - Parent directories (walk upward)
- Child directories (walk downward into active file’s tree)
Rules vary based on the agent’s working location. When editing packages/api/, load packages/api/CLAUDE.md; when editing packages/web/, load packages/web/CLAUDE.md. Monorepos stop being a single-context problem.
Trade-off: discoverability suffers when instructions span multiple files. Conflicting rules across scopes produce surprising behavior. Needs a deduplication layer (content-hash based in Claude Code) so identical instructions from different scopes load once.
Progressive context compaction (5-layer defense)
Multi-stage compression tuned for conversation age. The Claude Code leak showed 5 escalating layers, cheapest first:
| # | Layer | Mechanism | Cost |
|---|---|---|---|
| 1 | Tool result truncation | Large outputs persisted to disk, only a pointer kept in context | ~Zero |
| 2 | Microcompact | Remove stale tool results at compact boundaries, no LLM needed | Low |
| 3 | Auto-compact | LLM summarizes the entire conversation, creates a compact boundary | 1 API call |
| 4 | Reactive compact | Emergency: triggered by prompt_too_long error mid-turn | 1 API call + retry |
| 5 | Context collapse | Read-time projection across old turns (stubbed in leak) | Medium |
Design principles:
- Cheapest layer first; 99% of requests fast-path at layer 1.
- Each layer runs once per iteration only — prevents infinite loops.
- Circuit breaker (
hasAttemptedReactiveCompact) at layer 4. - Compact boundary = controlled forgetting. The summary IS the new history; details before it are unrecoverable.
- Memory extraction runs alongside compact — “forget conversation, remember lessons” (see
memory.md).
Empirical calibration: 20,000 tokens reserved for compact summary output, based on measured p99.99 = 17,387 tokens. Not guessed — measured.
Tiered memory (context side)
Memory index (200-line cap) is always in context; topic files load on demand. See memory.md for the storage side. The context side is: the index is a first-class context citizen, the topic files are lazy.
Dynamic boundary marker for prompt caching
From Claude Code’s system prompt structure:
Intro
System rules
Doing tasks
Actions care
__SYSTEM_PROMPT_DYNAMIC_BOUNDARY__ ← marker
Environment
Project context
Instruction files
Runtime config
Everything before the boundary is static across turns → cacheable by the API. Everything after the boundary changes per turn (env vars, current dir, active file) → not cacheable.
Why it matters: API prompt caching gives you free token cost for cached prefixes. Every static section you move above the boundary saves money on every turn of every session. The boundary is a performance lever, not a cosmetic separator.
Instruction file budgets
MAX_INSTRUCTION_FILE_CHARS = 4,000per fileMAX_TOTAL_INSTRUCTION_CHARS = 12,000total across all loaded instruction files- Truncation marker:
[truncated] - Deduplication: content-hash based
Hard budgets prevent a single file or a recursive directory walk from blowing out the window.
Compaction mechanics in depth
Compaction is not one technique — it’s a pipeline. In order:
- Truncate long tool results inline.
grep ... | head -100logic in the tool layer itself. - Drop stale results — tool results that are no longer referenced by subsequent turns get microcompacted (deleted from context, not summarized).
- Summarize old turns — once the window crosses a threshold, an LLM summarization pass collapses old turns into a single “so far” block.
- Extract memories in the same summarization call — see
memory.md. - Retry on
prompt_too_longerror — if layer 3 didn’t compact aggressively enough, the next API call returns an error; the reactive compact layer 4 kicks in, summarizes harder, retries the failed turn. - Context collapse — long-term turn projection that reduces N old turns into a compressed schematic. Mentioned in the leak, stubbed in source.
Each layer is one-shot per iteration with a circuit breaker. The agent never loops on “compact → fail → compact → fail”.
Streaming tool execution as context management
Tools execute immediately as tool_use blocks arrive from the API stream, not waiting for the full response. This is a context optimization: tool results can be interleaved into the next turn’s context as soon as they exist, rather than held in a buffer.
Sibling abort rules: only Bash errors cancel sibling tools. Read/Grep/WebFetch errors are independent. The stream keeps flowing.
See patterns.md for the async-generator machinery that makes this possible.
Anti-patterns
- Fixed-size history trimming (drop oldest N turns). Loses context surgically — the key fact you need is always in the dropped turn.
- Single-pass summarization. One LLM call compresses everything. Loses nuance, hallucinates mid-compression. Use the 5-layer defense instead.
- No tool result truncation.
cat large-file.jsondumps 200 KB into context. Compactors can’t save you from this — fix it at the tool layer. - Static system prompt with no dynamic boundary. Every turn recomputes the cache prefix. Burns money on every request.
- Rules in the system prompt as free-form prose. Unparseable by the compactor, unlinkable by the deduplication layer. Use structured instruction files.
- Loading all instruction files unconditionally. A recursive
CLAUDE.mdwalk through a monorepo loads 40 files = 160 KB. Use path-based scoping. - No compact boundary marker. Agent doesn’t know which “old turns” are safe to summarize vs which are load-bearing.
Takeaways for harness engineering
- Treat the window as a budget. Every hook, every rule, every injected line has a cost. Measure it.
- Layer the defense. Truncate → microcompact → summarize → reactive compact → collapse. Each layer has a specific cost band.
- Measure, don’t guess. Claude Code’s 20K reserved summary budget came from p99.99 = 17,387 measured tokens. Instrument your compactor before tuning it.
- Move static sections above the dynamic boundary. Every static token is a cached token is a free token.
- Dedupe by content hash. Scoped assembly will load the same rule from 3 places — dedupe before injection.
- Hard-budget instruction files. Per-file cap and total cap. Truncate with a marker, don’t silently drop.
- Extract memories in the summarization pass. One API call, two artifacts. Never summarize without extracting.
- Circuit-break every compaction layer. One attempt per iteration. No loops.
- Fix tool output truncation at the tool layer, not the compactor. Compactors are emergency, not routine.
- Separate static rules (cacheable, auto-loaded) from detailed standards (read on-demand). In this repo:
rules/vsdocs/.
What this repo does
hooks/inject-rules.cjs— loads every.mdfromrules/into context at session start. Auto-scoped: session-level rules are visible globally.hooks/dev-rules-reminder.cjs— re-injects rules on every user prompt, plus active Plan Context. Defends against mid-session drift.hooks/token-efficiency-reminder.cjs— injects token discipline on every prompt. “Match effort to request complexity. No preambles.”hooks/usage-context-awareness.cjs— tracks remaining rate-limit budget, writes to cache, read by statusline and context builder. Lets the agent adapt behavior to remaining quota.rules/vsdocs/split —rules/is auto-loaded (cost always);docs/is read on-demand (cost per request). The split is itself a context discipline.hooks/enforce-doc-rules.cjs— enforces doc conventions ondocs/*.mdedits. Prevents the rules from drifting into the docs or vice versa.hooks/scout-block.cjs— blocks reads from.ckignoredirectories (node_modules/,dist/, etc.). Stops context pollution before it enters the window.- Claude Code’s built-in
auto memorysystem handles the “extract before compact” principle via its own compaction-time memory extraction.
Open problems
- No per-turn token budget telemetry in this repo.
usage-context-awareness.cjstracks API-level rate limits but not per-turn context use. A dashboard would help. - No dynamic boundary marker in this repo’s system prompt. Every turn’s prompt is rebuilt; no explicit split between cacheable and dynamic. Lost cache wins.
- No empirical measurement of compact summary budgets. We don’t know what our p99 summary size is. Guessing.
- Context collapse is still stubbed in Claude Code’s leak. Nobody has shipped a production context-collapse layer that works well. Open research problem.