Recovery
Recovery
Recovery is what the system does when things break: API errors, rate limits, prompt-too-long, model unavailability, tool failures, hook crashes, network drops. Recovery design is the difference between a session that survives a hiccup and one that dies.
The naive approach is binary: success or fail. The real problem requires graduated recovery — multiple escalation levels, each with its own budget and circuit breaker.
The core problem
Agentic sessions are long. Long means more API calls, more tools, more chances for things to go wrong. A binary success/fail design treats every failure the same way (abort) and gives the user a useless partial result. A naive retry loop treats every failure the same way (retry forever) and burns budget on unrecoverable errors.
The right design distinguishes failure types and applies the cheapest viable recovery for each. Transient errors retry. Permanent errors abort. Context errors compact. Rate limits switch models. Each strategy runs once per failure — never retry the same strategy that just failed.
Patterns observed in Claude Code
Escalating recovery chain
Five levels, in order of severity. Each runs once per failure.
| # | Strategy | When | Cost |
|---|---|---|---|
| 1 | Retry same config | Transient errors (network blip, rate-limit retry-after) | Free |
| 2 | Escalate output limit | Hit max_tokens cap | One-shot config change |
| 3 | Inject recovery message | LLM gave up mid-thought | One injected user turn |
| 4 | Switch to fallback model | Primary model unavailable | Model switch |
| 5 | Surface error to user | All previous strategies exhausted | Session interruption |
Each level has a per-level budget counter (e.g., retry x3) and a circuit breaker (don’t retry the same strategy that just failed).
Death spiral prevention: when an error occurs, skip hooks. Hooks inject tokens; tokens worsen prompt_too_long. The recovery path bypasses the normal hook pipeline.
The recovery message (prompt engineering gold)
When the LLM gives up mid-response (output token limit, mid-thought truncation), Claude Code injects this exact message:
“Output token limit hit. Resume directly — no apology, no recap of what you were doing. Pick up mid-thought if that is where the cut happened. Break remaining work into smaller pieces.”
Why every word matters:
- “Resume directly” — don’t restart from scratch
- “No apology” — burns tokens, signals defeat, often triggers more apologies
- “No recap” — re-summarizing prior work doubles the token cost
- “Pick up mid-thought” — explicitly permit the unusual continuation
- “Break remaining work into smaller pieces” — adapt to the constraint, don’t fight it
This is a specific recovery prompt for a specific failure mode. Generic recovery prompts fail because they don’t tell the model what to not do.
Per-strategy budget counters
Each escalation level has a budget:
- Retry: x3 max
- Escalate output: 1 attempt (8K → 64K, then no more increases)
- Recovery message: 1 injection per failure event
- Model fallback: 1 switch per session
- User surface: terminal
After the budget is exhausted, escalate to the next level. Never retry the same level twice in the same failure event.
Why: retrying a strategy that just failed wastes turns. The strategy didn’t fail because of bad luck — it failed because of a real problem the strategy can’t fix.
Circuit breakers
A circuit breaker prevents infinite loops in recovery. Examples from Claude Code:
hasAttemptedReactiveCompact— once reactive compact runs, it cannot run again in the same iterationstop_hook_active— once a Stop hook blocks, the next Stop pass is allowed through- Per-strategy budget counters (above)
The pattern: a flag set when a recovery strategy starts, checked before that strategy runs again. Reset at iteration boundaries.
Lesson: every recovery layer needs a circuit breaker. Otherwise the system can loop within recovery itself, never reaching the surface.
Error taxonomy
Claude Code’s leaked code distinguishes 7 distinct exit paths in Phase 4 (Stop or Continue), each with a different handler:
| Error type | Handler | Strategy |
|---|---|---|
| Transient (network, 5xx) | Retry | Same config x3 |
| Rate limit | Retry with backoff | Use API’s retry-after |
prompt_too_long | Reactive compact | Layer 4 of compaction defense |
max_tokens hit | Recovery message | “Resume directly” |
| Model unavailable | Fallback model | Switch and retry |
| Hook crash | Skip hook, log, continue | Don’t break the loop |
| Permanent (4xx auth, validation) | Abort | Surface to user |
Mixing these up causes death spirals. Treating prompt_too_long like a transient retry → retry → fail → retry → fail forever. Treating auth failure like a transient retry → retry → still 401 → retry → still 401.
Hook error handling
Hooks run outside the main loop. If a hook crashes, the loop continues. Specifically:
- Each hook is wrapped in a try/catch
- Crash logged to
hooks/.logs/hook-log.jsonl - Hook’s exit code = 0 (fail-open)
- Loop continues as if the hook didn’t run
Why fail-open: hooks must NEVER break the session. A bug in inject-rules.cjs should not prevent the agent from responding. The user can debug the hook offline.
Trade-off: fail-open hides hook bugs. Mitigation: log every crash, and surface the log location to the user during session start.
Skip hooks during recovery
When the agent is in error recovery mode, the normal hook pipeline is skipped. Specifically:
- API errors → no
UserPromptSubmithooks injected (those add tokens) prompt_too_long→ noinject-rules.cjs(that adds tokens)- Recovery messages bypass the hook chain entirely
Reason: hooks add tokens. Token addition during a token-limit error is the death spiral. Strip the harness during emergencies.
Anti-patterns
- Binary success/fail. No graduated recovery. Every failure aborts.
- Retry forever. Same strategy, infinite times. Wastes budget on unrecoverable errors.
- Generic recovery prompts. “Please continue.” Doesn’t tell the model what NOT to do.
- Recovery without circuit breakers. Recovery layer triggers itself, infinite loop.
- No error taxonomy. All errors handled the same way. Auth failures retried forever; transient blips give up immediately.
- Hooks crash kills the session. A buggy hook breaks every session for everyone.
- Hooks run during error recovery. Hooks add tokens, error is
prompt_too_long, recovery makes it worse. - No per-level budget. Each escalation can run unlimited times.
- Recovery state not reset between iterations. Circuit breaker stays tripped forever; recovery never runs again even when needed.
- No fallback model. Primary model goes down → entire session blocked.
Takeaways for harness engineering
- Graduated, not binary. 5 levels: retry → escalate config → inject recovery prompt → switch model → surface to user.
- Per-level budgets. Each level runs N times max. Then escalate.
- Each strategy runs once per failure. Don’t retry the strategy that just failed.
- Circuit breakers everywhere. Flag set when strategy starts, checked before re-entry. Reset at iteration boundaries.
- Skip hooks during recovery. Strip the harness in emergencies. Hooks add tokens; recovery often needs fewer tokens.
- Error taxonomy matters. Transient ≠ permanent ≠ context-overflow ≠ rate-limit. Different handlers.
- Recovery prompts are specific. Tell the model what NOT to do. Apologies, recaps, restarts are all forbidden in good recovery prompts.
- Hooks fail open. Crash → log → exit 0 → loop continues. The hook is best-effort.
- Always have a fallback model. Primary outage shouldn’t kill the session.
- Surface to user is the last resort. If you reach it, the session is paused, not failed. The user gets a structured error and a chance to override.
What this repo does
- Hook crash safety — every
.cjshook is wrapped in a top-level try/catch that logs crashes tohooks/.logs/hook-log.jsonland exits 0. Examples:session-init.cjs,subagent-init.cjs,dev-rules-reminder.cjs,descriptive-name.cjs,usage-context-awareness.cjs. Hooks NEVER break the session. hooks/stop-verify.cjs— usesstop_hook_activecircuit breaker to prevent infinite Stop loops. The first Stop pass injects a verify prompt; the second pass (with the flag set) lets the agent finish normally.- Graceful degradation pattern — the distribution’s hook scripts check prerequisites before running (e.g., venv existence, config flags) and no-op when unavailable. Self-disabling rather than crashing.
hooks/usage-context-awareness.cjs— throttles API calls (60s TTL, 1-min for prompts, 5-min for tool use). Avoids hammering the OAuth endpoint when something’s wrong upstream.hooks/loop-detection.cjs— soft circuit breaker on per-file edit counts. Doesn’t abort; injects a warning. The agent can override but is nudged toward stepping back.
Gaps in this repo
- No per-error-type handling. Hooks treat all crashes the same way (log + exit 0). No retry, no fallback, no escalation. Acceptable for hooks (they’re best-effort), insufficient for tools.
- No fallback model switching at the harness level. This is inside Claude Code’s runtime; the harness can’t intercept.
- No structured recovery prompts in skills. Skills assume the happy path. When a skill fails mid-execution (e.g.,
shipfails on test step), there’s no skill-level recovery — the user has to manually recover. - Hook crash log isn’t surfaced.
hooks/.logs/hook-log.jsonlexists but the user has to read it manually. Should be auto-checked at session start and surfaced if non-empty. - No death-spiral metric. We don’t count repeated failures of the same type to detect when the system is looping. Needs an instrumentation pass.
Open problems
- Recovery telemetry. We don’t know which recovery strategies fire most, which fail most, which save the most sessions. Instrumentation would help tune the budgets.
- Recovery prompt library. Each failure mode deserves a tuned prompt. Currently we have one (
stop-verifyretry message); a library would be useful. - Cross-session recovery. Process crashes, session dies. Resume from where we were. Claude Code has
session_store+/resume; this repo doesn’t model it explicitly. - Recovery during multi-agent fanout. If one fork child fails, do siblings retry, abort, or continue? No policy in this repo.