Recovery

8 min read 1520 words

Table of Contents

Recovery

Recovery

Recovery is what the system does when things break: API errors, rate limits, prompt-too-long, model unavailability, tool failures, hook crashes, network drops. Recovery design is the difference between a session that survives a hiccup and one that dies.

The naive approach is binary: success or fail. The real problem requires graduated recovery — multiple escalation levels, each with its own budget and circuit breaker.

The core problem

Agentic sessions are long. Long means more API calls, more tools, more chances for things to go wrong. A binary success/fail design treats every failure the same way (abort) and gives the user a useless partial result. A naive retry loop treats every failure the same way (retry forever) and burns budget on unrecoverable errors.

The right design distinguishes failure types and applies the cheapest viable recovery for each. Transient errors retry. Permanent errors abort. Context errors compact. Rate limits switch models. Each strategy runs once per failure — never retry the same strategy that just failed.

Patterns observed in Claude Code

Escalating recovery chain

Five levels, in order of severity. Each runs once per failure.

#	Strategy	When	Cost
1	Retry same config	Transient errors (network blip, rate-limit retry-after)	Free
2	Escalate output limit	Hit `max_tokens` cap	One-shot config change
3	Inject recovery message	LLM gave up mid-thought	One injected user turn
4	Switch to fallback model	Primary model unavailable	Model switch
5	Surface error to user	All previous strategies exhausted	Session interruption

Each level has a per-level budget counter (e.g., retry x3) and a circuit breaker (don’t retry the same strategy that just failed).

Death spiral prevention: when an error occurs, skip hooks. Hooks inject tokens; tokens worsen prompt_too_long. The recovery path bypasses the normal hook pipeline.

The recovery message (prompt engineering gold)

When the LLM gives up mid-response (output token limit, mid-thought truncation), Claude Code injects this exact message:

“Output token limit hit. Resume directly — no apology, no recap of what you were doing. Pick up mid-thought if that is where the cut happened. Break remaining work into smaller pieces.”

Why every word matters:

“Resume directly” — don’t restart from scratch
“No apology” — burns tokens, signals defeat, often triggers more apologies
“No recap” — re-summarizing prior work doubles the token cost
“Pick up mid-thought” — explicitly permit the unusual continuation
“Break remaining work into smaller pieces” — adapt to the constraint, don’t fight it

This is a specific recovery prompt for a specific failure mode. Generic recovery prompts fail because they don’t tell the model what to not do.

Per-strategy budget counters

Each escalation level has a budget:

Retry: x3 max
Escalate output: 1 attempt (8K → 64K, then no more increases)
Recovery message: 1 injection per failure event
Model fallback: 1 switch per session
User surface: terminal

After the budget is exhausted, escalate to the next level. Never retry the same level twice in the same failure event.

Why: retrying a strategy that just failed wastes turns. The strategy didn’t fail because of bad luck — it failed because of a real problem the strategy can’t fix.

Circuit breakers

A circuit breaker prevents infinite loops in recovery. Examples from Claude Code:

hasAttemptedReactiveCompact — once reactive compact runs, it cannot run again in the same iteration
stop_hook_active — once a Stop hook blocks, the next Stop pass is allowed through
Per-strategy budget counters (above)

The pattern: a flag set when a recovery strategy starts, checked before that strategy runs again. Reset at iteration boundaries.

Lesson: every recovery layer needs a circuit breaker. Otherwise the system can loop within recovery itself, never reaching the surface.

Error taxonomy

Claude Code’s leaked code distinguishes 7 distinct exit paths in Phase 4 (Stop or Continue), each with a different handler:

Error type	Handler	Strategy
Transient (network, 5xx)	Retry	Same config x3
Rate limit	Retry with backoff	Use API’s `retry-after`
`prompt_too_long`	Reactive compact	Layer 4 of compaction defense
`max_tokens` hit	Recovery message	“Resume directly”
Model unavailable	Fallback model	Switch and retry
Hook crash	Skip hook, log, continue	Don’t break the loop
Permanent (4xx auth, validation)	Abort	Surface to user

Mixing these up causes death spirals. Treating prompt_too_long like a transient retry → retry → fail → retry → fail forever. Treating auth failure like a transient retry → retry → still 401 → retry → still 401.

Hook error handling

Hooks run outside the main loop. If a hook crashes, the loop continues. Specifically:

Each hook is wrapped in a try/catch
Crash logged to hooks/.logs/hook-log.jsonl
Hook’s exit code = 0 (fail-open)
Loop continues as if the hook didn’t run

Why fail-open: hooks must NEVER break the session. A bug in inject-rules.cjs should not prevent the agent from responding. The user can debug the hook offline.

Trade-off: fail-open hides hook bugs. Mitigation: log every crash, and surface the log location to the user during session start.

Skip hooks during recovery

When the agent is in error recovery mode, the normal hook pipeline is skipped. Specifically:

API errors → no UserPromptSubmit hooks injected (those add tokens)
prompt_too_long → no inject-rules.cjs (that adds tokens)
Recovery messages bypass the hook chain entirely

Reason: hooks add tokens. Token addition during a token-limit error is the death spiral. Strip the harness during emergencies.

Anti-patterns

Binary success/fail. No graduated recovery. Every failure aborts.
Retry forever. Same strategy, infinite times. Wastes budget on unrecoverable errors.
Generic recovery prompts. “Please continue.” Doesn’t tell the model what NOT to do.
Recovery without circuit breakers. Recovery layer triggers itself, infinite loop.
No error taxonomy. All errors handled the same way. Auth failures retried forever; transient blips give up immediately.
Hooks crash kills the session. A buggy hook breaks every session for everyone.
Hooks run during error recovery. Hooks add tokens, error is prompt_too_long, recovery makes it worse.
No per-level budget. Each escalation can run unlimited times.
Recovery state not reset between iterations. Circuit breaker stays tripped forever; recovery never runs again even when needed.
No fallback model. Primary model goes down → entire session blocked.

Takeaways for harness engineering

Graduated, not binary. 5 levels: retry → escalate config → inject recovery prompt → switch model → surface to user.
Per-level budgets. Each level runs N times max. Then escalate.
Each strategy runs once per failure. Don’t retry the strategy that just failed.
Circuit breakers everywhere. Flag set when strategy starts, checked before re-entry. Reset at iteration boundaries.
Skip hooks during recovery. Strip the harness in emergencies. Hooks add tokens; recovery often needs fewer tokens.
Error taxonomy matters. Transient ≠ permanent ≠ context-overflow ≠ rate-limit. Different handlers.
Recovery prompts are specific. Tell the model what NOT to do. Apologies, recaps, restarts are all forbidden in good recovery prompts.
Hooks fail open. Crash → log → exit 0 → loop continues. The hook is best-effort.
Always have a fallback model. Primary outage shouldn’t kill the session.
Surface to user is the last resort. If you reach it, the session is paused, not failed. The user gets a structured error and a chance to override.

What this repo does

Hook crash safety — every .cjs hook is wrapped in a top-level try/catch that logs crashes to hooks/.logs/hook-log.jsonl and exits 0. Examples: session-init.cjs, subagent-init.cjs, dev-rules-reminder.cjs, descriptive-name.cjs, usage-context-awareness.cjs. Hooks NEVER break the session.
hooks/stop-verify.cjs — uses stop_hook_active circuit breaker to prevent infinite Stop loops. The first Stop pass injects a verify prompt; the second pass (with the flag set) lets the agent finish normally.
Graceful degradation pattern — the distribution’s hook scripts check prerequisites before running (e.g., venv existence, config flags) and no-op when unavailable. Self-disabling rather than crashing.
hooks/usage-context-awareness.cjs — throttles API calls (60s TTL, 1-min for prompts, 5-min for tool use). Avoids hammering the OAuth endpoint when something’s wrong upstream.
hooks/loop-detection.cjs — soft circuit breaker on per-file edit counts. Doesn’t abort; injects a warning. The agent can override but is nudged toward stepping back.

Gaps in this repo

No per-error-type handling. Hooks treat all crashes the same way (log + exit 0). No retry, no fallback, no escalation. Acceptable for hooks (they’re best-effort), insufficient for tools.
No fallback model switching at the harness level. This is inside Claude Code’s runtime; the harness can’t intercept.
No structured recovery prompts in skills. Skills assume the happy path. When a skill fails mid-execution (e.g., ship fails on test step), there’s no skill-level recovery — the user has to manually recover.
Hook crash log isn’t surfaced. hooks/.logs/hook-log.jsonl exists but the user has to read it manually. Should be auto-checked at session start and surfaced if non-empty.
No death-spiral metric. We don’t count repeated failures of the same type to detect when the system is looping. Needs an instrumentation pass.

Open problems

Recovery telemetry. We don’t know which recovery strategies fire most, which fail most, which save the most sessions. Instrumentation would help tune the budgets.
Recovery prompt library. Each failure mode deserves a tuned prompt. Currently we have one (stop-verify retry message); a library would be useful.
Cross-session recovery. Process crashes, session dies. Resume from where we were. Claude Code has session_store + /resume; this repo doesn’t model it explicitly.
Recovery during multi-agent fanout. If one fork child fails, do siblings retry, abort, or continue? No policy in this repo.

Recovery#

The core problem#

Patterns observed in Claude Code#

Escalating recovery chain#

The recovery message (prompt engineering gold)#

Per-strategy budget counters#

Circuit breakers#

Error taxonomy#

Hook error handling#

Skip hooks during recovery#

Anti-patterns#

Takeaways for harness engineering#

What this repo does#

Gaps in this repo#

Open problems#

Recovery

The core problem

Patterns observed in Claude Code

Escalating recovery chain

The recovery message (prompt engineering gold)

Per-strategy budget counters

Circuit breakers

Error taxonomy

Hook error handling

Skip hooks during recovery

Anti-patterns

Takeaways for harness engineering

What this repo does

Gaps in this repo

Open problems