Error Handling and Recovery

How Claude Code handles failures across API calls, tool execution, agent spawning, and MCP connections. The system is designed around a core principle: errors are contained, not propagated. Tool failures become conversation messages, not crashes. API failures trigger tiered retry with fallback. Agent failures preserve partial results.

The containment boundary

The tool execution layer (src/services/tools/toolExecution.ts) is the primary error containment boundary. Every tool error — including sub-agent crashes, MCP failures, and validation errors — is converted to a tool_result message with is_error: true, wrapped in <tool_use_error> tags. The model sees the error and decides how to react. The agent loop never crashes from a tool failure.

Three validation layers run before any tool executes: 1. Tool existence check — is this tool registered? If not: "No such tool available: {name}" 2. Zod schema validation (inputSchema.safeParse) — returns InputValidationError 3. Custom validation (tool.validateInput()) — tool-specific checks

API retry strategy

withRetry() in src/services/api/withRetry.ts implements a three-tier escalation:

Tier 1: Exponential backoff retry

DEFAULT_MAX_RETRIES = 10 (overridable via CLAUDE_CODE_MAX_RETRIES)
BASE_DELAY_MS = 500 with 25% jitter
Honors retry-after header when present
Retries: connection errors, 408, 409, 429, 401 (with token refresh), 403, 5xx, 529 overloaded

Tier 2: Model fallback

After MAX_529_RETRIES = 3 consecutive 529 overloaded errors, throws FallbackTriggeredError. The query loop in query.ts catches this and transparently switches to fallbackModel, clearing assistant messages and retrying the entire request.

Tier 3: User-visible error

CannotRetryError when retries are exhausted. getAssistantMessageFromError() converts any error into an AssistantMessage with isApiErrorMessage: true and actionable guidance.

Special behaviors

Background work drops silently on 529 — non-foreground query sources (summaries, classifiers, suggestions) bail immediately. "Each retry is 3-10x gateway amplification."
Fast mode cooldown — long retry-after values trigger 10-30 minute cooldown, switching to standard speed. Overage rejection permanently disables fast mode.
Persistent retry mode (CLAUDE_CODE_UNATTENDED_RETRY) — retries 429/529 indefinitely with 30s heartbeat intervals.
Auth recovery — 401/403 triggers OAuth token refresh, AWS/GCP credential cache clear, and fresh client creation.

Circuit breakers

Breaker	Threshold	Behavior
Auto-compact	3 consecutive failures	Stops compaction for session. Was wasting ~250K API calls/day globally before fix.
529 retry cap	3 consecutive 529s	Triggers model fallback or terminal error
Background 529	Immediate	Non-foreground sources don't retry at all
Fast mode cooldown	Long retry-after	10-30 min cooldown, switches to standard speed

Agent error handling

Sub-agent failures are contained, not propagated:

Foreground agent errors: If the agent collected any assistant messages before failing, finalizeAgentTool() extracts partial results. Only zero-message failures re-throw (becoming tool_result is_error=true).
Background agent errors: failAsyncAgent() marks the task as failed. extractPartialResult() scans messages backward for the last assistant text. The notification system delivers failure to the parent.
AbortError: Always re-thrown for proper interruption handling (not a real error).
Worktree cleanup: Runs in finally blocks for both sync and async paths, including error/abort.

MCP error handling

Error	Trigger	Recovery
`McpAuthError`	Auth failure during tool call	Server transitions to `needs-auth` state
`McpSessionExpiredError`	HTTP 404 + JSON-RPC `-32001`	Automatic reconnect + single retry
`McpToolCallError`	MCP tool returns `isError: true`	Surfaced as `tool_result is_error=true`
Connection lost	Transport disconnect	Exponential backoff (1s-30s, 5 attempts)

All MCP errors produce tool_result is_error=true — the agent continues.

Reactive recovery

When the API returns prompt-too-long or media-size errors, query.ts withholds the error from the user and attempts recovery:

Prompt-too-long: Triggers reactive compact (tryReactiveCompact()) or context collapse (recoverFromOverflow()), then retries
Media size errors: Strips images from messages and retries
Only surfaces to user if recovery fails

Graceful degradation hierarchy

Silent fallback — background work dropped, model fallback, fast mode cooldown
Transparent recovery — reactive compact, auth refresh, image stripping
Partial result preservation — agent extracts whatever was accomplished before failure
User-visible error with guidance — actionable messages, not stack traces

auto-compact — circuit breaker at 3 consecutive failures
cost-tracker — cached responses still count toward cost
mcp-client — session retry, auth caching
telemetry-system — error classification for analytics
agent-lifecycle — partial result preservation