Error Handling and Recovery

How Claude Code handles failures across API calls, tool execution, agent spawning, and MCP connections. The system is designed around a core principle: errors are contained, not propagated. Tool failures become conversation messages, not crashes. API failures trigger tiered retry with fallback. Agent failures preserve partial results.

The containment boundary

The tool execution layer (src/services/tools/toolExecution.ts) is the primary error containment boundary. Every tool error — including sub-agent crashes, MCP failures, and validation errors — is converted to a tool_result message with is_error: true, wrapped in <tool_use_error> tags. The model sees the error and decides how to react. The agent loop never crashes from a tool failure.

Three validation layers run before any tool executes: 1. Tool existence check — is this tool registered? If not: "No such tool available: {name}" 2. Zod schema validation (inputSchema.safeParse) — returns InputValidationError 3. Custom validation (tool.validateInput()) — tool-specific checks

API retry strategy

withRetry() in src/services/api/withRetry.ts implements a three-tier escalation:

Tier 1: Exponential backoff retry

Tier 2: Model fallback

After MAX_529_RETRIES = 3 consecutive 529 overloaded errors, throws FallbackTriggeredError. The query loop in query.ts catches this and transparently switches to fallbackModel, clearing assistant messages and retrying the entire request.

Tier 3: User-visible error

CannotRetryError when retries are exhausted. getAssistantMessageFromError() converts any error into an AssistantMessage with isApiErrorMessage: true and actionable guidance.

Special behaviors

Circuit breakers

Breaker Threshold Behavior
Auto-compact 3 consecutive failures Stops compaction for session. Was wasting ~250K API calls/day globally before fix.
529 retry cap 3 consecutive 529s Triggers model fallback or terminal error
Background 529 Immediate Non-foreground sources don't retry at all
Fast mode cooldown Long retry-after 10-30 min cooldown, switches to standard speed

Agent error handling

Sub-agent failures are contained, not propagated:

MCP error handling

Error Trigger Recovery
McpAuthError Auth failure during tool call Server transitions to needs-auth state
McpSessionExpiredError HTTP 404 + JSON-RPC -32001 Automatic reconnect + single retry
McpToolCallError MCP tool returns isError: true Surfaced as tool_result is_error=true
Connection lost Transport disconnect Exponential backoff (1s-30s, 5 attempts)

All MCP errors produce tool_result is_error=true — the agent continues.

Reactive recovery

When the API returns prompt-too-long or media-size errors, query.ts withholds the error from the user and attempts recovery:

  1. Prompt-too-long: Triggers reactive compact (tryReactiveCompact()) or context collapse (recoverFromOverflow()), then retries
  2. Media size errors: Strips images from messages and retries
  3. Only surfaces to user if recovery fails

Graceful degradation hierarchy

  1. Silent fallback — background work dropped, model fallback, fast mode cooldown
  2. Transparent recovery — reactive compact, auth refresh, image stripping
  3. Partial result preservation — agent extracts whatever was accomplished before failure
  4. User-visible error with guidance — actionable messages, not stack traces