Error Handling and Recovery
How Claude Code handles failures across API calls, tool execution, agent spawning, and MCP connections. The system is designed around a core principle: errors are contained, not propagated. Tool failures become conversation messages, not crashes. API failures trigger tiered retry with fallback. Agent failures preserve partial results.
The containment boundary
The tool execution layer (src/services/tools/toolExecution.ts) is the primary error containment boundary. Every tool error — including sub-agent crashes, MCP failures, and validation errors — is converted to a tool_result message with is_error: true, wrapped in <tool_use_error> tags. The model sees the error and decides how to react. The agent loop never crashes from a tool failure.
Three validation layers run before any tool executes:
1. Tool existence check — is this tool registered? If not: "No such tool available: {name}"
2. Zod schema validation (inputSchema.safeParse) — returns InputValidationError
3. Custom validation (tool.validateInput()) — tool-specific checks
API retry strategy
withRetry() in src/services/api/withRetry.ts implements a three-tier escalation:
Tier 1: Exponential backoff retry
DEFAULT_MAX_RETRIES = 10(overridable viaCLAUDE_CODE_MAX_RETRIES)BASE_DELAY_MS = 500with 25% jitter- Honors
retry-afterheader when present - Retries: connection errors, 408, 409, 429, 401 (with token refresh), 403, 5xx, 529 overloaded
Tier 2: Model fallback
After MAX_529_RETRIES = 3 consecutive 529 overloaded errors, throws FallbackTriggeredError. The query loop in query.ts catches this and transparently switches to fallbackModel, clearing assistant messages and retrying the entire request.
Tier 3: User-visible error
CannotRetryError when retries are exhausted. getAssistantMessageFromError() converts any error into an AssistantMessage with isApiErrorMessage: true and actionable guidance.
Special behaviors
- Background work drops silently on 529 — non-foreground query sources (summaries, classifiers, suggestions) bail immediately. "Each retry is 3-10x gateway amplification."
- Fast mode cooldown — long retry-after values trigger 10-30 minute cooldown, switching to standard speed. Overage rejection permanently disables fast mode.
- Persistent retry mode (
CLAUDE_CODE_UNATTENDED_RETRY) — retries 429/529 indefinitely with 30s heartbeat intervals. - Auth recovery — 401/403 triggers OAuth token refresh, AWS/GCP credential cache clear, and fresh client creation.
Circuit breakers
| Breaker | Threshold | Behavior |
|---|---|---|
| Auto-compact | 3 consecutive failures | Stops compaction for session. Was wasting ~250K API calls/day globally before fix. |
| 529 retry cap | 3 consecutive 529s | Triggers model fallback or terminal error |
| Background 529 | Immediate | Non-foreground sources don't retry at all |
| Fast mode cooldown | Long retry-after | 10-30 min cooldown, switches to standard speed |
Agent error handling
Sub-agent failures are contained, not propagated:
- Foreground agent errors: If the agent collected any assistant messages before failing,
finalizeAgentTool()extracts partial results. Only zero-message failures re-throw (becomingtool_result is_error=true). - Background agent errors:
failAsyncAgent()marks the task as failed.extractPartialResult()scans messages backward for the last assistant text. The notification system delivers failure to the parent. - AbortError: Always re-thrown for proper interruption handling (not a real error).
- Worktree cleanup: Runs in
finallyblocks for both sync and async paths, including error/abort.
MCP error handling
| Error | Trigger | Recovery |
|---|---|---|
McpAuthError |
Auth failure during tool call | Server transitions to needs-auth state |
McpSessionExpiredError |
HTTP 404 + JSON-RPC -32001 |
Automatic reconnect + single retry |
McpToolCallError |
MCP tool returns isError: true |
Surfaced as tool_result is_error=true |
| Connection lost | Transport disconnect | Exponential backoff (1s-30s, 5 attempts) |
All MCP errors produce tool_result is_error=true — the agent continues.
Reactive recovery
When the API returns prompt-too-long or media-size errors, query.ts withholds the error from the user and attempts recovery:
- Prompt-too-long: Triggers reactive compact (
tryReactiveCompact()) or context collapse (recoverFromOverflow()), then retries - Media size errors: Strips images from messages and retries
- Only surfaces to user if recovery fails
Graceful degradation hierarchy
- Silent fallback — background work dropped, model fallback, fast mode cooldown
- Transparent recovery — reactive compact, auth refresh, image stripping
- Partial result preservation — agent extracts whatever was accomplished before failure
- User-visible error with guidance — actionable messages, not stack traces
Related entities
- auto-compact — circuit breaker at 3 consecutive failures
- cost-tracker — cached responses still count toward cost
- mcp-client — session retry, auth caching
- telemetry-system — error classification for analytics
- agent-lifecycle — partial result preservation