Retry System
- Entity ID:
ent-retry-system - Type:
service - Scope:
shared
What it is
Claude Code's retry system handles API failures with 10 retries using exponential backoff, automatic OAuth token refresh, a 90-second watchdog timer, and intelligent model fallback. It is the resilience layer between the agent loop and the Anthropic API.
Why it exists
API calls fail for many reasons — rate limits (HTTP 529), authentication expiry, network issues, context-too-large errors. Without a robust retry system, any failure would terminate the user's session. The retry system ensures the agent loop survives transient failures and degrades gracefully under sustained load.
The model fallback (Opus to Sonnet after 3 consecutive HTTP 529 errors) is particularly important: during peak usage, Opus may be rate-limited. Rather than failing, the system automatically drops to a smaller model to keep the session alive. This mirrors the error recovery cascade in query.ts.
What depends on it
- Agent loop (query.ts) — every API call goes through the retry system
- KAIROS — autonomous agents are especially dependent on retry resilience since there's no human to manually restart
- Forked agents — parallel agent spawns multiply the API call volume, making retries more likely
- OAuth flow — auto-refresh on auth failure prevents session death when tokens expire
Trade-offs and limitations
- 90-second watchdog — if a request takes longer than 90 seconds, it's killed and retried. This can cause issues with very long completions.
- Opus → Sonnet fallback — preserves availability but reduces capability. The user may not realize they're getting Sonnet-quality responses after rate limiting.
- 10 retry limit — after 10 failures, the request fails permanently. For KAIROS, this could mean a silent death of the autonomous loop.
- No circuit breaker for auth — if OAuth refresh fails repeatedly, the system burns all 10 retries before giving up.
Key claims
- 10 retries with exponential backoff
- Auto OAuth refresh on authentication failure
- 90-second watchdog timer per request
- Automatic Opus → Sonnet fallback after 3 consecutive HTTP 529 errors
Relationships
- used_by: queryengine-ts, query-ts, forked-agent-pattern, kairos
- depends_on: anthropic-api, oauth-system
- related_to: error-recovery, cache-economics
Evidence
src-20260409-a14e9e98c3cd: Internals — Additional Architecture Insights