Cost and Token Economics
How Claude Code tracks, estimates, and optimizes token usage and cost. The system operates on three levels: real-time cost tracking per session, token estimation for threshold decisions, and architectural optimization for prompt cache economics.
Cost tracking
The cost tracker (src/cost-tracker.ts) accumulates per-model usage after every API response:
Pricing tiers
| Tier | Models | Input/Mtok | Output/Mtok | Cache Write | Cache Read |
|---|---|---|---|---|---|
| Sonnet | Claude Sonnet 4.6 | $3 | $15 | $3.75 | $0.30 |
| Opus standard | Opus 4.5/4.6 | $5 | $25 | $6.25 | $0.50 |
| Opus legacy | Opus 4/4.1 | $15 | $75 | $18.75 | $1.50 |
| Opus fast | Opus 4.6 fast mode | $30 | $150 | $37.50 | $3.00 |
| Haiku 3.5 | Haiku 3.5 | $0.80 | $4 | $1.00 | $0.08 |
| Haiku 4.5 | Haiku 4.5 | $1 | $5 | $1.25 | $0.10 |
Web search: $0.01 per request across all tiers. Unknown models default to Opus standard pricing.
What's tracked
Per model: inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, webSearchRequests, costUSD, contextWindow, maxOutputTokens.
Session costs persist to project config and restore on --resume, so cost display is accurate across session resumption.
Token estimation
The token estimation service (src/services/tokenEstimation.ts) provides three accuracy levels:
| Method | Speed | Accuracy | Cost |
|---|---|---|---|
API counting (countTokensWithAPI) |
Slow | Exact | API call |
| Haiku fallback | Medium | Good | Cheap API call |
| Rough estimation (4 bytes/token) | Instant | ~2x variance | Free |
The rough estimation is file-type-aware: JSON uses 2 bytes/token (higher token density); everything else uses 4. Images and documents are a fixed 2000 tokens.
These estimates drive threshold decisions throughout the system — auto-compact triggers, context window budgets, tool result persistence.
Prompt cache economics
Prompt caching is not an optimization — it's the cost model. Claude Code achieves 92% overall prompt cache prefix reuse. The architecture enforces this:
Cache-preserving patterns
- Tool registration order — built-in tools sorted alphabetically as a contiguous prefix, MCP tools as a sorted suffix. Adding/removing an MCP server invalidates only the suffix.
- Fork agent cache sharing — sub-agents inherit the parent's exact system prompt and tool array, producing byte-identical API request prefixes. The child gets cache hits from the parent's cached prefix.
- Micro-compaction with cache edits — removes old tool results via the API's
cache_editsmechanism without invalidating the cached prefix. - One-shot agent optimization — Explore and Plan agents skip agentId/SendMessage/usage trailer, saving ~135 chars × 34M runs/week.
Cache-breaking risks
| Risk | Impact | Mitigation |
|---|---|---|
--resume |
10-20x cost on first turn (only system prompt cached) | Known limitation |
/effort cascade |
Changes effort in one terminal nukes cache in all others | --debug detection only |
| MCP server changes | Invalidates MCP suffix cache | Partition keeps built-in prefix intact |
| Auto-compact | Rewrites message history, breaks cache | Forked agent reuses parent cache for the summarization itself |
| 5-minute TTL | Cache expires after inactivity | SleepTool must balance sleep vs cache window |
Scale impact
At Claude Code's scale (4% of all public GitHub commits), even small cache efficiency changes have massive cost implications. The autocompact death loop wasted ~250K API calls/day globally. A PR that dropped cache hit rate from 92.7% to 61% was immediately reverted.
Tool result storage economics
Large tool outputs are persisted to disk rather than kept in conversation context:
- maxResultSizeChars = 30,000 — threshold for disk persistence
- MAX_PERSISTED_SIZE = 64MB — absolute limit
- Images resized to max 20MB
This prevents memory bloat during long sessions with hundreds of file reads, and keeps the conversation context budget for actual reasoning.
Context window management
The auto-compact system manages context budget through three tiers:
- Microcompaction — continuous trimming of old tool results (zero API cost)
- Session memory compaction — uses extracted session memory as summary (zero API cost)
- Full compaction — LLM-generated structured summary (API cost, but uses cache-sharing fork)
Threshold: effectiveContextWindow - 13K buffer. For a 200K-context model: triggers at ~167K tokens.
Related entities
- cost-tracker — per-session cost accumulation and display
- token-estimation-service — three-tier token counting
- cache-economics — the deep dive on cache preservation
- auto-compact — context window management
- tool-system — cache-aware tool registration
- forked-agent-pattern — cache-sharing fork model