Cost and Token Economics

How Claude Code tracks, estimates, and optimizes token usage and cost. The system operates on three levels: real-time cost tracking per session, token estimation for threshold decisions, and architectural optimization for prompt cache economics.

Cost tracking

The cost tracker (src/cost-tracker.ts) accumulates per-model usage after every API response:

Pricing tiers

Tier Models Input/Mtok Output/Mtok Cache Write Cache Read
Sonnet Claude Sonnet 4.6 $3 $15 $3.75 $0.30
Opus standard Opus 4.5/4.6 $5 $25 $6.25 $0.50
Opus legacy Opus 4/4.1 $15 $75 $18.75 $1.50
Opus fast Opus 4.6 fast mode $30 $150 $37.50 $3.00
Haiku 3.5 Haiku 3.5 $0.80 $4 $1.00 $0.08
Haiku 4.5 Haiku 4.5 $1 $5 $1.25 $0.10

Web search: $0.01 per request across all tiers. Unknown models default to Opus standard pricing.

What's tracked

Per model: inputTokens, outputTokens, cacheReadInputTokens, cacheCreationInputTokens, webSearchRequests, costUSD, contextWindow, maxOutputTokens.

Session costs persist to project config and restore on --resume, so cost display is accurate across session resumption.

Token estimation

The token estimation service (src/services/tokenEstimation.ts) provides three accuracy levels:

Method Speed Accuracy Cost
API counting (countTokensWithAPI) Slow Exact API call
Haiku fallback Medium Good Cheap API call
Rough estimation (4 bytes/token) Instant ~2x variance Free

The rough estimation is file-type-aware: JSON uses 2 bytes/token (higher token density); everything else uses 4. Images and documents are a fixed 2000 tokens.

These estimates drive threshold decisions throughout the system — auto-compact triggers, context window budgets, tool result persistence.

Prompt cache economics

Prompt caching is not an optimization — it's the cost model. Claude Code achieves 92% overall prompt cache prefix reuse. The architecture enforces this:

Cache-preserving patterns

  1. Tool registration order — built-in tools sorted alphabetically as a contiguous prefix, MCP tools as a sorted suffix. Adding/removing an MCP server invalidates only the suffix.
  2. Fork agent cache sharing — sub-agents inherit the parent's exact system prompt and tool array, producing byte-identical API request prefixes. The child gets cache hits from the parent's cached prefix.
  3. Micro-compaction with cache edits — removes old tool results via the API's cache_edits mechanism without invalidating the cached prefix.
  4. One-shot agent optimization — Explore and Plan agents skip agentId/SendMessage/usage trailer, saving ~135 chars × 34M runs/week.

Cache-breaking risks

Risk Impact Mitigation
--resume 10-20x cost on first turn (only system prompt cached) Known limitation
/effort cascade Changes effort in one terminal nukes cache in all others --debug detection only
MCP server changes Invalidates MCP suffix cache Partition keeps built-in prefix intact
Auto-compact Rewrites message history, breaks cache Forked agent reuses parent cache for the summarization itself
5-minute TTL Cache expires after inactivity SleepTool must balance sleep vs cache window

Scale impact

At Claude Code's scale (4% of all public GitHub commits), even small cache efficiency changes have massive cost implications. The autocompact death loop wasted ~250K API calls/day globally. A PR that dropped cache hit rate from 92.7% to 61% was immediately reverted.

Tool result storage economics

Large tool outputs are persisted to disk rather than kept in conversation context: - maxResultSizeChars = 30,000 — threshold for disk persistence - MAX_PERSISTED_SIZE = 64MB — absolute limit - Images resized to max 20MB

This prevents memory bloat during long sessions with hundreds of file reads, and keeps the conversation context budget for actual reasoning.

Context window management

The auto-compact system manages context budget through three tiers:

  1. Microcompaction — continuous trimming of old tool results (zero API cost)
  2. Session memory compaction — uses extracted session memory as summary (zero API cost)
  3. Full compaction — LLM-generated structured summary (API cost, but uses cache-sharing fork)

Threshold: effectiveContextWindow - 13K buffer. For a 200K-context model: triggers at ~167K tokens.