Testing and Quality Assurance
How Claude Code is tested and what quality mechanisms exist. The approach is unusual: minimal unit tests, heavy reliance on VCR fixture recording for API interactions, a multi-layer security classification system, and a built-in verification agent for post-implementation checking.
VCR: the primary test infrastructure
The VCR system (src/services/vcr.ts) is the dominant testing mechanism. It records and replays Anthropic API interactions as deterministic fixtures.
How it works
- Hash input messages (SHA-1) to create a fixture key
- On cache hit: replay stored response (no API call)
- On cache miss (non-CI): make real API call, write fixture
- On cache miss (CI without
VCR_RECORD=1): throw error (no new fixtures in CI)
Dehydration/hydration
Path-dependent values are replaced with placeholders so fixtures work across machines:
- cwd → [CWD]
- Config home → [CONFIG_HOME]
- UUIDs → [UUID]
- Timestamps → [TIMESTAMP]
Windows path normalization handled explicitly.
Three entry points
withVCR— batch message recordingwithStreamingVCR— async generator streamingwithTokenCountVCR— token counting calls
Cached responses still get their USD cost added to session totals via addCachedCostToTotalSessionCost — cost tracking is accurate even in test mode.
Why VCR over unit tests
The source contains only 2 test files in src/ (auth token validation, PTY session management, both using node:test). The VCR system is the primary test infrastructure because:
- Claude Code's behavior depends heavily on model responses — mocking the model defeats the purpose
- Fixture-recorded interactions capture the actual model behavior at a point in time
- Integration tests via recorded API calls catch regressions that unit tests would miss
Security classification
Auto-mode classifier (yoloClassifier.ts)
A side-query to Claude that classifies each tool invocation into allow/soft_deny categories. Uses a dedicated system prompt (auto_mode_system_prompt.txt) with separate permission templates for internal vs external users. Results typed as YoloClassifierResult.
Users can customize rules via settings.autoMode config (allow, soft_deny, environment sections).
Bash security (bashSecurity.ts)
23+ named security check IDs scanning for command injection patterns:
- Command substitution: $(, ${, $[, <(, >(
- Shell metacharacters, obfuscated flags, dangerous variables
- IFS injection, proc environ access, brace expansion
- Control characters, unicode whitespace
- Zsh-specific dangerous commands (zmodload, sysopen, ztcp, etc.)
Quote extraction strips quoted strings to analyze only the unquoted shell where injection is dangerous.
Dangerous patterns (dangerousPatterns.ts)
Cross-platform code execution entry points (python, node, ruby, perl, ssh) and Bash-specific dangerous patterns (eval, exec, sudo) that bypass the classifier entirely.
PowerShell security
Parallel security checks for the PowerShell tool, with its own validation rules.
Verification agent
A built-in agent (src/tools/AgentTool/built-in/verificationAgent.ts) specifically for post-implementation verification:
- Read-only: cannot use FileEdit, FileWrite, NotebookEdit, or spawn nested agents
- Must run actual commands: "reading is not verification" — runs tests, curls endpoints, starts dev servers
- Structured output: Check reports with Command/Output/Result format
- Verdict:
PASS | FAIL | PARTIAL - Domain coverage: frontend (dev server + browser automation), backend, CLI, infrastructure, library, bug fixes, mobile, data/ML, database migrations, refactoring
- Anti-rationalization: explicitly documented patterns like "the code looks correct" are flagged as insufficient
Schema validation
- Zod v4 throughout for runtime schema validation — tool inputs, settings, MCP configs, hook definitions
- Tree-sitter AST analysis for structured bash command parsing
- Sed validation (
sedValidation.ts) — validates sed commands before execution - Path validation — filesystem access control for both Bash and PowerShell
- Plugin schemas (
src/utils/plugins/schemas.ts) — validates plugin configurations
Feature flags and A/B testing
GrowthBook SDK provides feature flags and experiment infrastructure:
- User attributes for targeting: platform, org, subscription, rate limit tier, app version
- Three access patterns: async (always fresh), cached (may be stale), cached with refresh callback
- Experiment exposure dedup via loggedExposures Set
- Env overrides for internal testing: CLAUDE_INTERNAL_FC_OVERRIDES
- Cached features persist to global config file between sessions
Mock rate limits
Internal-only system (ant employees) with 20+ predefined scenarios for simulating rate limit conditions without hitting real API limits: session-limit-reached, weekly-limit-reached, overage-active, out-of-credits, fast-mode-limit, etc. Enables testing rate limit UX without waiting for actual limits.
Quality without conventional test suites
The quality strategy is layered:
- Prevention — Zod schemas, bash security checks, dangerous pattern blocking, permission pipeline
- Detection — auto-mode classifier, verification agent, tree-sitter AST analysis
- Regression — VCR fixtures capture known-good model interactions
- Simulation — mock rate limits test edge cases in rate limit handling
- Experimentation — GrowthBook enables gradual rollout with rollback
Related entities
- bash-security — 23+ security check IDs
- permission-pipeline — tool permission decisions
- telemetry-system — GrowthBook feature flags
- tool-system — schema validation via Zod
- agent-lifecycle — verification agent