Testing and Quality Assurance

How Claude Code is tested and what quality mechanisms exist. The approach is unusual: minimal unit tests, heavy reliance on VCR fixture recording for API interactions, a multi-layer security classification system, and a built-in verification agent for post-implementation checking.

VCR: the primary test infrastructure

The VCR system (src/services/vcr.ts) is the dominant testing mechanism. It records and replays Anthropic API interactions as deterministic fixtures.

How it works

Hash input messages (SHA-1) to create a fixture key
On cache hit: replay stored response (no API call)
On cache miss (non-CI): make real API call, write fixture
On cache miss (CI without VCR_RECORD=1): throw error (no new fixtures in CI)

Dehydration/hydration

Path-dependent values are replaced with placeholders so fixtures work across machines: - cwd → [CWD] - Config home → [CONFIG_HOME] - UUIDs → [UUID] - Timestamps → [TIMESTAMP]

Windows path normalization handled explicitly.

Three entry points

withVCR — batch message recording
withStreamingVCR — async generator streaming
withTokenCountVCR — token counting calls

Cached responses still get their USD cost added to session totals via addCachedCostToTotalSessionCost — cost tracking is accurate even in test mode.

Why VCR over unit tests

The source contains only 2 test files in src/ (auth token validation, PTY session management, both using node:test). The VCR system is the primary test infrastructure because: - Claude Code's behavior depends heavily on model responses — mocking the model defeats the purpose - Fixture-recorded interactions capture the actual model behavior at a point in time - Integration tests via recorded API calls catch regressions that unit tests would miss

Security classification

Auto-mode classifier (`yoloClassifier.ts`)

A side-query to Claude that classifies each tool invocation into allow/soft_deny categories. Uses a dedicated system prompt (auto_mode_system_prompt.txt) with separate permission templates for internal vs external users. Results typed as YoloClassifierResult.

Users can customize rules via settings.autoMode config (allow, soft_deny, environment sections).

Bash security (`bashSecurity.ts`)

23+ named security check IDs scanning for command injection patterns: - Command substitution: $(, ${, $[, <(, >( - Shell metacharacters, obfuscated flags, dangerous variables - IFS injection, proc environ access, brace expansion - Control characters, unicode whitespace - Zsh-specific dangerous commands (zmodload, sysopen, ztcp, etc.)

Quote extraction strips quoted strings to analyze only the unquoted shell where injection is dangerous.

Dangerous patterns (`dangerousPatterns.ts`)

Cross-platform code execution entry points (python, node, ruby, perl, ssh) and Bash-specific dangerous patterns (eval, exec, sudo) that bypass the classifier entirely.

PowerShell security

Parallel security checks for the PowerShell tool, with its own validation rules.

Verification agent

A built-in agent (src/tools/AgentTool/built-in/verificationAgent.ts) specifically for post-implementation verification:

Read-only: cannot use FileEdit, FileWrite, NotebookEdit, or spawn nested agents
Must run actual commands: "reading is not verification" — runs tests, curls endpoints, starts dev servers
Structured output: Check reports with Command/Output/Result format
Verdict: PASS | FAIL | PARTIAL
Domain coverage: frontend (dev server + browser automation), backend, CLI, infrastructure, library, bug fixes, mobile, data/ML, database migrations, refactoring
Anti-rationalization: explicitly documented patterns like "the code looks correct" are flagged as insufficient

Schema validation

Zod v4 throughout for runtime schema validation — tool inputs, settings, MCP configs, hook definitions
Tree-sitter AST analysis for structured bash command parsing
Sed validation (sedValidation.ts) — validates sed commands before execution
Path validation — filesystem access control for both Bash and PowerShell
Plugin schemas (src/utils/plugins/schemas.ts) — validates plugin configurations

Feature flags and A/B testing

GrowthBook SDK provides feature flags and experiment infrastructure: - User attributes for targeting: platform, org, subscription, rate limit tier, app version - Three access patterns: async (always fresh), cached (may be stale), cached with refresh callback - Experiment exposure dedup via loggedExposures Set - Env overrides for internal testing: CLAUDE_INTERNAL_FC_OVERRIDES - Cached features persist to global config file between sessions

Mock rate limits

Internal-only system (ant employees) with 20+ predefined scenarios for simulating rate limit conditions without hitting real API limits: session-limit-reached, weekly-limit-reached, overage-active, out-of-credits, fast-mode-limit, etc. Enables testing rate limit UX without waiting for actual limits.

Quality without conventional test suites

The quality strategy is layered:

Prevention — Zod schemas, bash security checks, dangerous pattern blocking, permission pipeline
Detection — auto-mode classifier, verification agent, tree-sitter AST analysis
Regression — VCR fixtures capture known-good model interactions
Simulation — mock rate limits test edge cases in rate limit handling
Experimentation — GrowthBook enables gradual rollout with rollback

bash-security — 23+ security check IDs
permission-pipeline — tool permission decisions
telemetry-system — GrowthBook feature flags
tool-system — schema validation via Zod
agent-lifecycle — verification agent