Testing and Quality Assurance

How Claude Code is tested and what quality mechanisms exist. The approach is unusual: minimal unit tests, heavy reliance on VCR fixture recording for API interactions, a multi-layer security classification system, and a built-in verification agent for post-implementation checking.

VCR: the primary test infrastructure

The VCR system (src/services/vcr.ts) is the dominant testing mechanism. It records and replays Anthropic API interactions as deterministic fixtures.

How it works

  1. Hash input messages (SHA-1) to create a fixture key
  2. On cache hit: replay stored response (no API call)
  3. On cache miss (non-CI): make real API call, write fixture
  4. On cache miss (CI without VCR_RECORD=1): throw error (no new fixtures in CI)

Dehydration/hydration

Path-dependent values are replaced with placeholders so fixtures work across machines: - cwd[CWD] - Config home → [CONFIG_HOME] - UUIDs → [UUID] - Timestamps → [TIMESTAMP]

Windows path normalization handled explicitly.

Three entry points

Cached responses still get their USD cost added to session totals via addCachedCostToTotalSessionCost — cost tracking is accurate even in test mode.

Why VCR over unit tests

The source contains only 2 test files in src/ (auth token validation, PTY session management, both using node:test). The VCR system is the primary test infrastructure because: - Claude Code's behavior depends heavily on model responses — mocking the model defeats the purpose - Fixture-recorded interactions capture the actual model behavior at a point in time - Integration tests via recorded API calls catch regressions that unit tests would miss

Security classification

Auto-mode classifier (yoloClassifier.ts)

A side-query to Claude that classifies each tool invocation into allow/soft_deny categories. Uses a dedicated system prompt (auto_mode_system_prompt.txt) with separate permission templates for internal vs external users. Results typed as YoloClassifierResult.

Users can customize rules via settings.autoMode config (allow, soft_deny, environment sections).

Bash security (bashSecurity.ts)

23+ named security check IDs scanning for command injection patterns: - Command substitution: $(, ${, $[, <(, >( - Shell metacharacters, obfuscated flags, dangerous variables - IFS injection, proc environ access, brace expansion - Control characters, unicode whitespace - Zsh-specific dangerous commands (zmodload, sysopen, ztcp, etc.)

Quote extraction strips quoted strings to analyze only the unquoted shell where injection is dangerous.

Dangerous patterns (dangerousPatterns.ts)

Cross-platform code execution entry points (python, node, ruby, perl, ssh) and Bash-specific dangerous patterns (eval, exec, sudo) that bypass the classifier entirely.

PowerShell security

Parallel security checks for the PowerShell tool, with its own validation rules.

Verification agent

A built-in agent (src/tools/AgentTool/built-in/verificationAgent.ts) specifically for post-implementation verification:

Schema validation

Feature flags and A/B testing

GrowthBook SDK provides feature flags and experiment infrastructure: - User attributes for targeting: platform, org, subscription, rate limit tier, app version - Three access patterns: async (always fresh), cached (may be stale), cached with refresh callback - Experiment exposure dedup via loggedExposures Set - Env overrides for internal testing: CLAUDE_INTERNAL_FC_OVERRIDES - Cached features persist to global config file between sessions

Mock rate limits

Internal-only system (ant employees) with 20+ predefined scenarios for simulating rate limit conditions without hitting real API limits: session-limit-reached, weekly-limit-reached, overage-active, out-of-credits, fast-mode-limit, etc. Enables testing rate limit UX without waiting for actual limits.

Quality without conventional test suites

The quality strategy is layered:

  1. Prevention — Zod schemas, bash security checks, dangerous pattern blocking, permission pipeline
  2. Detection — auto-mode classifier, verification agent, tree-sitter AST analysis
  3. Regression — VCR fixtures capture known-good model interactions
  4. Simulation — mock rate limits test edge cases in rate limit handling
  5. Experimentation — GrowthBook enables gradual rollout with rollback