Auto Mode Classifier

The two-layer defense system for auto mode in claude-code, also known as ADVISOR. Ensures that every tool invocation in auto mode is consistent with the user's stated intent.

Architecture

Input Layer

A server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output) before they enter the agent's context. When content looks like a hijack attempt, it prepends a warning instructing the agent to treat the content as suspect.

Output Layer

A transcript classifier running on Sonnet 4.6 evaluates each action before execution. It sees only user messages and the agent's tool calls — assistant text and tool results are deliberately stripped. The classifier cannot be "talked into" a bad decision by Claude's own persuasive reasoning.

Two-Stage Evaluation

Stage Speed Purpose FPR FNR
1 — Fast filter Single token yes/no Err toward blocking 8.5%
2 — Chain-of-thought Full reasoning Reduce false positives 0.4% 17% on real overeager

Stage 2 runs only when Stage 1 flags. Because both stages use the same prompt with only the final instruction changed, Stage 2 is almost entirely a prompt-cache hit from Stage 1.

The Honest Number

Anthropic published the false-negative rate on real overeager actions: 17%. The classifier correctly identifies the danger in most missed cases — it sees the force-push, the deletion, the infrastructure change — but incorrectly judges whether the user's prior messages authorize the specific blast radius. "Clean up the PR" is present in the session, but the classifier incorrectly treats it as consent for a force-push.

Escalation Logic

When the classifier blocks an action, Claude receives the denial as a tool result plus an instruction to find a safer path: - 3 consecutive denials → escalate to human (or terminate in headless mode) - 20 total denials → same escalation

Cost

One extra Sonnet 4.6 inference per tool call. A session with 200 tool calls adds 200 classifier inferences. Costs tracked separately via getAdvisorUsage() and accumulated recursively alongside main session costs.

Relationship to Permission Pipeline

The classifier is Layer 4 of the permission-pipeline — reached only when the three cheaper checks (rules, acceptEdits simulation, read-only whitelist) are inconclusive. In yoloClassifier.ts (1,495 lines): - Read-only tools skip the classifier entirely - 3 consecutive denials trigger fallback to prompting mode - 20 total denials trigger the same fallback - Fail-open: API errors fall back to "ask the user" rather than auto-deny

Key Claims

Sources