Auto-Mode Threat Model (4 Categories)

Description

From the auto-mode classifier design (Hughes 2026). Four explicitly targeted risk categories: (1) overeager behavior, (2) honest mistakes, (3) prompt injection, (4) model misalignment. Drives the two-stage fast-filter + chain-of-thought evaluation in yoloClassifier.ts.

Key claims

Relations

Sources

src-20260423-0cff68d3291b