Harness Underperformance Gap (77% vs 93%)

Description

Leaked benchmark data showing Opus 4.6 scores 77% on agent benchmarks when run through Claude Code's native harness vs 93% through Cursor's harness with the same model. The 16-percentage-point gap is attributable to tool reference expansion and stop sequence sampling bugs where Capybara samples a stop token at ~10% probability on encountering ... tags at the prompt tail, terminating responses early. Cursor's harness does not trigger this pattern.

Key claims

Relations

Sources

src-20260409-28c9af66ed0c