Voice System

Entity ID: ent-20260410-22c6dc232c2c
Type: service
Scope: shared
Status: active

Description

Voice input/output subsystem enabling hands-free interaction with Claude Code. The system implements a full speech-to-text pipeline: local audio capture feeds raw PCM into Anthropic's voice_stream WebSocket endpoint, which runs Deepgram-backed STT (with an optional Nova 3 path via GrowthBook feature gate tengu_cobalt_frost). Transcripts stream back as interim and final segments, rendered live in the prompt input via React state management.

The subsystem spans six layers:

Audio capture (src/services/voice.ts) -- Records 16 kHz, 16-bit signed mono PCM. Uses a native NAPI module (audio-capture-napi, backed by cpal/CoreAudio/ALSA/WASAPI) as the primary backend, with fallbacks to arecord (ALSA utils) and SoX rec on Linux. The native module is lazy-loaded on first voice keypress to avoid blocking the event loop with a synchronous dlopen (1-8 seconds on macOS depending on coreaudiod state). On Windows, the native module is required with no fallback.
STT streaming (src/services/voiceStreamSTT.ts) -- Connects to Anthropic's voice_stream WebSocket endpoint (/api/ws/speech_to_text/voice_stream) using OAuth Bearer authentication. The wire protocol sends binary audio frames and JSON control messages (KeepAlive, CloseStream). The server responds with TranscriptText (interim/progressive), TranscriptEndpoint (utterance boundary), and TranscriptError messages. Connection targets api.anthropic.com rather than claude.ai to avoid Cloudflare TLS fingerprinting challenges against non-browser clients. Includes a finalization protocol with three resolution paths: post-CloseStream endpoint (~300ms), no-data timeout (1.5s), and safety timeout (5s).
Domain vocabulary (src/services/voiceKeyterms.ts) -- Builds a per-session list of up to 50 keyterms sent as query parameters to the STT endpoint for Deepgram keyword boosting. Combines hardcoded coding terms (MCP, grep, regex, TypeScript, OAuth, gRPC, etc.) with dynamic context: the project root basename, git branch name segments (split on camelCase/kebab-case/snake_case), and words from recently accessed file names. Terms "Claude" and "Anthropic" are boosted server-side.
Core React hook (src/hooks/useVoice.ts) -- Manages the recording lifecycle through three states: idle, recording, processing. Implements hold-to-talk with release detection via auto-repeat key gap timing (200ms threshold). Audio is buffered in memory while the WebSocket connects, then flushed on onReady, eliminating 1-2s of latency. Computes RMS audio levels for a 16-bar waveform visualizer. Supports multi-language STT with 20 languages (BCP-47 codes mapped from language names in English and native scripts). Includes silent-drop detection and automatic replay: when the server accepts audio but returns zero transcripts (a ~1% session-sticky bug), the full audio buffer is replayed on a fresh WebSocket connection after a 250ms backoff. Also supports a focus mode where recording starts/stops automatically with terminal focus, enabling a "multi-clauding army" workflow with a 5-second silence timeout.
Input integration (src/hooks/useVoiceIntegration.tsx) -- Bridges voice transcripts into the prompt input field. Tracks cursor position (prefix/suffix anchors) so interim transcripts insert at the cursor without clobbering surrounding text. Handles two keybinding modes: modifier combos (e.g., meta+k) activate on first press, while bare characters (e.g., space) require a hold threshold of 5 rapid presses to distinguish from normal typing. Provides an interimRange for the UI to dim not-yet-finalized text. Manages flow-through of warmup characters and stripping of leaked hold-key characters (including full-width space from CJK IMEs).
Voice command and gating (src/commands/voice/, src/voice/voiceModeEnabled.ts) -- The /voice slash command toggles voice mode on/off. Before enabling, it runs pre-flight checks: GrowthBook kill-switch (tengu_amber_quartz_disabled), OAuth authentication (requires Claude.ai account, not API keys/Bedrock/Vertex), microphone permission probe (triggers OS TCC dialog on macOS), recording backend availability, and SoX dependency detection with auto-install hints for brew/apt/dnf/pacman. Visibility is gated by feature('VOICE_MODE') at compile time (dead code elimination in non-ant builds) and the GrowthBook kill-switch at runtime.

Voice state is managed through a dedicated React context (src/context/voice.tsx) using a synchronous external store pattern. The store holds voiceState (idle/recording/processing), voiceError, voiceInterimTranscript, voiceAudioLevels (number array for waveform), and voiceWarmingUp. Slice-based subscriptions via useVoiceState(selector) ensure components only re-render when their selected slice changes.

Key claims

Audio is captured at 16 kHz, 16-bit signed, mono (raw PCM) across all platforms.
The native audio module (audio-capture-napi, cpal-based) is the primary capture backend on macOS, Linux, and Windows. SoX and arecord serve as Linux-only fallbacks.
Native module loading is deferred to first voice keypress via lazy import() to avoid blocking the event loop at startup (dlopen costs 1-8s on macOS).
STT uses Anthropic's voice_stream WebSocket endpoint backed by Deepgram, with an optional Nova 3 path gated on tengu_cobalt_frost.
The WebSocket targets api.anthropic.com (not claude.ai) to bypass Cloudflare TLS fingerprint challenges that block non-browser clients.
Audio is buffered locally and flushed to the WebSocket on connection ready, eliminating 1-2s of OAuth + handshake latency.
Up to 50 domain-specific keyterms are sent per session for STT accuracy boosting, combining hardcoded coding terms with project name, git branch, and recent file names.
Silent-drop detection replays the full audio buffer on a fresh connection when the server returns zero transcripts despite receiving audio (~1% of sessions).
Voice mode requires OAuth authentication (Claude.ai subscriber); API keys, Bedrock, Vertex, and Foundry are not supported.
The system supports 20 languages for dictation, with BCP-47 code normalization and graceful fallback to English for unsupported languages.
Hold-to-talk release is detected via auto-repeat key gap timing (200ms threshold); modifier combos activate immediately while bare characters require 5 rapid presses.
Focus mode enables automatic recording tied to terminal focus/blur with a 5-second silence timeout for teardown.
Remote environments (Homespace, CLAUDE_CODE_REMOTE) are blocked from voice mode due to lack of local microphone access.
The entire voice subsystem is compile-time gated via feature('VOICE_MODE') for dead code elimination in external builds, plus a runtime GrowthBook kill-switch (tengu_amber_quartz_disabled).

Relations

depends-on audio-capture-napi (native NAPI module wrapping cpal for cross-platform audio capture)
depends-on voice_stream endpoint (Anthropic's WebSocket STT service at /api/ws/speech_to_text/voice_stream)
depends-on Deepgram STT (server-side speech-to-text engine, with Nova 3 as an optional upgraded path)
depends-on OAuth subsystem (src/utils/auth.ts -- requires valid Anthropic OAuth tokens for WebSocket auth)
depends-on GrowthBook feature gating (tengu_amber_quartz_disabled kill-switch, tengu_cobalt_frost Nova 3 gate)
depends-on React context/store system (src/context/voice.tsx -- VoiceProvider, useVoiceState, useSetVoiceState)
depends-on Settings system (settings.voiceEnabled user preference toggle)
depends-on Keybinding system (voice:pushToTalk action, default bound to Space in Chat scope)
integrates-with Prompt input (useVoiceIntegration manages cursor-aware transcript insertion)
integrates-with Notification system (voice errors surface as notifications via addNotification)
integrates-with Analytics (tengu_voice_recording_started, tengu_voice_recording_completed, tengu_voice_toggled, tengu_voice_silent_drop_replay events)
integrates-with Git utilities (getBranch for keyterm extraction from branch names)
fallback-chain Native cpal module -> arecord (ALSA, Linux) -> SoX rec (Linux/macOS)

Sources

src/services/voice.ts -- Audio recording service (native + SoX + arecord backends, dependency checks, mic permission probing)
src/services/voiceStreamSTT.ts -- WebSocket STT client (connection management, wire protocol, finalization, keepalive)
src/services/voiceKeyterms.ts -- Domain vocabulary builder (global terms, project context, git branch, recent files)
src/hooks/useVoice.ts -- Core voice React hook (recording lifecycle, language normalization, audio buffering, silent-drop replay, focus mode)
src/hooks/useVoiceEnabled.ts -- Voice enablement check hook (user intent + auth + GrowthBook, memoized)
src/hooks/useVoiceIntegration.tsx -- Input integration hook (cursor anchoring, interim rendering, keybinding handler, hold detection)
src/voice/voiceModeEnabled.ts -- Voice gating functions (GrowthBook kill-switch, auth check, compile-time feature gate)
src/commands/voice/voice.ts -- /voice slash command (toggle with pre-flight checks, language hint, dependency install guidance)
src/commands/voice/index.ts -- Command registration (availability, visibility gating)
src/context/voice.tsx -- Voice state store (React context, synchronous external store, slice subscriptions)