Voice System

Description

Voice input/output subsystem enabling hands-free interaction with Claude Code. The system implements a full speech-to-text pipeline: local audio capture feeds raw PCM into Anthropic's voice_stream WebSocket endpoint, which runs Deepgram-backed STT (with an optional Nova 3 path via GrowthBook feature gate tengu_cobalt_frost). Transcripts stream back as interim and final segments, rendered live in the prompt input via React state management.

The subsystem spans six layers:

  1. Audio capture (src/services/voice.ts) -- Records 16 kHz, 16-bit signed mono PCM. Uses a native NAPI module (audio-capture-napi, backed by cpal/CoreAudio/ALSA/WASAPI) as the primary backend, with fallbacks to arecord (ALSA utils) and SoX rec on Linux. The native module is lazy-loaded on first voice keypress to avoid blocking the event loop with a synchronous dlopen (1-8 seconds on macOS depending on coreaudiod state). On Windows, the native module is required with no fallback.

  2. STT streaming (src/services/voiceStreamSTT.ts) -- Connects to Anthropic's voice_stream WebSocket endpoint (/api/ws/speech_to_text/voice_stream) using OAuth Bearer authentication. The wire protocol sends binary audio frames and JSON control messages (KeepAlive, CloseStream). The server responds with TranscriptText (interim/progressive), TranscriptEndpoint (utterance boundary), and TranscriptError messages. Connection targets api.anthropic.com rather than claude.ai to avoid Cloudflare TLS fingerprinting challenges against non-browser clients. Includes a finalization protocol with three resolution paths: post-CloseStream endpoint (~300ms), no-data timeout (1.5s), and safety timeout (5s).

  3. Domain vocabulary (src/services/voiceKeyterms.ts) -- Builds a per-session list of up to 50 keyterms sent as query parameters to the STT endpoint for Deepgram keyword boosting. Combines hardcoded coding terms (MCP, grep, regex, TypeScript, OAuth, gRPC, etc.) with dynamic context: the project root basename, git branch name segments (split on camelCase/kebab-case/snake_case), and words from recently accessed file names. Terms "Claude" and "Anthropic" are boosted server-side.

  4. Core React hook (src/hooks/useVoice.ts) -- Manages the recording lifecycle through three states: idle, recording, processing. Implements hold-to-talk with release detection via auto-repeat key gap timing (200ms threshold). Audio is buffered in memory while the WebSocket connects, then flushed on onReady, eliminating 1-2s of latency. Computes RMS audio levels for a 16-bar waveform visualizer. Supports multi-language STT with 20 languages (BCP-47 codes mapped from language names in English and native scripts). Includes silent-drop detection and automatic replay: when the server accepts audio but returns zero transcripts (a ~1% session-sticky bug), the full audio buffer is replayed on a fresh WebSocket connection after a 250ms backoff. Also supports a focus mode where recording starts/stops automatically with terminal focus, enabling a "multi-clauding army" workflow with a 5-second silence timeout.

  5. Input integration (src/hooks/useVoiceIntegration.tsx) -- Bridges voice transcripts into the prompt input field. Tracks cursor position (prefix/suffix anchors) so interim transcripts insert at the cursor without clobbering surrounding text. Handles two keybinding modes: modifier combos (e.g., meta+k) activate on first press, while bare characters (e.g., space) require a hold threshold of 5 rapid presses to distinguish from normal typing. Provides an interimRange for the UI to dim not-yet-finalized text. Manages flow-through of warmup characters and stripping of leaked hold-key characters (including full-width space from CJK IMEs).

  6. Voice command and gating (src/commands/voice/, src/voice/voiceModeEnabled.ts) -- The /voice slash command toggles voice mode on/off. Before enabling, it runs pre-flight checks: GrowthBook kill-switch (tengu_amber_quartz_disabled), OAuth authentication (requires Claude.ai account, not API keys/Bedrock/Vertex), microphone permission probe (triggers OS TCC dialog on macOS), recording backend availability, and SoX dependency detection with auto-install hints for brew/apt/dnf/pacman. Visibility is gated by feature('VOICE_MODE') at compile time (dead code elimination in non-ant builds) and the GrowthBook kill-switch at runtime.

Voice state is managed through a dedicated React context (src/context/voice.tsx) using a synchronous external store pattern. The store holds voiceState (idle/recording/processing), voiceError, voiceInterimTranscript, voiceAudioLevels (number array for waveform), and voiceWarmingUp. Slice-based subscriptions via useVoiceState(selector) ensure components only re-render when their selected slice changes.

Key claims

Relations

Sources