Voice System
- Entity ID:
ent-20260410-22c6dc232c2c - Type:
service - Scope:
shared - Status:
active
Description
Voice input/output subsystem enabling hands-free interaction with Claude Code. The system implements a full speech-to-text pipeline: local audio capture feeds raw PCM into Anthropic's voice_stream WebSocket endpoint, which runs Deepgram-backed STT (with an optional Nova 3 path via GrowthBook feature gate tengu_cobalt_frost). Transcripts stream back as interim and final segments, rendered live in the prompt input via React state management.
The subsystem spans six layers:
-
Audio capture (
src/services/voice.ts) -- Records 16 kHz, 16-bit signed mono PCM. Uses a native NAPI module (audio-capture-napi, backed by cpal/CoreAudio/ALSA/WASAPI) as the primary backend, with fallbacks toarecord(ALSA utils) and SoXrecon Linux. The native module is lazy-loaded on first voice keypress to avoid blocking the event loop with a synchronousdlopen(1-8 seconds on macOS depending on coreaudiod state). On Windows, the native module is required with no fallback. -
STT streaming (
src/services/voiceStreamSTT.ts) -- Connects to Anthropic'svoice_streamWebSocket endpoint (/api/ws/speech_to_text/voice_stream) using OAuth Bearer authentication. The wire protocol sends binary audio frames and JSON control messages (KeepAlive,CloseStream). The server responds withTranscriptText(interim/progressive),TranscriptEndpoint(utterance boundary), andTranscriptErrormessages. Connection targetsapi.anthropic.comrather thanclaude.aito avoid Cloudflare TLS fingerprinting challenges against non-browser clients. Includes a finalization protocol with three resolution paths: post-CloseStream endpoint (~300ms), no-data timeout (1.5s), and safety timeout (5s). -
Domain vocabulary (
src/services/voiceKeyterms.ts) -- Builds a per-session list of up to 50 keyterms sent as query parameters to the STT endpoint for Deepgram keyword boosting. Combines hardcoded coding terms (MCP, grep, regex, TypeScript, OAuth, gRPC, etc.) with dynamic context: the project root basename, git branch name segments (split on camelCase/kebab-case/snake_case), and words from recently accessed file names. Terms "Claude" and "Anthropic" are boosted server-side. -
Core React hook (
src/hooks/useVoice.ts) -- Manages the recording lifecycle through three states:idle,recording,processing. Implements hold-to-talk with release detection via auto-repeat key gap timing (200ms threshold). Audio is buffered in memory while the WebSocket connects, then flushed ononReady, eliminating 1-2s of latency. Computes RMS audio levels for a 16-bar waveform visualizer. Supports multi-language STT with 20 languages (BCP-47 codes mapped from language names in English and native scripts). Includes silent-drop detection and automatic replay: when the server accepts audio but returns zero transcripts (a ~1% session-sticky bug), the full audio buffer is replayed on a fresh WebSocket connection after a 250ms backoff. Also supports a focus mode where recording starts/stops automatically with terminal focus, enabling a "multi-clauding army" workflow with a 5-second silence timeout. -
Input integration (
src/hooks/useVoiceIntegration.tsx) -- Bridges voice transcripts into the prompt input field. Tracks cursor position (prefix/suffix anchors) so interim transcripts insert at the cursor without clobbering surrounding text. Handles two keybinding modes: modifier combos (e.g., meta+k) activate on first press, while bare characters (e.g., space) require a hold threshold of 5 rapid presses to distinguish from normal typing. Provides aninterimRangefor the UI to dim not-yet-finalized text. Manages flow-through of warmup characters and stripping of leaked hold-key characters (including full-width space from CJK IMEs). -
Voice command and gating (
src/commands/voice/,src/voice/voiceModeEnabled.ts) -- The/voiceslash command toggles voice mode on/off. Before enabling, it runs pre-flight checks: GrowthBook kill-switch (tengu_amber_quartz_disabled), OAuth authentication (requires Claude.ai account, not API keys/Bedrock/Vertex), microphone permission probe (triggers OS TCC dialog on macOS), recording backend availability, and SoX dependency detection with auto-install hints for brew/apt/dnf/pacman. Visibility is gated byfeature('VOICE_MODE')at compile time (dead code elimination in non-ant builds) and the GrowthBook kill-switch at runtime.
Voice state is managed through a dedicated React context (src/context/voice.tsx) using a synchronous external store pattern. The store holds voiceState (idle/recording/processing), voiceError, voiceInterimTranscript, voiceAudioLevels (number array for waveform), and voiceWarmingUp. Slice-based subscriptions via useVoiceState(selector) ensure components only re-render when their selected slice changes.
Key claims
- Audio is captured at 16 kHz, 16-bit signed, mono (raw PCM) across all platforms.
- The native audio module (
audio-capture-napi, cpal-based) is the primary capture backend on macOS, Linux, and Windows. SoX and arecord serve as Linux-only fallbacks. - Native module loading is deferred to first voice keypress via lazy
import()to avoid blocking the event loop at startup (dlopen costs 1-8s on macOS). - STT uses Anthropic's
voice_streamWebSocket endpoint backed by Deepgram, with an optional Nova 3 path gated ontengu_cobalt_frost. - The WebSocket targets
api.anthropic.com(notclaude.ai) to bypass Cloudflare TLS fingerprint challenges that block non-browser clients. - Audio is buffered locally and flushed to the WebSocket on connection ready, eliminating 1-2s of OAuth + handshake latency.
- Up to 50 domain-specific keyterms are sent per session for STT accuracy boosting, combining hardcoded coding terms with project name, git branch, and recent file names.
- Silent-drop detection replays the full audio buffer on a fresh connection when the server returns zero transcripts despite receiving audio (~1% of sessions).
- Voice mode requires OAuth authentication (Claude.ai subscriber); API keys, Bedrock, Vertex, and Foundry are not supported.
- The system supports 20 languages for dictation, with BCP-47 code normalization and graceful fallback to English for unsupported languages.
- Hold-to-talk release is detected via auto-repeat key gap timing (200ms threshold); modifier combos activate immediately while bare characters require 5 rapid presses.
- Focus mode enables automatic recording tied to terminal focus/blur with a 5-second silence timeout for teardown.
- Remote environments (Homespace, CLAUDE_CODE_REMOTE) are blocked from voice mode due to lack of local microphone access.
- The entire voice subsystem is compile-time gated via
feature('VOICE_MODE')for dead code elimination in external builds, plus a runtime GrowthBook kill-switch (tengu_amber_quartz_disabled).
Relations
- depends-on
audio-capture-napi(native NAPI module wrapping cpal for cross-platform audio capture) - depends-on
voice_streamendpoint (Anthropic's WebSocket STT service at/api/ws/speech_to_text/voice_stream) - depends-on Deepgram STT (server-side speech-to-text engine, with Nova 3 as an optional upgraded path)
- depends-on OAuth subsystem (
src/utils/auth.ts-- requires valid Anthropic OAuth tokens for WebSocket auth) - depends-on GrowthBook feature gating (
tengu_amber_quartz_disabledkill-switch,tengu_cobalt_frostNova 3 gate) - depends-on React context/store system (
src/context/voice.tsx--VoiceProvider,useVoiceState,useSetVoiceState) - depends-on Settings system (
settings.voiceEnableduser preference toggle) - depends-on Keybinding system (
voice:pushToTalkaction, default bound to Space in Chat scope) - integrates-with Prompt input (
useVoiceIntegrationmanages cursor-aware transcript insertion) - integrates-with Notification system (voice errors surface as notifications via
addNotification) - integrates-with Analytics (
tengu_voice_recording_started,tengu_voice_recording_completed,tengu_voice_toggled,tengu_voice_silent_drop_replayevents) - integrates-with Git utilities (
getBranchfor keyterm extraction from branch names) - fallback-chain Native cpal module -> arecord (ALSA, Linux) -> SoX rec (Linux/macOS)
Sources
src/services/voice.ts-- Audio recording service (native + SoX + arecord backends, dependency checks, mic permission probing)src/services/voiceStreamSTT.ts-- WebSocket STT client (connection management, wire protocol, finalization, keepalive)src/services/voiceKeyterms.ts-- Domain vocabulary builder (global terms, project context, git branch, recent files)src/hooks/useVoice.ts-- Core voice React hook (recording lifecycle, language normalization, audio buffering, silent-drop replay, focus mode)src/hooks/useVoiceEnabled.ts-- Voice enablement check hook (user intent + auth + GrowthBook, memoized)src/hooks/useVoiceIntegration.tsx-- Input integration hook (cursor anchoring, interim rendering, keybinding handler, hold detection)src/voice/voiceModeEnabled.ts-- Voice gating functions (GrowthBook kill-switch, auth check, compile-time feature gate)src/commands/voice/voice.ts--/voiceslash command (toggle with pre-flight checks, language hint, dependency install guidance)src/commands/voice/index.ts-- Command registration (availability, visibility gating)src/context/voice.tsx-- Voice state store (React context, synchronous external store, slice subscriptions)