Telemetry System

Entity ID: ent-20260410-613817e31735
Type: service
Scope: shared
Status: active
Aliases: analytics, telemetry, GrowthBook, event logging, observability

Description

The telemetry system (src/services/analytics/) provides event logging, feature flags, and observability for Claude Code. It uses a 3-layer architecture: a public event API with queue-before-sink pattern, a routing sink that dispatches to backends, and backend-specific exporters for Datadog and OpenTelemetry. GrowthBook provides feature flags and A/B testing.

Architecture

Layer 1: Public API (`index.ts`)

logEvent(name, metadata) and logEventAsync() are the only public entry points. Events queue in eventQueue[] until attachAnalyticsSink() is called during initialization. Drain happens async via queueMicrotask.

Type safety for PII prevention: metadata type is { [key]: boolean | number | undefined } — deliberately no strings. Logging strings (which could contain code or filepaths) requires explicit cast to AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS (a never type used as a compile-time documentation gate).

Layer 2: Routing sink (`sink.ts`)

initializeAnalyticsSink() creates and attaches the sink. Events are routed to: - Datadog — if tengu_log_datadog_events GrowthBook gate is enabled, not killed, and using first-party API - 1P event logging — always enabled, via OpenTelemetry

_PROTO_* metadata keys are stripped before Datadog (PII-tagged values restricted to privileged 1P columns). Event sampling via shouldSampleEvent() checks GrowthBook dynamic config tengu_event_sampling_config.

Layer 3: Backends

Datadog (datadog.ts): - Endpoint: https://http-intake.logs.us5.datadoghq.com/api/v2/logs - Client token: pubbbf48e6d78dae54bceaa4acf463299bf - Service name: claude-code - DEFAULT_FLUSH_INTERVAL_MS = 15000 (15s batch flush) - MAX_BATCH_SIZE = 100 - NETWORK_TIMEOUT_MS = 5000 - NUM_USER_BUCKETS = 30 — privacy-preserving user bucketing via SHA-256 hash - Allowed events whitelist: DATADOG_ALLOWED_EVENTS (~40 event names, all tengu_* or chrome_bridge_*) - MCP tool names normalized to 'mcp' for cardinality reduction - Only fires in production, only for first-party API provider

OpenTelemetry 1P (firstPartyEventLogger.ts + exporter): - Uses @opentelemetry/sdk-logs with BatchLogRecordProcessor - Custom FirstPartyEventLoggingExporter (implements LogRecordExporter) - Event types: ClaudeCodeInternalEvent, GrowthbookExperimentEvent (protobuf) - Failed events stored in ~/.claude/telemetry/ as JSONL files (1p_failed_events.*) - BATCH_UUID = randomUUID() per process run

GrowthBook (`growthbook.ts`)

Feature flags and A/B testing via GrowthBook SDK.

User attributes for targeting: id, sessionId, deviceID, platform, apiBaseUrlHost, organizationUUID, accountUUID, userType, subscriptionType, rateLimitTier, firstTokenTime, email, appVersion, github.

Re-initialization on auth change (tracked via clientCreatedWithAuth)
Env overrides: CLAUDE_INTERNAL_FC_OVERRIDES (ant-only, JSON object)
Exposure dedup: loggedExposures Set prevents duplicate experiment logs
onGrowthBookRefresh(listener) — callback for long-lived objects that bake feature values

Event enrichment (`metadata.ts`)

sanitizeToolNameForAnalytics(toolName) — MCP tools become 'mcp_tool'
isToolDetailsLoggingEnabled() — checks OTEL_LOG_TOOL_DETAILS=1
Detailed logging allowed only for: Cowork (local-agent), claude.ai connectors, official MCP registry URLs

Lazy loading

Telemetry modules are lazy-loaded to minimize startup time: - OpenTelemetry (~400KB + protobuf) loaded via await import() in init.ts - gRPC exporters (~700KB) further lazy-loaded within instrumentation - Total deferred: ~1.1MB of code

Trade-offs

Queue-before-sink — events never lost during startup, but queue grows unbounded until sink attaches. A crash before sink attachment loses all queued events.
No strings in metadata — effective PII prevention but makes it harder to log legitimate text data. The never-cast workaround is ugly but intentional.
Datadog allowlist — only ~40 event names are forwarded, preventing accidental PII leaks but requiring manual allowlist updates for new events.
15s batch flush — reduces network overhead but means the last 15s of events may be lost on crash.
Lazy loading — saves ~1.1MB on startup but means early events (before telemetry init) may lack tracing context.

Depends on

GrowthBook SDK — feature flags and experiment tracking
@opentelemetry/sdk-logs — log export
Datadog Logs API — event ingestion

Key claims

No-string metadata type prevents accidental PII logging at compile time
Datadog allowlist restricts forwarding to ~40 approved event names
Failed OTel events are persisted to disk as JSONL for retry
GrowthBook re-initializes on auth change to pick up org-specific flags
~1.1MB of telemetry code is lazy-loaded after startup

Relations

used_by cost-tracker (OpenTelemetry counters)
used_by service-layer (all services log through analytics)
depends_on GrowthBook SDK
depends_on OpenTelemetry SDK

Sources

src-20260409-a5fc157bc756, source code analysis of src/services/analytics/