Skip to content

Observability

The whole point of the telemetry stack is to iterate fast on field issues — a device in someone's living room that you can't put your hands on. Everything funnels into one table and a set of Grafana dashboards. Engineering reference: docs/observability.md.

The telemetry funnel

Telemetry funnel

One table, reassembled by turn_id

public.telemetry_events (migration 0006) holds one row per pipeline stage-transition. Because every row carries a turn_id, a single utterance / chat turn / WhatsApp message can be reassembled end-to-end across services. Columns:

  • surfacevoice | chat | whatsapp | edge | system
  • sourceedge | voice_bridge | sudo_api | hermes_plugin | livekit
  • stage — free text (new stages need no migration)
  • latency_ms + a meta JSONB blob

Who emits, and how

  • sudo-api writes directly to the DB via cloud/_shared/telemetry.py (record_event / record_bg).
  • Everyone without a DB pool POSTs to ingest endpoints:
    • /v1/internal/telemetry (bearer-gated) — voice-bridge, plugins, the LiveKit webhook.
    • /v1/me/telemetry (device-JWT-gated) — the Pi (user_id/device_id forced from token).
  • The LiveKit webhook/v1/internal/livekit/webhook records room/participant/track lifecycle — the ground truth behind "the Pi thinks it connected but voice is silent."
  • voice-bridge subscribes to the agent session's own metrics → LLM TTFT, TTS TTFB, end-of-utterance delay land as voice_metrics rows.

Logs, dashboards, crashes

  • Loki + promtail — promtail ships every container's stdout/stderr to Loki via the docker socket (real container names as labels).
  • Grafana (grafana.sudohomes.com) — two datasources: Postgres sudo-telemetry and sudo-loki. Dashboards: turn-funnel and fleet-health (deploy/grafana/dashboards/). Login: admin / GRAFANA_ADMIN_PASSWORD.
  • Sentry_init_sentry() in sudo-api + voice-bridge; no-op unless SENTRY_DSN set.
  • Edge self-telemetrysudoedge/telemetry.py keeps a ring buffer + background flusher; crashes report via report_panic (synchronous, since the loop is dead by then) and the last few ring events ride home in the heartbeat.

Raw transcripts are off by default — it's a kids' device

Voice transcripts are not logged by default. Opt in per user via global_settings.voice_debug_transcripts (true / "all" / a list of user_ids). Respect this when debugging.

Two deploy gotchas

  • The LiveKit webhook config (deploy/livekit.yaml) needs a force-recreate to take effect; the signing key is rendered from LIVEKIT_API_KEY at deploy time.
  • If Grafana's Postgres datasource errors on prepared statements against the transaction pooler (port 6543), point it at the Supabase session pooler (port 5432).