Observability¶
The whole point of the telemetry stack is to iterate fast on field issues — a device
in someone's living room that you can't put your hands on. Everything funnels into one
table and a set of Grafana dashboards. Engineering reference: docs/observability.md.
The telemetry funnel¶
One table, reassembled by turn_id¶
public.telemetry_events (migration 0006) holds one row per pipeline
stage-transition. Because every row carries a turn_id, a single utterance / chat turn
/ WhatsApp message can be reassembled end-to-end across services. Columns:
surface∈voice | chat | whatsapp | edge | systemsource∈edge | voice_bridge | sudo_api | hermes_plugin | livekitstage— free text (new stages need no migration)latency_ms+ ametaJSONB blob
Who emits, and how¶
- sudo-api writes directly to the DB via
cloud/_shared/telemetry.py(record_event/record_bg). - Everyone without a DB pool POSTs to ingest endpoints:
/v1/internal/telemetry(bearer-gated) — voice-bridge, plugins, the LiveKit webhook./v1/me/telemetry(device-JWT-gated) — the Pi (user_id/device_id forced from token).
- The LiveKit webhook →
/v1/internal/livekit/webhookrecords room/participant/track lifecycle — the ground truth behind "the Pi thinks it connected but voice is silent." - voice-bridge subscribes to the agent session's own metrics → LLM TTFT, TTS TTFB,
end-of-utterance delay land as
voice_metricsrows.
Logs, dashboards, crashes¶
- Loki + promtail — promtail ships every container's stdout/stderr to Loki via the docker socket (real container names as labels).
- Grafana (
grafana.sudohomes.com) — two datasources: Postgressudo-telemetryandsudo-loki. Dashboards: turn-funnel and fleet-health (deploy/grafana/dashboards/). Login:admin/GRAFANA_ADMIN_PASSWORD. - Sentry —
_init_sentry()in sudo-api + voice-bridge; no-op unlessSENTRY_DSNset. - Edge self-telemetry —
sudoedge/telemetry.pykeeps a ring buffer + background flusher; crashes report viareport_panic(synchronous, since the loop is dead by then) and the last few ring events ride home in the heartbeat.
Raw transcripts are off by default — it's a kids' device
Voice transcripts are not logged by default. Opt in per user via
global_settings.voice_debug_transcripts (true / "all" / a list of user_ids).
Respect this when debugging.
Two deploy gotchas¶
- The LiveKit webhook config (
deploy/livekit.yaml) needs a force-recreate to take effect; the signing key is rendered fromLIVEKIT_API_KEYat deploy time. - If Grafana's Postgres datasource errors on prepared statements against the transaction pooler (port 6543), point it at the Supabase session pooler (port 5432).