Runbooks¶
Field-tested triage for the failures we've actually hit. When something's on fire, start here. Each runbook says symptom → likely cause → fix.
"All surfaces are dead at once"¶
Symptom: voice, chat, and WhatsApp all stop replying simultaneously.
Because the LLM key is global, one bad key breaks everything. Two known causes:
- LLM key 402 / over quota. A free-tier or expired OpenRouter key starts returning
- The OpenRouter path does not auto-shrink the request the way the Anthropic path
does, so it just fails. Fix: rotate the key at
/admin/settings(which triggersrestart_all_runtimes()fleet-wide). See Secrets. - A dead WhatsApp pairing crash-kills the whole gateway. A stale Baileys-style pairing could take the entire hermes gateway down (not just WhatsApp). Fix: ensure the legacy Baileys platform stays disabled; we're on Twilio now. See WhatsApp.
Triage order: check the LLM key first (Grafana errors / hermes logs in Loki), then confirm no gateway is crash-looping.
"Voice does nothing on a Pi"¶
Symptom: the device boots but never responds to "hey sudo," or connects but is silent.
Suspect audio first, before the cloud:
- Did it play the ready cue? If only the boot cue plays, the mic never came live —
an audio device pick problem or a reSpeaker
that needs flashing (
arecord -lshows no capture). - If the mic is live but it's silent, check the LiveKit webhook telemetry — "Pi thinks it connected but voice is silent" shows up there. See Observability.
- Check sample-rate / device-index handling for that specific hardware.
"Mic wedged — no frames at all"¶
Symptom: the XVF3800 DSP wedges and delivers no frames (frozen hw_ptr).
This is a documented hardware wedge. The edge now recovers the wedged mic in seconds
and never cues 'ready' before the mic is actually live, so a wedge no longer looks
like a healthy device. If you see it persist, capture logs and check the recovery path in
sudoedge. (Background: repo commits and docs/ cover the XVF3800 wedge, recovery, and
cue gating.)
"UndefinedTableError 500 right after a deploy"¶
Symptom: an endpoint 500s with UndefinedTableError immediately after deploying.
Cause: a migration is sitting unapplied — they don't auto-apply. Fix: paste the
pending supabase/migrations/000N_*.sql into Supabase Studio. See
Database & migrations.
"Orphan hermes containers on the VPS"¶
Symptom: docker ps shows hermes-user-usr_* containers.
Real per-user containers are named hermes-user-<32-char uuid hex> (no underscores after
the prefix). Anything hermes-user-usr_* is an orphan from an older deploy — safe to
docker rm -f. See Provisioning.
General triage toolkit¶
- Logs: Grafana → Loki (
grafana.sudohomes.com), ormake prod-logs. - Per-turn trace: query
public.telemetry_eventsbyturn_idto see where a turn died. See Observability. - A specific deploy:
gh run list --workflow=deploy-cloud.yml,gh run watch <id>. - Container state on the VPS:
make prod-status.
These came from real incidents
Every runbook here is a postmortem in disguise. If you hit something new, add it — the next on-call will thank you.