Skip to content

Runbooks

Field-tested triage for the failures we've actually hit. When something's on fire, start here. Each runbook says symptom → likely cause → fix.

"All surfaces are dead at once"

Symptom: voice, chat, and WhatsApp all stop replying simultaneously.

Because the LLM key is global, one bad key breaks everything. Two known causes:

  1. LLM key 402 / over quota. A free-tier or expired OpenRouter key starts returning
  2. The OpenRouter path does not auto-shrink the request the way the Anthropic path does, so it just fails. Fix: rotate the key at /admin/settings (which triggers restart_all_runtimes() fleet-wide). See Secrets.
  3. A dead WhatsApp pairing crash-kills the whole gateway. A stale Baileys-style pairing could take the entire hermes gateway down (not just WhatsApp). Fix: ensure the legacy Baileys platform stays disabled; we're on Twilio now. See WhatsApp.

Triage order: check the LLM key first (Grafana errors / hermes logs in Loki), then confirm no gateway is crash-looping.

"Voice does nothing on a Pi"

Symptom: the device boots but never responds to "hey sudo," or connects but is silent.

Suspect audio first, before the cloud:

  1. Did it play the ready cue? If only the boot cue plays, the mic never came live — an audio device pick problem or a reSpeaker that needs flashing (arecord -l shows no capture).
  2. If the mic is live but it's silent, check the LiveKit webhook telemetry — "Pi thinks it connected but voice is silent" shows up there. See Observability.
  3. Check sample-rate / device-index handling for that specific hardware.

"Mic wedged — no frames at all"

Symptom: the XVF3800 DSP wedges and delivers no frames (frozen hw_ptr).

This is a documented hardware wedge. The edge now recovers the wedged mic in seconds and never cues 'ready' before the mic is actually live, so a wedge no longer looks like a healthy device. If you see it persist, capture logs and check the recovery path in sudoedge. (Background: repo commits and docs/ cover the XVF3800 wedge, recovery, and cue gating.)

"UndefinedTableError 500 right after a deploy"

Symptom: an endpoint 500s with UndefinedTableError immediately after deploying.

Cause: a migration is sitting unapplied — they don't auto-apply. Fix: paste the pending supabase/migrations/000N_*.sql into Supabase Studio. See Database & migrations.

"Orphan hermes containers on the VPS"

Symptom: docker ps shows hermes-user-usr_* containers.

Real per-user containers are named hermes-user-<32-char uuid hex> (no underscores after the prefix). Anything hermes-user-usr_* is an orphan from an older deploy — safe to docker rm -f. See Provisioning.

General triage toolkit

  • Logs: Grafana → Loki (grafana.sudohomes.com), or make prod-logs.
  • Per-turn trace: query public.telemetry_events by turn_id to see where a turn died. See Observability.
  • A specific deploy: gh run list --workflow=deploy-cloud.yml, gh run watch <id>.
  • Container state on the VPS: make prod-status.

These came from real incidents

Every runbook here is a postmortem in disguise. If you hit something new, add it — the next on-call will thank you.