Skip to content

Voice

Voice is the flagship surface: a family talks to the Pi in their home and hears a reply. This page covers the pipeline, the wake word, and the proactive path. For the hop-by-hop trace see A request, end to end.

The pipeline

Voice pipeline — mic to speaker

The Pi runs sudoedge, which does only audio + wake detection. All STT/LLM/TTS happens in the cloud, in voice-bridge (a livekit-agents worker). The Pi and voice-bridge meet inside a LiveKit room named room_<user_id>.

Wake word: "hey sudo"

  • Detected on-device by a small ONNX model at sudoedge/models/hey_sudo.onnx.
  • The detector keeps running even while the agent is speaking, so you can interrupt it — but that creates two hazards the code handles: not self-triggering on the agent's own audio, and not letting the voice-activity detector trip on speaker bleed-through.
  • Training the model is its own topic — see Wake-word training.

Tuned for kids in a noisy home

The device is for families, including children, in real living rooms — not a quiet developer's desk. Wake sensitivity and turn-taking are tuned for that, not for clean studio speech.

The voice is Indic by default

In production the voice stack is Hindi via Sarvam (STT saaras:v3, TTS bulbul:v3), not English. This matters: any language-specific component (endpointing models, turn-detection, wake tuning) must match the configured language.

Don't default-enable English turn-detection/endpointing

Turn-detection and endpointing models are opt-in and language-matched. An English end-of-utterance model was once enabled by default and broke the live Indic setup. If you touch turn-taking, gate it behind config and match the language.

Proactive voice

The agent can speak unprompted — a cron reminder, a send_message. That goes through the sudo_voice plugin:

Proactive voice — the agent speaks unprompted

voice-bridge looks up the active AgentSession by room name and cross-thread-dispatches session.say(text) onto the agents loop. If the device is offline there's no session, so it returns 404 and the agent can choose WhatsApp instead.

Where to look

Concern File
Voice worker (STT/hermes/TTS, session mgmt) cloud/voice_bridge/main.py
Pi-side LiveKit client + cues sudoedge/lk_client.py
Wake detection sudoedge/wake.py, sudoedge/models/hey_sudo.onnx
Audio device selection sudoedge/audio_devices.py (see Audio devices)

For the original engineering notes, see docs/livekit-setup.md, docs/wake-interrupt.md, and docs/voice-bridge-sse.md in the repo.