Voice¶

Voice is the flagship surface: a family talks to the Pi in their home and hears a reply. This page covers the pipeline, the wake word, and the proactive path. For the hop-by-hop trace see A request, end to end.

The pipeline¶

Voice pipeline — mic to speaker

The Pi runs sudoedge, which does only audio + wake detection. All STT/LLM/TTS happens in the cloud, in voice-bridge (a livekit-agents worker). The Pi and voice-bridge meet inside a LiveKit room named room_<user_id>.

Wake word: "hey sudo"¶

Detected on-device by a small ONNX model at sudoedge/models/hey_sudo.onnx.
A wake un-gates the mic for one turn; the turn ends on a cloud lifecycle event (the cloud owns turn-taking — the edge holds no silence timers). There is no mid-speech "say hey sudo to interrupt" barge-in: the user waits for the short reply and wakes again.
Training the model is its own topic — see Wake-word training.

Tuned for a noisy family home

The device is for families — adults, ageing parents, and children — in real living rooms, not a quiet developer's desk. Wake sensitivity and turn-taking are tuned for that, not for clean studio speech.

The voice is Indic by default¶

In production the voice stack is Hindi via Sarvam (STT saaras:v3, TTS bulbul:v3), not English. This matters: any language-specific component (endpointing models, turn-detection, wake tuning) must match the configured language.

Don't default-enable English turn-detection/endpointing

Turn-detection and endpointing models are opt-in and language-matched. An English end-of-utterance model was once enabled by default and broke the live Indic setup. If you touch turn-taking, gate it behind config and match the language.

Proactive voice¶

The agent can speak unprompted — a cron reminder, a send_message. That goes through the sudo_voice plugin:

Proactive voice — the agent speaks unprompted

voice-bridge looks up the active AgentSession by room name and cross-thread-dispatches session.say(text) onto the agents loop. If the device is offline there's no session, so it returns 404 and the agent can choose WhatsApp instead.

Where to look¶

Concern	File
Voice worker (STT/hermes/TTS, session mgmt)	`cloud/voice_bridge/main.py`
Pi-side persistent link + turn loop	`sudoedge/voice_link.py`
Pi-side LiveKit speaker + token	`sudoedge/lk_client.py`
Earcons / cues (incl. "still thinking")	`sudoedge/announce.py`
Wake detection	`sudoedge/wake.py`, `sudoedge/models/hey_sudo.onnx`
Audio device selection	`sudoedge/audio_devices.py` (see Audio devices)

For the original engineering notes, see docs/livekit-setup.md and docs/voice-bridge-sse.md in the repo.