agtOS provides a full voice pipeline that converts speech to text, processes it through an LLM, and synthesizes a spoken response. The pipeline supports three distinct audio architectures, each with different latency, cost, and quality trade-offs.Documentation Index
Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt
Use this file to discover all available pages before exploring further.
Architecture Overview
The voice pipeline follows agtOS’s dual-layer design:- Infrastructure Layer handles the technical audio pipeline: VAD, encoding/decoding, transport, and buffering.
- Orchestration Layer drives AI decisions: process speech, call tools, generate responses. The orchestration layer does not know or care which audio architecture is active.
Audio Modes
agtOS supports three audio processing architectures. The default is cascade, which provides the most flexibility and lowest cost.Cascade
STT → LLM → TTSEach component is independent and swappable. Best for development, low-cost operation, and maximum flexibility.~500ms latency | ~$0.15/min
Half-Cascade
Audio LLM → TTSThe LLM processes audio tokens directly, eliminating the STT step. Preserves tone and emphasis.~200-300ms latency | Lower than cascade
Native Audio
End-to-end modelNo separate STT or TTS. The model handles audio input and output natively. Most natural-sounding.~200-300ms latency | ~$1.50/min
Cascade (Default)
The cascade architecture processes audio through discrete stages:- VAD (Silero) filters silence and noise locally before sending to the server
- STT (sherpa-onnx, speaches fallback) transcribes speech to text
- LLM (Claude or Ollama) generates a text response, optionally calling tools
- TTS (speaches/Kokoro) synthesizes the response as audio
Half-Cascade
The half-cascade architecture eliminates the STT step by using an audio-understanding LLM (such as Ultravox) that processes audio tokens directly. TTS remains separate. This preserves paralinguistic cues (tone, emphasis, hesitation) that text transcription loses.Native Audio
Native audio models like GPT-4o Realtime and Gemini Live handle the entire audio pipeline end-to-end. No separate STT or TTS. The model directly produces speech output with natural intonation, laughter, and emotional tone.In-Process Speech Engine (sherpa-onnx)
agtOS can run STT, TTS, and VAD entirely in-process using sherpa-onnx, eliminating the need for an external speaches server. This is configured via environment variables and falls back to speaches automatically if models are not available.In-Process (sherpa-onnx)
STT + TTS + VAD in Node.jsNo Python sidecar. 17+ STT models, 7 TTS families, Silero VAD. True streaming STT with real-time partial results.
External Server (speaches)
OpenAI-compatible HTTP serverSeparate Docker container running faster-whisper + Kokoro. Original architecture, still fully supported.
Switching Providers
Model Management
Download models before first use:Available STT Models
| Model | Size | Languages | Streaming | Best For |
|---|---|---|---|---|
| Moonshine Tiny EN (default) | 102MB | English | No | Fast English transcription |
| SenseVoice INT8 | 155MB | zh, en, ja, ko | No | Multilingual, quality |
| Zipformer Streaming EN | 121MB | English | Yes | Real-time partial results |
True Streaming STT
When the Zipformer streaming model is available, agtOS provides real-time partial transcription results while the user is still speaking. The orchestrator automatically detects streaming capability and feeds audio chunks to the recognizer incrementally.If the streaming model is not downloaded, agtOS falls back to batch transcription (accumulate audio, transcribe on speech end). Both paths produce the same final result — streaming just provides earlier feedback.
STT (Speech-to-Text)
STT is provided by sherpa-onnx (default, in-process) or speaches (external server fallback).sherpa-onnx STT (Default)
The default STT provider runs directly in the Node.js process via ONNX Runtime:- 17+ models: Moonshine, SenseVoice, Zipformer, Paraformer, Whisper
- True streaming: Partial results while the user is still speaking (Zipformer)
- Word timestamps: Per-word timing for transcript display
- No network overhead: In-process inference, no HTTP round-trips
speaches STT (Fallback)
speaches is an external server that exposes an OpenAI-compatible/v1/audio/transcriptions endpoint:
- GPU-accelerated: CUDA/ROCm inference (4-6x faster than CPU)
- Faster Whisper models: Whisper large-v3, faster-whisper, distil-whisper
- Shared container: Same speaches server handles both STT and TTS
VAD Pre-Filtering
Voice Activity Detection runs locally (in the Node.js process or on the ESP32 device) before sending audio to the STT server. This prevents transcribing silence and background noise:- Silero VAD: ONNX model (~2MB), classifies 30ms audio frames in under 1ms
- ESP32 VAD: Energy-based on-device detection for hardware clients
- A 200ms pre-buffer is included before the VAD trigger to avoid clipping the start of speech
TTS (Text-to-Speech)
TTS is provided by sherpa-onnx (default, in-process) or speaches (external server fallback). The default model is Kokoro INT8, which provides a good balance of quality, speed, and resource usage. Key capabilities:- Sentence-based streaming: TTS begins on the first complete sentence while the LLM is still generating subsequent sentences, reducing perceived latency
- In-process synthesis: sherpa-onnx runs Kokoro directly in the Node.js process with no network overhead
- Thread pool: Multiple concurrent TTS instances (default: 3) for parallel sentence synthesis
- Configurable voice and speed:
SHERPA_TTS_VOICE=af_heart,SHERPA_TTS_SPEED=1.0
/v1/audio/speech endpoint with Kokoro ONNX models.
Stream Coordinator
The Stream Coordinator is the core integration point for low-latency voice responses. It manages:- LLM token accumulation as the model streams its response
- Sentence boundary detection to identify complete sentences
- TTS dispatch queue to synthesize sentences as they arrive
- Audio chunk ordering to ensure correct playback sequence
- Interruption handling for barge-in (user speaks while agent is responding)
Transport
WebSocket Audio Streaming
The primary transport for real-time audio is WebSocket, served on port 3000. Clients connect and stream raw audio frames, receiving synthesized audio in return. The WebSocket transport supports:- PTT (Push-to-Talk): Client explicitly signals start/stop of speech
- VAD mode: Server-side voice activity detection determines speech boundaries
- Authentication: Optional token validation on WebSocket upgrade
- Session management: Each connection gets a voice session with conversation context
WebRTC signaling is also available for browser-to-browser audio, but WebSocket is the primary transport for the MVP.
Browser Voice Client
The built-in web dashboard includes a voice client that connects via WebSocket with AudioWorklet-based capture (ScriptProcessorNode fallback for older browsers). It supports both PTT and VAD modes with real-time transcript display.Configuration
Configure the voice pipeline through environment variables:Pipeline Configuration Object
When initializing the orchestrator programmatically:Routing rules for audio modes
Routing rules for audio modes
The voice pipeline supports conditional routing between audio modes based on user preferences or operational constraints:
Graceful Degradation
The voice pipeline degrades gracefully when components are unavailable:- If a native audio API is down, fall back to cascade
- If the local STT/TTS server is down, fall back to cloud providers
- If Redis is unavailable, voice sessions still work (without memory persistence)
- Health checks for each component (STT, TTS, Ollama, Claude) are exposed via the
/api/healthendpoint
What’s next
WebSocket Protocol
Detailed protocol reference for the audio transport layer.
Chat API
Text-based chat endpoint using the same agent reasoning loop.