Voice Pipeline

agtOS provides a full voice pipeline that converts speech to text, processes it through an LLM, and synthesizes a spoken response. The pipeline supports three distinct audio architectures, each with different latency, cost, and quality trade-offs.

Architecture Overview

The voice pipeline follows agtOS’s dual-layer design:

Infrastructure Layer handles the technical audio pipeline: VAD, encoding/decoding, transport, and buffering.
Orchestration Layer drives AI decisions: process speech, call tools, generate responses. The orchestration layer does not know or care which audio architecture is active.

This separation means switching between audio modes is a configuration change, not an architectural change.

Audio In → [VAD] → [STT] → [LLM + Tools] → [TTS] → Audio Out
              │         │          │              │
              └─ Infrastructure Layer ─────────────┘
                         │
              ┌─ Orchestration Layer ──┐
              │  "User said X"         │
              │  "Respond with Y"      │
              │  "Call tool Z"         │
              └────────────────────────┘

Audio Modes

agtOS supports three audio processing architectures. The default is cascade, which provides the most flexibility and lowest cost.

Cascade

STT → LLM → TTSEach component is independent and swappable. Best for development, low-cost operation, and maximum flexibility.~500ms latency | ~$0.15/min

Half-Cascade

Audio LLM → TTSThe LLM processes audio tokens directly, eliminating the STT step. Preserves tone and emphasis.~200-300ms latency | Lower than cascade

Native Audio

End-to-end modelNo separate STT or TTS. The model handles audio input and output natively. Most natural-sounding.~200-300ms latency | ~$1.50/min

Cascade (Default)

The cascade architecture processes audio through discrete stages:

VAD (Silero) filters silence and noise locally before sending to the server
STT (sherpa-onnx, speaches fallback) transcribes speech to text
LLM (Claude or Ollama) generates a text response, optionally calling tools
TTS (speaches/Kokoro) synthesizes the response as audio

Each component can be swapped independently. For example, you can use a local Ollama model for the LLM while keeping speaches for STT and TTS.

Half-Cascade

The half-cascade architecture eliminates the STT step by using an audio-understanding LLM (such as Ultravox) that processes audio tokens directly. TTS remains separate. This preserves paralinguistic cues (tone, emphasis, hesitation) that text transcription loses.

Native Audio

Native audio models like GPT-4o Realtime and Gemini Live handle the entire audio pipeline end-to-end. No separate STT or TTS. The model directly produces speech output with natural intonation, laughter, and emotional tone.

Native audio models are currently the most expensive option (~$1.50/min, roughly 10x cascade). Costs are dropping rapidly, but this mode is best suited for scenarios where latency and naturalness matter more than cost.

In-Process Speech Engine (sherpa-onnx)

agtOS can run STT, TTS, and VAD entirely in-process using sherpa-onnx, eliminating the need for an external speaches server. This is configured via environment variables and falls back to speaches automatically if models are not available.

In-Process (sherpa-onnx)

STT + TTS + VAD in Node.jsNo Python sidecar. 17+ STT models, 7 TTS families, Silero VAD. True streaming STT with real-time partial results.

External Server (speaches)

OpenAI-compatible HTTP serverSeparate Docker container running faster-whisper + Kokoro. Original architecture, still fully supported.

Switching Providers

# Use in-process sherpa-onnx (default, no external server needed)
STT_PROVIDER=sherpa-onnx
TTS_PROVIDER=sherpa-onnx

# Use external speaches server (fallback)
STT_PROVIDER=speaches
TTS_PROVIDER=speaches

Model Management

Download models before first use:

# Download the default model set (~460MB)
npx agtos models download --default

# List available models
npx agtos models list

# Download a specific model
npx agtos models download sensevoice-int8

Available STT Models

Model	Size	Languages	Streaming	Best For
Moonshine Tiny EN (default)	102MB	English	No	Fast English transcription
SenseVoice INT8	155MB	zh, en, ja, ko	No	Multilingual, quality
Zipformer Streaming EN	121MB	English	Yes	Real-time partial results

True Streaming STT

When the Zipformer streaming model is available, agtOS provides real-time partial transcription results while the user is still speaking. The orchestrator automatically detects streaming capability and feeds audio chunks to the recognizer incrementally.

User speaking: "What's the wea..."  → partial: "What's the wea"
User speaking: "What's the weather"  → partial: "What's the weather"
User stops speaking                  → final: "What's the weather like today?"
                                     → triggers LLM processing

If the streaming model is not downloaded, agtOS falls back to batch transcription (accumulate audio, transcribe on speech end). Both paths produce the same final result — streaming just provides earlier feedback.

STT (Speech-to-Text)

STT is provided by sherpa-onnx (default, in-process) or speaches (external server fallback).

sherpa-onnx STT (Default)

The default STT provider runs directly in the Node.js process via ONNX Runtime:

17+ models: Moonshine, SenseVoice, Zipformer, Paraformer, Whisper
True streaming: Partial results while the user is still speaking (Zipformer)
Word timestamps: Per-word timing for transcript display
No network overhead: In-process inference, no HTTP round-trips

speaches STT (Fallback)

speaches is an external server that exposes an OpenAI-compatible /v1/audio/transcriptions endpoint:

GPU-accelerated: CUDA/ROCm inference (4-6x faster than CPU)
Faster Whisper models: Whisper large-v3, faster-whisper, distil-whisper
Shared container: Same speaches server handles both STT and TTS

VAD Pre-Filtering

Voice Activity Detection runs locally (in the Node.js process or on the ESP32 device) before sending audio to the STT server. This prevents transcribing silence and background noise:

Silero VAD: ONNX model (~2MB), classifies 30ms audio frames in under 1ms
ESP32 VAD: Energy-based on-device detection for hardware clients
A 200ms pre-buffer is included before the VAD trigger to avoid clipping the start of speech

TTS (Text-to-Speech)

TTS is provided by sherpa-onnx (default, in-process) or speaches (external server fallback). The default model is Kokoro INT8, which provides a good balance of quality, speed, and resource usage. Key capabilities:

Sentence-based streaming: TTS begins on the first complete sentence while the LLM is still generating subsequent sentences, reducing perceived latency
In-process synthesis: sherpa-onnx runs Kokoro directly in the Node.js process with no network overhead
Thread pool: Multiple concurrent TTS instances (default: 3) for parallel sentence synthesis
Configurable voice and speed: SHERPA_TTS_VOICE=af_heart, SHERPA_TTS_SPEED=1.0

When using the speaches fallback, TTS uses the OpenAI-compatible /v1/audio/speech endpoint with Kokoro ONNX models.

Stream Coordinator

The Stream Coordinator is the core integration point for low-latency voice responses. It manages:

LLM token accumulation as the model streams its response
Sentence boundary detection to identify complete sentences
TTS dispatch queue to synthesize sentences as they arrive
Audio chunk ordering to ensure correct playback sequence
Interruption handling for barge-in (user speaks while agent is responding)

This approach means TTS synthesis starts on the first sentence while the LLM is still generating the rest of the response, achieving sub-second perceived latency.

Transport

WebSocket Audio Streaming

The primary transport for real-time audio is WebSocket, served on port 3000. Clients connect and stream raw audio frames, receiving synthesized audio in return. The WebSocket transport supports:

PTT (Push-to-Talk): Client explicitly signals start/stop of speech
VAD mode: Server-side voice activity detection determines speech boundaries
Authentication: Optional token validation on WebSocket upgrade
Session management: Each connection gets a voice session with conversation context

WebRTC signaling is also available for browser-to-browser audio, but WebSocket is the primary transport for the MVP.

Browser Voice Client

The built-in web dashboard includes a voice client that connects via WebSocket with AudioWorklet-based capture (ScriptProcessorNode fallback for older browsers). It supports both PTT and VAD modes with real-time transcript display.

Configuration

Configure the voice pipeline through environment variables:

# Voice server port
VOICE_PORT=3000

# STT settings
SPEACHES_STT_MODEL=whisper-large-v3
SPEACHES_URL=http://localhost:8000

# TTS settings
SPEACHES_TTS_VOICE=af_heart

Pipeline Configuration Object

When initializing the orchestrator programmatically:

const orchestrator = new VoicePipelineOrchestrator({
  webrtc: {
    signalingPort: 3000,
  },
  stt: {
    provider: 'sherpa-onnx',  // or 'speaches' for external server
    model: 'moonshine-tiny-en-int8',
  },
  tts: {
    provider: 'sherpa-onnx',  // or 'speaches' for external server
    voice: 'af_heart',
  },
  command: {
    provider: 'claude',  // or 'ollama'
  },
  redisUrl: 'redis://localhost:6379',
});

Routing rules for audio modes

The voice pipeline supports conditional routing between audio modes based on user preferences or operational constraints:

voice:
  default_variant: cascade
  variants:
    cascade:
      stt: sherpa-onnx
      llm: claude-haiku-4.5
      tts: sherpa-onnx
    half_cascade:
      audio_llm: ultravox
      tts: sherpa-onnx
    native:
      provider: gemini-live
      model: gemini-2.5-flash
  routing:
    rules:
      - condition: "user.preference == 'natural'"
        variant: native
      - condition: "cost_budget < 0.50/hour"
        variant: cascade
      - condition: "emotion_detection_required"
        variant: half_cascade

Graceful Degradation

The voice pipeline degrades gracefully when components are unavailable:

If a native audio API is down, fall back to cascade
If the local STT/TTS server is down, fall back to cloud providers
If Redis is unavailable, voice sessions still work (without memory persistence)
Health checks for each component (STT, TTS, Ollama, Claude) are exposed via the /api/health endpoint

Voice Pipeline

Architecture Overview

Audio Modes

Cascade

Half-Cascade

Native Audio

Cascade (Default)

Half-Cascade

Native Audio

In-Process Speech Engine (sherpa-onnx)

In-Process (sherpa-onnx)

External Server (speaches)

Switching Providers

Model Management

Available STT Models

True Streaming STT

STT (Speech-to-Text)

sherpa-onnx STT (Default)

speaches STT (Fallback)

VAD Pre-Filtering

TTS (Text-to-Speech)

Stream Coordinator

Transport

WebSocket Audio Streaming

Browser Voice Client

Configuration

Pipeline Configuration Object

Graceful Degradation

What’s next

WebSocket Protocol

Chat API

​Architecture Overview

​Audio Modes

Cascade

Half-Cascade

Native Audio

​Cascade (Default)

​Half-Cascade

​Native Audio

​In-Process Speech Engine (sherpa-onnx)

In-Process (sherpa-onnx)

External Server (speaches)

​Switching Providers

​Model Management

​Available STT Models

​True Streaming STT

​STT (Speech-to-Text)

​sherpa-onnx STT (Default)

​speaches STT (Fallback)

​VAD Pre-Filtering

​TTS (Text-to-Speech)

​Stream Coordinator

​Transport

​WebSocket Audio Streaming

​Browser Voice Client

​Configuration

​Pipeline Configuration Object

​Graceful Degradation

​What’s next

WebSocket Protocol

Chat API

Architecture Overview

Audio Modes

Cascade (Default)

Half-Cascade

Native Audio

In-Process Speech Engine (sherpa-onnx)

Switching Providers

Model Management

Available STT Models

True Streaming STT

STT (Speech-to-Text)

sherpa-onnx STT (Default)

speaches STT (Fallback)

VAD Pre-Filtering

TTS (Text-to-Speech)

Stream Coordinator

Transport

WebSocket Audio Streaming

Browser Voice Client

Configuration

Pipeline Configuration Object

Graceful Degradation

What’s next