Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt

Use this file to discover all available pages before exploring further.

agtOS provides a full voice pipeline that converts speech to text, processes it through an LLM, and synthesizes a spoken response. The pipeline supports three distinct audio architectures, each with different latency, cost, and quality trade-offs.

Architecture Overview

The voice pipeline follows agtOS’s dual-layer design:
  • Infrastructure Layer handles the technical audio pipeline: VAD, encoding/decoding, transport, and buffering.
  • Orchestration Layer drives AI decisions: process speech, call tools, generate responses. The orchestration layer does not know or care which audio architecture is active.
This separation means switching between audio modes is a configuration change, not an architectural change.
Audio In → [VAD] → [STT] → [LLM + Tools] → [TTS] → Audio Out
              │         │          │              │
              └─ Infrastructure Layer ─────────────┘

              ┌─ Orchestration Layer ──┐
              │  "User said X"         │
              │  "Respond with Y"      │
              │  "Call tool Z"         │
              └────────────────────────┘

Audio Modes

agtOS supports three audio processing architectures. The default is cascade, which provides the most flexibility and lowest cost.

Cascade

STT → LLM → TTSEach component is independent and swappable. Best for development, low-cost operation, and maximum flexibility.~500ms latency | ~$0.15/min

Half-Cascade

Audio LLM → TTSThe LLM processes audio tokens directly, eliminating the STT step. Preserves tone and emphasis.~200-300ms latency | Lower than cascade

Native Audio

End-to-end modelNo separate STT or TTS. The model handles audio input and output natively. Most natural-sounding.~200-300ms latency | ~$1.50/min

Cascade (Default)

The cascade architecture processes audio through discrete stages:
  1. VAD (Silero) filters silence and noise locally before sending to the server
  2. STT (sherpa-onnx, speaches fallback) transcribes speech to text
  3. LLM (Claude or Ollama) generates a text response, optionally calling tools
  4. TTS (speaches/Kokoro) synthesizes the response as audio
Each component can be swapped independently. For example, you can use a local Ollama model for the LLM while keeping speaches for STT and TTS.

Half-Cascade

The half-cascade architecture eliminates the STT step by using an audio-understanding LLM (such as Ultravox) that processes audio tokens directly. TTS remains separate. This preserves paralinguistic cues (tone, emphasis, hesitation) that text transcription loses.

Native Audio

Native audio models like GPT-4o Realtime and Gemini Live handle the entire audio pipeline end-to-end. No separate STT or TTS. The model directly produces speech output with natural intonation, laughter, and emotional tone.
Native audio models are currently the most expensive option (~$1.50/min, roughly 10x cascade). Costs are dropping rapidly, but this mode is best suited for scenarios where latency and naturalness matter more than cost.

In-Process Speech Engine (sherpa-onnx)

agtOS can run STT, TTS, and VAD entirely in-process using sherpa-onnx, eliminating the need for an external speaches server. This is configured via environment variables and falls back to speaches automatically if models are not available.

In-Process (sherpa-onnx)

STT + TTS + VAD in Node.jsNo Python sidecar. 17+ STT models, 7 TTS families, Silero VAD. True streaming STT with real-time partial results.

External Server (speaches)

OpenAI-compatible HTTP serverSeparate Docker container running faster-whisper + Kokoro. Original architecture, still fully supported.

Switching Providers

# Use in-process sherpa-onnx (default, no external server needed)
STT_PROVIDER=sherpa-onnx
TTS_PROVIDER=sherpa-onnx

# Use external speaches server (fallback)
STT_PROVIDER=speaches
TTS_PROVIDER=speaches

Model Management

Download models before first use:
# Download the default model set (~460MB)
npx agtos models download --default

# List available models
npx agtos models list

# Download a specific model
npx agtos models download sensevoice-int8

Available STT Models

ModelSizeLanguagesStreamingBest For
Moonshine Tiny EN (default)102MBEnglishNoFast English transcription
SenseVoice INT8155MBzh, en, ja, koNoMultilingual, quality
Zipformer Streaming EN121MBEnglishYesReal-time partial results

True Streaming STT

When the Zipformer streaming model is available, agtOS provides real-time partial transcription results while the user is still speaking. The orchestrator automatically detects streaming capability and feeds audio chunks to the recognizer incrementally.
User speaking: "What's the wea..."  → partial: "What's the wea"
User speaking: "What's the weather"  → partial: "What's the weather"
User stops speaking                  → final: "What's the weather like today?"
                                     → triggers LLM processing
If the streaming model is not downloaded, agtOS falls back to batch transcription (accumulate audio, transcribe on speech end). Both paths produce the same final result — streaming just provides earlier feedback.

STT (Speech-to-Text)

STT is provided by sherpa-onnx (default, in-process) or speaches (external server fallback).

sherpa-onnx STT (Default)

The default STT provider runs directly in the Node.js process via ONNX Runtime:
  • 17+ models: Moonshine, SenseVoice, Zipformer, Paraformer, Whisper
  • True streaming: Partial results while the user is still speaking (Zipformer)
  • Word timestamps: Per-word timing for transcript display
  • No network overhead: In-process inference, no HTTP round-trips

speaches STT (Fallback)

speaches is an external server that exposes an OpenAI-compatible /v1/audio/transcriptions endpoint:
  • GPU-accelerated: CUDA/ROCm inference (4-6x faster than CPU)
  • Faster Whisper models: Whisper large-v3, faster-whisper, distil-whisper
  • Shared container: Same speaches server handles both STT and TTS

VAD Pre-Filtering

Voice Activity Detection runs locally (in the Node.js process or on the ESP32 device) before sending audio to the STT server. This prevents transcribing silence and background noise:
  • Silero VAD: ONNX model (~2MB), classifies 30ms audio frames in under 1ms
  • ESP32 VAD: Energy-based on-device detection for hardware clients
  • A 200ms pre-buffer is included before the VAD trigger to avoid clipping the start of speech

TTS (Text-to-Speech)

TTS is provided by sherpa-onnx (default, in-process) or speaches (external server fallback). The default model is Kokoro INT8, which provides a good balance of quality, speed, and resource usage. Key capabilities:
  • Sentence-based streaming: TTS begins on the first complete sentence while the LLM is still generating subsequent sentences, reducing perceived latency
  • In-process synthesis: sherpa-onnx runs Kokoro directly in the Node.js process with no network overhead
  • Thread pool: Multiple concurrent TTS instances (default: 3) for parallel sentence synthesis
  • Configurable voice and speed: SHERPA_TTS_VOICE=af_heart, SHERPA_TTS_SPEED=1.0
When using the speaches fallback, TTS uses the OpenAI-compatible /v1/audio/speech endpoint with Kokoro ONNX models.

Stream Coordinator

The Stream Coordinator is the core integration point for low-latency voice responses. It manages:
  1. LLM token accumulation as the model streams its response
  2. Sentence boundary detection to identify complete sentences
  3. TTS dispatch queue to synthesize sentences as they arrive
  4. Audio chunk ordering to ensure correct playback sequence
  5. Interruption handling for barge-in (user speaks while agent is responding)
This approach means TTS synthesis starts on the first sentence while the LLM is still generating the rest of the response, achieving sub-second perceived latency.

Transport

WebSocket Audio Streaming

The primary transport for real-time audio is WebSocket, served on port 3000. Clients connect and stream raw audio frames, receiving synthesized audio in return. The WebSocket transport supports:
  • PTT (Push-to-Talk): Client explicitly signals start/stop of speech
  • VAD mode: Server-side voice activity detection determines speech boundaries
  • Authentication: Optional token validation on WebSocket upgrade
  • Session management: Each connection gets a voice session with conversation context
WebRTC signaling is also available for browser-to-browser audio, but WebSocket is the primary transport for the MVP.

Browser Voice Client

The built-in web dashboard includes a voice client that connects via WebSocket with AudioWorklet-based capture (ScriptProcessorNode fallback for older browsers). It supports both PTT and VAD modes with real-time transcript display.

Configuration

Configure the voice pipeline through environment variables:
# Voice server port
VOICE_PORT=3000

# STT settings
SPEACHES_STT_MODEL=whisper-large-v3
SPEACHES_URL=http://localhost:8000

# TTS settings
SPEACHES_TTS_VOICE=af_heart

Pipeline Configuration Object

When initializing the orchestrator programmatically:
const orchestrator = new VoicePipelineOrchestrator({
  webrtc: {
    signalingPort: 3000,
  },
  stt: {
    provider: 'sherpa-onnx',  // or 'speaches' for external server
    model: 'moonshine-tiny-en-int8',
  },
  tts: {
    provider: 'sherpa-onnx',  // or 'speaches' for external server
    voice: 'af_heart',
  },
  command: {
    provider: 'claude',  // or 'ollama'
  },
  redisUrl: 'redis://localhost:6379',
});
The voice pipeline supports conditional routing between audio modes based on user preferences or operational constraints:
voice:
  default_variant: cascade
  variants:
    cascade:
      stt: sherpa-onnx
      llm: claude-haiku-4.5
      tts: sherpa-onnx
    half_cascade:
      audio_llm: ultravox
      tts: sherpa-onnx
    native:
      provider: gemini-live
      model: gemini-2.5-flash
  routing:
    rules:
      - condition: "user.preference == 'natural'"
        variant: native
      - condition: "cost_budget < 0.50/hour"
        variant: cascade
      - condition: "emotion_detection_required"
        variant: half_cascade

Graceful Degradation

The voice pipeline degrades gracefully when components are unavailable:
  • If a native audio API is down, fall back to cascade
  • If the local STT/TTS server is down, fall back to cloud providers
  • If Redis is unavailable, voice sessions still work (without memory persistence)
  • Health checks for each component (STT, TTS, Ollama, Claude) are exposed via the /api/health endpoint

What’s next

WebSocket Protocol

Detailed protocol reference for the audio transport layer.

Chat API

Text-based chat endpoint using the same agent reasoning loop.