Dual-Layer Architecture
Infrastructure Layer
The infrastructure layer handles the technical audio pipeline and connectivity:- Voice Activity Detection (VAD) — detects when the user is speaking
- Audio encoding/decoding — PCM, Opus, WAV format conversion
- Transport — WebSocket audio streaming, WebRTC signaling
- STT/TTS — speech-to-text and text-to-speech via speaches server
- Session management — Redis-backed session state with TTL and device indexing
Orchestration Layer
The orchestration layer is AI-driven and protocol-agnostic:- Protocol Gateway — routes requests through adapters (MCP today, A2A and AG-UI ready)
- Model Router — three-tier routing: intent classification, local model, cloud model
- Agent Reasoning Loop — multi-step tool execution via CommandProtocol
- Working Memory — per-session conversation history with automatic summarization
- Episodic Memory — cross-session recall stored in Redis
- Semantic Memory — embedding-based vector search via Redis Vector Search + Ollama
- Tool Registry — in-memory catalog with category filtering and dynamic selection
- Workflow Engine — multi-step workflow execution
- Task Scheduler — Redis-backed cron, one-shot, and interval scheduling
Voice Pipeline Modes
agtOS supports three voice pipeline architectures through the infrastructure layer (ADR-008). The orchestration layer remains unchanged across all three.CASCADE (Default)
HALF_CASCADE
NATIVE
The CASCADE mode is currently the production implementation. HALF_CASCADE and NATIVE are architecturally supported and will be connected as providers mature.
Protocol Gateway
The protocol gateway (ADR-001) abstracts protocol-specific details behind a unified internal interface. Core orchestration logic never imports protocol types directly.Why Protocol-Agnostic?
MCP moved to the Linux Foundation alongside Google’s A2A, signaling an industry shift toward multi-protocol futures. By abstracting protocols behind adapters, agtOS can adopt new protocols by writing an adapter — not restructuring the application.Platform-Aware Routing
The gateway supports platform-specific adapter overrides (ADR-015):| Platform | Characteristics |
|---|---|
| Web | Default cascade pipeline via WebSocket |
| Desktop | Tauri 2 native shell with system tray, global hotkey |
| ESP32 | Low-bandwidth audio transport, platform-specific VAD |
| CLI | Text-only routing, no audio pipeline overhead |
metadata.platform, the gateway checks for platform-specific adapter overrides before falling back to the default. Tools can also be restricted to specific platforms via an optional platforms field, reducing context window usage.
Data Flow
Here is the complete flow for a voice interaction using the cascade pipeline:Audio Capture
User speaks into microphone (browser WebSocket or ESP32 hardware). Audio frames stream to the voice server on port 3000.
Speech-to-Text
The STT provider (speaches, Faster Whisper model) transcribes the audio stream into text.
Intent Classification
The model router’s Tier 1 classifier (Ollama, qwen3:1.7b) categorizes the request in under 50ms: simple query, tool use, complex reasoning, creative, or system command.
Model Routing
Based on classification, the request goes to Tier 2 (Ollama, local) or Tier 3 (Claude, cloud). Privacy-sensitive requests always stay local.
Tool Execution
If the LLM invokes tools, the agent reasoning loop executes them via the CommandProtocol. Tools are discovered through the MCP server and tool registry.
Response Streaming
The LLM response streams token-by-token. Sentence boundary detection splits the stream for TTS chunking — the first sentence starts synthesizing while later sentences are still generating.
Key Components
MCP Server
Streamable HTTP on port 4100. Exposes 9 built-in tools:
voice.speak, voice.listen, system.health, session.status, workflow.run, workflow.list, schedule.create, schedule.list, schedule.cancel.MCP Client
Connects to external MCP servers for tool discovery. Auto-discovery with reconnection support.
Memory System
Three-tier memory: working (session context), episodic (cross-session Redis), semantic (vector search via Ollama embeddings + Redis).
REST API
13 endpoints on port 4102 with rate limiting and optional API key auth. Serves health, sessions, memory, scheduler, workflows, chat, voice status, and tasks.
Web Dashboard
React 19 + Vite 6 management UI with pages for health, devices, tasks, voice, memory, conversations, configuration, and logs.
Desktop App
Tauri 2 native shell with system tray, global push-to-talk hotkey, health monitoring, and auto-start. Bundles agtOS as a Node SEA sidecar.
Tech Stack
| Layer | Technology | Purpose |
|---|---|---|
| Runtime | Node.js 20+ / TypeScript | Server, CLI, all business logic |
| Cloud LLM | Claude (Anthropic SDK) | Complex reasoning, agentic tasks |
| Local LLM | Ollama (Qwen3, etc.) | Intent classification, simple queries |
| STT | speaches (Faster Whisper) | Speech-to-text transcription |
| TTS | speaches (Kokoro ONNX) | Text-to-speech synthesis |
| State | Redis (node-redis v5) | Sessions, memory, scheduler, events, devices |
| Protocols | MCP (Streamable HTTP) | Tool integration, external server connections |
| Desktop | Tauri 2 | Native app shell with system tray, hotkeys |
| Dashboard | React 19, Vite 6 | Web-based management UI |
| Hardware | ESP32-S3 (XIAO Sense) | Wearable voice client firmware |
| Metrics | Prometheus | /metrics endpoint with latency percentiles |
Security Model
BYOK Credentials
agtOS uses Bring Your Own Key (BYOK) credential management. API keys are encrypted at rest with AES-256-GCM and a user-provided secret. Per-provider validation ensures keys are valid before storage.API Authentication
The REST API supports opt-in Bearer token authentication viaAGTOS_API_KEY. When set, all /api/* routes require the header Authorization: Bearer <key>. Timing-safe comparison prevents timing attacks. See Authentication for details.
Device Authentication
ESP32 and other hardware devices authenticate via per-device SHA-256 tokens. The device registry tracks capabilities, status lifecycle, and trust levels.Rate Limiting
Token bucket rate limiting protects all API endpoints:- General API: 100 requests/minute per client IP
- Chat and task endpoints: 20 requests/minute per client IP