agtOS uses a multi-provider architecture. Claude or OpenAI handles complex reasoning in the cloud, Ollama runs local models for fast responses and privacy, and speech processing runs either in-process via sherpa-onnx or via an external speaches server. The model router ties them together.Documentation Index
Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt
Use this file to discover all available pages before exploring further.
Claude Provider
Claude is the primary cloud LLM for agtOS, used for complex conversations, multi-step reasoning, and background agentic tasks. agtOS integrates two Anthropic SDKs (see ADR-003):- Client SDK (
@anthropic-ai/sdk) — real-time voice path with streaming - Agent SDK (
@anthropic-ai/claude-agent-sdk) — background autonomous tasks
Model Selection
| Model | ID | Best For | Input $/MTok | Output $/MTok |
|---|---|---|---|---|
| Claude Opus 4 | claude-opus-4-20250514 | Complex reasoning, analysis | $15.00 | $75.00 |
| Claude Sonnet 4 | claude-sonnet-4-20250514 | Balanced performance (default) | $3.00 | $15.00 |
| Claude Haiku 4.5 | claude-haiku-4-5-20251001 | Speed, low cost, voice pipeline | $0.80 | $4.00 |
Configuration Options
Provider Defaults
When no environment variables are set, the Claude provider uses these defaults:| Setting | Default | Notes |
|---|---|---|
model | claude-sonnet-4-20250514 | Sonnet 4 balances speed and capability |
maxTokens | 4096 | Max output tokens per response |
temperature | 0.7 | Generation temperature (0.0 - 1.0) |
timeoutMs | 300000 | 5-minute request timeout |
promptCache.maxEntries | 50 | LRU prompt cache size |
Dual-SDK Architecture
The voice pipeline uses the Client SDK for real-time streaming, targeting first-token latency under 200ms and first-sentence latency under 500ms. Background tasks like “research the best options for X” are dispatched to the Agent SDK, which handles multi-step tool execution autonomously. Both SDKs connect to the same MCP servers, so tools are defined once and available to both paths. See Authentication for API key setup.OpenAI Provider
OpenAI is an alternative cloud LLM provider for agtOS, available as a drop-in replacement for Claude in the model router’s cloud tier (ADR-019). It uses the OpenAI Node SDK v6 with streaming support.Model Selection
| Model | ID | Best For | Input $/MTok | Output $/MTok |
|---|---|---|---|---|
| GPT-4o | gpt-4o | Complex reasoning (default) | $2.50 | $10.00 |
| GPT-4o Mini | gpt-4o-mini | Speed, low cost | $0.15 | $0.60 |
Configuration
Configure OpenAI as the provider for one or more model slots in~/.agtos/config.json:
agtos setup to configure slots interactively. You also need the API key:
Features
- Streaming: Full streaming support via
.stream()withfinalChatCompletion() - Tool calling: Function-based tool calls compatible with the agtOS tool registry
- Session management: 30-minute TTL sessions with token tracking
- Barge-in: Stream cancellation via
AbortSignalfor voice pipeline interrupts - Health check: Probes
/v1/modelsendpoint when API key is configured
When a slot is configured with
"provider": "openai", the model router sends that slot’s requests to OpenAI. If the OpenAI provider fails to initialize, it falls back to Claude with a warning.Ollama Provider
Ollama serves local models for intent classification and handling simple queries without cloud API calls. This reduces cost, latency, and keeps privacy-sensitive requests on-device.Configuration
Intent Classifier
The intent classifier is a small, fast model that categorizes every incoming request before it reaches an LLM. It runs on CPU in under 50ms and determines:| Category | Description | Route |
|---|---|---|
simple_query | Factual queries, greetings, time/date | Local (Ollama) |
system_command | Volume, timer, reminder | Local (Ollama) |
tool_use | File ops, API calls | May need Claude |
complex_reasoning | Analysis, code review | Claude |
creative | Writing, brainstorming | Claude |
Classifier Defaults
| Setting | Default | Notes |
|---|---|---|
host | http://localhost:11434 | Ollama API URL |
model | qwen3:1.7b | Small model for fast classification |
confidenceThreshold | 0.7 | Below this confidence, escalate to Claude |
timeoutMs | 2000 | Max classification time before defaulting to cloud |
speaches STT/TTS (Fallback)
speaches is a self-hosted server that provides OpenAI-compatible speech-to-text and text-to-speech endpoints. agtOS can use it as a fallback when sherpa-onnx is not available.Configuration
STT Defaults
| Setting | Default | Notes |
|---|---|---|
baseUrl | http://localhost:8000 | speaches API URL |
model | Systran/faster-whisper-small | Faster Whisper model |
language | en | Language code (en, es, fr, etc.) |
timeoutMs | 30000 | 30-second request timeout |
TTS Defaults
| Setting | Default | Notes |
|---|---|---|
baseUrl | http://localhost:8000 | speaches API URL |
model | speaches-ai/Kokoro-82M-v1.0-ONNX | Kokoro ONNX model for fast synthesis |
voice | af_heart | Default voice ID |
format | wav | Output format (wav or mp3) |
speed | 1.0 | Speaking speed (0.25 - 4.0) |
timeoutMs | 30000 | 30-second request timeout |
speaches does not support
opus or aac audio formats. Use wav (default) or mp3.sherpa-onnx Provider (Default)
sherpa-onnx is the default STT/TTS/VAD provider. It runs directly in the Node.js process via a native ONNX Runtime addon. No Python, no external server, no HTTP round-trips. See ADR-017 for the decision rationale.Why sherpa-onnx?
- In-process: No network latency for STT/TTS calls
- 17+ STT models: Whisper, Moonshine, SenseVoice, Zipformer, Paraformer
- True streaming STT: Partial results while the user is still speaking
- Voice cloning: PocketTTS and ZipVoice support (future)
- Apple Silicon: CoreML acceleration on macOS
Configuration
Available TTS Voices
The Kokoro TTS model includes 11 built-in voices. OpenAI voice names are mapped to their closest Kokoro equivalents.| Voice ID | Description | OpenAI Alias |
|---|---|---|
af_heart | Warm, natural American female | alloy |
af_bella | Clear, expressive American female | nova |
af_nicole | Calm, professional American female | — |
af_sarah | Friendly, conversational American female | — |
af_sky | Bright, energetic American female | — |
am_adam | Steady, confident American male | echo |
am_michael | Friendly, conversational American male | onyx |
bf_emma | Polished British female | shimmer |
bf_isabella | Elegant British female | — |
bm_george | Authoritative British male | fable |
bm_lewis | Warm British male | — |
Model Router
The STT model router automatically selects the best model based on context:| Context | Selected Model | Reason |
|---|---|---|
| English, fast mode | Moonshine Tiny | Lowest latency |
| Non-English | SenseVoice INT8 | Multilingual support |
| Streaming requested | Zipformer EN 20M | Real-time partial results |
| Quality mode | SenseVoice INT8 | Best accuracy |
sherpa-onnx requires downloading ONNX model files (~460MB for the default set). Run
npx agtos models download --default before first use. Models are cached locally in ~/.agtos/models/.Model Router
The model router implements ADR-004 — a three-tier routing architecture that sends each request to the optimal inference tier.How Routing Works
Router Configuration
The model router uses the Model Slot Registry (ADR-020) to route requests. Each slot maps to a provider and model:~/.agtos/config.json
Built-in Slots
| Slot | Type | Purpose |
|---|---|---|
chat | Conversation | General chat (required — system won’t start without it) |
reasoning | Conversation | Complex analysis and multi-step reasoning |
coding | Conversation | Code generation and review |
tool_calling | Conversation | Requests that require tool execution |
creative | Conversation | Writing, brainstorming, creative tasks |
embedding | Task | Vector embeddings for semantic memory |
classifier | Task | Intent classification for routing |
summarization | Task | Conversation summarization |
consolidation | Task | Memory consolidation (Dreamer) |
dialectic | Task | User reasoning (Dialectic engine) |
maintenance | Task | Stage 3 LLM judge for the NLI hybrid contradiction pipeline (ADR-027). Defaults to fallback: 'consolidation' so existing single-provider setups keep working unchanged. |
Pattern Overrides
The router supportsforceSlotPatterns — regex patterns that force routing to a specific slot regardless of classification:
Fallback Strategy
The router handles failures gracefully via per-slot fallback chains:- If a slot’s primary provider fails, the registry tries the slot’s
fallbackslot (max depth: 3, circular reference guard) - The
chatslot is the terminal fallback — it always exists and cannot be removed - Classification errors are tracked so thresholds can be adjusted over time
Per-Slot Metrics
Each slot is instrumented with Prometheus metrics:| Metric | Labels | Description |
|---|---|---|
agtos_slot_duration_seconds | slot | Request duration histogram |
agtos_slot_requests_total | slot | Total request count |
agtos_slot_errors_total | slot | Total error count |
Bypassing the Router
To skip the router and use a single provider directly:Cognitive Task Providers
Beyond the main LLM and speech providers, agtOS has several specialized AI tasks that can each use a different provider (ADR-018). This allows fine-grained optimization — for example, using local Ollama for embeddings while routing reasoning tasks to Claude, or pinning a cheap fast model for themaintenance task slot’s LLM judge.
| Task | Variable | Options | Purpose |
|---|---|---|---|
| Embedding | AGTOS_EMBEDDING_PROVIDER | ollama, openrouter | Vector embeddings for semantic memory search |
| Classification | AGTOS_CLASSIFIER_PROVIDER | ollama, claude, openrouter | Intent classification for model routing |
| Consolidation | AGTOS_CONSOLIDATION_PROVIDER | ollama, claude, openrouter | Memory consolidation (Dreamer) — compresses episodic memories |
| Reasoning | AGTOS_REASONING_PROVIDER | ollama, claude, openrouter | Dialectic reasoning — synthesizes user profile conclusions |
| Summarization | AGTOS_SUMMARIZATION_PROVIDER | ollama, claude, openrouter | Conversation summarization for working memory |
_MODEL variable (e.g., AGTOS_EMBEDDING_MODEL) to override the default model.
OpenRouter
OpenRouter is a first-class provider in agtOS (ADR-026). It proxies requests to Claude, GPT, Gemini, Llama, and many other models through a single API, and can be configured for any slot — conversation slots (chat, reasoning, coding, etc.) as well as task slots (embedding, classification, consolidation, dialectic, maintenance).
~/.agtos/config.json
provider-openrouter) — distinct from provider-openai — and the client sets the HTTP-Referer and X-Title attribution headers required by the OpenRouter leaderboard. The OpenRouterCatalog pulls rich model metadata from /api/v1/models (context length, per-token pricing, supported parameters, and input modalities for vision / PDF / audio detection), while /api/v1/key powers the account info card in the dashboard.
These settings are also configurable at runtime via PUT /api/settings — see Environment Variables for the full list.
Provider Catalog
Every provider implements theProviderCatalog interface (ADR-026) so the dashboard, the agtos setup wizard, and slot pickers can discover available models in a provider-agnostic way. listModels() returns a list of ModelInfo entries with context length, max output tokens, per-1M-token pricing, and a 13-entry capability union (including 'contradiction' for the NLI hybrid pipeline). Catalog results cache for one hour by default.
| Provider | Catalog implementation | Source |
|---|---|---|
| Claude | ClaudeCatalog | Auto-paginated client.models.list() with capabilities.{batch, code_execution, image_input, pdf_input, structured_outputs, thinking} flags |
| OpenAI | OpenAICatalog | Live /v1/models merged with a hand-maintained capability map (OpenAI’s API doesn’t expose capabilities) |
| Ollama | OllamaCatalog | list + show fan-out with family-prefixed model_info extraction |
| OpenRouter | OpenRouterCatalog | /api/v1/models with parsed per-token pricing and supported_parameters-derived capabilities |
provider.catalog.refreshed lifecycle event fires whenever a catalog successfully fetches from the network (cache hits don’t emit). A provider.credentials.updated event fires on create/rotate/delete in CredentialManager.
Catalog freshness is tracked per provider via getLastFetchedMs(). The per-provider health checks (provider-claude, provider-openai, etc.) report staleness when the last fetch exceeds 10 minutes. The cache TTL is configurable via AGTOS_PROVIDER_CATALOG_CACHE_TTL_SECONDS (default 1 hour).
Credential Rotation
API keys can be rotated at runtime without restarting the server. TheProviderLifecycleManager handles the lifecycle:
- Update the credential via the dashboard Settings page or
POST /api/credentials. - The
provider.credentials.updatedevent fires. - The lifecycle manager calls
updateCredentials()on the client provider instance. - In-flight requests complete on the old client; new requests use the new credentials.
- Slot registry references are preserved — no slot reconfiguration needed.
provider-claude, provider-openai, provider-ollama, provider-openrouter) report credential status, catalog freshness, and whether the client provider is initialized. Ollama is credential-less — the lifecycle manager owns only its catalog and health check.
Provider Architecture Summary
Claude (Cloud)
Complex reasoning, multi-step tasks, creative generation. Sonnet 4 default, Haiku 4.5 for voice speed. Default cloud provider.
OpenAI (Cloud)
Alternative cloud provider. GPT-4o for reasoning, GPT-4o Mini for speed. Configure per slot in
~/.agtos/config.json.OpenRouter (Cloud)
First-class cloud provider that proxies Claude, GPT, Gemini, Llama, and more through a single API. Rich catalog + pricing, per-slot config.
Ollama (Local)
Simple queries, privacy-sensitive requests, intent classification. Qwen3 models via local GPU.
sherpa-onnx (In-Process)
In-process STT, TTS, and VAD via ONNX Runtime. No external server. 17+ STT models, true streaming.
speaches (External)
Self-hosted STT (Faster Whisper) and TTS (Kokoro). OpenAI-compatible API on port 8000.
Model Router
Three-tier routing: classify intent, try local, fall back to cloud. Cost and privacy aware.
What’s next
Environment Variables
Complete reference for all 80+ configuration options.
Voice Pipeline
How STT, TTS, and VAD work together in the cascade pipeline.
Docker Deployment
Run agtOS with Docker Compose including Redis and GPU acceleration.