Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt

Use this file to discover all available pages before exploring further.

agtOS uses a multi-provider architecture. Claude or OpenAI handles complex reasoning in the cloud, Ollama runs local models for fast responses and privacy, and speech processing runs either in-process via sherpa-onnx or via an external speaches server. The model router ties them together.

Claude Provider

Claude is the primary cloud LLM for agtOS, used for complex conversations, multi-step reasoning, and background agentic tasks. agtOS integrates two Anthropic SDKs (see ADR-003):
  • Client SDK (@anthropic-ai/sdk) — real-time voice path with streaming
  • Agent SDK (@anthropic-ai/claude-agent-sdk) — background autonomous tasks

Model Selection

ModelIDBest ForInput $/MTokOutput $/MTok
Claude Opus 4claude-opus-4-20250514Complex reasoning, analysis$15.00$75.00
Claude Sonnet 4claude-sonnet-4-20250514Balanced performance (default)$3.00$15.00
Claude Haiku 4.5claude-haiku-4-5-20251001Speed, low cost, voice pipeline$0.80$4.00
Set the model via environment variable:
CLAUDE_MODEL=claude-sonnet-4-20250514

Configuration Options

# .env.local

# Model (defaults to Sonnet 4)
CLAUDE_MODEL=claude-sonnet-4-20250514

# Adaptive thinking — lets the model decide when to use extended reasoning
# Values: adaptive (recommended), enabled (legacy), disabled
CLAUDE_THINKING=adaptive

# Effort level for adaptive thinking
# Values: low, medium, high, max (max is Opus 4.6 only)
CLAUDE_EFFORT=medium

# Service tier — 'auto' uses priority capacity when available
# Values: auto, standard_only
CLAUDE_SERVICE_TIER=auto

# Custom API base URL (for proxies or CLI transport)
ANTHROPIC_BASE_URL=https://api.anthropic.com

Provider Defaults

When no environment variables are set, the Claude provider uses these defaults:
SettingDefaultNotes
modelclaude-sonnet-4-20250514Sonnet 4 balances speed and capability
maxTokens4096Max output tokens per response
temperature0.7Generation temperature (0.0 - 1.0)
timeoutMs3000005-minute request timeout
promptCache.maxEntries50LRU prompt cache size

Dual-SDK Architecture

The voice pipeline uses the Client SDK for real-time streaming, targeting first-token latency under 200ms and first-sentence latency under 500ms. Background tasks like “research the best options for X” are dispatched to the Agent SDK, which handles multi-step tool execution autonomously. Both SDKs connect to the same MCP servers, so tools are defined once and available to both paths. See Authentication for API key setup.
For voice interactions, Haiku 4.5 provides the best speed-to-cost ratio. The model router (below) automatically selects Haiku for simple queries and Sonnet for complex ones.

OpenAI Provider

OpenAI is an alternative cloud LLM provider for agtOS, available as a drop-in replacement for Claude in the model router’s cloud tier (ADR-019). It uses the OpenAI Node SDK v6 with streaming support.

Model Selection

ModelIDBest ForInput $/MTokOutput $/MTok
GPT-4ogpt-4oComplex reasoning (default)$2.50$10.00
GPT-4o Minigpt-4o-miniSpeed, low cost$0.15$0.60

Configuration

Configure OpenAI as the provider for one or more model slots in ~/.agtos/config.json:
{
  "slots": {
    "chat": { "provider": "openai", "model": "gpt-4o" },
    "reasoning": { "provider": "openai", "model": "gpt-4o" }
  }
}
Or run agtos setup to configure slots interactively. You also need the API key:
# .env.local
OPENAI_API_KEY=sk-your-key-here

Features

  • Streaming: Full streaming support via .stream() with finalChatCompletion()
  • Tool calling: Function-based tool calls compatible with the agtOS tool registry
  • Session management: 30-minute TTL sessions with token tracking
  • Barge-in: Stream cancellation via AbortSignal for voice pipeline interrupts
  • Health check: Probes /v1/models endpoint when API key is configured
When a slot is configured with "provider": "openai", the model router sends that slot’s requests to OpenAI. If the OpenAI provider fails to initialize, it falls back to Claude with a warning.

Ollama Provider

Ollama serves local models for intent classification and handling simple queries without cloud API calls. This reduces cost, latency, and keeps privacy-sensitive requests on-device.

Configuration

# .env.local

# Ollama server URL
OLLAMA_HOST=http://localhost:11434

# Default model for local query execution
OLLAMA_DEFAULT_MODEL=qwen3:4b

# Model for intent classification (small and fast)
OLLAMA_INTENT_MODEL=qwen3:1.7b

Intent Classifier

The intent classifier is a small, fast model that categorizes every incoming request before it reaches an LLM. It runs on CPU in under 50ms and determines:
CategoryDescriptionRoute
simple_queryFactual queries, greetings, time/dateLocal (Ollama)
system_commandVolume, timer, reminderLocal (Ollama)
tool_useFile ops, API callsMay need Claude
complex_reasoningAnalysis, code reviewClaude
creativeWriting, brainstormingClaude

Classifier Defaults

SettingDefaultNotes
hosthttp://localhost:11434Ollama API URL
modelqwen3:1.7bSmall model for fast classification
confidenceThreshold0.7Below this confidence, escalate to Claude
timeoutMs2000Max classification time before defaulting to cloud
Ollama has a confirmed bug where streaming + tools enabled simultaneously produces malformed output. agtOS automatically uses stream: false when tools are involved, which adds latency but produces correct results.

speaches STT/TTS (Fallback)

speaches is a self-hosted server that provides OpenAI-compatible speech-to-text and text-to-speech endpoints. agtOS can use it as a fallback when sherpa-onnx is not available.

Configuration

# .env.local

# Use external speaches server (fallback)
STT_PROVIDER=speaches
TTS_PROVIDER=speaches

# speaches server URL (shared by STT and TTS)
SPEACHES_URL=http://localhost:8000

# --- Speech-to-Text ---
SPEACHES_STT_MODEL=Systran/faster-whisper-small

# --- Text-to-Speech ---
SPEACHES_TTS_MODEL=speaches-ai/Kokoro-82M-v1.0-ONNX
SPEACHES_TTS_VOICE=af_heart

STT Defaults

SettingDefaultNotes
baseUrlhttp://localhost:8000speaches API URL
modelSystran/faster-whisper-smallFaster Whisper model
languageenLanguage code (en, es, fr, etc.)
timeoutMs3000030-second request timeout

TTS Defaults

SettingDefaultNotes
baseUrlhttp://localhost:8000speaches API URL
modelspeaches-ai/Kokoro-82M-v1.0-ONNXKokoro ONNX model for fast synthesis
voiceaf_heartDefault voice ID
formatwavOutput format (wav or mp3)
speed1.0Speaking speed (0.25 - 4.0)
timeoutMs3000030-second request timeout
speaches does not support opus or aac audio formats. Use wav (default) or mp3.

sherpa-onnx Provider (Default)

sherpa-onnx is the default STT/TTS/VAD provider. It runs directly in the Node.js process via a native ONNX Runtime addon. No Python, no external server, no HTTP round-trips. See ADR-017 for the decision rationale.

Why sherpa-onnx?

  • In-process: No network latency for STT/TTS calls
  • 17+ STT models: Whisper, Moonshine, SenseVoice, Zipformer, Paraformer
  • True streaming STT: Partial results while the user is still speaking
  • Voice cloning: PocketTTS and ZipVoice support (future)
  • Apple Silicon: CoreML acceleration on macOS

Configuration

# sherpa-onnx is the default — these are shown for clarity
STT_PROVIDER=sherpa-onnx
TTS_PROVIDER=sherpa-onnx

# Model selection
SHERPA_STT_MODEL=moonshine-tiny-en-int8    # Fast English (default)
SHERPA_TTS_MODEL=kokoro-int8-multi-v1      # Kokoro TTS (default)
SHERPA_TTS_VOICE=af_heart                  # Default voice

# Performance tuning
SHERPA_STT_NUM_THREADS=4
SHERPA_TTS_NUM_THREADS=2
SHERPA_TTS_POOL_SIZE=3                     # Concurrent TTS instances

Available TTS Voices

The Kokoro TTS model includes 11 built-in voices. OpenAI voice names are mapped to their closest Kokoro equivalents.
Voice IDDescriptionOpenAI Alias
af_heartWarm, natural American femalealloy
af_bellaClear, expressive American femalenova
af_nicoleCalm, professional American female
af_sarahFriendly, conversational American female
af_skyBright, energetic American female
am_adamSteady, confident American maleecho
am_michaelFriendly, conversational American maleonyx
bf_emmaPolished British femaleshimmer
bf_isabellaElegant British female
bm_georgeAuthoritative British malefable
bm_lewisWarm British male

Model Router

The STT model router automatically selects the best model based on context:
ContextSelected ModelReason
English, fast modeMoonshine TinyLowest latency
Non-EnglishSenseVoice INT8Multilingual support
Streaming requestedZipformer EN 20MReal-time partial results
Quality modeSenseVoice INT8Best accuracy
The router is consulted automatically when the configured model is not available.
sherpa-onnx requires downloading ONNX model files (~460MB for the default set). Run npx agtos models download --default before first use. Models are cached locally in ~/.agtos/models/.

Model Router

The model router implements ADR-004 — a three-tier routing architecture that sends each request to the optimal inference tier.

How Routing Works

User Request
     |
     v
Tier 1: Intent Classifier (<50ms, local)
     |
     +-- simple + tools ------> Tier 2: Ollama (stream: false)
     +-- simple + no tools ----> Tier 2: Ollama (stream: true)
     +-- complex --------------> Tier 3: Cloud (Claude or OpenAI, streaming)
     +-- very complex ---------> Tier 3: Cloud (Claude or OpenAI, streaming)
     +-- privacy sensitive ----> Tier 2: Ollama (regardless of complexity)

Router Configuration

The model router uses the Model Slot Registry (ADR-020) to route requests. Each slot maps to a provider and model:
~/.agtos/config.json
{
  "slots": {
    "chat": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "reasoning": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "coding": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "tool_calling": { "provider": "ollama", "model": "qwen3:4b" },
    "creative": { "provider": "claude", "model": "claude-sonnet-4-20250514" }
  }
}
The intent classifier routes each request to a named slot, and the registry resolves that slot to a provider + model pair. Slots can also define fallback chains:
{
  "slots": {
    "chat": { "provider": "openai", "model": "gpt-4o", "fallback": "reasoning" },
    "reasoning": { "provider": "claude", "model": "claude-sonnet-4-20250514" }
  }
}
Set the command provider via environment variable:
# .env.local
COMMAND_PROVIDER=model-router    # default — use the slot-based router
OLLAMA_DEFAULT_MODEL=qwen3:4b    # local model for Tier 2
OLLAMA_INTENT_MODEL=qwen3:1.7b   # intent classification model

Built-in Slots

SlotTypePurpose
chatConversationGeneral chat (required — system won’t start without it)
reasoningConversationComplex analysis and multi-step reasoning
codingConversationCode generation and review
tool_callingConversationRequests that require tool execution
creativeConversationWriting, brainstorming, creative tasks
embeddingTaskVector embeddings for semantic memory
classifierTaskIntent classification for routing
summarizationTaskConversation summarization
consolidationTaskMemory consolidation (Dreamer)
dialecticTaskUser reasoning (Dialectic engine)
maintenanceTaskStage 3 LLM judge for the NLI hybrid contradiction pipeline (ADR-027). Defaults to fallback: 'consolidation' so existing single-provider setups keep working unchanged.

Pattern Overrides

The router supports forceSlotPatterns — regex patterns that force routing to a specific slot regardless of classification:
{
  "forceSlotPatterns": {
    "reasoning": ["analyze.*code", "review.*pull.?request", "explain.*architecture"],
    "chat": ["^(hi|hello|hey)", "^what.*time", "^set.*timer"]
  }
}

Fallback Strategy

The router handles failures gracefully via per-slot fallback chains:
  1. If a slot’s primary provider fails, the registry tries the slot’s fallback slot (max depth: 3, circular reference guard)
  2. The chat slot is the terminal fallback — it always exists and cannot be removed
  3. Classification errors are tracked so thresholds can be adjusted over time

Per-Slot Metrics

Each slot is instrumented with Prometheus metrics:
MetricLabelsDescription
agtos_slot_duration_secondsslotRequest duration histogram
agtos_slot_requests_totalslotTotal request count
agtos_slot_errors_totalslotTotal error count

Bypassing the Router

To skip the router and use a single provider directly:
# Use Claude directly (no local routing)
COMMAND_PROVIDER=claude

# Use Ollama directly (no cloud fallback)
COMMAND_PROVIDER=ollama

Cognitive Task Providers

Beyond the main LLM and speech providers, agtOS has several specialized AI tasks that can each use a different provider (ADR-018). This allows fine-grained optimization — for example, using local Ollama for embeddings while routing reasoning tasks to Claude, or pinning a cheap fast model for the maintenance task slot’s LLM judge.
TaskVariableOptionsPurpose
EmbeddingAGTOS_EMBEDDING_PROVIDERollama, openrouterVector embeddings for semantic memory search
ClassificationAGTOS_CLASSIFIER_PROVIDERollama, claude, openrouterIntent classification for model routing
ConsolidationAGTOS_CONSOLIDATION_PROVIDERollama, claude, openrouterMemory consolidation (Dreamer) — compresses episodic memories
ReasoningAGTOS_REASONING_PROVIDERollama, claude, openrouterDialectic reasoning — synthesizes user profile conclusions
SummarizationAGTOS_SUMMARIZATION_PROVIDERollama, claude, openrouterConversation summarization for working memory
Each task also has a _MODEL variable (e.g., AGTOS_EMBEDDING_MODEL) to override the default model.

OpenRouter

OpenRouter is a first-class provider in agtOS (ADR-026). It proxies requests to Claude, GPT, Gemini, Llama, and many other models through a single API, and can be configured for any slot — conversation slots (chat, reasoning, coding, etc.) as well as task slots (embedding, classification, consolidation, dialectic, maintenance).
OPENROUTER_API_KEY=sk-or-your-key
~/.agtos/config.json
{
  "slots": {
    "chat": { "provider": "openrouter", "model": "anthropic/claude-sonnet-4" },
    "maintenance": { "provider": "openrouter", "model": "openai/gpt-4o-mini" }
  }
}
OpenRouter has its own credential scope (provider-openrouter) — distinct from provider-openai — and the client sets the HTTP-Referer and X-Title attribution headers required by the OpenRouter leaderboard. The OpenRouterCatalog pulls rich model metadata from /api/v1/models (context length, per-token pricing, supported parameters, and input modalities for vision / PDF / audio detection), while /api/v1/key powers the account info card in the dashboard. These settings are also configurable at runtime via PUT /api/settings — see Environment Variables for the full list.

Provider Catalog

Every provider implements the ProviderCatalog interface (ADR-026) so the dashboard, the agtos setup wizard, and slot pickers can discover available models in a provider-agnostic way. listModels() returns a list of ModelInfo entries with context length, max output tokens, per-1M-token pricing, and a 13-entry capability union (including 'contradiction' for the NLI hybrid pipeline). Catalog results cache for one hour by default.
ProviderCatalog implementationSource
ClaudeClaudeCatalogAuto-paginated client.models.list() with capabilities.{batch, code_execution, image_input, pdf_input, structured_outputs, thinking} flags
OpenAIOpenAICatalogLive /v1/models merged with a hand-maintained capability map (OpenAI’s API doesn’t expose capabilities)
OllamaOllamaCataloglist + show fan-out with family-prefixed model_info extraction
OpenRouterOpenRouterCatalog/api/v1/models with parsed per-token pricing and supported_parameters-derived capabilities
A provider.catalog.refreshed lifecycle event fires whenever a catalog successfully fetches from the network (cache hits don’t emit). A provider.credentials.updated event fires on create/rotate/delete in CredentialManager. Catalog freshness is tracked per provider via getLastFetchedMs(). The per-provider health checks (provider-claude, provider-openai, etc.) report staleness when the last fetch exceeds 10 minutes. The cache TTL is configurable via AGTOS_PROVIDER_CATALOG_CACHE_TTL_SECONDS (default 1 hour).

Credential Rotation

API keys can be rotated at runtime without restarting the server. The ProviderLifecycleManager handles the lifecycle:
  1. Update the credential via the dashboard Settings page or POST /api/credentials.
  2. The provider.credentials.updated event fires.
  3. The lifecycle manager calls updateCredentials() on the client provider instance.
  4. In-flight requests complete on the old client; new requests use the new credentials.
  5. Slot registry references are preserved — no slot reconfiguration needed.
Per-provider health checks (provider-claude, provider-openai, provider-ollama, provider-openrouter) report credential status, catalog freshness, and whether the client provider is initialized. Ollama is credential-less — the lifecycle manager owns only its catalog and health check.

Provider Architecture Summary

Claude (Cloud)

Complex reasoning, multi-step tasks, creative generation. Sonnet 4 default, Haiku 4.5 for voice speed. Default cloud provider.

OpenAI (Cloud)

Alternative cloud provider. GPT-4o for reasoning, GPT-4o Mini for speed. Configure per slot in ~/.agtos/config.json.

OpenRouter (Cloud)

First-class cloud provider that proxies Claude, GPT, Gemini, Llama, and more through a single API. Rich catalog + pricing, per-slot config.

Ollama (Local)

Simple queries, privacy-sensitive requests, intent classification. Qwen3 models via local GPU.

sherpa-onnx (In-Process)

In-process STT, TTS, and VAD via ONNX Runtime. No external server. 17+ STT models, true streaming.

speaches (External)

Self-hosted STT (Faster Whisper) and TTS (Kokoro). OpenAI-compatible API on port 8000.

Model Router

Three-tier routing: classify intent, try local, fall back to cloud. Cost and privacy aware.

What’s next

Environment Variables

Complete reference for all 80+ configuration options.

Voice Pipeline

How STT, TTS, and VAD work together in the cascade pipeline.

Docker Deployment

Run agtOS with Docker Compose including Redis and GPU acceleration.