Provider Configuration

agtOS uses a multi-provider architecture. Claude or OpenAI handles complex reasoning in the cloud, Ollama runs local models for fast responses and privacy, and speech processing runs either in-process via sherpa-onnx or via an external speaches server. The model router ties them together.

Claude Provider

Claude is the primary cloud LLM for agtOS, used for complex conversations, multi-step reasoning, and background agentic tasks. agtOS integrates two Anthropic SDKs (see ADR-003):

Client SDK (@anthropic-ai/sdk) — real-time voice path with streaming
Agent SDK (@anthropic-ai/claude-agent-sdk) — background autonomous tasks

Model Selection

Model	ID	Best For	Input $/MTok	Output $/MTok
Claude Opus 4	`claude-opus-4-20250514`	Complex reasoning, analysis	$15.00	$75.00
Claude Sonnet 4	`claude-sonnet-4-20250514`	Balanced performance (default)	$3.00	$15.00
Claude Haiku 4.5	`claude-haiku-4-5-20251001`	Speed, low cost, voice pipeline	$0.80	$4.00

Set the model via environment variable:

CLAUDE_MODEL=claude-sonnet-4-20250514

Configuration Options

# .env.local

# Model (defaults to Sonnet 4)
CLAUDE_MODEL=claude-sonnet-4-20250514

# Adaptive thinking — lets the model decide when to use extended reasoning
# Values: adaptive (recommended), enabled (legacy), disabled
CLAUDE_THINKING=adaptive

# Effort level for adaptive thinking
# Values: low, medium, high, max (max is Opus 4.6 only)
CLAUDE_EFFORT=medium

# Service tier — 'auto' uses priority capacity when available
# Values: auto, standard_only
CLAUDE_SERVICE_TIER=auto

# Custom API base URL (for proxies or CLI transport)
ANTHROPIC_BASE_URL=https://api.anthropic.com

Provider Defaults

When no environment variables are set, the Claude provider uses these defaults:

Setting	Default	Notes
`model`	`claude-sonnet-4-20250514`	Sonnet 4 balances speed and capability
`maxTokens`	`4096`	Max output tokens per response
`temperature`	`0.7`	Generation temperature (0.0 - 1.0)
`timeoutMs`	`300000`	5-minute request timeout
`promptCache.maxEntries`	`50`	LRU prompt cache size

Dual-SDK Architecture

The voice pipeline uses the Client SDK for real-time streaming, targeting first-token latency under 200ms and first-sentence latency under 500ms. Background tasks like “research the best options for X” are dispatched to the Agent SDK, which handles multi-step tool execution autonomously. Both SDKs connect to the same MCP servers, so tools are defined once and available to both paths. See Authentication for API key setup.

For voice interactions, Haiku 4.5 provides the best speed-to-cost ratio. The model router (below) automatically selects Haiku for simple queries and Sonnet for complex ones.

OpenAI Provider

OpenAI is an alternative cloud LLM provider for agtOS, available as a drop-in replacement for Claude in the model router’s cloud tier (ADR-019). It uses the OpenAI Node SDK v6 with streaming support.

Model Selection

Model	ID	Best For	Input $/MTok	Output $/MTok
GPT-4o	`gpt-4o`	Complex reasoning (default)	$2.50	$10.00
GPT-4o Mini	`gpt-4o-mini`	Speed, low cost	$0.15	$0.60

Configuration

Configure OpenAI as the provider for one or more model slots in ~/.agtos/config.json:

{
  "slots": {
    "chat": { "provider": "openai", "model": "gpt-4o" },
    "reasoning": { "provider": "openai", "model": "gpt-4o" }
  }
}

Or run agtos setup to configure slots interactively. You also need the API key:

# .env.local
OPENAI_API_KEY=sk-your-key-here

Features

Streaming: Full streaming support via .stream() with finalChatCompletion()
Tool calling: Function-based tool calls compatible with the agtOS tool registry
Session management: 30-minute TTL sessions with token tracking
Barge-in: Stream cancellation via AbortSignal for voice pipeline interrupts
Health check: Probes /v1/models endpoint when API key is configured

When a slot is configured with "provider": "openai", the model router sends that slot’s requests to OpenAI. If the OpenAI provider fails to initialize, it falls back to Claude with a warning.

Ollama Provider

Ollama serves local models for intent classification and handling simple queries without cloud API calls. This reduces cost, latency, and keeps privacy-sensitive requests on-device.

Configuration

# .env.local

# Ollama server URL
OLLAMA_HOST=http://localhost:11434

# Default model for local query execution
OLLAMA_DEFAULT_MODEL=qwen3:4b

# Model for intent classification (small and fast)
OLLAMA_INTENT_MODEL=qwen3:1.7b

Intent Classifier

The intent classifier is a small, fast model that categorizes every incoming request before it reaches an LLM. It runs on CPU in under 50ms and determines:

Category	Description	Route
`simple_query`	Factual queries, greetings, time/date	Local (Ollama)
`system_command`	Volume, timer, reminder	Local (Ollama)
`tool_use`	File ops, API calls	May need Claude
`complex_reasoning`	Analysis, code review	Claude
`creative`	Writing, brainstorming	Claude

Classifier Defaults

Setting	Default	Notes
`host`	`http://localhost:11434`	Ollama API URL
`model`	`qwen3:1.7b`	Small model for fast classification
`confidenceThreshold`	`0.7`	Below this confidence, escalate to Claude
`timeoutMs`	`2000`	Max classification time before defaulting to cloud

Ollama has a confirmed bug where streaming + tools enabled simultaneously produces malformed output. agtOS automatically uses stream: false when tools are involved, which adds latency but produces correct results.

speaches STT/TTS (Fallback)

speaches is a self-hosted server that provides OpenAI-compatible speech-to-text and text-to-speech endpoints. agtOS can use it as a fallback when sherpa-onnx is not available.

Configuration

# .env.local

# Use external speaches server (fallback)
STT_PROVIDER=speaches
TTS_PROVIDER=speaches

# speaches server URL (shared by STT and TTS)
SPEACHES_URL=http://localhost:8000

# --- Speech-to-Text ---
SPEACHES_STT_MODEL=Systran/faster-whisper-small

# --- Text-to-Speech ---
SPEACHES_TTS_MODEL=speaches-ai/Kokoro-82M-v1.0-ONNX
SPEACHES_TTS_VOICE=af_heart

STT Defaults

Setting	Default	Notes
`baseUrl`	`http://localhost:8000`	speaches API URL
`model`	`Systran/faster-whisper-small`	Faster Whisper model
`language`	`en`	Language code (en, es, fr, etc.)
`timeoutMs`	`30000`	30-second request timeout

TTS Defaults

Setting	Default	Notes
`baseUrl`	`http://localhost:8000`	speaches API URL
`model`	`speaches-ai/Kokoro-82M-v1.0-ONNX`	Kokoro ONNX model for fast synthesis
`voice`	`af_heart`	Default voice ID
`format`	`wav`	Output format (wav or mp3)
`speed`	`1.0`	Speaking speed (0.25 - 4.0)
`timeoutMs`	`30000`	30-second request timeout

speaches does not support opus or aac audio formats. Use wav (default) or mp3.

sherpa-onnx Provider (Default)

sherpa-onnx is the default STT/TTS/VAD provider. It runs directly in the Node.js process via a native ONNX Runtime addon. No Python, no external server, no HTTP round-trips. See ADR-017 for the decision rationale.

Why sherpa-onnx?

In-process: No network latency for STT/TTS calls
17+ STT models: Whisper, Moonshine, SenseVoice, Zipformer, Paraformer
True streaming STT: Partial results while the user is still speaking
Voice cloning: PocketTTS and ZipVoice support (future)
Apple Silicon: CoreML acceleration on macOS

Configuration

# sherpa-onnx is the default — these are shown for clarity
STT_PROVIDER=sherpa-onnx
TTS_PROVIDER=sherpa-onnx

# Model selection
SHERPA_STT_MODEL=moonshine-tiny-en-int8    # Fast English (default)
SHERPA_TTS_MODEL=kokoro-int8-multi-v1      # Kokoro TTS (default)
SHERPA_TTS_VOICE=af_heart                  # Default voice

# Performance tuning
SHERPA_STT_NUM_THREADS=4
SHERPA_TTS_NUM_THREADS=2
SHERPA_TTS_POOL_SIZE=3                     # Concurrent TTS instances

Available TTS Voices

The Kokoro TTS model includes 11 built-in voices. OpenAI voice names are mapped to their closest Kokoro equivalents.

Voice ID	Description	OpenAI Alias
`af_heart`	Warm, natural American female	`alloy`
`af_bella`	Clear, expressive American female	`nova`
`af_nicole`	Calm, professional American female	—
`af_sarah`	Friendly, conversational American female	—
`af_sky`	Bright, energetic American female	—
`am_adam`	Steady, confident American male	`echo`
`am_michael`	Friendly, conversational American male	`onyx`
`bf_emma`	Polished British female	`shimmer`
`bf_isabella`	Elegant British female	—
`bm_george`	Authoritative British male	`fable`
`bm_lewis`	Warm British male	—

Model Router

The STT model router automatically selects the best model based on context:

Context	Selected Model	Reason
English, fast mode	Moonshine Tiny	Lowest latency
Non-English	SenseVoice INT8	Multilingual support
Streaming requested	Zipformer EN 20M	Real-time partial results
Quality mode	SenseVoice INT8	Best accuracy

The router is consulted automatically when the configured model is not available.

sherpa-onnx requires downloading ONNX model files (~460MB for the default set). Run npx agtos models download --default before first use. Models are cached locally in ~/.agtos/models/.

Model Router

The model router implements ADR-004 — a three-tier routing architecture that sends each request to the optimal inference tier.

How Routing Works

User Request
     |
     v
Tier 1: Intent Classifier (<50ms, local)
     |
     +-- simple + tools ------> Tier 2: Ollama (stream: false)
     +-- simple + no tools ----> Tier 2: Ollama (stream: true)
     +-- complex --------------> Tier 3: Cloud (Claude or OpenAI, streaming)
     +-- very complex ---------> Tier 3: Cloud (Claude or OpenAI, streaming)
     +-- privacy sensitive ----> Tier 2: Ollama (regardless of complexity)

Router Configuration

The model router uses the Model Slot Registry (ADR-020) to route requests. Each slot maps to a provider and model:

~/.agtos/config.json

{
  "slots": {
    "chat": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "reasoning": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "coding": { "provider": "claude", "model": "claude-sonnet-4-20250514" },
    "tool_calling": { "provider": "ollama", "model": "qwen3:4b" },
    "creative": { "provider": "claude", "model": "claude-sonnet-4-20250514" }
  }
}

The intent classifier routes each request to a named slot, and the registry resolves that slot to a provider + model pair. Slots can also define fallback chains:

{
  "slots": {
    "chat": { "provider": "openai", "model": "gpt-4o", "fallback": "reasoning" },
    "reasoning": { "provider": "claude", "model": "claude-sonnet-4-20250514" }
  }
}

Set the command provider via environment variable:

# .env.local
COMMAND_PROVIDER=model-router    # default — use the slot-based router
OLLAMA_DEFAULT_MODEL=qwen3:4b    # local model for Tier 2
OLLAMA_INTENT_MODEL=qwen3:1.7b   # intent classification model

Built-in Slots

Slot	Type	Purpose
`chat`	Conversation	General chat (required — system won’t start without it)
`reasoning`	Conversation	Complex analysis and multi-step reasoning
`coding`	Conversation	Code generation and review
`tool_calling`	Conversation	Requests that require tool execution
`creative`	Conversation	Writing, brainstorming, creative tasks
`embedding`	Task	Vector embeddings for semantic memory
`classifier`	Task	Intent classification for routing
`summarization`	Task	Conversation summarization
`consolidation`	Task	Memory consolidation (Dreamer)
`dialectic`	Task	User reasoning (Dialectic engine)
`maintenance`	Task	Stage 3 LLM judge for the NLI hybrid contradiction pipeline (ADR-027). Defaults to `fallback: 'consolidation'` so existing single-provider setups keep working unchanged.

Pattern Overrides

The router supports forceSlotPatterns — regex patterns that force routing to a specific slot regardless of classification:

{
  "forceSlotPatterns": {
    "reasoning": ["analyze.*code", "review.*pull.?request", "explain.*architecture"],
    "chat": ["^(hi|hello|hey)", "^what.*time", "^set.*timer"]
  }
}

Fallback Strategy

The router handles failures gracefully via per-slot fallback chains:

If a slot’s primary provider fails, the registry tries the slot’s fallback slot (max depth: 3, circular reference guard)
The chat slot is the terminal fallback — it always exists and cannot be removed
Classification errors are tracked so thresholds can be adjusted over time

Per-Slot Metrics

Each slot is instrumented with Prometheus metrics:

Metric	Labels	Description
`agtos_slot_duration_seconds`	`slot`	Request duration histogram
`agtos_slot_requests_total`	`slot`	Total request count
`agtos_slot_errors_total`	`slot`	Total error count

Bypassing the Router

To skip the router and use a single provider directly:

# Use Claude directly (no local routing)
COMMAND_PROVIDER=claude

# Use Ollama directly (no cloud fallback)
COMMAND_PROVIDER=ollama

Cognitive Task Providers

Beyond the main LLM and speech providers, agtOS has several specialized AI tasks that can each use a different provider (ADR-018). This allows fine-grained optimization — for example, using local Ollama for embeddings while routing reasoning tasks to Claude, or pinning a cheap fast model for the maintenance task slot’s LLM judge.

Task	Variable	Options	Purpose
Embedding	`AGTOS_EMBEDDING_PROVIDER`	`ollama`, `openrouter`	Vector embeddings for semantic memory search
Classification	`AGTOS_CLASSIFIER_PROVIDER`	`ollama`, `claude`, `openrouter`	Intent classification for model routing
Consolidation	`AGTOS_CONSOLIDATION_PROVIDER`	`ollama`, `claude`, `openrouter`	Memory consolidation (Dreamer) — compresses episodic memories
Reasoning	`AGTOS_REASONING_PROVIDER`	`ollama`, `claude`, `openrouter`	Dialectic reasoning — synthesizes user profile conclusions
Summarization	`AGTOS_SUMMARIZATION_PROVIDER`	`ollama`, `claude`, `openrouter`	Conversation summarization for working memory

Each task also has a _MODEL variable (e.g., AGTOS_EMBEDDING_MODEL) to override the default model.

OpenRouter

OpenRouter is a first-class provider in agtOS (ADR-026). It proxies requests to Claude, GPT, Gemini, Llama, and many other models through a single API, and can be configured for any slot — conversation slots (chat, reasoning, coding, etc.) as well as task slots (embedding, classification, consolidation, dialectic, maintenance).

OPENROUTER_API_KEY=sk-or-your-key

~/.agtos/config.json

{
  "slots": {
    "chat": { "provider": "openrouter", "model": "anthropic/claude-sonnet-4" },
    "maintenance": { "provider": "openrouter", "model": "openai/gpt-4o-mini" }
  }
}

OpenRouter has its own credential scope (provider-openrouter) — distinct from provider-openai — and the client sets the HTTP-Referer and X-Title attribution headers required by the OpenRouter leaderboard. The OpenRouterCatalog pulls rich model metadata from /api/v1/models (context length, per-token pricing, supported parameters, and input modalities for vision / PDF / audio detection), while /api/v1/key powers the account info card in the dashboard. These settings are also configurable at runtime via PUT /api/settings — see Environment Variables for the full list.

Provider Catalog

Every provider implements the ProviderCatalog interface (ADR-026) so the dashboard, the agtos setup wizard, and slot pickers can discover available models in a provider-agnostic way. listModels() returns a list of ModelInfo entries with context length, max output tokens, per-1M-token pricing, and a 13-entry capability union (including 'contradiction' for the NLI hybrid pipeline). Catalog results cache for one hour by default.

Provider	Catalog implementation	Source
Claude	`ClaudeCatalog`	Auto-paginated `client.models.list()` with `capabilities.{batch, code_execution, image_input, pdf_input, structured_outputs, thinking}` flags
OpenAI	`OpenAICatalog`	Live `/v1/models` merged with a hand-maintained capability map (OpenAI’s API doesn’t expose capabilities)
Ollama	`OllamaCatalog`	`list` + `show` fan-out with family-prefixed `model_info` extraction
OpenRouter	`OpenRouterCatalog`	`/api/v1/models` with parsed per-token pricing and `supported_parameters`-derived capabilities

A provider.catalog.refreshed lifecycle event fires whenever a catalog successfully fetches from the network (cache hits don’t emit). A provider.credentials.updated event fires on create/rotate/delete in CredentialManager. Catalog freshness is tracked per provider via getLastFetchedMs(). The per-provider health checks (provider-claude, provider-openai, etc.) report staleness when the last fetch exceeds 10 minutes. The cache TTL is configurable via AGTOS_PROVIDER_CATALOG_CACHE_TTL_SECONDS (default 1 hour).

Credential Rotation

API keys can be rotated at runtime without restarting the server. The ProviderLifecycleManager handles the lifecycle:

Update the credential via the dashboard Settings page or POST /api/credentials.
The provider.credentials.updated event fires.
The lifecycle manager calls updateCredentials() on the client provider instance.
In-flight requests complete on the old client; new requests use the new credentials.
Slot registry references are preserved — no slot reconfiguration needed.

Per-provider health checks (provider-claude, provider-openai, provider-ollama, provider-openrouter) report credential status, catalog freshness, and whether the client provider is initialized. Ollama is credential-less — the lifecycle manager owns only its catalog and health check.

Provider Architecture Summary

Claude (Cloud)

Complex reasoning, multi-step tasks, creative generation. Sonnet 4 default, Haiku 4.5 for voice speed. Default cloud provider.

OpenAI (Cloud)

Alternative cloud provider. GPT-4o for reasoning, GPT-4o Mini for speed. Configure per slot in ~/.agtos/config.json.

OpenRouter (Cloud)

First-class cloud provider that proxies Claude, GPT, Gemini, Llama, and more through a single API. Rich catalog + pricing, per-slot config.

Ollama (Local)

Simple queries, privacy-sensitive requests, intent classification. Qwen3 models via local GPU.

sherpa-onnx (In-Process)

In-process STT, TTS, and VAD via ONNX Runtime. No external server. 17+ STT models, true streaming.

speaches (External)

Self-hosted STT (Faster Whisper) and TTS (Kokoro). OpenAI-compatible API on port 8000.

Model Router

Three-tier routing: classify intent, try local, fall back to cloud. Cost and privacy aware.

What’s next

Environment Variables

Complete reference for all 80+ configuration options.

Voice Pipeline

How STT, TTS, and VAD work together in the cascade pipeline.

Docker Deployment

Run agtOS with Docker Compose including Redis and GPU acceleration.

​Claude Provider

​Model Selection

​Configuration Options

​Provider Defaults

​Dual-SDK Architecture

​OpenAI Provider

​Model Selection

​Configuration

​Features

​Ollama Provider

​Configuration

​Intent Classifier

​Classifier Defaults

​speaches STT/TTS (Fallback)

​Configuration

​STT Defaults

​TTS Defaults

​sherpa-onnx Provider (Default)

​Why sherpa-onnx?

​Configuration

​Available TTS Voices

​Model Router

​Model Router

​How Routing Works

​Router Configuration

​Built-in Slots

​Pattern Overrides

​Fallback Strategy

​Per-Slot Metrics

​Bypassing the Router

​Cognitive Task Providers

​OpenRouter

​Provider Catalog

​Credential Rotation

​Provider Architecture Summary

Claude (Cloud)

OpenAI (Cloud)

OpenRouter (Cloud)

Ollama (Local)

sherpa-onnx (In-Process)

speaches (External)

Model Router

​What’s next

Environment Variables

Voice Pipeline

Docker Deployment

Claude Provider

Model Selection

Configuration Options

Provider Defaults

Dual-SDK Architecture

OpenAI Provider

Model Selection

Configuration

Features

Ollama Provider

Configuration

Intent Classifier

Classifier Defaults

speaches STT/TTS (Fallback)

Configuration

STT Defaults

TTS Defaults

sherpa-onnx Provider (Default)

Why sherpa-onnx?

Configuration

Available TTS Voices

Model Router

Model Router

How Routing Works

Router Configuration

Built-in Slots

Pattern Overrides

Fallback Strategy

Per-Slot Metrics

Bypassing the Router

Cognitive Task Providers

OpenRouter

Provider Catalog

Credential Rotation

Provider Architecture Summary

What’s next