Skip to main content
agtOS uses a multi-provider architecture. Claude handles complex reasoning in the cloud, Ollama runs local models for fast responses and privacy, and speaches provides speech-to-text and text-to-speech. The model router ties them together, sending each request to the optimal provider.

Claude Provider

Claude is the primary cloud LLM for agtOS, used for complex conversations, multi-step reasoning, and background agentic tasks. agtOS integrates two Anthropic SDKs (see ADR-003):
  • Client SDK (@anthropic-ai/sdk) — real-time voice path with streaming
  • Agent SDK (@anthropic-ai/claude-agent-sdk) — background autonomous tasks

Model Selection

ModelIDBest ForInput $/MTokOutput $/MTok
Claude Opus 4claude-opus-4-20250514Complex reasoning, analysis$15.00$75.00
Claude Sonnet 4claude-sonnet-4-20250514Balanced performance (default)$3.00$15.00
Claude Haiku 4.5claude-haiku-4-5-20251001Speed, low cost, voice pipeline$0.80$4.00
Set the model via environment variable:
CLAUDE_MODEL=claude-sonnet-4-20250514

Configuration Options

# .env.local

# Model (defaults to Sonnet 4)
CLAUDE_MODEL=claude-sonnet-4-20250514

# Adaptive thinking — lets the model decide when to use extended reasoning
# Values: adaptive (recommended), enabled (legacy), disabled
CLAUDE_THINKING=adaptive

# Effort level for adaptive thinking
# Values: low, medium, high, max (max is Opus 4.6 only)
CLAUDE_EFFORT=medium

# Service tier — 'auto' uses priority capacity when available
# Values: auto, standard_only
CLAUDE_SERVICE_TIER=auto

# Custom API base URL (for proxies or CLI transport)
ANTHROPIC_BASE_URL=https://api.anthropic.com

Provider Defaults

When no environment variables are set, the Claude provider uses these defaults:
SettingDefaultNotes
modelclaude-sonnet-4-20250514Sonnet 4 balances speed and capability
maxTokens4096Max output tokens per response
temperature0.7Generation temperature (0.0 - 1.0)
timeoutMs3000005-minute request timeout
promptCache.maxEntries50LRU prompt cache size

Dual-SDK Architecture

The voice pipeline uses the Client SDK for real-time streaming, targeting first-token latency under 200ms and first-sentence latency under 500ms. Background tasks like “research the best options for X” are dispatched to the Agent SDK, which handles multi-step tool execution autonomously. Both SDKs connect to the same MCP servers, so tools are defined once and available to both paths. See Authentication for API key setup.
For voice interactions, Haiku 4.5 provides the best speed-to-cost ratio. The model router (below) automatically selects Haiku for simple queries and Sonnet for complex ones.

Ollama Provider

Ollama serves local models for intent classification and handling simple queries without cloud API calls. This reduces cost, latency, and keeps privacy-sensitive requests on-device.

Configuration

# .env.local

# Ollama server URL
OLLAMA_HOST=http://localhost:11434

# Default model for local query execution
OLLAMA_DEFAULT_MODEL=qwen3:4b

# Model for intent classification (small and fast)
OLLAMA_INTENT_MODEL=qwen3:1.7b

Intent Classifier

The intent classifier is a small, fast model that categorizes every incoming request before it reaches an LLM. It runs on CPU in under 50ms and determines:
CategoryDescriptionRoute
simple_queryFactual queries, greetings, time/dateLocal (Ollama)
system_commandVolume, timer, reminderLocal (Ollama)
tool_useFile ops, API callsMay need Claude
complex_reasoningAnalysis, code reviewClaude
creativeWriting, brainstormingClaude

Classifier Defaults

SettingDefaultNotes
hosthttp://localhost:11434Ollama API URL
modelqwen3:1.7bSmall model for fast classification
confidenceThreshold0.7Below this confidence, escalate to Claude
timeoutMs2000Max classification time before defaulting to cloud
Ollama has a confirmed bug where streaming + tools enabled simultaneously produces malformed output. agtOS automatically uses stream: false when tools are involved, which adds latency but produces correct results.

speaches STT/TTS

speaches is a self-hosted server that provides OpenAI-compatible speech-to-text and text-to-speech endpoints. agtOS uses it for the cascade voice pipeline.

Configuration

# .env.local

# speaches server URL (shared by STT and TTS)
SPEACHES_URL=http://localhost:8000

# --- Speech-to-Text ---
STT_PROVIDER=speaches
SPEACHES_STT_MODEL=Systran/faster-whisper-small

# --- Text-to-Speech ---
TTS_PROVIDER=speaches
SPEACHES_TTS_MODEL=speaches-ai/Kokoro-82M-v1.0-ONNX
SPEACHES_TTS_VOICE=af_heart

STT Defaults

SettingDefaultNotes
baseUrlhttp://localhost:8000speaches API URL
modelSystran/faster-whisper-smallFaster Whisper model
languageenLanguage code (en, es, fr, etc.)
timeoutMs3000030-second request timeout

TTS Defaults

SettingDefaultNotes
baseUrlhttp://localhost:8000speaches API URL
modelspeaches-ai/Kokoro-82M-v1.0-ONNXKokoro ONNX model for fast synthesis
voiceaf_heartDefault voice ID
formatwavOutput format (wav or mp3)
speed1.0Speaking speed (0.25 - 4.0)
timeoutMs3000030-second request timeout
speaches does not support opus or aac audio formats. Use wav (default) or mp3.

Model Router

The model router implements ADR-004 — a three-tier routing architecture that sends each request to the optimal inference tier.

How Routing Works

User Request
     |
     v
Tier 1: Intent Classifier (<50ms, local)
     |
     +-- simple + tools ------> Tier 2: Ollama (stream: false)
     +-- simple + no tools ----> Tier 2: Ollama (stream: true)
     +-- complex --------------> Tier 3: Claude Haiku (streaming)
     +-- very complex ---------> Tier 3: Claude Sonnet (streaming)
     +-- privacy sensitive ----> Tier 2: Ollama (regardless of complexity)

Router Configuration

# .env.local

# Use the model router as the command provider (default)
COMMAND_PROVIDER=model-router

# Local model for Tier 2 execution
OLLAMA_DEFAULT_MODEL=qwen3:4b

# Cloud model for Tier 3 execution
CLAUDE_MODEL=claude-sonnet-4-20250514

Router Defaults

SettingDefaultNotes
classificationEnabledtrueEnable intent classification
defaultProviderclaudeFallback when classification is disabled
localModelqwen3:4bOllama model for local execution
cloudModelclaude-sonnet-4-20250514Claude model for cloud execution
privacyModefalseWhen true, never sends to cloud
maxCostPerRequest0Cost limit in USD (0 = unlimited)

Pattern Overrides

The router supports regex patterns that force specific routing regardless of classification:
// Always route to Claude:
'analyze.*code'
'review.*pull.?request'
'refactor'
'debug'
'write.*test'
'explain.*architecture'

Fallback Strategy

The router handles failures gracefully:
  1. If Tier 2 (local) fails or produces low-confidence output, it automatically escalates to Tier 3 (cloud)
  2. If Tier 3 (cloud) is unavailable (network down), it falls back to Tier 2 with a degraded capability warning
  3. Classification errors are tracked so thresholds can be adjusted over time

Bypassing the Router

To skip the router and use a single provider directly:
# Use Claude directly (no local routing)
COMMAND_PROVIDER=claude

# Use Ollama directly (no cloud fallback)
COMMAND_PROVIDER=ollama

Provider Architecture Summary

Claude (Cloud)

Complex reasoning, multi-step tasks, creative generation. Sonnet 4 default, Haiku 4.5 for voice speed.

Ollama (Local)

Simple queries, privacy-sensitive requests, intent classification. Qwen3 models via local GPU.

speaches (Voice)

Self-hosted STT (Faster Whisper) and TTS (Kokoro). OpenAI-compatible API on port 8000.

Model Router

Three-tier routing: classify intent, try local, fall back to cloud. Cost and privacy aware.