Configure Claude, Ollama, speaches STT/TTS, and the model router
agtOS uses a multi-provider architecture. Claude handles complex reasoning in the cloud, Ollama runs local models for fast responses and privacy, and speaches provides speech-to-text and text-to-speech. The model router ties them together, sending each request to the optimal provider.
Claude is the primary cloud LLM for agtOS, used for complex conversations, multi-step reasoning, and background agentic tasks. agtOS integrates two Anthropic SDKs (see ADR-003):
Client SDK (@anthropic-ai/sdk) — real-time voice path with streaming
# .env.local# Model (defaults to Sonnet 4)CLAUDE_MODEL=claude-sonnet-4-20250514# Adaptive thinking — lets the model decide when to use extended reasoning# Values: adaptive (recommended), enabled (legacy), disabledCLAUDE_THINKING=adaptive# Effort level for adaptive thinking# Values: low, medium, high, max (max is Opus 4.6 only)CLAUDE_EFFORT=medium# Service tier — 'auto' uses priority capacity when available# Values: auto, standard_onlyCLAUDE_SERVICE_TIER=auto# Custom API base URL (for proxies or CLI transport)ANTHROPIC_BASE_URL=https://api.anthropic.com
The voice pipeline uses the Client SDK for real-time streaming, targeting first-token latency under 200ms and first-sentence latency under 500ms. Background tasks like “research the best options for X” are dispatched to the Agent SDK, which handles multi-step tool execution autonomously.Both SDKs connect to the same MCP servers, so tools are defined once and available to both paths. See Authentication for API key setup.
For voice interactions, Haiku 4.5 provides the best speed-to-cost ratio. The model router (below) automatically selects Haiku for simple queries and Sonnet for complex ones.
Ollama serves local models for intent classification and handling simple queries without cloud API calls. This reduces cost, latency, and keeps privacy-sensitive requests on-device.
# .env.local# Ollama server URLOLLAMA_HOST=http://localhost:11434# Default model for local query executionOLLAMA_DEFAULT_MODEL=qwen3:4b# Model for intent classification (small and fast)OLLAMA_INTENT_MODEL=qwen3:1.7b
The intent classifier is a small, fast model that categorizes every incoming request before it reaches an LLM. It runs on CPU in under 50ms and determines:
Max classification time before defaulting to cloud
Ollama has a confirmed bug where streaming + tools enabled simultaneously produces malformed output. agtOS automatically uses stream: false when tools are involved, which adds latency but produces correct results.
speaches is a self-hosted server that provides OpenAI-compatible speech-to-text and text-to-speech endpoints. agtOS uses it for the cascade voice pipeline.
# .env.local# Use the model router as the command provider (default)COMMAND_PROVIDER=model-router# Local model for Tier 2 executionOLLAMA_DEFAULT_MODEL=qwen3:4b# Cloud model for Tier 3 executionCLAUDE_MODEL=claude-sonnet-4-20250514