Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt

Use this file to discover all available pages before exploring further.

agtOS implements a three-tier memory system that gives the agent the ability to remember conversation context within a session, recall past interactions across sessions, and build long-term knowledge about users and their preferences.

Why Memory Matters

Without memory, a voice agent is stateless. It asks the same clarifying questions every session, cannot reference previous conversations, and feels impersonal. The memory system solves this by providing three layers of recall, each serving a different purpose.

Memory Tiers

Working Memory

Current session contextRecent conversation turns, active tool results, current task state. Always available — no external dependencies.

Episodic Memory

Cross-session recallConversation summaries, extracted facts, user corrections. Stored in Redis with configurable TTL.

Semantic Memory

Long-term knowledgeUser preferences, learned facts, entity relationships. Embedding-based vector search for semantic retrieval.

Working Memory

Working memory is the conversation context available to the LLM during the current session. It lives directly in the LLM’s context window. Scope: Current session only Storage: In-process (no external dependencies) Contents: Recent conversation turns, active tool results, current task state Working memory is managed by the session manager and passed in the messages array to the LLM. When the conversation grows long, automatic summarization compresses older turns into a summary, keeping the context window focused on recent and relevant information.
Turn 1: User asks about weather → stored in working memory
Turn 2: Agent responds with forecast → stored in working memory
Turn 3: User asks follow-up → LLM sees both previous turns
...
Turn 20: Older turns are summarized → "User discussed weather, then scheduling"
Working memory always works, even without Redis. It is the baseline that ensures every conversation has context, regardless of infrastructure availability.

Episodic Memory

Episodic memory preserves conversation knowledge after a session ends. When a session completes, the system uses an LLM to summarize the conversation and extract key facts, then stores these in Redis. Scope: Retained across sessions Storage: Redis with TTL-based expiration (default 30 days) Contents: Conversation summaries, extracted facts, user corrections, task outcomes

How Memories Are Saved

Not every conversation is worth remembering. The episodic memory system uses heuristic save decisions to determine what to persist:
  • Conversations with user corrections or preferences are always saved
  • Task completions and their outcomes are saved
  • Short, trivial exchanges (greetings, single-turn lookups) may be skipped
  • An importance score (0.0-1.0) is assigned to each episode

Querying Episodes

Episodic memories can be retrieved by recency or keyword search:
# Get recent episodes
curl http://localhost:4102/api/memory/episodes?limit=10

# Search episodes by keyword
curl "http://localhost:4102/api/memory/search?q=weather+forecast&limit=5"
{
  "episodes": [
    {
      "id": "ep-abc123",
      "summary": "User asked about weather forecast for San Francisco",
      "keywords": ["weather", "forecast", "san francisco"],
      "topic": "weather",
      "timestamp": 1711612800000,
      "importance": 0.7,
      "type": "conversation"
    }
  ],
  "count": 1,
  "available": true
}

Semantic Memory

Semantic memory provides long-term knowledge storage with embedding-based vector search. Unlike episodic memory (which stores summaries of specific conversations), semantic memory stores distilled facts, preferences, and patterns that persist indefinitely. Scope: Cross-session, permanent until explicitly deleted Storage: Redis Vector Search (HNSW indexing) Contents: User preferences, learned facts, entity relationships, behavioral patterns

How It Works

  1. Embedding generation: When a memory is stored, its text is converted to a dense vector using an embedding model
  2. Vector indexing: The embedding is stored in Redis Vector Search with HNSW indexing for fast approximate nearest-neighbor search
  3. Semantic retrieval: When the agent needs context, the current query is embedded and compared against stored memories using cosine similarity
  4. Relevance threshold: Only memories above a configurable similarity threshold are included in the LLM context

Search Modes

The memory system supports three search strategies, configurable via AGTOS_MEMORY_SEARCH_MODE:
ModeDescriptionDefault
hybridCombines BM25 keyword matching with vector similarity using reciprocal rank fusion (RRF)Yes
vectorPure embedding-based semantic search (cosine similarity)No
bm25Pure keyword/term frequency searchNo
Hybrid mode gives the best results for most queries — it finds exact keyword matches and semantically similar content simultaneously.
Enable query expansion with AGTOS_MEMORY_QUERY_EXPANSION=true to use Ollama to rewrite search queries for better recall. This adds ~50ms latency but can significantly improve results for vague queries.

Embedding Providers

ProviderModelDimensionsUse Case
Ollamanomic-embed-text768Local/private, no external API calls (default)
OpenRouteropenai/text-embedding-3-small1536Cloud-based, highest quality
Configure via:
AGTOS_EMBEDDING_PROVIDER=ollama          # or 'openrouter'
AGTOS_EMBEDDING_MODEL=nomic-embed-text   # or 'openai/text-embedding-3-small'
Semantic memory requires both Redis (with the RediSearch module) and an embedding provider (Ollama or OpenRouter). If either is unavailable, semantic memory falls back to episodic keyword search.

Context Assembly

Before each LLM call, the Memory Coordinator assembles a context packet that combines all three memory tiers:
1

Working Memory

Current conversation turns are already in the messages array.
2

Episodic Recall

Relevant session summaries are retrieved from Redis based on semantic similarity to the current query.
3

Semantic Recall

Relevant long-term memories are retrieved from the vector store based on embedding similarity.
4

Deduplication

Redundant information across layers is removed.
5

Prioritization

Memories are ranked by relevance score and recency, then fit within a token budget.
6

Context Injection

The assembled memories are formatted as a system prompt section for the LLM.

Memory Coordinator

The MemoryCoordinator is the unified facade for the three-tier system. It provides a single entry point for the orchestrator and agent loop, handling the complexity of coordinating across tiers. Graceful degradation is built in:
  • Working memory (in-process) always works — no external dependencies
  • Episodic memory (Redis) is optional — logs warnings if unavailable
  • Semantic memory (Redis + Ollama) is optional — falls back to episodic keyword search
This means agtOS works on a fresh install with no Redis and no Ollama. As you add infrastructure, memory capabilities unlock progressively.

API Endpoints

MethodEndpointDescription
GET/api/memory/episodesList recent episodic memories
GET/api/memory/search?q=...Semantic + keyword search across episodes
GET/api/memory/profileGet user profile (name, preferences, patterns)
PUT/api/memory/profileUpdate user profile fields
GET/api/memory/conclusionsGet dialectic reasoning conclusions
DELETE/api/memory/conclusions/:idDelete a specific conclusion
POST/api/memory/askAsk a question about the user (RAG via Dialectic engine)
GET/api/memory/sourcesScan for importable external AI tool memories
POST/api/memory/importImport memories from external AI tools
POST/api/memory/maintainTrigger a memory maintenance sweep (memory lint)
GET/api/memory/maintain/historyList recent maintenance reports
GET/api/memory/maintain/history/:timestampFetch a single maintenance report by timestamp
See HTTP Endpoints for complete request/response documentation.

Query Parameters

GET /api/memory/episodes
ParameterTypeDefaultDescription
limitnumber20Max results (1-100)
GET /api/memory/search
ParameterTypeDefaultDescription
qstringRequired. Search query text
limitnumber10Max results (1-50)

User Profile

The memory system builds a user profile from conversation history, tracking the user’s name, communication style, behavioral patterns, and preferences. The profile is used to personalize agent responses.

Dreamer (Memory Consolidation)

The Dreamer is a background process that consolidates episodic memories into higher-level user conclusions at the end of every voice session. It uses an LLM to synthesize patterns across multiple episodes — for example, noticing that a user consistently asks about weather in the morning and inferring “user is a morning person who checks weather first.” The Dreamer is wired into endVoiceSession() and also listens for the server-level sessionEnded event as defense-in-depth (see ADR-021). Conclusions have a confidence score (0.0—1.0) and a type:
TypeDescription
explicitDirectly stated by the user (“I prefer metric units”)
deductiveLogically derived from multiple facts (A + B implies C)
inductiveGeneralized from observed patterns across sessions
abductiveBest explanation for observed user behavior
Configure the Dreamer provider via the consolidation task slot in ~/.agtos/config.json:
{
  "slots": {
    "consolidation": { "provider": "ollama", "model": "qwen3:4b" }
  }
}
The legacy AGTOS_CONSOLIDATION_PROVIDER and AGTOS_CONSOLIDATION_MODEL env vars are still supported as fallbacks when the slot is not configured.

Query-as-Ingest

When the agent produces a high-quality synthesized response — multi-step reasoning, a comparison table, an analysis — that synthesis used to live only in the current session’s working memory. Query-as-Ingest persists those responses as RESPONSE_INGEST episodes so future questions can build on prior synthesis (see ADR-021). The importance score is computed heuristically:
SignalImportance bump
Baseline4
Tool calls executed+3
Multi-step reasoning (≥ 2 steps)+2
Length > 1000 chars+2
Length > 500 chars+1
Synthesis patterns (“compared to”, “trade-offs”, “analysis shows”)+1
A response is ingested when importance ≥ 6. A per-session rate limit (5 by default) and a 5-second dedup window prevent flooding and collisions with the normal recordInteraction() save path.

Maintenance Mode (Memory Lint)

The Dreamer also runs a periodic knowledge-base health sweep that Karpathy’s “LLM Wiki” gist calls a lint operation. The sweep is six steps:
  1. Stale detection — conclusions older than staleThresholdDays are flagged.
  2. Confidence decay — stale conclusions have their confidence multiplied by 0.9.
  3. Redundancy merge — pairs above a Jaccard similarity threshold (0.85 default) are merged (higher-confidence wins, sources combined).
  4. Orphan episode flagging — episodes not referenced by any conclusion and with importance < 3 (configurable via orphanImportanceThreshold) are flagged.
  5. Contradiction detection — a contradiction detector runs against the conclusion set (see Contradiction pipeline below).
  6. Low-confidence pruning — conclusions below pruneConfidenceThreshold (0.3 default) are deleted.
The sweep is scheduled as a cron task at startup (default 0 3 * * * in the AGTOS_MAINTENANCE_TIMEZONE zone, default UTC) and can also be triggered on demand via POST /api/memory/maintain or the CLI:
agtos memory maintain
agtos memory maintain --verbose
A memory-maintenance health check goes unhealthy if the sweep hasn’t run in more than 48 hours. The lastMaintenanceAt and totalSweeps counters are persisted to Redis so a restart within the 48-hour window doesn’t spuriously report stale. A memory-semantic health check separately probes the RediSearch vector index document count and size via FT.INFO with a 1-second timeout.
All maintenance runs are gated by the ResourceGuard (below) — they’re skipped on busy systems and retried on the next cron tick. Set AGTOS_MAINTENANCE_ENABLED=false to disable maintenance entirely (this kill switch also short-circuits manually-created schedules and direct event publishes).

Contradiction pipeline (NLI hybrid)

On large profiles, asking a single LLM call to audit every pair of conclusions suffers from attention dilution and hallucinated IDs. agtOS ships a mandatory 3-stage NLI hybrid contradiction pipeline (ADR-027) that powers Step 5:
  1. Stage 1 — Candidate selection: in-memory cosine similarity over conclusion embeddings, top-K nearest plus an “interesting pair” heuristic, deduplicated and truncated to 500.
  2. Stage 2 — NLI cross-encoder: a quantized DeBERTa-v3-base MNLI model (~223 MB, SHA-256 pinned in a manifest, atomic-rename download) runs locally via onnxruntime-node. Verdicts are cached in Redis via a PairCache (Redis 7.4+ HEXPIRE with STRING+EX fallback).
  3. Stage 3 — Batched LLM judge: the surviving candidates go to a batched judge (10 pairs per call, Zod-validated structured JSON), which runs through the maintenance task slot from ADR-026 so operators can pin a different model for the judge than for ordinary consolidation.
Telemetry lives under MaintenanceReport.summary.contradictionPipeline (6 counters + per-stage latencies). A new memory.contradiction.detected event fires once per confirmation. Set AGTOS_NLI_ENABLED=false to skip Stage 2 — candidates pass directly from Stage 1 to Stage 3 (the LLM judge), which increases token cost but removes the ONNX Runtime dependency. Prebuild the NLI model bundle with npm run prebuild:nli.

ResourceGuard

Background LLM work on local Ollama can contend with an active voice session for GPU and VRAM: requesting a different model causes Ollama to unload the currently-loaded model, and the user’s next voice response then takes 10-20 seconds while the original model reloads. The ResourceGuard gates all background LLM calls via a deterministic decision tree (ADR-021):
  1. Policy overridepolicy === 'always' short-circuits to safe
  2. Cloud provider — Claude or OpenRouter short-circuits to safe (no local contention possible)
  3. Remote Ollama — hostname not in the local-host allowlist short-circuits to safe
  4. Active sessions — skip if any voice session is active
  5. Session cooldown — skip if fewer than 30 seconds have elapsed since the last session ended
  6. System load — skip if os.loadavg()[0] > cpuCount * 0.8 (no-op on Windows, where loadavg() returns zeros)
  7. Ollama VRAM probe — skip if GET /api/ps reports any loaded model with a non-expiring expires_at
Consolidation uses retry-with-backoff (5 × 60 s) when deferred — stale consolidation is worse than brief contention. Maintenance uses skip-and-wait — the next cron tick retries naturally. Configure the policy via AGTOS_BACKGROUND_WORK_POLICY:
PolicyBehavior
autoDefault. Runs all seven checks.
alwaysSkip all checks. Use on dedicated GPU hosts.
idle-onlyStricter load threshold (0.3 × CPU count). POSIX hosts only — on Windows the load check is a no-op, so the VRAM probe becomes the sole strong signal.
The idle-only strict-idle guarantee only holds on POSIX hosts. Operators running idle-only on Windows should rely on the Ollama VRAM probe and the active-session count.
ResourceGuard exposes a resource-guard health check (informational, always healthy), the agtos_background_work_safe Prometheus gauge, and the agtos_resource_guard_defer_count_total{reason} counter so dashboards can alert on sustained deferrals.

Dialectic Engine

The Dialectic engine answers questions about the user by synthesizing information from the profile, conclusions, and episodic memories. Use POST /api/memory/ask to query it:
curl -X POST http://localhost:4102/api/memory/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are this user'\''s preferred communication channels?"}'

Cross-Tool Memory Import

agtOS can import memories from other AI tools via the import pipeline. Supported sources:
SourceWhat’s Imported
Claude CodeConversation history and project memories
CursorEditor conversation context
WindsurfAI assistant conversations
AiderGit-based coding conversations
GitHub CopilotInteraction history
The CLI provides a convenient interface:
# Scan for available sources and import
npx agtos memory import
Or use the API directly:
# Scan available sources
curl http://localhost:4102/api/memory/sources

# Import from all sources
curl -X POST http://localhost:4102/api/memory/import

Configuration

# Redis URL (required for episodic and semantic memory)
REDIS_URL=redis://localhost:6379

# Ollama (provides embeddings for semantic memory)
OLLAMA_HOST=http://localhost:11434
Working memory settings are configured via code (WorkingMemoryConfig), not environment variables. The defaults work well for most deployments:
SettingDefaultDescription
maxHistoryTokens4000Max tokens before summarization triggers
maxMessages30Max messages before summarization regardless of token count
preserveRecentCount6Recent messages preserved verbatim during summarization
When conversation length exceeds these limits, older turns are summarized by the LLM into a condensed context, keeping the working memory focused and within token limits.

Entity-Centric Memory (Knowledge Wiki)

On top of the three-tier memory system, agtOS builds a structured knowledge graph of entities and relationships extracted from conversations (ADR-030).

How It Works

When episodes are created, the system runs NER (Named Entity Recognition) via @huggingface/transformers (bert-base-NER) to automatically extract entities — people, places, organizations, events, and things. Entities are deduplicated via alias matching and embedding similarity (0.85 threshold), then stored in Redis JSON with RediSearch indexing.

Entity Types

TypeExamples
personPeople mentioned in conversations
placeLocations, cities, countries
organizationCompanies, teams, institutions
eventMeetings, holidays, deadlines
thingProjects, tools, concepts

Relationships

Entities are connected via typed relationships (e.g., “Alice works-at Acme Corp”, “Project X belongs-to Team Alpha”). Relationships have confidence scores and source episode references, enabling graph-style queries.

API Endpoints

MethodEndpointDescription
GET/api/entitiesList/search entities by name or type
GET/api/entities/statsEntity counts grouped by type
GET/api/entities/:idGet entity details
PUT/api/entities/:idUpdate entity name, aliases, confidence
DELETE/api/entities/:idSoft-delete entity
POST/api/entities/:id/mergeMerge duplicate entities
GET/api/entities/:id/episodesEpisodes mentioning this entity
GET/api/entities/:id/conclusionsConclusions referencing this entity
GET/api/entities/:id/relationshipsRelationships involving this entity

Dashboard

The Knowledge page in the dashboard provides a wiki-style browser for entities. You can search by name, filter by type, view relationship graphs, merge duplicates, and edit entity details. The Entity Detail page shows a single entity with its full context — episodes, conclusions, and relationships.
Entity-centric memory requires Redis. Entities are automatically extracted from conversations — no manual setup is needed.

Privacy and Data Control

The memory system includes privacy controls:
  • Explicit deletion: Memories can be removed via the forget() protocol method
  • TTL expiration: Episodic memories expire after a configurable period (default 30 days)
  • Per-user isolation: Memories are scoped to individual users via device-to-user mapping
  • User preferences: Privacy settings allow users to opt out of memory persistence entirely

What’s next

MCP Integration

How tools and external servers extend agent capabilities.

Memory API

REST endpoints for episodes, search, profile, and Dialectic reasoning.