agtOS implements a three-tier memory system that gives the agent the ability to remember conversation context within a session, recall past interactions across sessions, and build long-term knowledge about users and their preferences.Documentation Index
Fetch the complete documentation index at: https://docs.agtos.ai/llms.txt
Use this file to discover all available pages before exploring further.
Why Memory Matters
Without memory, a voice agent is stateless. It asks the same clarifying questions every session, cannot reference previous conversations, and feels impersonal. The memory system solves this by providing three layers of recall, each serving a different purpose.Memory Tiers
Working Memory
Current session contextRecent conversation turns, active tool results, current task state. Always available — no external dependencies.
Episodic Memory
Cross-session recallConversation summaries, extracted facts, user corrections. Stored in Redis with configurable TTL.
Semantic Memory
Long-term knowledgeUser preferences, learned facts, entity relationships. Embedding-based vector search for semantic retrieval.
Working Memory
Working memory is the conversation context available to the LLM during the current session. It lives directly in the LLM’s context window. Scope: Current session only Storage: In-process (no external dependencies) Contents: Recent conversation turns, active tool results, current task state Working memory is managed by the session manager and passed in themessages array to the LLM. When the conversation grows long, automatic summarization compresses older turns into a summary, keeping the context window focused on recent and relevant information.
Episodic Memory
Episodic memory preserves conversation knowledge after a session ends. When a session completes, the system uses an LLM to summarize the conversation and extract key facts, then stores these in Redis. Scope: Retained across sessions Storage: Redis with TTL-based expiration (default 30 days) Contents: Conversation summaries, extracted facts, user corrections, task outcomesHow Memories Are Saved
Not every conversation is worth remembering. The episodic memory system uses heuristic save decisions to determine what to persist:- Conversations with user corrections or preferences are always saved
- Task completions and their outcomes are saved
- Short, trivial exchanges (greetings, single-turn lookups) may be skipped
- An importance score (0.0-1.0) is assigned to each episode
Querying Episodes
Episodic memories can be retrieved by recency or keyword search:Semantic Memory
Semantic memory provides long-term knowledge storage with embedding-based vector search. Unlike episodic memory (which stores summaries of specific conversations), semantic memory stores distilled facts, preferences, and patterns that persist indefinitely. Scope: Cross-session, permanent until explicitly deleted Storage: Redis Vector Search (HNSW indexing) Contents: User preferences, learned facts, entity relationships, behavioral patternsHow It Works
- Embedding generation: When a memory is stored, its text is converted to a dense vector using an embedding model
- Vector indexing: The embedding is stored in Redis Vector Search with HNSW indexing for fast approximate nearest-neighbor search
- Semantic retrieval: When the agent needs context, the current query is embedded and compared against stored memories using cosine similarity
- Relevance threshold: Only memories above a configurable similarity threshold are included in the LLM context
Search Modes
The memory system supports three search strategies, configurable viaAGTOS_MEMORY_SEARCH_MODE:
| Mode | Description | Default |
|---|---|---|
hybrid | Combines BM25 keyword matching with vector similarity using reciprocal rank fusion (RRF) | Yes |
vector | Pure embedding-based semantic search (cosine similarity) | No |
bm25 | Pure keyword/term frequency search | No |
Embedding Providers
| Provider | Model | Dimensions | Use Case |
|---|---|---|---|
| Ollama | nomic-embed-text | 768 | Local/private, no external API calls (default) |
| OpenRouter | openai/text-embedding-3-small | 1536 | Cloud-based, highest quality |
Semantic memory requires both Redis (with the RediSearch module) and an embedding provider (Ollama or OpenRouter). If either is unavailable, semantic memory falls back to episodic keyword search.
Context Assembly
Before each LLM call, the Memory Coordinator assembles a context packet that combines all three memory tiers:Episodic Recall
Relevant session summaries are retrieved from Redis based on semantic similarity to the current query.
Semantic Recall
Relevant long-term memories are retrieved from the vector store based on embedding similarity.
Memory Coordinator
TheMemoryCoordinator is the unified facade for the three-tier system. It provides a single entry point for the orchestrator and agent loop, handling the complexity of coordinating across tiers.
Graceful degradation is built in:
- Working memory (in-process) always works — no external dependencies
- Episodic memory (Redis) is optional — logs warnings if unavailable
- Semantic memory (Redis + Ollama) is optional — falls back to episodic keyword search
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET | /api/memory/episodes | List recent episodic memories |
GET | /api/memory/search?q=... | Semantic + keyword search across episodes |
GET | /api/memory/profile | Get user profile (name, preferences, patterns) |
PUT | /api/memory/profile | Update user profile fields |
GET | /api/memory/conclusions | Get dialectic reasoning conclusions |
DELETE | /api/memory/conclusions/:id | Delete a specific conclusion |
POST | /api/memory/ask | Ask a question about the user (RAG via Dialectic engine) |
GET | /api/memory/sources | Scan for importable external AI tool memories |
POST | /api/memory/import | Import memories from external AI tools |
POST | /api/memory/maintain | Trigger a memory maintenance sweep (memory lint) |
GET | /api/memory/maintain/history | List recent maintenance reports |
GET | /api/memory/maintain/history/:timestamp | Fetch a single maintenance report by timestamp |
Query Parameters
GET /api/memory/episodes| Parameter | Type | Default | Description |
|---|---|---|---|
limit | number | 20 | Max results (1-100) |
| Parameter | Type | Default | Description |
|---|---|---|---|
q | string | — | Required. Search query text |
limit | number | 10 | Max results (1-50) |
User Profile
The memory system builds a user profile from conversation history, tracking the user’s name, communication style, behavioral patterns, and preferences. The profile is used to personalize agent responses.Dreamer (Memory Consolidation)
The Dreamer is a background process that consolidates episodic memories into higher-level user conclusions at the end of every voice session. It uses an LLM to synthesize patterns across multiple episodes — for example, noticing that a user consistently asks about weather in the morning and inferring “user is a morning person who checks weather first.” The Dreamer is wired intoendVoiceSession() and also listens for the server-level sessionEnded event as defense-in-depth (see ADR-021).
Conclusions have a confidence score (0.0—1.0) and a type:
| Type | Description |
|---|---|
explicit | Directly stated by the user (“I prefer metric units”) |
deductive | Logically derived from multiple facts (A + B implies C) |
inductive | Generalized from observed patterns across sessions |
abductive | Best explanation for observed user behavior |
consolidation task slot in ~/.agtos/config.json:
AGTOS_CONSOLIDATION_PROVIDER and AGTOS_CONSOLIDATION_MODEL env vars are still supported as fallbacks when the slot is not configured.
Query-as-Ingest
When the agent produces a high-quality synthesized response — multi-step reasoning, a comparison table, an analysis — that synthesis used to live only in the current session’s working memory. Query-as-Ingest persists those responses asRESPONSE_INGEST episodes so future questions can build on prior synthesis (see ADR-021).
The importance score is computed heuristically:
| Signal | Importance bump |
|---|---|
| Baseline | 4 |
| Tool calls executed | +3 |
| Multi-step reasoning (≥ 2 steps) | +2 |
| Length > 1000 chars | +2 |
| Length > 500 chars | +1 |
| Synthesis patterns (“compared to”, “trade-offs”, “analysis shows”) | +1 |
recordInteraction() save path.
Maintenance Mode (Memory Lint)
The Dreamer also runs a periodic knowledge-base health sweep that Karpathy’s “LLM Wiki” gist calls a lint operation. The sweep is six steps:- Stale detection — conclusions older than
staleThresholdDaysare flagged. - Confidence decay — stale conclusions have their confidence multiplied by 0.9.
- Redundancy merge — pairs above a Jaccard similarity threshold (0.85 default) are merged (higher-confidence wins, sources combined).
- Orphan episode flagging — episodes not referenced by any conclusion and with
importance < 3(configurable viaorphanImportanceThreshold) are flagged. - Contradiction detection — a contradiction detector runs against the conclusion set (see Contradiction pipeline below).
- Low-confidence pruning — conclusions below
pruneConfidenceThreshold(0.3 default) are deleted.
0 3 * * * in the AGTOS_MAINTENANCE_TIMEZONE zone, default UTC) and can also be triggered on demand via POST /api/memory/maintain or the CLI:
memory-maintenance health check goes unhealthy if the sweep hasn’t run in more than 48 hours. The lastMaintenanceAt and totalSweeps counters are persisted to Redis so a restart within the 48-hour window doesn’t spuriously report stale. A memory-semantic health check separately probes the RediSearch vector index document count and size via FT.INFO with a 1-second timeout.
All maintenance runs are gated by the ResourceGuard (below) — they’re skipped on busy systems and retried on the next cron tick. Set
AGTOS_MAINTENANCE_ENABLED=false to disable maintenance entirely (this kill switch also short-circuits manually-created schedules and direct event publishes).Contradiction pipeline (NLI hybrid)
On large profiles, asking a single LLM call to audit every pair of conclusions suffers from attention dilution and hallucinated IDs. agtOS ships a mandatory 3-stage NLI hybrid contradiction pipeline (ADR-027) that powers Step 5:- Stage 1 — Candidate selection: in-memory cosine similarity over conclusion embeddings, top-K nearest plus an “interesting pair” heuristic, deduplicated and truncated to 500.
- Stage 2 — NLI cross-encoder: a quantized DeBERTa-v3-base MNLI model (~223 MB, SHA-256 pinned in a manifest, atomic-rename download) runs locally via
onnxruntime-node. Verdicts are cached in Redis via aPairCache(Redis 7.4+HEXPIREwith STRING+EX fallback). - Stage 3 — Batched LLM judge: the surviving candidates go to a batched judge (10 pairs per call, Zod-validated structured JSON), which runs through the
maintenancetask slot from ADR-026 so operators can pin a different model for the judge than for ordinary consolidation.
MaintenanceReport.summary.contradictionPipeline (6 counters + per-stage latencies). A new memory.contradiction.detected event fires once per confirmation. Set AGTOS_NLI_ENABLED=false to skip Stage 2 — candidates pass directly from Stage 1 to Stage 3 (the LLM judge), which increases token cost but removes the ONNX Runtime dependency. Prebuild the NLI model bundle with npm run prebuild:nli.
ResourceGuard
Background LLM work on local Ollama can contend with an active voice session for GPU and VRAM: requesting a different model causes Ollama to unload the currently-loaded model, and the user’s next voice response then takes 10-20 seconds while the original model reloads. The ResourceGuard gates all background LLM calls via a deterministic decision tree (ADR-021):- Policy override —
policy === 'always'short-circuits to safe - Cloud provider — Claude or OpenRouter short-circuits to safe (no local contention possible)
- Remote Ollama — hostname not in the local-host allowlist short-circuits to safe
- Active sessions — skip if any voice session is active
- Session cooldown — skip if fewer than 30 seconds have elapsed since the last session ended
- System load — skip if
os.loadavg()[0] > cpuCount * 0.8(no-op on Windows, whereloadavg()returns zeros) - Ollama VRAM probe — skip if
GET /api/psreports any loaded model with a non-expiringexpires_at
AGTOS_BACKGROUND_WORK_POLICY:
| Policy | Behavior |
|---|---|
auto | Default. Runs all seven checks. |
always | Skip all checks. Use on dedicated GPU hosts. |
idle-only | Stricter load threshold (0.3 × CPU count). POSIX hosts only — on Windows the load check is a no-op, so the VRAM probe becomes the sole strong signal. |
resource-guard health check (informational, always healthy), the agtos_background_work_safe Prometheus gauge, and the agtos_resource_guard_defer_count_total{reason} counter so dashboards can alert on sustained deferrals.
Dialectic Engine
The Dialectic engine answers questions about the user by synthesizing information from the profile, conclusions, and episodic memories. UsePOST /api/memory/ask to query it:
Cross-Tool Memory Import
agtOS can import memories from other AI tools via the import pipeline. Supported sources:| Source | What’s Imported |
|---|---|
| Claude Code | Conversation history and project memories |
| Cursor | Editor conversation context |
| Windsurf | AI assistant conversations |
| Aider | Git-based coding conversations |
| GitHub Copilot | Interaction history |
Configuration
Working memory configuration
Working memory configuration
Working memory settings are configured via code (
When conversation length exceeds these limits, older turns are summarized by the LLM into a condensed context, keeping the working memory focused and within token limits.
WorkingMemoryConfig), not environment variables. The defaults work well for most deployments:| Setting | Default | Description |
|---|---|---|
maxHistoryTokens | 4000 | Max tokens before summarization triggers |
maxMessages | 30 | Max messages before summarization regardless of token count |
preserveRecentCount | 6 | Recent messages preserved verbatim during summarization |
Entity-Centric Memory (Knowledge Wiki)
On top of the three-tier memory system, agtOS builds a structured knowledge graph of entities and relationships extracted from conversations (ADR-030).How It Works
When episodes are created, the system runs NER (Named Entity Recognition) via@huggingface/transformers (bert-base-NER) to automatically extract entities — people, places, organizations, events, and things. Entities are deduplicated via alias matching and embedding similarity (0.85 threshold), then stored in Redis JSON with RediSearch indexing.
Entity Types
| Type | Examples |
|---|---|
person | People mentioned in conversations |
place | Locations, cities, countries |
organization | Companies, teams, institutions |
event | Meetings, holidays, deadlines |
thing | Projects, tools, concepts |
Relationships
Entities are connected via typed relationships (e.g., “Alice works-at Acme Corp”, “Project X belongs-to Team Alpha”). Relationships have confidence scores and source episode references, enabling graph-style queries.API Endpoints
| Method | Endpoint | Description |
|---|---|---|
GET | /api/entities | List/search entities by name or type |
GET | /api/entities/stats | Entity counts grouped by type |
GET | /api/entities/:id | Get entity details |
PUT | /api/entities/:id | Update entity name, aliases, confidence |
DELETE | /api/entities/:id | Soft-delete entity |
POST | /api/entities/:id/merge | Merge duplicate entities |
GET | /api/entities/:id/episodes | Episodes mentioning this entity |
GET | /api/entities/:id/conclusions | Conclusions referencing this entity |
GET | /api/entities/:id/relationships | Relationships involving this entity |
Dashboard
The Knowledge page in the dashboard provides a wiki-style browser for entities. You can search by name, filter by type, view relationship graphs, merge duplicates, and edit entity details. The Entity Detail page shows a single entity with its full context — episodes, conclusions, and relationships.Entity-centric memory requires Redis. Entities are automatically extracted from conversations — no manual setup is needed.
Privacy and Data Control
The memory system includes privacy controls:- Explicit deletion: Memories can be removed via the
forget()protocol method - TTL expiration: Episodic memories expire after a configurable period (default 30 days)
- Per-user isolation: Memories are scoped to individual users via device-to-user mapping
- User preferences: Privacy settings allow users to opt out of memory persistence entirely
What’s next
MCP Integration
How tools and external servers extend agent capabilities.
Memory API
REST endpoints for episodes, search, profile, and Dialectic reasoning.