Memory System

agtOS implements a three-tier memory system that gives the agent the ability to remember conversation context within a session, recall past interactions across sessions, and build long-term knowledge about users and their preferences.

Why Memory Matters

Without memory, a voice agent is stateless. It asks the same clarifying questions every session, cannot reference previous conversations, and feels impersonal. The memory system solves this by providing three layers of recall, each serving a different purpose.

Memory Tiers

Working Memory

Current session contextRecent conversation turns, active tool results, current task state. Always available — no external dependencies.

Episodic Memory

Cross-session recallConversation summaries, extracted facts, user corrections. Stored in Redis with configurable TTL.

Semantic Memory

Long-term knowledgeUser preferences, learned facts, entity relationships. Embedding-based vector search for semantic retrieval.

Working Memory

Working memory is the conversation context available to the LLM during the current session. It lives directly in the LLM’s context window. Scope: Current session only Storage: In-process (no external dependencies) Contents: Recent conversation turns, active tool results, current task state Working memory is managed by the session manager and passed in the messages array to the LLM. When the conversation grows long, automatic summarization compresses older turns into a summary, keeping the context window focused on recent and relevant information.

Turn 1: User asks about weather → stored in working memory
Turn 2: Agent responds with forecast → stored in working memory
Turn 3: User asks follow-up → LLM sees both previous turns
...
Turn 20: Older turns are summarized → "User discussed weather, then scheduling"

Working memory always works, even without Redis. It is the baseline that ensures every conversation has context, regardless of infrastructure availability.

Episodic Memory

Episodic memory preserves conversation knowledge after a session ends. When a session completes, the system uses an LLM to summarize the conversation and extract key facts, then stores these in Redis. Scope: Retained across sessions Storage: Redis with TTL-based expiration (default 30 days) Contents: Conversation summaries, extracted facts, user corrections, task outcomes

How Memories Are Saved

Not every conversation is worth remembering. The episodic memory system uses heuristic save decisions to determine what to persist:

Conversations with user corrections or preferences are always saved
Task completions and their outcomes are saved
Short, trivial exchanges (greetings, single-turn lookups) may be skipped
An importance score (0.0-1.0) is assigned to each episode

Querying Episodes

Episodic memories can be retrieved by recency or keyword search:

# Get recent episodes
curl http://localhost:4102/api/memory/episodes?limit=10

# Search episodes by keyword
curl "http://localhost:4102/api/memory/search?q=weather+forecast&limit=5"

{
  "episodes": [
    {
      "id": "ep-abc123",
      "summary": "User asked about weather forecast for San Francisco",
      "keywords": ["weather", "forecast", "san francisco"],
      "topic": "weather",
      "timestamp": 1711612800000,
      "importance": 0.7,
      "type": "conversation"
    }
  ],
  "count": 1,
  "available": true
}

{
  "query": "weather forecast",
  "results": [
    {
      "id": "ep-abc123",
      "summary": "User asked about weather forecast for San Francisco",
      "keywords": ["weather", "forecast", "san francisco"],
      "topic": "weather",
      "timestamp": 1711612800000,
      "importance": 0.7,
      "type": "conversation",
      "score": 0.92,
      "matchType": "semantic"
    }
  ],
  "count": 1
}

Semantic Memory

Semantic memory provides long-term knowledge storage with embedding-based vector search. Unlike episodic memory (which stores summaries of specific conversations), semantic memory stores distilled facts, preferences, and patterns that persist indefinitely. Scope: Cross-session, permanent until explicitly deleted Storage: Redis Vector Search (HNSW indexing) Contents: User preferences, learned facts, entity relationships, behavioral patterns

How It Works

Embedding generation: When a memory is stored, its text is converted to a dense vector using an embedding model
Vector indexing: The embedding is stored in Redis Vector Search with HNSW indexing for fast approximate nearest-neighbor search
Semantic retrieval: When the agent needs context, the current query is embedded and compared against stored memories using cosine similarity
Relevance threshold: Only memories above a configurable similarity threshold are included in the LLM context

Search Modes

The memory system supports three search strategies, configurable via AGTOS_MEMORY_SEARCH_MODE:

Mode	Description	Default
`hybrid`	Combines BM25 keyword matching with vector similarity using reciprocal rank fusion (RRF)	Yes
`vector`	Pure embedding-based semantic search (cosine similarity)	No
`bm25`	Pure keyword/term frequency search	No

Hybrid mode gives the best results for most queries — it finds exact keyword matches and semantically similar content simultaneously.

Enable query expansion with AGTOS_MEMORY_QUERY_EXPANSION=true to use Ollama to rewrite search queries for better recall. This adds ~50ms latency but can significantly improve results for vague queries.

Embedding Providers

Provider	Model	Dimensions	Use Case
Ollama	`nomic-embed-text`	768	Local/private, no external API calls (default)
OpenRouter	`openai/text-embedding-3-small`	1536	Cloud-based, highest quality

Configure via:

AGTOS_EMBEDDING_PROVIDER=ollama          # or 'openrouter'
AGTOS_EMBEDDING_MODEL=nomic-embed-text   # or 'openai/text-embedding-3-small'

Semantic memory requires both Redis (with the RediSearch module) and an embedding provider (Ollama or OpenRouter). If either is unavailable, semantic memory falls back to episodic keyword search.

Context Assembly

Before each LLM call, the Memory Coordinator assembles a context packet that combines all three memory tiers:

Working Memory

Current conversation turns are already in the messages array.

Episodic Recall

Relevant session summaries are retrieved from Redis based on semantic similarity to the current query.

Semantic Recall

Relevant long-term memories are retrieved from the vector store based on embedding similarity.

Deduplication

Redundant information across layers is removed.

Prioritization

Memories are ranked by relevance score and recency, then fit within a token budget.

Context Injection

The assembled memories are formatted as a system prompt section for the LLM.

Memory Coordinator

The MemoryCoordinator is the unified facade for the three-tier system. It provides a single entry point for the orchestrator and agent loop, handling the complexity of coordinating across tiers. Graceful degradation is built in:

Working memory (in-process) always works — no external dependencies
Episodic memory (Redis) is optional — logs warnings if unavailable
Semantic memory (Redis + Ollama) is optional — falls back to episodic keyword search

This means agtOS works on a fresh install with no Redis and no Ollama. As you add infrastructure, memory capabilities unlock progressively.

API Endpoints

Method	Endpoint	Description
`GET`	`/api/memory/episodes`	List recent episodic memories
`GET`	`/api/memory/search?q=...`	Semantic + keyword search across episodes
`GET`	`/api/memory/profile`	Get user profile (name, preferences, patterns)
`PUT`	`/api/memory/profile`	Update user profile fields
`GET`	`/api/memory/conclusions`	Get dialectic reasoning conclusions
`DELETE`	`/api/memory/conclusions/:id`	Delete a specific conclusion
`POST`	`/api/memory/ask`	Ask a question about the user (RAG via Dialectic engine)
`GET`	`/api/memory/sources`	Scan for importable external AI tool memories
`POST`	`/api/memory/import`	Import memories from external AI tools
`POST`	`/api/memory/maintain`	Trigger a memory maintenance sweep (memory lint)
`GET`	`/api/memory/maintain/history`	List recent maintenance reports
`GET`	`/api/memory/maintain/history/:timestamp`	Fetch a single maintenance report by timestamp

See HTTP Endpoints for complete request/response documentation.

Query Parameters

GET /api/memory/episodes

Parameter	Type	Default	Description
`limit`	number	`20`	Max results (1-100)

GET /api/memory/search

Parameter	Type	Default	Description
`q`	string	—	Required. Search query text
`limit`	number	`10`	Max results (1-50)

User Profile

The memory system builds a user profile from conversation history, tracking the user’s name, communication style, behavioral patterns, and preferences. The profile is used to personalize agent responses.

Dreamer (Memory Consolidation)

The Dreamer is a background process that consolidates episodic memories into higher-level user conclusions at the end of every voice session. It uses an LLM to synthesize patterns across multiple episodes — for example, noticing that a user consistently asks about weather in the morning and inferring “user is a morning person who checks weather first.” The Dreamer is wired into endVoiceSession() and also listens for the server-level sessionEnded event as defense-in-depth (see ADR-021). Conclusions have a confidence score (0.0—1.0) and a type:

Type	Description
`explicit`	Directly stated by the user (“I prefer metric units”)
`deductive`	Logically derived from multiple facts (A + B implies C)
`inductive`	Generalized from observed patterns across sessions
`abductive`	Best explanation for observed user behavior

Configure the Dreamer provider via the consolidation task slot in ~/.agtos/config.json:

{
  "slots": {
    "consolidation": { "provider": "ollama", "model": "qwen3:4b" }
  }
}

The legacy AGTOS_CONSOLIDATION_PROVIDER and AGTOS_CONSOLIDATION_MODEL env vars are still supported as fallbacks when the slot is not configured.

Query-as-Ingest

When the agent produces a high-quality synthesized response — multi-step reasoning, a comparison table, an analysis — that synthesis used to live only in the current session’s working memory. Query-as-Ingest persists those responses as RESPONSE_INGEST episodes so future questions can build on prior synthesis (see ADR-021). The importance score is computed heuristically:

Signal	Importance bump
Baseline	4
Tool calls executed	+3
Multi-step reasoning (≥ 2 steps)	+2
Length > 1000 chars	+2
Length > 500 chars	+1
Synthesis patterns (“compared to”, “trade-offs”, “analysis shows”)	+1

A response is ingested when importance ≥ 6. A per-session rate limit (5 by default) and a 5-second dedup window prevent flooding and collisions with the normal recordInteraction() save path.

Maintenance Mode (Memory Lint)

The Dreamer also runs a periodic knowledge-base health sweep that Karpathy’s “LLM Wiki” gist calls a lint operation. The sweep is six steps:

Stale detection — conclusions older than staleThresholdDays are flagged.
Confidence decay — stale conclusions have their confidence multiplied by 0.9.
Redundancy merge — pairs above a Jaccard similarity threshold (0.85 default) are merged (higher-confidence wins, sources combined).
Orphan episode flagging — episodes not referenced by any conclusion and with importance < 3 (configurable via orphanImportanceThreshold) are flagged.
Contradiction detection — a contradiction detector runs against the conclusion set (see Contradiction pipeline below).
Low-confidence pruning — conclusions below pruneConfidenceThreshold (0.3 default) are deleted.

The sweep is scheduled as a cron task at startup (default 0 3 * * * in the AGTOS_MAINTENANCE_TIMEZONE zone, default UTC) and can also be triggered on demand via POST /api/memory/maintain or the CLI:

agtos memory maintain
agtos memory maintain --verbose

A memory-maintenance health check goes unhealthy if the sweep hasn’t run in more than 48 hours. The lastMaintenanceAt and totalSweeps counters are persisted to Redis so a restart within the 48-hour window doesn’t spuriously report stale. A memory-semantic health check separately probes the RediSearch vector index document count and size via FT.INFO with a 1-second timeout.

All maintenance runs are gated by the ResourceGuard (below) — they’re skipped on busy systems and retried on the next cron tick. Set AGTOS_MAINTENANCE_ENABLED=false to disable maintenance entirely (this kill switch also short-circuits manually-created schedules and direct event publishes).

Contradiction pipeline (NLI hybrid)

On large profiles, asking a single LLM call to audit every pair of conclusions suffers from attention dilution and hallucinated IDs. agtOS ships a mandatory 3-stage NLI hybrid contradiction pipeline (ADR-027) that powers Step 5:

Stage 1 — Candidate selection: in-memory cosine similarity over conclusion embeddings, top-K nearest plus an “interesting pair” heuristic, deduplicated and truncated to 500.
Stage 2 — NLI cross-encoder: a quantized DeBERTa-v3-base MNLI model (~223 MB, SHA-256 pinned in a manifest, atomic-rename download) runs locally via onnxruntime-node. Verdicts are cached in Redis via a PairCache (Redis 7.4+ HEXPIRE with STRING+EX fallback).
Stage 3 — Batched LLM judge: the surviving candidates go to a batched judge (10 pairs per call, Zod-validated structured JSON), which runs through the maintenance task slot from ADR-026 so operators can pin a different model for the judge than for ordinary consolidation.

Telemetry lives under MaintenanceReport.summary.contradictionPipeline (6 counters + per-stage latencies). A new memory.contradiction.detected event fires once per confirmation. Set AGTOS_NLI_ENABLED=false to skip Stage 2 — candidates pass directly from Stage 1 to Stage 3 (the LLM judge), which increases token cost but removes the ONNX Runtime dependency. Prebuild the NLI model bundle with npm run prebuild:nli.

ResourceGuard

Background LLM work on local Ollama can contend with an active voice session for GPU and VRAM: requesting a different model causes Ollama to unload the currently-loaded model, and the user’s next voice response then takes 10-20 seconds while the original model reloads. The ResourceGuard gates all background LLM calls via a deterministic decision tree (ADR-021):

Policy override — policy === 'always' short-circuits to safe
Cloud provider — Claude or OpenRouter short-circuits to safe (no local contention possible)
Remote Ollama — hostname not in the local-host allowlist short-circuits to safe
Active sessions — skip if any voice session is active
Session cooldown — skip if fewer than 30 seconds have elapsed since the last session ended
System load — skip if os.loadavg()[0] > cpuCount * 0.8 (no-op on Windows, where loadavg() returns zeros)
Ollama VRAM probe — skip if GET /api/ps reports any loaded model with a non-expiring expires_at

Consolidation uses retry-with-backoff (5 × 60 s) when deferred — stale consolidation is worse than brief contention. Maintenance uses skip-and-wait — the next cron tick retries naturally. Configure the policy via AGTOS_BACKGROUND_WORK_POLICY:

Policy	Behavior
`auto`	Default. Runs all seven checks.
`always`	Skip all checks. Use on dedicated GPU hosts.
`idle-only`	Stricter load threshold (0.3 × CPU count). POSIX hosts only — on Windows the load check is a no-op, so the VRAM probe becomes the sole strong signal.

The idle-only strict-idle guarantee only holds on POSIX hosts. Operators running idle-only on Windows should rely on the Ollama VRAM probe and the active-session count.

ResourceGuard exposes a resource-guard health check (informational, always healthy), the agtos_background_work_safe Prometheus gauge, and the agtos_resource_guard_defer_count_total{reason} counter so dashboards can alert on sustained deferrals.

Dialectic Engine

The Dialectic engine answers questions about the user by synthesizing information from the profile, conclusions, and episodic memories. Use POST /api/memory/ask to query it:

curl -X POST http://localhost:4102/api/memory/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What are this user'\''s preferred communication channels?"}'

Cross-Tool Memory Import

agtOS can import memories from other AI tools via the import pipeline. Supported sources:

Source	What’s Imported
Claude Code	Conversation history and project memories
Cursor	Editor conversation context
Windsurf	AI assistant conversations
Aider	Git-based coding conversations
GitHub Copilot	Interaction history

The CLI provides a convenient interface:

# Scan for available sources and import
npx agtos memory import

Or use the API directly:

# Scan available sources
curl http://localhost:4102/api/memory/sources

# Import from all sources
curl -X POST http://localhost:4102/api/memory/import

Configuration

# Redis URL (required for episodic and semantic memory)
REDIS_URL=redis://localhost:6379

# Ollama (provides embeddings for semantic memory)
OLLAMA_HOST=http://localhost:11434

Working memory configuration

Working memory settings are configured via code (WorkingMemoryConfig), not environment variables. The defaults work well for most deployments:

Setting	Default	Description
`maxHistoryTokens`	`4000`	Max tokens before summarization triggers
`maxMessages`	`30`	Max messages before summarization regardless of token count
`preserveRecentCount`	`6`	Recent messages preserved verbatim during summarization

When conversation length exceeds these limits, older turns are summarized by the LLM into a condensed context, keeping the working memory focused and within token limits.

Entity-Centric Memory (Knowledge Wiki)

On top of the three-tier memory system, agtOS builds a structured knowledge graph of entities and relationships extracted from conversations (ADR-030).

How It Works

When episodes are created, the system runs NER (Named Entity Recognition) via @huggingface/transformers (bert-base-NER) to automatically extract entities — people, places, organizations, events, and things. Entities are deduplicated via alias matching and embedding similarity (0.85 threshold), then stored in Redis JSON with RediSearch indexing.

Entity Types

Type	Examples
`person`	People mentioned in conversations
`place`	Locations, cities, countries
`organization`	Companies, teams, institutions
`event`	Meetings, holidays, deadlines
`thing`	Projects, tools, concepts

Relationships

Entities are connected via typed relationships (e.g., “Alice works-at Acme Corp”, “Project X belongs-to Team Alpha”). Relationships have confidence scores and source episode references, enabling graph-style queries.

API Endpoints

Method	Endpoint	Description
`GET`	`/api/entities`	List/search entities by name or type
`GET`	`/api/entities/stats`	Entity counts grouped by type
`GET`	`/api/entities/:id`	Get entity details
`PUT`	`/api/entities/:id`	Update entity name, aliases, confidence
`DELETE`	`/api/entities/:id`	Soft-delete entity
`POST`	`/api/entities/:id/merge`	Merge duplicate entities
`GET`	`/api/entities/:id/episodes`	Episodes mentioning this entity
`GET`	`/api/entities/:id/conclusions`	Conclusions referencing this entity
`GET`	`/api/entities/:id/relationships`	Relationships involving this entity

Dashboard

The Knowledge page in the dashboard provides a wiki-style browser for entities. You can search by name, filter by type, view relationship graphs, merge duplicates, and edit entity details. The Entity Detail page shows a single entity with its full context — episodes, conclusions, and relationships.

Entity-centric memory requires Redis. Entities are automatically extracted from conversations — no manual setup is needed.

Privacy and Data Control

The memory system includes privacy controls:

Explicit deletion: Memories can be removed via the forget() protocol method
TTL expiration: Episodic memories expire after a configurable period (default 30 days)
Per-user isolation: Memories are scoped to individual users via device-to-user mapping
User preferences: Privacy settings allow users to opt out of memory persistence entirely

Memory System

Why Memory Matters

Memory Tiers

Working Memory

Episodic Memory

Semantic Memory

Working Memory

Episodic Memory

How Memories Are Saved

Querying Episodes

Semantic Memory

How It Works

Search Modes

Embedding Providers

Context Assembly

Memory Coordinator

API Endpoints

Query Parameters

User Profile

Dreamer (Memory Consolidation)

Query-as-Ingest

Maintenance Mode (Memory Lint)

Contradiction pipeline (NLI hybrid)

ResourceGuard

Dialectic Engine

Cross-Tool Memory Import

Configuration

Entity-Centric Memory (Knowledge Wiki)

How It Works

Entity Types

Relationships

API Endpoints

Dashboard

Privacy and Data Control

What’s next

MCP Integration

Memory API

​Why Memory Matters

​Memory Tiers

Working Memory

Episodic Memory

Semantic Memory

​Working Memory

​Episodic Memory

​How Memories Are Saved

​Querying Episodes

​Semantic Memory

​How It Works

​Search Modes

​Embedding Providers

​Context Assembly

​Memory Coordinator

​API Endpoints

​Query Parameters

​User Profile

​Dreamer (Memory Consolidation)

​Query-as-Ingest

​Maintenance Mode (Memory Lint)

​Contradiction pipeline (NLI hybrid)

​ResourceGuard

​Dialectic Engine

​Cross-Tool Memory Import

​Configuration

​Entity-Centric Memory (Knowledge Wiki)

​How It Works

​Entity Types

​Relationships

​API Endpoints

​Dashboard

​Privacy and Data Control

​What’s next

MCP Integration

Memory API

Why Memory Matters

Memory Tiers

Working Memory

Episodic Memory

How Memories Are Saved

Querying Episodes

Semantic Memory

How It Works

Search Modes

Embedding Providers

Context Assembly

Memory Coordinator

API Endpoints

Query Parameters

User Profile

Dreamer (Memory Consolidation)

Query-as-Ingest

Maintenance Mode (Memory Lint)

Contradiction pipeline (NLI hybrid)

ResourceGuard

Dialectic Engine

Cross-Tool Memory Import

Configuration

Entity-Centric Memory (Knowledge Wiki)

How It Works

Entity Types

Relationships

API Endpoints

Dashboard

Privacy and Data Control

What’s next