Organism
A local memory layer for AI agents. Organism gives any LLM persistent, searchable memory across sessions — without sending your data to external APIs.
Architecture
Organism uses a four-tier memory system. Tiers 0–2 run on every request. Tier 3 is offline and disabled by default.
Tier 0 Working memory Last-N messages from the database (no retrieval step)
Tier 1 RAG chunks Verbatim turns → rag_chunks — FTS5 full-text + HNSW vector
Tier 2 Facts LLM-extracted facts per session → facts + user_profile
Tier 3 Research Consolidation + LoRA sleep [experimental, off by default]
Online pipeline (every request):
User message
├─ Tier 0: recent messages (direct load)
├─ Tier 1: HybridRetriever on rag_chunks (FTS5 + HNSW)
├─ Tier 2: HybridRetriever on facts (FTS5 + HNSW)
└─ ContextAssembler → prompt → LLM → reply
└─ async: FactExtractor daemon writes to facts table
Full architecture details: docs/architecture.md
Quickstart
Prerequisites
- Python 3.10+
- CUDA 12.x with ≥ 2.5 GB VRAM for the default 4-bit local model, ≥ 8 GB for full bfloat16
- Without a GPU: use proxy mode and point Organism at any OpenAI-compatible server
Install
git clone https://github.com/sunnsten/organism-memory
cd organism-memory
pip install -r requirements.txt
cp organism_config.example.yaml organism_config.yaml
Local GPU mode
# organism_config.yaml
base_model:
type: qwen35
model_name: Qwen/Qwen3.5-4B
device_map: cuda
load_in_4bit: true # ~2.5 GB VRAM
max_new_tokens: 256
rag:
embedder_enabled: true
embedder_model: Qwen/Qwen3-Embedding-0.6B
from organism import Organism
from organism.config import OrganismConfig
org = Organism.from_config(OrganismConfig.from_yaml("organism_config.yaml"))
reply = org.chat(user_id="alice", user_message="Hello!")
print(reply.reply)
Proxy mode
Organism wraps an existing OpenAI-compatible endpoint and injects memory into every request — no local GPU needed for inference.
# Start your inference server (vLLM, Ollama, OpenAI, Anthropic, ...)
vllm serve Qwen/Qwen3.5-4B --port 8001
# Start Organism proxy
ORGANISM_CONFIG_PATH=organism_config.yaml uvicorn organism.proxy.server:app --port 9000
Point your client at http://localhost:9000 instead of the original endpoint. Organism intercepts each request, retrieves relevant memory, injects it into the system prompt, and forwards to your backend. See docs/claude-code-proxy.md for the Claude Code / Anthropic variant.
MCP server
python -m organism.mcp_server --config organism_config.yaml
Exposes six tools: organism_chat, memory.store_event, memory.query, memory.remember, memory.reset, memory.metrics. See docs/mcp.md for full reference and configuration examples.
Configuration
All options live in organism_config.yaml. The most common keys:
base_model:
type: qwen35 # qwen35 | vllm | llama_cpp | openai
model_name: Qwen/Qwen3.5-4B
device_map: cuda # cuda | cpu | auto
load_in_4bit: true # 4-bit NF4 quantization via bitsandbytes
max_new_tokens: 256
# Optional: dedicated small model for background fact extraction.
# When absent, fact extraction reuses base_model (shares GPU).
fact_llm:
type: openai
model_name: Qwen/Qwen3-4B
base_url: http://localhost:8001/v1
api_key: not-needed
max_new_tokens: 512
rag:
embedder_enabled: true
embedder_model: Qwen/Qwen3-Embedding-0.6B
embedder_base_url: ~ # leave blank for local; set for vLLM embedder endpoint
consolidation:
enabled: false # set true to activate Tier 3 (experimental)
Model selection via environment variables (useful in Docker):
| Variable | Effect | |---|---| | ORGANISM_MODEL_TYPE | Override base_model.type | | ORGANISM_MODEL_NAME | Override base_model.model_name |
Benchmarks
Evaluated on two long-term memory benchmarks using Qwen3.5-4B (4-bit) as both the chat and fact-extraction model.
LongMemEval
Single-user, multi-session QA. 500 questions across 5 categories.
| Category | Score | |---|---| | Single-session | ~80% | | Single-session preference | 6.7% | | Multi-session | 41.4% | | Temporal reasoning | 36.1% | | Overall | 53.4% |
LoCoMo
Conversational long-term memory. Two-person dialogues across multiple sessions.
| Category | Score | |---|---| | Single-hop | ~42% | | Multi-hop | 5.0% | | Temporal | 9.8% | | Overall | 28.0% |
Both benchmarks run with memory_mode: t2 (Tier 1 + Tier 2, no consolidation). See scripts/bench/ for benchmark scripts and runs/ for run artifacts.
API Reference
Organism.chat
reply = org.chat(
user_id="alice",
user_message="What did I say about my job last week?",
session_id=None, # auto-generated if omitted
system_prompt=None, # prepended before memory context
max_new_tokens=None, # overrides config value
)
# reply.reply → str
Organism.retrieve_context
Retrieval without LLM generation. Used by the proxy layer and for inspection.
facts = org.retrieve_context(
user_id="alice",
query="job location",
limit=8,
)
# → List[str] (Tier 1 chunks + Tier 2 facts, ranked by relevance)
Organism.remember
Explicit fact write, bypasses async extraction pipeline.
count = org.remember(user_id="alice", text="Alice is a software engineer in Berlin.")
# → int (number of new facts stored)
Limitations
- 4B model ceiling: LongMemEval ~54%, LoCoMo ~28%. Breaking these requires query expansion, session-aware retrieval, or a larger model (7B+).
- Temporal reasoning: The 4B model handles date arithmetic poorly. Temporal questions plateau at ~36% on LongMemEval.
- Fact extraction is async: Facts are written after the chat turn completes. A follow-up question in the same turn may not see newly extracted facts yet.
- Single-session preference recall: Currently 6.7% on LongMemEval — preference facts are over-deduplicated.
- No streaming:
Organism.chatreturns the full reply; streaming is not supported in direct mode (proxy mode inherits streaming from the backend). - SQLite only: The store is a single SQLite file (
organism_data/organism.db). Not designed for concurrent multi-process write access.
Research Layer (Tier 3)
Tier 3 includes ConsolidationWorker, LoRA sleep fine-tuning, and SSM neural memory. The infrastructure is present in organism/research/ but not active by default.
Enable consolidation:
consolidation:
enabled: true
summary_temperature: 0.0
summary_max_new_tokens: 1536
When enabled, WriteService writes experience_blocks after each turn, and ConsolidationWorker processes them into memory_items which are retrieved alongside Tier 2 facts. LoRA training is architecturally wired but requires a separate training run to activate.
Database
Single SQLite file, auto-initialized on first run.
| Table | Tier | Description | |---|---|---| | messages | 0 | Raw conversation turns | | sessions | — | Session metadata | | rag_chunks | 1 | Verbatim chunks, FTS5 + HNSW | | facts | 2 | LLM-extracted user facts | | user_profile | 2 | Key→value profile (derived from facts) | | experience_blocks | 3 | Raw experience feed (Tier 3, gated) | | memory_items | 3 | Consolidated memories (Tier 3) |
Tests
# Unit + integration (no GPU required)
python -m pytest tests/unit/ tests/integration/ -x -q
# With a real model (GPU required)
TEST_USE_REAL_MODEL=1 python -m pytest tests/sleep/ -m sleep -x -q
Contributing
Open issues and PRs welcome. Please run the unit + integration suite before submitting:
python -m pytest tests/unit/ tests/integration/ -x -q





