<h1 align="center">MemoryBench</h1>
<p align="center"> <b>A standardized, extensible benchmark for memory and continual learning in LLM systems.</b> </p>
<p align="center"> <a href="https://huggingface.co/datasets/THUIR/MemoryBench"> <img alt="HF Dataset" src="https://img.shields.io/badge/π€%20Dataset-THUIR%2FMemoryBench-yellow"> </a> <a href="https://huggingface.co/datasets/THUIR/MemoryBench-Full"> <img alt="HF Dataset Full" src="https://img.shields.io/badge/π€%20Dataset-Full-orange"> </a> <a href="https://arxiv.org/abs/2510.17281"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2510.17281-b31b1b"> </a> <a href="#license"> <img alt="License" src="https://img.shields.io/badge/license-MIT-blue"> </a> <a href="#citation"> <img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026%20Spotlight-red"> </a> <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-blue"> <a href="https://github.com/QingyaoAi/MemoryBench/stargazers"> <img alt="Stars" src="https://img.shields.io/github/stars/THUIR/MemoryBench?style=social"> </a> </p>
<p align="center"> <a href="#-quick-start">Quick Start</a> β’ <a href="#-datasets">Datasets</a> β’ <a href="#-baselines">Baselines</a> β’ <a href="#-experiments">Experiments</a> β’ <a href="frontend/README.md">Frontend</a> β’ <a href="#-extending-memorybench">Extending</a> β’ <a href="#citation">Citation</a> </p>
---
π’ News
- 2026-06-24 β Off-policy experiment results released on
THUIR/MemoryBench-Results, with Python APIs for loading result files and summary tables. - 2026-05-26 β π Accepted to ICML 2026 as a Spotlight paper.
- 2026-04-15 β Streamlit frontend released. Configure and run experiments without touching any YAML. See frontend/README.md.
- 2025-12-08 β Extended version released:
THUIR/MemoryBench-Full. - 2025-12-05 β User-feedback simulator upgraded to
Mistral-Small-3.2-24B-Instruct-2506.
---
π Overview
Scaling data, parameters, and test-time compute is hitting diminishing returns for LLM systems (LLMsys). MemoryBench evaluates a complementary axis: can LLM systems learn from accumulated user feedback during service time? Memory and continual-learning frameworks claim to enable this, but most existing benchmarks reduce the problem to long-form reading comprehension β a poor proxy for real feedback-driven adaptation.
MemoryBench tests the harder regime: multi-task, multi-domain, multilingual evaluation with simulated user feedback, across both off-policy (replay pre-recorded dialogs) and on-policy (generate dialogs on the fly) settings.
Highlights
- 28 datasets across 3 domains (Academic & Knowledge, Legal, Open-Domain) and 4 task shapes (Long-Long, Long-Short, Short-Long, Short-Short).
- 8 memory-system baselines with a one-call registry interface (vanilla, BM25-M/S, Emb-M/S, A-Mem, Mem0, MemoryOS).
- 4 experiment regimes: off-policy, stepwise off-policy, on-policy, and training-set performance.
- User-feedback simulator based on
Mistral-Small-3.2-24B-Instruct-2506. - LLM providers: vLLM, OpenAI-compatible, and Anthropic β wired through one
LlmFactory. - Streamlit frontend with conditional UI and explicit dataset-path support.
- Plug-and-play extension via a single registry entry. See CONTRIBUTING.md.
> This repository hosts the lightweight benchmark interface and baseline implementations. The full reproduction code for the paper lives at LittleDinoC/MemoryBench-code.
---
π Table of Contents
- Quick Start
- Datasets
- Baselines
- Experiments
- Python API
- Frontend
- Extending MemoryBench
- Repository Layout
- Citation
- License
- Acknowledgements
---
π Quick Start
Installation
conda create -n memorybench python=3.10 -y
conda activate memorybench
git clone https://github.com/THUIR/MemoryBench.git
cd MemoryBench
pip install -r requirements.txt
pip install -e baselines/mem0 # editable install required by Mem0
python -c "import nltk; [nltk.download(p) for p in ('punkt','wordnet','stopwords')]"
Quick Use
from memorybench import load_memory_bench, evaluate, summary_results
dataset = load_memory_bench(dataset_type="single", name="JRE-L")
predicts = [
{"test_idx": int(row["test_idx"]), "response": "...", "dataset": "JRE-L"}
for row in dataset.dataset["test"]
]
details = evaluate("single", "JRE-L", predicts)
print(summary_results("single", "JRE-L", predicts, details)["summary"])
Run an experiment
# Off-policy with BM25 on the Open-Domain split
python -m src.off-policy \
--memory_system bm25_message \
--dataset_type domain \
--set_name Open-Domain
---
π Datasets
The dataset is on Hugging Face: THUIR/MemoryBench.
| Domain | Task Shape | Datasets | |-----------------------|-------------|-------------------------------------------------------------------------------| | Open-Domain | Long-Short | Locomo-0 β¦ Locomo-9, DialSim-friends, DialSim-bigbang, DialSim-theoffice | | Open-Domain | Long-Long | HelloBench-Creative&Design, WritingBench-Creative&Design | | Open-Domain | Short-Long | WritingPrompts | | Open-Domain | Short-Short | NFCats | | Academic & Knowledge | Long-Short | LimitGen-Syn, IdeaBench | | Academic & Knowledge | Long-Long | HelloBench-Academic&Knowledge-Writing, WritingBench-Academic&Engineering | | Academic & Knowledge | Short-Long | HelloBench-Academic&Knowledge-QA | | Academic & Knowledge | Short-Short | JRE-L | | Legal | Long-Short | LexEval-Summarization | | Legal | Long-Long | LexEval-Judge, WritingBench-Politics&Law | | Legal | Short-Long | JuDGE | | Legal | Short-Short | LexEval-QA |
The full list (28 datasets) lives in configs/datasets/each.json; domain and task groupings are in domain.json / task.json.
Corpus datasets. LoCoMo and DialSim ship a multi-session conversation corpus that the memory system must ingest before answering. MemoryBench dispatches per-corpus loading by an attribute on the dataset class:
class Locomo_Dataset(BaseDataset):
corpus_format = "locomo" # β solver.memory_locomo_conversation
summary_group_name = "Locomo" # β collapse Locomo-0..9 under one normalization key
---
π§ Baselines
All baselines are registered in src/memory_systems.py. The runner CLI (--memory_system <name>), the frontend dropdown, and the run scripts all derive their lists from this single source of truth.
| Paper Name | Code Name | Type | Config File | |------------|---------------------|----------------------------|------------------------------------------------------| | Vanilla | wo_memory | No memory (baseline) | base.json | | BM25-M | bm25_message | Lexical, message-level | bm25.json | | BM25-S | bm25_dialog | Lexical, session-level | bm25.json | | Emb-M | embedder_message | Dense, message-level | embedder.json | | Emb-S | embedder_dialog | Dense, session-level | embedder.json | | A-Mem | a_mem | Note-based associative | a_mem.json | | Mem0 | mem0 | Fact-extraction memory | mem0.json | | MemoryOS | memoryos | Hierarchical OS-style | memoryos.json |
Upstream sources for a_mem, mem0, and memoryos are vendored under baselines/.
---
π§ͺ Experiments
MemoryBench evaluates memory systems under four complementary regimes. Each one ships with both a per-experiment Python entry point (src/<experiment>.py) and a sweep driver (run_scripts/<experiment>.py) that iterates every registered memory system.
| Regime | TrainβMemory | Test access | When to use | Entry point | |-----------------------|---------------|-----------------|----------------------------------------------------------|-----------------------------------| | Off-policy | Bulk replay | Read only | Compare baselines on a fixed training-dialog corpus | python -m src.off-policy | | Stepwise off-policy | Replay in batches | Read between batches | Track scaling with training data | python -m src.stepwise_off-policy | | On-policy | Live generation | Read between steps | Realistic continual-learning loop | python -m src.on-policy | | Training perf. | Bulk replay | Re-eval on train | Detect overfit / catastrophic forgetting | python -m src.train_performance |
Common arguments: --memory_system <name>, --dataset_type single|domain|task, --set_name <name>. See --help on any entry point for the full list.
<details> <summary><b>Example: off-policy run on the Open-Domain split</b></summary>
python -m src.off-policy \
--memory_system bm25_message \
--dataset_type domain \
--set_name Open-Domain \
--retrieve_k 5
Results are written to off-policy/results/domain/Open-Domain/bm25_message/start_at_<timestamp>/.
</details>
<details> <summary><b>Example: full sweep across all baselines Γ domains</b></summary>
python run_scripts/off-policy.py
The sweep iterates memory_systems.all_names() Γ domain.json and task.json, automatically skipping known-incompatible combinations declared in the registry (e.g. mem0 on Open-Domain).
</details>
<details> <summary><b>Example: on-policy with live feedback generation</b></summary>
python -m src.on-policy \
--memory_system mem0 \
--dataset_type domain \
--set_name Legal \
--step 10 --batch_size 100 --max_rounds 3
</details>
<details> <summary><b>Default vLLM deployment (only required to reproduce paper results)</b></summary>
vllm serve Qwen/Qwen3-32B --port 12345 --chat-template qwen3_nonthinking.jinja # Main LLM
vllm serve Qwen/Qwen3-8B --port 12366 --chat-template qwen3_nonthinking.jinja # Memory-system LLM
vllm serve Qwen/Qwen3-Embedding-0.6B --port 12377 --task embed # Embedder
vllm serve AQuarterMile/WritingBench-Critic-Model-Qwen-7B --port 12388 # WritingBench evaluator
With these ports the default configs/memory_systems/*.json files work as-is.
</details>
---
π Python API
The benchmark exposes dataset/evaluation helpers and a small result-loading client in memorybench.py.
Load MemoryBench
The API is load_memory_bench(dataset_type, name, eval_mode=False). It returns a BaseDataset (when dataset_type="single") or a list[BaseDataset] (for "domain" / "task").
from memorybench import load_memory_bench
ds = load_memory_bench("single", "JRE-L")
ds.dataset_name # "JRE-L"
ds.dataset # HF DatasetDict with "train" and "test" splits
ds.has_corpus # bool β True for LoCoMo/DialSim
ds.get_data(test_idx=42) # β row dict
Evaluate
The API is evaluate(dataset_type, name, predicts) β list[dict].
from memorybench import evaluate
predicts = [{"test_idx": 0, "response": "...", "dataset": "JRE-L"}, ...]
details = evaluate("single", "JRE-L", predicts)
# [{"dataset": "JRE-L", "test_idx": 0, "metrics": {"Rouge-L": ..., ...}}, ...]
Summary Results
The API is summary_results(dataset_type, name, predicts, evaluate_details). It returns a mean metrics for a single dataset while also computing min-max-normalized and z-normalized aggregates for a domain or task.
from memorybench import summary_results
summary = summary_results("domain", "Open-Domain", predicts, details)
summary["summary"]["weighted_average"]
summary["minmax_normalized_average"]
Load Published Experiment Results
Published off-policy result files can be loaded from the Hugging Face dataset THUIR/MemoryBench-Results.
Key APIs:
MemoryBenchResults.from_hf()loads the published results from Hugging Face.load_summary(...),load_predict(...),load_evaluate_details(...), andload_run_config(...)load one result file.list_exps(),list_models(...),list_set_names(...), andlist_baselines(...)inspect available results.
The example below loads the result for one off-policy run: Qwen3-8B on the Open-Domain group with the A-Mem baseline.
from memorybench import MemoryBenchResults
results = MemoryBenchResults.from_hf()
summary = results.load_summary(
exp="off-policy",
model="Qwen3-8B",
dataset_type="domain",
set_name="Open-Domain",
baseline="a_mem",
)
print(summary["summary"])
<details> <summary><b>More result-loading examples</b></summary>
from memorybench import MemoryBenchResults
results = MemoryBenchResults.from_hf()
predicts = results.load_predict(
exp="off-policy",
model="Qwen3-8B",
dataset_type="domain",
set_name="Open-Domain",
baseline="a_mem",
)
evaluate_details = results.load_evaluate_details(
exp="off-policy",
model="Qwen3-8B",
dataset_type="domain",
set_name="Open-Domain",
baseline="a_mem",
)
run_config = results.load_run_config(
exp="off-policy",
model="Qwen3-8B",
dataset_type="domain",
set_name="Open-Domain",
baseline="a_mem",
)
Read a local staged result directory that contains manifest.jsonl:
results = MemoryBenchResults.from_local("/path/to/MemoryBench-Results")
Inspect available results:
results.list_exps()
results.list_models(exp="off-policy")
results.list_set_names(exp="off-policy", model="Qwen3-8B", dataset_type="domain")
results.list_baselines(
exp="off-policy",
model="Qwen3-8B",
dataset_type="domain",
set_name="Open-Domain",
)
</details>
Load Result Summaries as a Table
Use load_result_summary_table(...) to collect one metric from the summary field of matching summary.json files and organize the results as a pandas.DataFrame.
from memorybench import load_result_summary_table
table = load_result_summary_table(
metric="z_score",
exp="off-policy",
models=["Qwen3-8B"],
dataset_type=None, # None means both domain and task results.
set_name=None,
)
---
π₯ Frontend
python -m streamlit run frontend/streamlit_app.py
# β http://localhost:8501
The frontend covers off-policy and on-policy runs end to end. It auto-hides irrelevant fields:
- LLM provider dropdown is filtered per baseline β
mem0/a_mem/memoryosdon't expose the Anthropic option because they route through their own provider abstractions. - LLM base URL default updates when you switch providers (vllm / openai / anthropic).
- Embedder section only appears for baselines that consume embeddings (
embedder_*,mem0). - Retrieve k is hidden for
wo_memory. - Dataset source is an explicit radio: Hugging Face Hub vs Local path, with live path validation.
See frontend/README.md for the full walkthrough.
---
π§° Extending MemoryBench
Adding a new baseline or dataset is a single-file change plus one registry entry.
Add a new memory-system baseline: 1. Write src/agent/<name>.py (agent + pydantic config) and src/solver/<name>.py (solver). 2. Drop configs/memory_systems/<name>.json. 3. Add one register(MemorySystemSpec(...)) call in src/memory_systems.py.
Everything else β CLI choices, frontend dropdowns, sweep scripts, dialog-field lookups, skip rules β picks the new entry up automatically.
Add a new dataset: subclass BaseDataset, add one entry to configs/datasets/each.json, and (for corpus-style datasets) set corpus_format = "<name>" on the class.
Full step-by-step walkthrough: CONTRIBUTING.md.
The parametric test tests/test_refactor.py::TestAllBaselinesContract walks every registered baseline and asserts the off-policy + on-policy method contract β your new baseline is auto-tested.
---
π Repository Layout
MemoryBench/
βββ memorybench.py # Public API: load_memory_bench, evaluate, summary_results
βββ configs/
β βββ datasets/ # each.json, domain.json, task.json
β βββ memory_systems/ # one JSON per baseline
β βββ final_evaluate_summary_wo_details.json # min/max/mu/sigma stats
βββ src/
β βββ memory_systems.py # β central registry of baselines
β βββ dataset/ # BaseDataset + per-dataset subclasses
β βββ agent/ # Agent implementations
β βββ solver/ # Per-baseline solvers
β βββ llms/ # OpenAI / vLLM / Anthropic clients
β βββ generate_dialogs/ # Dialog-generation scripts
β βββ off-policy.py Β· on-policy.py Β· stepwise_off-policy.py Β· train_performance.py
β βββ utils.py
βββ run_scripts/ # Sweep drivers (loops over every registered baseline)
βββ baselines/ # Vendored upstream baselines (mem0, A-Mem, MemoryOS)
βββ frontend/ # Streamlit app
βββ tests/ # Unit + integration tests
βββ CONTRIBUTING.md # How to add baselines / datasets
βββ README.md
---
π Notes & Caveats
bert_scoretruncation bug. Some datasets (e.g.JRE-L) evaluate withbert_score. Locally-loaded models don't truncate inputs β load from Hugging Face Hub to avoid "exceeding max length" errors.- WritingBench evaluator. Long-form writing datasets use a 7 B critic; we recommend serving WritingBench-Critic-Model-Qwen-7B via vLLM and pointing
WRITINGBENCH_EVAL_BASE_URLat it. - Mem0 cost.
mem0is slow onOpen-DomainandLong-Short; the run scripts skip these combinations by default βskip_combinationsin the registry entry. - Secrets.
API_config.json,.env*(except.env.example), andfrontend/runtime_configs/are all gitignored β see.gitignore.
---
Citation
If you use MemoryBench in your research, please cite:
@inproceedings{memorybench2026,
title = {MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems},
author = {Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026},
note = {Spotlight}
}
(Full BibTeX will be updated once the camera-ready DOI is available.)
---
License
Released under the MIT License. Upstream baseline code under baselines/ retains its original license β see each subdirectory's LICENSE file.
---
Acknowledgements
MemoryBench builds on prior datasets and memory systems from many open-source efforts: LoCoMo, DialSim, HelloBench, WritingBench, IdeaBench, LimitGen, JRE-L, JuDGE, LexEval, NFCats, WritingPrompts, A-Mem, Mem0, and MemoryOS. Thank you to all upstream authors.
For questions and feedback, open an issue on GitHub or contact the maintainers.





