autoskill-openclaw-adapter

<h1 align="center">MemoryBench</h1>

A standardized, extensible benchmark for memory and continual learning in LLM systems.

<a href="#-quick-start">Quick Start</a> • <a href="#-datasets">Datasets</a> • <a href="#-baselines">Baselines</a> • <a href="#-experiments">Experiments</a> • <a href="frontend/README.md">Frontend</a> • <a href="#-extending-memorybench">Extending</a> • <a href="#citation">Citation</a>

---

📢 News

2026-06-24 — Off-policy experiment results released on THUIR/MemoryBench-Results, with Python APIs for loading result files and summary tables.
2026-05-26 — 🌟 Accepted to ICML 2026 as a Spotlight paper.
2026-04-15 — Streamlit frontend released. Configure and run experiments without touching any YAML. See frontend/README.md.
2025-12-08 — Extended version released: THUIR/MemoryBench-Full.
2025-12-05 — User-feedback simulator upgraded to Mistral-Small-3.2-24B-Instruct-2506.

---

🔍 Overview

Scaling data, parameters, and test-time compute is hitting diminishing returns for LLM systems (LLMsys). MemoryBench evaluates a complementary axis: can LLM systems learn from accumulated user feedback during service time? Memory and continual-learning frameworks claim to enable this, but most existing benchmarks reduce the problem to long-form reading comprehension — a poor proxy for real feedback-driven adaptation.

MemoryBench tests the harder regime: multi-task, multi-domain, multilingual evaluation with simulated user feedback, across both off-policy (replay pre-recorded dialogs) and on-policy (generate dialogs on the fly) settings.

Highlights

28 datasets across 3 domains (Academic & Knowledge, Legal, Open-Domain) and 4 task shapes (Long-Long, Long-Short, Short-Long, Short-Short).
8 memory-system baselines with a one-call registry interface (vanilla, BM25-M/S, Emb-M/S, A-Mem, Mem0, MemoryOS).
4 experiment regimes: off-policy, stepwise off-policy, on-policy, and training-set performance.
User-feedback simulator based on Mistral-Small-3.2-24B-Instruct-2506.
LLM providers: vLLM, OpenAI-compatible, and Anthropic — wired through one LlmFactory.
Streamlit frontend with conditional UI and explicit dataset-path support.
Plug-and-play extension via a single registry entry. See CONTRIBUTING.md.

> This repository hosts the lightweight benchmark interface and baseline implementations. The full reproduction code for the paper lives at LittleDinoC/MemoryBench-code.

---

---

🚀 Quick Start

Installation

conda create -n memorybench python=3.10 -y
conda activate memorybench

git clone https://github.com/THUIR/MemoryBench.git
cd MemoryBench

pip install -r requirements.txt
pip install -e baselines/mem0          # editable install required by Mem0

python -c "import nltk; [nltk.download(p) for p in ('punkt','wordnet','stopwords')]"

Quick Use

from memorybench import load_memory_bench, evaluate, summary_results

dataset = load_memory_bench(dataset_type="single", name="JRE-L")
predicts = [
    {"test_idx": int(row["test_idx"]), "response": "...", "dataset": "JRE-L"}
    for row in dataset.dataset["test"]
]
details = evaluate("single", "JRE-L", predicts)
print(summary_results("single", "JRE-L", predicts, details)["summary"])

Run an experiment

# Off-policy with BM25 on the Open-Domain split
python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain

---

📊 Datasets

The dataset is on Hugging Face: THUIR/MemoryBench.

| Domain | Task Shape | Datasets | |-----------------------|-------------|-------------------------------------------------------------------------------| | Open-Domain | Long-Short | Locomo-0 … Locomo-9, DialSim-friends, DialSim-bigbang, DialSim-theoffice | | Open-Domain | Long-Long | HelloBench-Creative&Design, WritingBench-Creative&Design | | Open-Domain | Short-Long | WritingPrompts | | Open-Domain | Short-Short | NFCats | | Academic & Knowledge | Long-Short | LimitGen-Syn, IdeaBench | | Academic & Knowledge | Long-Long | HelloBench-Academic&Knowledge-Writing, WritingBench-Academic&Engineering | | Academic & Knowledge | Short-Long | HelloBench-Academic&Knowledge-QA | | Academic & Knowledge | Short-Short | JRE-L | | Legal | Long-Short | LexEval-Summarization | | Legal | Long-Long | LexEval-Judge, WritingBench-Politics&Law | | Legal | Short-Long | JuDGE | | Legal | Short-Short | LexEval-QA |

The full list (28 datasets) lives in configs/datasets/each.json; domain and task groupings are in domain.json / task.json.

Corpus datasets. LoCoMo and DialSim ship a multi-session conversation corpus that the memory system must ingest before answering. MemoryBench dispatches per-corpus loading by an attribute on the dataset class:

class Locomo_Dataset(BaseDataset):
    corpus_format = "locomo"      # → solver.memory_locomo_conversation
    summary_group_name = "Locomo" # → collapse Locomo-0..9 under one normalization key

---

🧠 Baselines

All baselines are registered in src/memory_systems.py. The runner CLI (--memory_system <name>), the frontend dropdown, and the run scripts all derive their lists from this single source of truth.

| Paper Name | Code Name | Type | Config File | |------------|---------------------|----------------------------|------------------------------------------------------| | Vanilla | wo_memory | No memory (baseline) | base.json | | BM25-M | bm25_message | Lexical, message-level | bm25.json | | BM25-S | bm25_dialog | Lexical, session-level | bm25.json | | Emb-M | embedder_message | Dense, message-level | embedder.json | | Emb-S | embedder_dialog | Dense, session-level | embedder.json | | A-Mem | a_mem | Note-based associative | a_mem.json | | Mem0 | mem0 | Fact-extraction memory | mem0.json | | MemoryOS | memoryos | Hierarchical OS-style | memoryos.json |

Upstream sources for a_mem, mem0, and memoryos are vendored under baselines/.

---

🧪 Experiments

MemoryBench evaluates memory systems under four complementary regimes. Each one ships with both a per-experiment Python entry point (src/<experiment>.py) and a sweep driver (run_scripts/<experiment>.py) that iterates every registered memory system.

| Regime | Train→Memory | Test access | When to use | Entry point | |-----------------------|---------------|-----------------|----------------------------------------------------------|-----------------------------------| | Off-policy | Bulk replay | Read only | Compare baselines on a fixed training-dialog corpus | python -m src.off-policy | | Stepwise off-policy | Replay in batches | Read between batches | Track scaling with training data | python -m src.stepwise_off-policy | | On-policy | Live generation | Read between steps | Realistic continual-learning loop | python -m src.on-policy | | Training perf. | Bulk replay | Re-eval on train | Detect overfit / catastrophic forgetting | python -m src.train_performance |

Common arguments: --memory_system <name>, --dataset_type single|domain|task, --set_name <name>. See --help on any entry point for the full list.

<details> <summary>Example: off-policy run on the Open-Domain split</summary>

python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain \
    --retrieve_k 5

Results are written to off-policy/results/domain/Open-Domain/bm25_message/start_at_<timestamp>/.

</details>

<details> <summary>Example: full sweep across all baselines × domains</summary>

python run_scripts/off-policy.py

The sweep iterates memory_systems.all_names() × domain.json and task.json, automatically skipping known-incompatible combinations declared in the registry (e.g. mem0 on Open-Domain).

</details>

<details> <summary>Example: on-policy with live feedback generation</summary>

python -m src.on-policy \
    --memory_system mem0 \
    --dataset_type domain \
    --set_name Legal \
    --step 10 --batch_size 100 --max_rounds 3

</details>

<details> <summary>Default vLLM deployment (only required to reproduce paper results)</summary>

vllm serve Qwen/Qwen3-32B  --port 12345 --chat-template qwen3_nonthinking.jinja   # Main LLM
vllm serve Qwen/Qwen3-8B   --port 12366 --chat-template qwen3_nonthinking.jinja   # Memory-system LLM
vllm serve Qwen/Qwen3-Embedding-0.6B --port 12377 --task embed                    # Embedder
vllm serve AQuarterMile/WritingBench-Critic-Model-Qwen-7B --port 12388            # WritingBench evaluator

With these ports the default configs/memory_systems/*.json files work as-is.

</details>

---

🐍 Python API

The benchmark exposes dataset/evaluation helpers and a small result-loading client in memorybench.py.

Load MemoryBench

The API is load_memory_bench(dataset_type, name, eval_mode=False). It returns a BaseDataset (when dataset_type="single") or a list[BaseDataset] (for "domain" / "task").

from memorybench import load_memory_bench

ds = load_memory_bench("single", "JRE-L")
ds.dataset_name           # "JRE-L"
ds.dataset                # HF DatasetDict with "train" and "test" splits
ds.has_corpus             # bool — True for LoCoMo/DialSim
ds.get_data(test_idx=42)  # → row dict

Evaluate

The API is evaluate(dataset_type, name, predicts) → list[dict].

from memorybench import evaluate

predicts = [{"test_idx": 0, "response": "...", "dataset": "JRE-L"}, ...]
details  = evaluate("single", "JRE-L", predicts)
# [{"dataset": "JRE-L", "test_idx": 0, "metrics": {"Rouge-L": ..., ...}}, ...]

Summary Results

The API is summary_results(dataset_type, name, predicts, evaluate_details). It returns a mean metrics for a single dataset while also computing min-max-normalized and z-normalized aggregates for a domain or task.

from memorybench import summary_results

summary = summary_results("domain", "Open-Domain", predicts, details)
summary["summary"]["weighted_average"]
summary["minmax_normalized_average"]

Load Published Experiment Results

Published off-policy result files can be loaded from the Hugging Face dataset THUIR/MemoryBench-Results.

Key APIs:

MemoryBenchResults.from_hf() loads the published results from Hugging Face.
load_summary(...), load_predict(...), load_evaluate_details(...), and load_run_config(...) load one result file.
list_exps(), list_models(...), list_set_names(...), and list_baselines(...) inspect available results.

The example below loads the result for one off-policy run: Qwen3-8B on the Open-Domain group with the A-Mem baseline.

from memorybench import MemoryBenchResults

results = MemoryBenchResults.from_hf()
summary = results.load_summary(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)
print(summary["summary"])

<details> <summary>More result-loading examples</summary>

from memorybench import MemoryBenchResults

results = MemoryBenchResults.from_hf()

predicts = results.load_predict(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

evaluate_details = results.load_evaluate_details(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

run_config = results.load_run_config(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

Read a local staged result directory that contains manifest.jsonl:

results = MemoryBenchResults.from_local("/path/to/MemoryBench-Results")

Inspect available results:

results.list_exps()
results.list_models(exp="off-policy")
results.list_set_names(exp="off-policy", model="Qwen3-8B", dataset_type="domain")
results.list_baselines(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
)

</details>

Load Result Summaries as a Table

Use load_result_summary_table(...) to collect one metric from the summary field of matching summary.json files and organize the results as a pandas.DataFrame.

from memorybench import load_result_summary_table

table = load_result_summary_table(
    metric="z_score",
    exp="off-policy",
    models=["Qwen3-8B"],
    dataset_type=None,  # None means both domain and task results.
    set_name=None,
)

---

🖥 Frontend

python -m streamlit run frontend/streamlit_app.py
# → http://localhost:8501

The frontend covers off-policy and on-policy runs end to end. It auto-hides irrelevant fields:

LLM provider dropdown is filtered per baseline — mem0 / a_mem / memoryos don't expose the Anthropic option because they route through their own provider abstractions.
LLM base URL default updates when you switch providers (vllm / openai / anthropic).
Embedder section only appears for baselines that consume embeddings (embedder_*, mem0).
Retrieve k is hidden for wo_memory.
Dataset source is an explicit radio: Hugging Face Hub vs Local path, with live path validation.

See frontend/README.md for the full walkthrough.

---

🧰 Extending MemoryBench

Adding a new baseline or dataset is a single-file change plus one registry entry.

Add a new memory-system baseline: 1. Write src/agent/<name>.py (agent + pydantic config) and src/solver/<name>.py (solver). 2. Drop configs/memory_systems/<name>.json. 3. Add one register(MemorySystemSpec(...)) call in src/memory_systems.py.

Everything else — CLI choices, frontend dropdowns, sweep scripts, dialog-field lookups, skip rules — picks the new entry up automatically.

Add a new dataset: subclass BaseDataset, add one entry to configs/datasets/each.json, and (for corpus-style datasets) set corpus_format = "<name>" on the class.

Full step-by-step walkthrough: CONTRIBUTING.md.

The parametric test tests/test_refactor.py::TestAllBaselinesContract walks every registered baseline and asserts the off-policy + on-policy method contract — your new baseline is auto-tested.

---

🏗 Repository Layout

MemoryBench/
├── memorybench.py              # Public API: load_memory_bench, evaluate, summary_results
├── configs/
│   ├── datasets/               # each.json, domain.json, task.json
│   ├── memory_systems/         # one JSON per baseline
│   └── final_evaluate_summary_wo_details.json  # min/max/mu/sigma stats
├── src/
│   ├── memory_systems.py       # ← central registry of baselines
│   ├── dataset/                # BaseDataset + per-dataset subclasses
│   ├── agent/                  # Agent implementations
│   ├── solver/                 # Per-baseline solvers
│   ├── llms/                   # OpenAI / vLLM / Anthropic clients
│   ├── generate_dialogs/       # Dialog-generation scripts
│   ├── off-policy.py · on-policy.py · stepwise_off-policy.py · train_performance.py
│   └── utils.py
├── run_scripts/                # Sweep drivers (loops over every registered baseline)
├── baselines/                  # Vendored upstream baselines (mem0, A-Mem, MemoryOS)
├── frontend/                   # Streamlit app
├── tests/                      # Unit + integration tests
├── CONTRIBUTING.md             # How to add baselines / datasets
└── README.md

---

📝 Notes & Caveats

bert_score truncation bug. Some datasets (e.g. JRE-L) evaluate with bert_score. Locally-loaded models don't truncate inputs — load from Hugging Face Hub to avoid "exceeding max length" errors.
WritingBench evaluator. Long-form writing datasets use a 7 B critic; we recommend serving WritingBench-Critic-Model-Qwen-7B via vLLM and pointing WRITINGBENCH_EVAL_BASE_URL at it.
Mem0 cost. mem0 is slow on Open-Domain and Long-Short; the run scripts skip these combinations by default — skip_combinations in the registry entry.
Secrets. API_config.json, .env* (except .env.example), and frontend/runtime_configs/ are all gitignored — see .gitignore.

---

Citation

If you use MemoryBench in your research, please cite:

@inproceedings{memorybench2026,
  title     = {MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems},
  author    = {Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  note      = {Spotlight}
}

(Full BibTeX will be updated once the camera-ready DOI is available.)

---

License

Released under the MIT License. Upstream baseline code under baselines/ retains its original license — see each subdirectory's LICENSE file.

---

Acknowledgements

MemoryBench builds on prior datasets and memory systems from many open-source efforts: LoCoMo, DialSim, HelloBench, WritingBench, IdeaBench, LimitGen, JRE-L, JuDGE, LexEval, NFCats, WritingPrompts, A-Mem, Mem0, and MemoryOS. Thank you to all upstream authors.

For questions and feedback, open an issue on GitHub or contact the maintainers.

Summary

Install to Claude Code

📢 News

🔍 Overview

Highlights

📚 Table of Contents

🚀 Quick Start

Installation

Quick Use

Run an experiment

📊 Datasets

🧠 Baselines

🧪 Experiments

🐍 Python API

Load MemoryBench

Evaluate

Summary Results

Load Published Experiment Results

Load Result Summaries as a Table

🖥 Frontend

🧰 Extending MemoryBench

🏗 Repository Layout

📝 Notes & Caveats

Citation

License

Acknowledgements

Related plugins

Plugins by category