autoskill-openclaw-adapter

THUIR/MemoryBench

Otheropenclawby THUIR

Summary

OpenClaw plugin exposing 0 skills.

Install to Claude Code

openclaw plugin add THUIR/MemoryBench

Run in Claude Code. Add the marketplace first with /plugin marketplace add THUIR/MemoryBench if you haven't already.

README.md

<h1 align="center">MemoryBench</h1>

<p align="center"> <b>A standardized, extensible benchmark for memory and continual learning in LLM systems.</b> </p>

<p align="center"> <a href="https://huggingface.co/datasets/THUIR/MemoryBench"> <img alt="HF Dataset" src="https://img.shields.io/badge/πŸ€—%20Dataset-THUIR%2FMemoryBench-yellow"> </a> <a href="https://huggingface.co/datasets/THUIR/MemoryBench-Full"> <img alt="HF Dataset Full" src="https://img.shields.io/badge/πŸ€—%20Dataset-Full-orange"> </a> <a href="https://arxiv.org/abs/2510.17281"> <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2510.17281-b31b1b"> </a> <a href="#license"> <img alt="License" src="https://img.shields.io/badge/license-MIT-blue"> </a> <a href="#citation"> <img alt="ICML 2026" src="https://img.shields.io/badge/ICML-2026%20Spotlight-red"> </a> <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-blue"> <a href="https://github.com/QingyaoAi/MemoryBench/stargazers"> <img alt="Stars" src="https://img.shields.io/github/stars/THUIR/MemoryBench?style=social"> </a> </p>

<p align="center"> <a href="#-quick-start">Quick Start</a> β€’ <a href="#-datasets">Datasets</a> β€’ <a href="#-baselines">Baselines</a> β€’ <a href="#-experiments">Experiments</a> β€’ <a href="frontend/README.md">Frontend</a> β€’ <a href="#-extending-memorybench">Extending</a> β€’ <a href="#citation">Citation</a> </p>

---

πŸ“’ News

  • 2026-06-24 β€” Off-policy experiment results released on THUIR/MemoryBench-Results, with Python APIs for loading result files and summary tables.
  • 2026-05-26 β€” 🌟 Accepted to ICML 2026 as a Spotlight paper.
  • 2026-04-15 β€” Streamlit frontend released. Configure and run experiments without touching any YAML. See frontend/README.md.
  • 2025-12-08 β€” Extended version released: THUIR/MemoryBench-Full.
  • 2025-12-05 β€” User-feedback simulator upgraded to Mistral-Small-3.2-24B-Instruct-2506.

---

πŸ” Overview

Scaling data, parameters, and test-time compute is hitting diminishing returns for LLM systems (LLMsys). MemoryBench evaluates a complementary axis: can LLM systems learn from accumulated user feedback during service time? Memory and continual-learning frameworks claim to enable this, but most existing benchmarks reduce the problem to long-form reading comprehension β€” a poor proxy for real feedback-driven adaptation.

MemoryBench tests the harder regime: multi-task, multi-domain, multilingual evaluation with simulated user feedback, across both off-policy (replay pre-recorded dialogs) and on-policy (generate dialogs on the fly) settings.

Highlights

  • 28 datasets across 3 domains (Academic & Knowledge, Legal, Open-Domain) and 4 task shapes (Long-Long, Long-Short, Short-Long, Short-Short).
  • 8 memory-system baselines with a one-call registry interface (vanilla, BM25-M/S, Emb-M/S, A-Mem, Mem0, MemoryOS).
  • 4 experiment regimes: off-policy, stepwise off-policy, on-policy, and training-set performance.
  • User-feedback simulator based on Mistral-Small-3.2-24B-Instruct-2506.
  • LLM providers: vLLM, OpenAI-compatible, and Anthropic β€” wired through one LlmFactory.
  • Streamlit frontend with conditional UI and explicit dataset-path support.
  • Plug-and-play extension via a single registry entry. See CONTRIBUTING.md.

> This repository hosts the lightweight benchmark interface and baseline implementations. The full reproduction code for the paper lives at LittleDinoC/MemoryBench-code.

---

πŸ“š Table of Contents

---

πŸš€ Quick Start

Installation

conda create -n memorybench python=3.10 -y
conda activate memorybench

git clone https://github.com/THUIR/MemoryBench.git
cd MemoryBench

pip install -r requirements.txt
pip install -e baselines/mem0          # editable install required by Mem0

python -c "import nltk; [nltk.download(p) for p in ('punkt','wordnet','stopwords')]"

Quick Use

from memorybench import load_memory_bench, evaluate, summary_results

dataset = load_memory_bench(dataset_type="single", name="JRE-L")
predicts = [
    {"test_idx": int(row["test_idx"]), "response": "...", "dataset": "JRE-L"}
    for row in dataset.dataset["test"]
]
details = evaluate("single", "JRE-L", predicts)
print(summary_results("single", "JRE-L", predicts, details)["summary"])

Run an experiment

# Off-policy with BM25 on the Open-Domain split
python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain

---

πŸ“Š Datasets

The dataset is on Hugging Face: THUIR/MemoryBench.

| Domain | Task Shape | Datasets | |-----------------------|-------------|-------------------------------------------------------------------------------| | Open-Domain | Long-Short | Locomo-0 … Locomo-9, DialSim-friends, DialSim-bigbang, DialSim-theoffice | | Open-Domain | Long-Long | HelloBench-Creative&Design, WritingBench-Creative&Design | | Open-Domain | Short-Long | WritingPrompts | | Open-Domain | Short-Short | NFCats | | Academic & Knowledge | Long-Short | LimitGen-Syn, IdeaBench | | Academic & Knowledge | Long-Long | HelloBench-Academic&Knowledge-Writing, WritingBench-Academic&Engineering | | Academic & Knowledge | Short-Long | HelloBench-Academic&Knowledge-QA | | Academic & Knowledge | Short-Short | JRE-L | | Legal | Long-Short | LexEval-Summarization | | Legal | Long-Long | LexEval-Judge, WritingBench-Politics&Law | | Legal | Short-Long | JuDGE | | Legal | Short-Short | LexEval-QA |

The full list (28 datasets) lives in configs/datasets/each.json; domain and task groupings are in domain.json / task.json.

Corpus datasets. LoCoMo and DialSim ship a multi-session conversation corpus that the memory system must ingest before answering. MemoryBench dispatches per-corpus loading by an attribute on the dataset class:

class Locomo_Dataset(BaseDataset):
    corpus_format = "locomo"      # β†’ solver.memory_locomo_conversation
    summary_group_name = "Locomo" # β†’ collapse Locomo-0..9 under one normalization key

---

🧠 Baselines

All baselines are registered in src/memory_systems.py. The runner CLI (--memory_system <name>), the frontend dropdown, and the run scripts all derive their lists from this single source of truth.

| Paper Name | Code Name | Type | Config File | |------------|---------------------|----------------------------|------------------------------------------------------| | Vanilla | wo_memory | No memory (baseline) | base.json | | BM25-M | bm25_message | Lexical, message-level | bm25.json | | BM25-S | bm25_dialog | Lexical, session-level | bm25.json | | Emb-M | embedder_message | Dense, message-level | embedder.json | | Emb-S | embedder_dialog | Dense, session-level | embedder.json | | A-Mem | a_mem | Note-based associative | a_mem.json | | Mem0 | mem0 | Fact-extraction memory | mem0.json | | MemoryOS | memoryos | Hierarchical OS-style | memoryos.json |

Upstream sources for a_mem, mem0, and memoryos are vendored under baselines/.

---

πŸ§ͺ Experiments

MemoryBench evaluates memory systems under four complementary regimes. Each one ships with both a per-experiment Python entry point (src/<experiment>.py) and a sweep driver (run_scripts/<experiment>.py) that iterates every registered memory system.

| Regime | Train→Memory | Test access | When to use | Entry point | |-----------------------|---------------|-----------------|----------------------------------------------------------|-----------------------------------| | Off-policy | Bulk replay | Read only | Compare baselines on a fixed training-dialog corpus | python -m src.off-policy | | Stepwise off-policy | Replay in batches | Read between batches | Track scaling with training data | python -m src.stepwise_off-policy | | On-policy | Live generation | Read between steps | Realistic continual-learning loop | python -m src.on-policy | | Training perf. | Bulk replay | Re-eval on train | Detect overfit / catastrophic forgetting | python -m src.train_performance |

Common arguments: --memory_system <name>, --dataset_type single|domain|task, --set_name <name>. See --help on any entry point for the full list.

<details> <summary><b>Example: off-policy run on the Open-Domain split</b></summary>

python -m src.off-policy \
    --memory_system bm25_message \
    --dataset_type domain \
    --set_name Open-Domain \
    --retrieve_k 5

Results are written to off-policy/results/domain/Open-Domain/bm25_message/start_at_<timestamp>/.

</details>

<details> <summary><b>Example: full sweep across all baselines Γ— domains</b></summary>

python run_scripts/off-policy.py

The sweep iterates memory_systems.all_names() Γ— domain.json and task.json, automatically skipping known-incompatible combinations declared in the registry (e.g. mem0 on Open-Domain).

</details>

<details> <summary><b>Example: on-policy with live feedback generation</b></summary>

python -m src.on-policy \
    --memory_system mem0 \
    --dataset_type domain \
    --set_name Legal \
    --step 10 --batch_size 100 --max_rounds 3

</details>

<details> <summary><b>Default vLLM deployment (only required to reproduce paper results)</b></summary>

vllm serve Qwen/Qwen3-32B  --port 12345 --chat-template qwen3_nonthinking.jinja   # Main LLM
vllm serve Qwen/Qwen3-8B   --port 12366 --chat-template qwen3_nonthinking.jinja   # Memory-system LLM
vllm serve Qwen/Qwen3-Embedding-0.6B --port 12377 --task embed                    # Embedder
vllm serve AQuarterMile/WritingBench-Critic-Model-Qwen-7B --port 12388            # WritingBench evaluator

With these ports the default configs/memory_systems/*.json files work as-is.

</details>

---

🐍 Python API

The benchmark exposes dataset/evaluation helpers and a small result-loading client in memorybench.py.

Load MemoryBench

The API is load_memory_bench(dataset_type, name, eval_mode=False). It returns a BaseDataset (when dataset_type="single") or a list[BaseDataset] (for "domain" / "task").

from memorybench import load_memory_bench

ds = load_memory_bench("single", "JRE-L")
ds.dataset_name           # "JRE-L"
ds.dataset                # HF DatasetDict with "train" and "test" splits
ds.has_corpus             # bool β€” True for LoCoMo/DialSim
ds.get_data(test_idx=42)  # β†’ row dict

Evaluate

The API is evaluate(dataset_type, name, predicts) β†’ list[dict].

from memorybench import evaluate

predicts = [{"test_idx": 0, "response": "...", "dataset": "JRE-L"}, ...]
details  = evaluate("single", "JRE-L", predicts)
# [{"dataset": "JRE-L", "test_idx": 0, "metrics": {"Rouge-L": ..., ...}}, ...]

Summary Results

The API is summary_results(dataset_type, name, predicts, evaluate_details). It returns a mean metrics for a single dataset while also computing min-max-normalized and z-normalized aggregates for a domain or task.

from memorybench import summary_results

summary = summary_results("domain", "Open-Domain", predicts, details)
summary["summary"]["weighted_average"]
summary["minmax_normalized_average"]

Load Published Experiment Results

Published off-policy result files can be loaded from the Hugging Face dataset THUIR/MemoryBench-Results.

Key APIs:

  • MemoryBenchResults.from_hf() loads the published results from Hugging Face.
  • load_summary(...), load_predict(...), load_evaluate_details(...), and load_run_config(...) load one result file.
  • list_exps(), list_models(...), list_set_names(...), and list_baselines(...) inspect available results.

The example below loads the result for one off-policy run: Qwen3-8B on the Open-Domain group with the A-Mem baseline.

from memorybench import MemoryBenchResults

results = MemoryBenchResults.from_hf()
summary = results.load_summary(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)
print(summary["summary"])

<details> <summary><b>More result-loading examples</b></summary>

from memorybench import MemoryBenchResults

results = MemoryBenchResults.from_hf()

predicts = results.load_predict(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

evaluate_details = results.load_evaluate_details(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

run_config = results.load_run_config(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
    baseline="a_mem",
)

Read a local staged result directory that contains manifest.jsonl:

results = MemoryBenchResults.from_local("/path/to/MemoryBench-Results")

Inspect available results:

results.list_exps()
results.list_models(exp="off-policy")
results.list_set_names(exp="off-policy", model="Qwen3-8B", dataset_type="domain")
results.list_baselines(
    exp="off-policy",
    model="Qwen3-8B",
    dataset_type="domain",
    set_name="Open-Domain",
)

</details>

Load Result Summaries as a Table

Use load_result_summary_table(...) to collect one metric from the summary field of matching summary.json files and organize the results as a pandas.DataFrame.

from memorybench import load_result_summary_table

table = load_result_summary_table(
    metric="z_score",
    exp="off-policy",
    models=["Qwen3-8B"],
    dataset_type=None,  # None means both domain and task results.
    set_name=None,
)

---

πŸ–₯ Frontend

python -m streamlit run frontend/streamlit_app.py
# β†’ http://localhost:8501

The frontend covers off-policy and on-policy runs end to end. It auto-hides irrelevant fields:

  • LLM provider dropdown is filtered per baseline β€” mem0 / a_mem / memoryos don't expose the Anthropic option because they route through their own provider abstractions.
  • LLM base URL default updates when you switch providers (vllm / openai / anthropic).
  • Embedder section only appears for baselines that consume embeddings (embedder_*, mem0).
  • Retrieve k is hidden for wo_memory.
  • Dataset source is an explicit radio: Hugging Face Hub vs Local path, with live path validation.

See frontend/README.md for the full walkthrough.

---

🧰 Extending MemoryBench

Adding a new baseline or dataset is a single-file change plus one registry entry.

Add a new memory-system baseline: 1. Write src/agent/<name>.py (agent + pydantic config) and src/solver/<name>.py (solver). 2. Drop configs/memory_systems/<name>.json. 3. Add one register(MemorySystemSpec(...)) call in src/memory_systems.py.

Everything else β€” CLI choices, frontend dropdowns, sweep scripts, dialog-field lookups, skip rules β€” picks the new entry up automatically.

Add a new dataset: subclass BaseDataset, add one entry to configs/datasets/each.json, and (for corpus-style datasets) set corpus_format = "<name>" on the class.

Full step-by-step walkthrough: CONTRIBUTING.md.

The parametric test tests/test_refactor.py::TestAllBaselinesContract walks every registered baseline and asserts the off-policy + on-policy method contract β€” your new baseline is auto-tested.

---

πŸ— Repository Layout

MemoryBench/
β”œβ”€β”€ memorybench.py              # Public API: load_memory_bench, evaluate, summary_results
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ datasets/               # each.json, domain.json, task.json
β”‚   β”œβ”€β”€ memory_systems/         # one JSON per baseline
β”‚   └── final_evaluate_summary_wo_details.json  # min/max/mu/sigma stats
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ memory_systems.py       # ← central registry of baselines
β”‚   β”œβ”€β”€ dataset/                # BaseDataset + per-dataset subclasses
β”‚   β”œβ”€β”€ agent/                  # Agent implementations
β”‚   β”œβ”€β”€ solver/                 # Per-baseline solvers
β”‚   β”œβ”€β”€ llms/                   # OpenAI / vLLM / Anthropic clients
β”‚   β”œβ”€β”€ generate_dialogs/       # Dialog-generation scripts
β”‚   β”œβ”€β”€ off-policy.py Β· on-policy.py Β· stepwise_off-policy.py Β· train_performance.py
β”‚   └── utils.py
β”œβ”€β”€ run_scripts/                # Sweep drivers (loops over every registered baseline)
β”œβ”€β”€ baselines/                  # Vendored upstream baselines (mem0, A-Mem, MemoryOS)
β”œβ”€β”€ frontend/                   # Streamlit app
β”œβ”€β”€ tests/                      # Unit + integration tests
β”œβ”€β”€ CONTRIBUTING.md             # How to add baselines / datasets
└── README.md

---

πŸ“ Notes & Caveats

  • bert_score truncation bug. Some datasets (e.g. JRE-L) evaluate with bert_score. Locally-loaded models don't truncate inputs β€” load from Hugging Face Hub to avoid "exceeding max length" errors.
  • WritingBench evaluator. Long-form writing datasets use a 7 B critic; we recommend serving WritingBench-Critic-Model-Qwen-7B via vLLM and pointing WRITINGBENCH_EVAL_BASE_URL at it.
  • Mem0 cost. mem0 is slow on Open-Domain and Long-Short; the run scripts skip these combinations by default β€” skip_combinations in the registry entry.
  • Secrets. API_config.json, .env* (except .env.example), and frontend/runtime_configs/ are all gitignored β€” see .gitignore.

---

Citation

If you use MemoryBench in your research, please cite:

@inproceedings{memorybench2026,
  title     = {MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems},
  author    = {Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year      = {2026},
  note      = {Spotlight}
}

(Full BibTeX will be updated once the camera-ready DOI is available.)

---

License

Released under the MIT License. Upstream baseline code under baselines/ retains its original license β€” see each subdirectory's LICENSE file.

---

Acknowledgements

MemoryBench builds on prior datasets and memory systems from many open-source efforts: LoCoMo, DialSim, HelloBench, WritingBench, IdeaBench, LimitGen, JRE-L, JuDGE, LexEval, NFCats, WritingPrompts, A-Mem, Mem0, and MemoryOS. Thank you to all upstream authors.

For questions and feedback, open an issue on GitHub or contact the maintainers.

Related plugins

Browse all β†’