mcp-vl-msa-rs

DioNanos/mcp-vl-msa-rs
1 starsApache-2.0Community

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Searchable agent memory: BM25 corpus recall, original-text injection, remember/forget capsules.

README.md

mcp-vl-msa-rs

![CI](https://github.com/DioNanos/mcp-vl-msa-rs/actions/workflows/ci.yml) ![Tests](https://github.com/DioNanos/mcp-vl-msa-rs/actions/workflows/ci.yml) ![Benchmarks](docs/NEGATIVE_RESULTS.md) ![License: Apache-2.0](LICENSE) ![Rust](https://www.rust-lang.org)

A searchable long-term memory for AI agents, exposed as an MCP stdio server. Index documents, notes and past conversations into collections; retrieve the top-k relevant chunks for a query and inject the original text back to the model; add or drop agent memories with msa_remember / msa_forget. Pure Rust, BM25 over tantivy, zero ML deps in the default build; optional in-process dense rerank.

Any MCP client (Claude Code, Codex, or anything speaking MCP stdio) gets the same memory: a queryable corpus that survives across sessions and model swaps, with no cloud account and no embedding service required. Use it to give an agent durable recall over a knowledge base, a docs tree, or its own chat history — retrieval that returns the original text, not just embeddings.

It is one half of a two-part memory: this server is the library (corpus recall), its companion mcp-memory-rs is the notebook (curated state). An agent that swaps models loses neither.

flowchart LR
    A["AI agent<br/>(any MCP client)"]
    A -->|"curated state<br/>read / write / sync"| M["mcp-memory-rs<br/><i>the notebook</i>"]
    A -->|"corpus recall<br/>index / search / fetch"| V["mcp-vl-msa-rs<br/><i>the library</i>"]
    M --- D1[("JSON categories<br/>SQLite FTS5")]
    V --- D2[("tantivy BM25<br/>collections")]

The name: msa is the retrieval pattern it borrows from the Memory Sparse Attention paper (arXiv:2603.23516) — an extrinsic approximation, not the neural model; distinct from MiniMax's MSA-architecture LLMs, which are intrinsic (in-model) generators. vl is for Vivling (codex-vl), its first adopter — but the server is fully AI-agnostic and depends on nothing from it.

Status: v0.4 — hybrid sparse+dense optional.

Why

The original Memory Sparse Attention paper (EverMind-AI) describes an end-to-end trainable sparse attention layer over chunk-pooled KV caches. That is a neural artifact and is not portable to a pure-Rust MCP server. What is portable, and what this repo aims to deliver, is the MSA macro pattern:

  1. Chunked storage of long-form text with a small fixed pool size (P=64 words by default, mirroring the paper).
  2. Top-k sparse routing over chunks (BM25 surrogate; learned routing is out of scope).
  3. Original text injection (paper §4.3, ablation -37.1% without): msa_search returns chunks, msa_fetch_doc returns the full document.
  4. Memory Interleave as a protocol (planned v0.4): the AI client orchestrates multi-hop retrieval through repeated tool calls with a server-side cursor.

Design and rationale are documented in the project notes (negative results, gate methodology); see docs/NEGATIVE_RESULTS.md.

Benchmarks

Retrieval changes are decided on pre-registered, paired deltas with bootstrap confidence intervals — not on absolute scores. Workloads: HotpotQA (extractive QA), MLDR-it (long-doc retrieval, Italian), LongMemEval-S (500 conversational-memory questions). Full methodology, acceptance gates and refuted hypotheses live in docs/NEGATIVE_RESULTS.md.

Headline measurements:

  • BM25 is the engine, not a placeholder. Three pre-registered attempts; no

hybrid (BM25 + dense rerank) configuration beat the gate on these workloads. Dense rerank stays available (dense_alpha, off by default) for re-testing as encoders improve.

  • Rich capsules at ingestion (deterministic enrich, no LLM): **+7 to +20

recall@5** across every category.

  • Original-text injection (msa_fetch_doc after msa_search): +14.6 F1

exactly on the stratum where snippets miss the content.

  • Recency priors lose — handle time at serving, not in the retrieval score.

Reproduce:

crates/msa-bench/scripts/download-bench-datasets.sh   # fetch datasets
scripts/run-baseline-bench.sh                         # BM25 vs BM25+dense sweep
# results land under crates/msa-bench/results/ as JSON

Tool surface

| Tool | Since | Description | |---|---|---| | msa_index | v0.1 | Index a document; existing chunks for doc_id are replaced. | | msa_search | v0.1 | Top-k chunks, score normalized 0.0–1.0. | | msa_fetch_doc | v0.1 | Full original text of a document. | | msa_delete | v0.1 | Remove a document and all its chunks. | | msa_list_collections | v0.1 | Collections open in the registry. | | msa_stats | v0.1 | Per-collection statistics (exact num_documents / total_tokens). | | SearchFilter | v0.2 | Metadata filter (where_eq/where_in/created_*), post-retrieval. | | msa_search_iterative | v0.3 | Memory Interleave with server-side cursor; dedups across rounds. | | msa_drop_session | v0.3 | Force-evict a Memory Interleave session before TTL. | | dense_alpha on msa_search | v0.4 | Hybrid BM25 + cosine rerank. Requires --features embeddings + [embeddings] config. | | msa_remember / msa_forget | v0.4 | Agent-memory surface: enrich + low-signal gate + content-hash dedup; standard metadata (kind / source_id / created_at). | | msa_sync_path | v0.4 | Mirror a directory into a collection (filesystem source; blake3 delta sync). |

Install

Prebuilt binary (recommended) — download the archive for your platform from the latest release, extract, and point your MCP client at the binary:

tar xzf mcp-vl-msa-rs-x86_64-unknown-linux-gnu.tar.gz
install -m755 mcp-vl-msa-rs-*/mcp-vl-msa-rs ~/.local/bin/

Prebuilt targets (Linux + Android): x86_64-unknown-linux-gnu, x86_64-unknown-linux-musl, aarch64-unknown-linux-gnu, aarch64-unknown-linux-musl (edge / ARM / Termux), aarch64-linux-android.

macOS: no prebuilt binary is shipped (it would need Apple code-signing). Install from source instead — cargo install below compiles it on your Mac in one command, no signing needed.

From source (Rust toolchain) — --locked is required (the workspace Cargo.lock pins a working time / tantivy-common resolution; a fresh resolve breaks the build), and mcp-msa-server is the package name (the binary it installs is mcp-vl-msa-rs):

cargo install --git https://github.com/DioNanos/mcp-vl-msa-rs \
  --locked --features source-fs mcp-msa-server

Build & test

cd mcp-vl-msa-rs

# Default: pure BM25, zero network deps
cargo build --release
cargo test

# Hybrid sparse + dense (in-process Candle rerank, no external service)
cargo build --release --features embeddings
cargo test  --features embeddings

Hybrid mode config

Add [embeddings] to MCP_MSA_CONFIG to activate dense rerank. Without this section the server stays in BM25-only mode even when the binary was built with --features embeddings.

The production backend is candle-modernbert: the encoder runs in-process (Candle), offline-deterministic, from a local model bundle — no daemon, no network at runtime, no automatic downloads. Prepare the bundle once with scripts/prepare-granite-r2-97m.sh.

[storage]
storage_dir = "~/.local/state/mcp-vl-msa-rs"

[chunking]
chunk_size = 64
overlap = 0

[embeddings]
backend   = "candle-modernbert"
model_dir = "~/.local/share/mcp-vl-msa-rs/models/granite-r2-97m"
dim       = 768
model_id  = "granite-r2-97m"

A transitional backend = "ollama" (HTTP to an Ollama-compatible service) still exists but is deprecated and scheduled for removal in v0.6 — do not build new setups on it.

The AI client opts into hybrid scoring per-call by passing dense_alpha to msa_search (or any future tool that supports it). dense_alpha = 1.0 (default) is BM25-only; 0.0 is dense-only; intermediate values are a linear blend α·bm25 + (1-α)·((cos+1)/2). Cosine is shifted to [0,1] so it composes linearly with the already max-normalized BM25 score.

Run as MCP stdio

# Default storage: ~/.local/state/mcp-vl-msa-rs/
./target/release/mcp-vl-msa-rs

# With explicit config
MCP_VL_MSA_CONFIG=~/.config/mcp-vl-msa-rs/config.toml \
MCP_DEVICE=my-node \
./target/release/mcp-vl-msa-rs

Example ~/.codex/config.toml entry:

[mcp_servers.vl_msa]
command = "/path/to/mcp-vl-msa-rs/target/release/mcp-vl-msa-rs"
env = { MCP_DEVICE = "my-node" }
# let the model call tools without a per-call approval prompt
default_tools_approval_mode = "approve"

Equivalent ~/.claude.json entry for Claude Code:

{
  "mcpServers": {
    "vl_msa": {
      "command": "/path/to/mcp-vl-msa-rs/target/release/mcp-vl-msa-rs",
      "env": { "MCP_DEVICE": "my-node" }
    }
  }
}

AI client compatibility

  • Clients with partial MCP support may not surface the server's instructions

text. The tool descriptions and request-field descriptions are self-contained, so a model can work from those alone.

  • Read-only tools (msa_search, msa_fetch_doc, msa_stats,

msa_list_collections, msa_manifest, msa_search_iterative, msa_interleave_round) carry the readOnlyHint annotation, which lets a gating client auto-approve them.

  • If a model reports an "unsupported call" or "user cancelled" on codex, that is

the approval gate, not a server fault — set default_tools_approval_mode (above) so tool calls are not blocked on a prompt.

Storage layout

~/.local/state/mcp-vl-msa-rs/
├── <collection_a>/        ← tantivy index directory
├── <collection_b>/
└── ...

Each collection is an independent tantivy index. Collection names are validated (rejected if they contain path separators, .., etc.) so a collection cannot escape the root.

Roadmap

Shipped:

  • v0.2SearchFilter (where_eq / where_in / created range), post-retrieval.
  • v0.3msa_search_iterative Memory Interleave with server-side cursor + TTL'd MsaSession registry.
  • v0.4 — hybrid BM25 + dense rerank behind feature flag embeddings, Ollama backend, per-call dense_alpha; agent-memory surface (msa_remember / msa_forget); filesystem source metadata (created_at / source / ext / dir) at index time; exact num_documents / total_tokens in msa_stats; msa-bench reproducible benchmark crate; prebuilt-binary packaging.

Next (not yet built):

  • Query-time tantivy filter (today SearchFilter runs post-retrieval; fine for

normal corpora, but a pre-filter would help when selectivity is high on a very large index).

  • ACL for multi-tenant collections.
  • Tool-description tuning.

Related work

architectural inspiration (neural, intrinsic); this repo is an extrinsic, pure-Rust approximation of the macro pattern.

  • Vivling (in codex-vl) — the first downstream consumer: this server is

its long-term memory.

server for curated agent state (named JSON categories, per-device ACL, fleet sync). This server does corpus recall; together they cover both halves of agent memory: the curated notebook and the queryable library.

License

Apache-2.0. See LICENSE.

Related MCP servers

Browse all →