nautilus-compass
<!-- mcp-name: io.github.chunxiaoxx/nautilus-compass -->
Reliability layer for multi-agent setups · keep multiple agents — or your own long-running sessions — coordinating reliably without an orchestrator. Cross-dialog contracts + drift detection + a 4-tier memory lifecycle schema (activation in progress). Plugin for Claude Code/Desktop · Cline · Cursor · Continue.dev · Zed. When an agent drifts from a rule you set, takes a shortcut you flagged, or claims a prior agreement that never happened — compass catches it before the agent acts. Why it holds up technically: the memory underneath is black-box — raw text embedded locally with BGE-m3, no LLM extraction step, no graph, no data leaving your machine (~14× cheaper to reproduce than white-box stacks like Mem0 / Letta / Cognee / Zep / MemOS). That same raw-prompt index is exactly what lets compass score the next action against your past mistakes — drift detection that white-box entity-graph memory structurally can't do. Full argument: paper/BLACKBOX_VS_WHITEBOX.md. Built by Nautilus Platform · open agent ecosystem · join as agent →
🇬🇧 English (this file) · 🇨🇳 中文
        
---
30-second pitch
compass's #1 job is multi-agent reliability without an orchestrator. The reason it can do that — and not be just another memory store — is its black-box memory core:
White-box memory layers (Mem0, Letta, Cognee, Zep, MemOS, smrti):
"I call an LLM to extract facts from your conversation,
then store them in a graph. Pay extraction tokens. Send
data to the provider."
Black-box memory (compass · this project):
"I embed raw text locally with BGE-m3. No extraction LLM.
No graph. No data leaving your machine. And because raw
prompts are still in the index, I can score the next
prompt against your past mistakes before the agent acts."
The trade is real: −30 points on LongMemEval-S vs white-box leaders that build entity graphs, in exchange for 14× cheaper reproduction, full local-deployment, cross-LLM portability, and drift detection that white-box systems can't offer. Full argument: paper/BLACKBOX_VS_WHITEBOX.md.
In one line: when the AI is about to forget a rule you set, take a shortcut you flagged, or fabricate a prior agreement, it gets stopped by its own history of failure patterns.
---
What's new in v2.1.0 · drift v2 + line reconciliation
v2.1.0 unifies two development lines (daemon/reliability + lifecycle/PoI) onto a single main and hardens the drift loop.
Drift v2 cutover (cry-wolf fix)
The old OR-vote firing (neg_cos ≥ 0.538) fired on 64.5% of events in 11.5k records of real traffic — benign prompts with high anti-anchor cosine overlapped genuine drift, so agents tuned out (act-on rate 9.87%). v2.1.0 makes firing high-signal:
should_alert = rule_hit (danger-command regex) OR drift_score < −0.07
Production-measured fire rate 0.5% · danger commands (rm -rf / force push / DROP / hardcoded key) always caught · the multi-signal drift/firing.py vote is retained behind an env flag for A/B.
Cross-agent contract scanner (L4 substrate)
- implicit contracts derived from
inbound_/outbound_handoff files - auto-consume detection (1:1 greedy · receiver-authorship guarded · opt-in)
- idempotent contract ledger · 720h close-loop window
L3 tier promotion + Proof-of-Impact
- daily idempotent tier-promotion driver (impact-based · LLM-free) — shipped + unit-tested; not yet scheduled in production
- PoI candidate emission at recall time + impact-weighted ranking boost
- L1 session-summary overlay
Activation status (honest): the L3 lifecycle machinery — tier promotion,
forget_atarchival, the promotion driver — is shipped and unit-tested, but the production recall path does not yet promote tiers or applyforget_atat query time (query ranking currently uses file-agearchived_atdecay + an importance gate). PoI emission requires cross-agent outcome events, which depend on the L4 data pipeline now being wired. Treat the lifecycle below as a schema + tested functions, with production activation + validation in progress.
Daemon hardening (P4–P9)
bounded handler pool · in-flight semaphore (CLOSE_WAIT cure) · server-side recall cache · pkl warmup (cold-start CPU cure) · BM25 + vector RRF fusion (opt-in) · inotify cache invalidation.
---
What's new in v2.0.0 · Opinionated EvoMap
v2.0.0 ships a deterministic lifecycle layer on top of the black-box memory base — paradigm fuse of llm-wiki2 (Karpathy v2), agentmemory (LongMemEval-S 95.2% R@5), and GBrain (Garry Tan · MIT).
The bet: every other memory project (Mem0, Letta, Cognee, Zep, MemOS, llm-wiki2, agentmemory) calls an LLM at some lifecycle decision — ingest, promotion, consolidation, or forgetting. compass v2.0.0 makes them all schema-declared.
5 new frontmatter fields (write-time LLM-free)
tier: working | episodic | semantic | procedural # 4 tiers verbatim from llm-wiki2
decay_rate: 0.5 # Ebbinghaus exponential decay
forget_at: 2026-06-01T00:00:00Z # null = never · soft-archive when reached
promote_after: "7d" | "5_access" # duration or access count
reinforce_count: 0 # access event counter
Deterministic promotion rule (no LLM call)
reinforce_count >= promote_after→tier++- access event → reset decay timer +
reinforce_count++ forget_atreached → soft-archive flagprocedural(top tier) does not promote
Full design rationale in paper/LLM_WIKI2_FUSE_DESIGN.md; implementation at recall.py:708+.
The promotion rule above is implemented as
promote_lifecycle_tier()and covered bytests/test_lifecycle_fuse.py, but is not yet invoked on the production recall path — see the activation-status note under L3 tier promotion above.
Other v2.0.0 additions
- 9 agentmemory-verbatim lifecycle hooks in
stop_hook.pyfor
Claude Code: SessionStart, UserPromptSubmit, PreToolUse, PostToolUse, PostToolUseFailure, PreCompact, SubagentStart/Stop, SessionEnd
add_worker(spec)MCP tool: super-agents register deterministic
worker specs (cron / pubsub / queue / http / custom) to .cache/workers.jsonl
- RRF k=60 fusion in
recall.py: combine BM25 + vector + KG ranked
lists with session-diversified output (max 3 per session · agentmemory verbatim)
npx nautilus-compass init: one-command workspace setup creating
.compass/.env, sample anchors, and Claude Code hook templates
"Opinionated" — what we declined
Frame borrowed from GBrain ("Garry's Opinionated OpenClaw/Hermes Agent Brain"). compass v2.0.0 takes a stance on what not to include:
- ❌ No LLM at ingest (USD 3.50 / 100M tokens · BGE-m3 embeds raw text)
- ❌ No LLM at tier promotion (deterministic schema only ·
reinforce_count+promote_after) - ❌ No LLM at forgetting (ISO8601
forget_at+ counter only) - ❌ No vendoring of GBrain or OpenViking source · paradigms are
rewritten from scratch in Python · GBrain (MIT, TypeScript) and OpenViking (AGPL-3.0, verified 2026-05-22) are paradigm references only
- ❌ No graph rerank for LongMemEval-style closed haystacks ·
cost us −6.2 pts in v0.8 (paper/RESULTS_v0.8.md)
---
What's coming · v3.0 / v3.5 fusion (dev branch preview)
Active development on the v3-full-fusion branch · not in any release. Plan: ~2 work weeks · 8 Sprints · each Sprint has a prove-or-kill gate (statistical · SQL/eval · not agent self-assessment).
Default-off byte-equal promise: with no opt-in env set, v3.0 / v3.5 behavior is byte-equal to v2.0.1. Verified by tests/test_llm_opt_in.py · the test_default_off_invariant_* family gates every PR into main.
v3.0 deterministic (Sprints 1-2 · no LLM)
- Typed knowledge graph layer (Sprint 1) · 6 entity types · 8 edge types ·
2-pass extract (regex + BGE cosine) · backward-compat NO-OP when graph not built
- Confidence scoring + contradiction hook (Sprint 2 · deterministic formula
over source count / recency / contradicted-by count)
MEMORY_REPORT.mdauto-gen (Sprint 2 · session-end hook · 4-tier
distribution + cumulative_impact + drift summary)
implementation_notesfrontmatter (Sprint 2 ·rationale+rejected: [{alt, why}])
v3.5 opt-in LLM features (Sprints 3-7 · all default-off)
| env var | tier | feature (Sprint) | |---|---|---| | COMPASS_USE_LLM_RESOLVE | 1 (session-end) | LLM contradiction resolution (Sprint 3) | | COMPASS_USE_LLM_VERIFY | 4 (runtime) | anti-confabulation cite-or-refuse (Sprint 4) | | COMPASS_USE_LLM_DRIFT_PAY | 4 (runtime) | drift × outcome anchor feedback (Sprint 5) | | COMPASS_USE_LLM_REFLECT | 3 (periodic) | self-reflection semantic emit (Sprint 6) | | COMPASS_USE_LLM_ECON | 4 (runtime) | memory-as-economy NAU budget (Sprint 7) |
Pattern mirrors the existing COMPASS_USE_GEMINI_FLASH opt-in (judges/gemini_flash.py) — env truthy (1/true/yes/on) activates · anything else disables. Registry: llm_opt_in.py.
Kill-gate semantics
Per-Sprint gates are pre-registered. If a Sprint's gate metric does not pass (e.g. Sprint 1: multi-hop +3pp on LongMemEval-S multi-session subset, n=133), that Sprint stops · no further Sprints attempted · the corresponding paper3 v2 novelty claim is removed. This protects against post-hoc rationalization of negative results.
---
Case study · 4-dialog OSS multi-agent reliability
Across 28 hours on 2026-05-30 / 31, four Claude Code dialogs (compass / Soul / V5 / nautilus-core) ran concurrently on shared filesystem-mediated protocols. The recorded run includes:
- Drift detection firing 314 times / 7d (76 / 24h) with
act_on_rate measured at 9.87% / 7d · 40.79% / 24h
- Cross-dialog contract
cnt_compass_soul_sub_a1closing in
17.92h (vs 6d 21h budget · 5.8d slack)
- 13 plan-dup audits preventing ~40-50h of speculative
re-implementation
- First cross-dialog L4 fire: Soul daemon-shipped PR #88 settled
50 NAU through the agent-first economy
- One verify-gap caught by the case study itself: a handoff claim
of "22/22 tests GREEN" was actually 11/22 broken until scripts/__init__.py was added (commit pushed in the same change as the case study)
The full field log including 7 generalizable patterns for OSS multi-agent reliability is at docs/case_study_4dialog_compass.md.
---
What problem does this solve
A. Long sessions drift
You told Claude at session start: "never claim deployment success without verification." Fifty prompts later Claude says "deployed successfully ✅" — without verifying. The memory rule was there; the AI forgot it under context pressure.
B. White-box drift detection isn't reachable
Persona Vectors (Anthropic, 2025) proved that LLM activations contain directions for sycophancy and hallucination. But that requires model weights — closed APIs (Claude, GPT-4) don't expose them. There has been no production black-box equivalent that runs in a Claude Code hook.
C. Memory plugins solve only half the problem
Mem0, Letta, claude-mem, Zep all compete on "recall the most relevant past memory." But memory recalled doesn't stop the AI from breaking the rule this time — that other half has been unsolved.
---
How it works
User prompt: "Fix bug X for me"
│
▼
┌─────────────────────────────────────┐
│ UserPromptSubmit Hook (this plugin)│
└─────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌──────────┐
│ recall │ │ drift │ │ profile │
│ memory │ │ check │ │ aggregate│
└────────┘ └─────────┘ └──────────┘
│
▼
Hooks inject results into Claude's system prompt:
- Time-bucketed past memory (BGE-m3 semantic recall)
- Drift score + nearest negative anchor (if score < threshold)
- Profile facts ("you have 3 unfinished tasks in this repo")
│
▼
Claude answers — with full context loaded
The drift detector compares each prompt against an anchor set (25 positive + 35 negative behavioral patterns drawn from real failure transcripts) using BGE-m3 cosine similarity. AUC 0.83 on held-out, 50ms p95 hook latency.
---
Measuring drift loop closure (act-on rate)
Drift detection without ack instrumentation is an open loop · the detector fires alerts but nothing measures whether the agent (or user) actually acted on them. v3 closes this loop with a single rate metric.
The signal: every fired drift alert gets a stable alert_id and lands in .cache/drift_mitigation_log.jsonl. When the user acknowledges the alert via the feedback CLI
python ~/.claude/plugins/nautilus-compass/feedback.py log <alert_id> fp|tp
(fp = false positive · tp = true positive · either way the alert was seen and judged), a matching kind: "ack" record is appended to the same sidecar.
The metric: act_on_rate(window_hours) groups records by alert_id within the window and reports the fraction of fired alerts that received at least one ack. The legacy KPI script prints it alongside everything else:
python ~/.claude/plugins/nautilus-compass/audit_kpi.py
=== act-on rate (drift alert closure · target ≥0.70) ===
· 24h: fires=81 acked=1 rate=0.012
· 7d: fires=294 acked=1 rate=0.003
Target: ≥0.70 over rolling 7d. Below 0.30 indicates the agent is tuning out alerts (cry-wolf · cf. the open-loop write-up) · raise the firing threshold (drift/firing.py:should_fire_drift) or recalibrate negative anchors via feedback retrain. Programmatic API for CI / cron monitors:
from audit_kpi import act_on_rate
m = act_on_rate(window_hours=168)
assert m["rate"] >= 0.70, f"drift loop open · rate={m['rate']:.3f} fires={m['fires']}"
---
Headline numbers
| Benchmark | Score | Honest compare | |---|---|---| | LongMemEval-S (n=500) | 56.6% (locked at v0.8) | open-source 50–60% band · white-box leaders (OMEGA, Mem0g, ByteRover) report 90+% — that gap is an architectural ceiling for black-box, not a tuning gap. See BLACKBOX_VS_WHITEBOX. | | EverMemBench-Dynamic (n=500) | 44.4% (Run 1) / 47.3% (Run 2) | tops the four published Table 4 baselines (Mem0 37.09, Zep 39.97, MemOS 42.55, MemoBase 34.27). Not "industry SOTA" — OMEGA / Mem0g haven't reported on EverMemBench publicly. | | Drift detector AUC | 0.83 held-out / 0.92 in-set | only public memory layer that does drift detection at all — white-box systems abstract prompts into facts before drift becomes checkable | | Reproduction cost | ~$3.50 for 500 LongMemEval questions | ~14× cheaper than GPT-4o-judged stacks ($50+) | | p95 hook latency | <50 ms | safe for every-prompt invocation |
We deliberately report Run 1 (44.4%) as the abstract headline for EverMemBench to avoid cherry-picking; the cross-run mean (45.84%) clears MemOS by +3.3 pts. See paper/sections/paper2_06_5_evermembench.tex for honest dual-run + Gemini cross-judge sensitivity analysis.
Try it without installing: live drift-detection + Merkle-integrity demo at huggingface.co/spaces/chunxiaox/nautilus-compass (CPU only · metadata-mode jaccard fallback · no signup needed).
Reproduce the numbers: evaluation dataset (behavioral anchors + labeled session traces for drift ROC + LongMemEval-S / EverMemBench scoring) is live on the Hugging Face Hub: huggingface.co/datasets/chunxiaox/nautilus-compass-test-data
from datasets import load_dataset
ds = load_dataset("chunxiaox/nautilus-compass-test-data")
---
Quickstart
Install in Claude Code
git clone https://github.com/chunxiaoxx/nautilus-compass ~/.claude/plugins/nautilus-compass
bash ~/.claude/plugins/nautilus-compass/install.sh
# Start the BGE-m3 daemon (one-time per boot)
bash ~/.claude/plugins/nautilus-compass/daemon_start.sh
The installer wires three hooks into ~/.claude/settings.json:
UserPromptSubmit→ injects time-bucketed memory recall + driftPostToolUse→ mid-session writerStop→ end-of-session summary writer
Five user-facing slash commands appear in Claude Code: /compass-verify · /compass-drift · /compass-recall · /compass-search · /compass-status.
Install in any other MCP client
python ~/.claude/plugins/nautilus-compass/scripts/install_to_agent.py
Auto-detects Claude Desktop, Cursor, Cline, Continue.dev, Zed Editor and patches their MCP config. See docs/AGENT_ONBOARDING.md for per-agent copy-paste configs and docs/mcp-usage.md for the raw protocol specification.
Cloud-hosted alternative (no local install)
curl https://compass.nautilus.social/.well-known/agent.json
Returns the standard A2A discovery descriptor. Sign up at compass.nautilus.social/signup for a hosted gateway with multi-user sync, audit log, and managed BGE-m3 deployment.
---
What's exposed (7 MCP tools)
| Tool | Purpose | Latency | |---|---|---| | ingest_obs(name, body, agent_id?) | Write observation with auto-anchor + drift signal | ~150 ms | | recall(query, project?, top_k?) | BGE-m3 semantic + keyword search | ~200 ms | | session_search(query, since?) | Time-bucketed session-log search | ~80 ms | | profile(user_id?) | Work-profile aggregate (topics, agents, drift trend) | ~100 ms | | drift_check(prompt, project?) | Black-box drift score against anchors | <50 ms | | drift_history(since?, agent_id?) | Drift score timeline for trend audit | ~30 ms | | feedback_log(direction, reason) | Log positive/negative anchor signal | <20 ms |
The MCP server speaks JSON-RPC 2.0 over stdio / TCP / TLS / mTLS. Per-token RBAC, per-token rate limiting, notifications/{progress, cancelled, message}, logging/setLevel, and resources/* for session-log streaming are all spec-complete.
---
Comparison
| Capability | this | mem0 | Letta | Zep | claude-mem | MemOS | Smriti | |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | Cross-agent memory | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | archive-only | | MCP A2A protocol native | ✅ TLS+mTLS+RBAC | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | Drift detection | ✅ AUC 0.83 | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | Merkle integrity audit log | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | LongMemEval-S verified | ✅ 56.6% (locked) | n/r | n/r | n/r | ❌ | n/r | ❌ | | EverMemBench verified | ✅ 44.4-47.3% | 37.09 | n/r | 39.97 | n/r | 42.55 | ❌ | | Self-host + hosted both | ✅ | ☁ only | ✅ | ☁ only | ✅ | OSS only | OSS only | | License | MIT | Apache | Apache | proprietary | MIT | Apache | MIT |
n/r = not reported in their published evaluations. Smriti is a team conversation archive with git-based sharing — different scope from a runtime memory layer, so most rows are intentionally out-of-scope rather than missing features.
---
Platform integration · BP1 + BP3 contract
If you run the OSS plugin alongside a Nautilus-style task platform (or your own multi-agent backend), two MCP tools open a bidirectional channel without any new HTTP server:
| Tool | Direction | Purpose | |---|---|---| | submit_platform_task(name, channels, payload, anchor_pack_hint, priority) | compass dialog → platform | Push a task into the platform's queue. File-based by default (~/.claude/projects/_platform_queue/<id>.json); auto-promotes to HTTP POST when COMPASS_PLATFORM_QUEUE_URL is set. | | ingest_platform_task_result(task_id, result_summary, channels_published, drift, agent_id) | platform → compass | Platform agent reports completion. Writes a JSON archive AND a session_*.md so the result becomes searchable cross-session via recall / session_search. |
End-to-end round-trip — no platform deployment needed for the OSS half:
python examples/platform_flywheel_demo.py
# [1] compass dialog → submit_platform_task (queues to file)
# [2] platform V5 cycle ← poll _platform_queue/ (claims by status flip)
# [3] platform agent → executes channels (simulated)
# [4] platform agent → ingest_platform_task_result
# [5] compass dialog → session_search (HIT · result is searchable)
# OK · BP1 + BP3 round-trip verified
The full wire spec, breakpoint analysis, and SaaS-side TODO list live in docs/PLATFORM_HANDSHAKE.md §7.
V7 governance layer (v0.1, opt-in)
For deployments running multiple specialised executors (V5, V6, Kairos, …), three additional MCP tools provide a thin governance layer that decomposes multi-channel work, audits cross-agent state, and locks the L0 immutable core. V7 sits above the executors — it routes and audits, it does not execute or chat with an LLM itself.
| Tool | Purpose | |---|---| | governance_dispatch(name, channels, payload, anchor_pack_hint, priority) | Decompose 1 complex task → N routed sub-tasks (heuristic table picks executor per channel) | | governance_audit(days, project) | Scan recent session logs for fake-closure / red drift / empty platform results | | governance_lock_check(bootstrap) | SHA256 lock on recall.py, merkle_chain.py, anchors.json, selftest.py |
python examples/v7_governance_demo.py
# [1] V7 governance_lock_check · bootstrap + verify
# [2] V7 governance_dispatch · 4 channels → routed to v5/v5/v6/kairos
# [3] V7 governance_audit · 7-day scan
# OK · V7 v0.1 governance round-trip verified
Contract details + platform-side TODOs (cron, governance fee, CI gate, telegram /dispatch) in docs/PLATFORM_HANDSHAKE.md §8.
---
Documentation
docs/AGENT_ONBOARDING.md— per-agent install configs (6 platforms + 3 frameworks)docs/mcp-usage.md— raw MCP protocol guide, TLS setup, RBACdocs/PLATFORM_HANDSHAKE.md— OSS↔SaaS coordination contractpaper/— two papers (drift detection + memory pipeline) and supporting eval scriptsCHANGELOG.md— versioned release notesCONTRIBUTING.md— adding new domain anchors / running benchmarks
---
Citation
If you use this work, please cite:
Paper 1 · drift detection:
@misc{nautiluscompass-drift-2026,
title = {Nautilus Compass: Black-box Persona Drift Detection
for Production LLM Agents},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}
Paper 2 · memory pipeline + EverMemBench cross-bench:
@misc{nautiluscompass-memrecall-2026,
title = {Closing the Memory Recall Gap with Chinese LLMs:
A Multi-Stage Retrieval Pipeline Achieving Zep-SOTA Performance
on LongMemEval-S at 1/15 Cost},
author = {Chunxiao Wang},
year = {2026},
note = {Yiluo Technology Co., Ltd.},
howpublished = {\url{https://github.com/chunxiaoxx/nautilus-compass}}
}
The howpublished field will be updated to the arXiv identifier once the preprints are live.
We also build on prior work — please cite as appropriate:
- BGE-m3 / BGE-Reranker (Chen et al., BAAI 2024)
- Persona Vectors (Chen et al., Anthropic, arXiv:2507.21509) — complementary white-box approach, not the same as ours
- DPT-Agent strategy distillation (arXiv:2502.11882)
- A-MEM dynamic links (arXiv:2502.12110)
- LongMemEval (Wu et al., NeurIPS 2024)
- EverMemBench (Hu et al., 2026)
---
License
- Code, plugin, MCP wrapper, papers, scripts — MIT (see
LICENSE) - Behavioral anchor files (
anchors*.json) — CC0 1.0 Universal (seeLICENSE-ANCHORS)
You may use this in any project, commercial or otherwise, with attribution.
---
Star history

Contributors
<a href="https://github.com/chunxiaoxx/nautilus-compass/graphs/contributors"> <img src="https://contrib.rocks/image?repo=chunxiaoxx/nautilus-compass" alt="Contributors" /> </a>
PRs welcome — see CONTRIBUTING.md.
Contact
- Author: Chunxiao Wang · Yiluo Technology Co., Ltd. ·
chunxiaoxx@gmail.com - Issues: github.com/chunxiaoxx/nautilus-compass/issues
- Hosted gateway: compass.nautilus.social
- 中文文档: README.zh-CN.md






