Knowledge RAG
<div align="center">
   !Python !License !Platform !GPU    
Your docs, your machine, zero cloud. Claude Code searches them natively.
Drop your PDFs, markdown, code, notebooks β 1800+ files, 39K chunks, indexed in under 3 minutes.<br/> Hybrid search (BM25 + semantic vectors + cross-encoder reranking) through 13 MCP tools.<br/> Everything runs locally via ONNX. No Docker, no Ollama, no API keys, no data leaves your machine.
pip install knowledge-rag β restart Claude Code β search_knowledge("your query")
---
13 MCP Tools | Hybrid Search + Reranking | 20 File Formats | Optional NVIDIA GPU | 100% Local
What's New | Supported Formats | Installation | Configuration | API Reference | Architecture
</div>
---
Star History
<div align="center">
<a href="https://www.star-history.com/?repos=lyonzin%2Fknowledge-rag&type=date&legend=top-left"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=lyonzin/knowledge-rag&type=date&theme=dark&legend=top-left" /> <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=lyonzin/knowledge-rag&type=date&legend=top-left" /> <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=lyonzin/knowledge-rag&type=date&legend=top-left" /> </picture> </a>
</div>
---
What's New in v4.2.0
Search Performance & Output Quality (v4.2.0)
128Γ faster BM25 search β replaced rank-bm25 full-corpus scan with a custom inverted-index implementation. Only documents containing query terms are scored, using numpy.argpartition for O(n) top-k selection. Adjacent chunk fetching now uses a single batched ChromaDB call instead of N round-trips, and an O(1) reverse lookup (_source_to_docid) eliminates linear scans.
Smarter output β two new parameters on search_knowledge:
snippet_mode(default:true) β truncates content to ~500 characters at natural break points, reducing token consumption by ~72%. Addscontent_lengthfield with original size; useget_document()for full content.min_scoreβ filters results below a normalized relevance threshold (0.0β1.0). Eliminates low-quality noise from results. Response includesfiltered_by_scorecount for transparency.
Both parameters are fully backwards-compatible (existing callers see no change in behavior).
Enterprise Concurrent Access β SSE/HTTP Transport (v4.0.0)
The server now supports SSE and streamable-http transport modes. Instead of spawning a separate process per client (stdio), a single server process serves all clients with shared resources β 1 embedding model, 1 ChromaDB, 1 query cache.
# config.yaml
server:
transport: "sse" # "stdio" | "sse" | "streamable-http"
host: "127.0.0.1"
port: 8179
Or via CLI: knowledge-rag --transport sse
Optional enterprise features (all disabled by default):
- Rate limiting: Sliding-window counter, configurable RPM and burst
- Prometheus metrics:
/metricsendpoint on separate port - Bearer auth: Token validation for SSE/HTTP connections
All 13 MCP tools are instrumented with @rate_limited and @instrument decorators β zero overhead when features are disabled. Default transport remains stdio for full backwards compatibility.
Migration: Existing users need zero changes. SSE mode is opt-in via
server.transport: "sse"in config.yaml. See Configuration for details.
Quality Gate β 7-Pillar PR Validation
Every PR (including dependabot bumps and one-line fixes) is now evaluated against 35+ automated checks spread across 7 pillars before any human review:
| Pillar | What it enforces | Tools | |---|---|---| | 1 Security | SAST, secrets, CVEs, supply chain | bandit, semgrep, gitleaks, pip-audit, dependency-review, Snyk, CodeQL, Socket | | 2 Stability | Flake detection, coverage trend, test count, deterministic runs | pytest-rerunfailures, codecov Β±0.5pp, test-count guard | | 3 Memory Leak | RSS bounded under 1000-query load, no idle bloat | psutil-based baseline tests + nightly 50K-iteration soak | | 4 Versatility | 9 OSΓPython combos, 14 format parsers, 4 config presets, locale tolerance, property-based fuzzing | matrix CI on Linux+Windows+macOS Γ 3.11+3.12+3.13, Hypothesis | | 5 Scalability | Performance regression > 10% blocks merge, public bench dashboard | pytest-benchmark, GH Pages chart | | 6 Versioning | Atomic version sync, API surface diff, conventional commits, CHANGELOG enforcement, backwards compat | griffe-style AST diff, custom guards | | 7 Quality | Type strictness, docstring coverage, complexity, dead code | mypy strict, interrogate β₯80%, radon, vulture |
Plus a nightly resilience workflow that runs chaos failure-injection (HF down, ChromaDB corruption, watchdog crash, ONNX zero-byte replay), determinism check (full suite Γ 3), and mutation testing on selected modules.
Read the full philosophy in CONTRIBUTING.md. Report bugs via SECURITY.md or the issue templates.
Critical Hotfix β No More Silent Zero-Vector Corruption (v3.8.1)
FastEmbedEmbeddings.__call__ no longer swallows exceptions and returns [[0.0]*dim, ...] when the ONNX model fails to load. That bug pre-existed in master but was silent: ChromaDB happily stored zero embeddings, count() reported normal numbers, smart-reindex skipped them as "already indexed", and queries returned garbage similarity with no error visible. Now raises EmbeddingModelLoadError / EmbeddingError loudly. All v3.8.0 users should upgrade. Full details in Changelog.
Lazy-Loaded Embeddings β Cheaper Idle Processes (v3.8.0)
The FastEmbed ONNX model (~200MB resident) now loads on the first query, not at startup. Idle knowledge-rag processes are now genuinely cheap. Why this matters: MCP stdio is one-process-per-client by protocol β multiple Claude Code windows, Claude Desktop + IDE simultaneously, or review/approval flows that open extra connections all spawn their own processes. Before v3.8.0, every one of them paid the full embedding-model cost up front. Now only processes that actually serve queries load the model. Public API is unchanged.
Opt-In Single-Instance Guard (v3.8.0)
For users who measured their setup and want a hard cap of one server per data_dir:
export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
A second instance exits immediately with code 75. OFF by default so multi-client MCP usage continues to work unchanged. Stale-PID recovery + SIGINT/SIGTERM cleanup wired correctly. Full guide in docs/single-instance.md. Sample MCP config in examples/mcp-config-single-instance.json.
5 Ways to Install
npx -y knowledge-rag # NPM β zero setup, auto-manages Python venv
pip install knowledge-rag # PyPI β classic Python install
curl -fsSL .../install.sh | bash # One-line installer (Linux/macOS/Windows)
docker pull ghcr.io/lyonzin/knowledge-rag # Docker β models pre-downloaded
git clone ... && pip install -r ... # From source
All methods produce the same MCP server. See Installation for full instructions.
Recent Highlights
- v4.0.0 β Enterprise concurrent access: SSE/HTTP transport (1 server β N clients), thread-safe shared state, optional rate limiting + Prometheus metrics, ChromaDB WAL mode,
--transportCLI - v3.9.0 β Quality Gate activated: 35+ automated PR checks across 7 pillars (Security, Stability, Memory Leak, Versatility, Scalability, Versioning, Quality) + nightly resilience suite (chaos, soak, determinism, mutation)
- v3.8.1 β Critical hotfix: loud-fail embeddings (no more silent zero-vector corruption); Windows CI flake erradicated (HF_HUB_OFFLINE + shell:bash + atexit wrapper)
- v3.8.0 β Lazy-load embeddings, opt-in single-instance guard, version sync across PyPI/NPM/Docker
- v3.6.0 β Multi-language code parsing (C/C++/JS/TS/XML), NPM wrapper, Docker image, automated release pipeline
- v3.5.2 β CUDA DLL auto-discovery from pip packages, graceful GPUβCPU fallback, explicit CPU provider (no CUDA noise when
gpu: false), BASE_DIR resolution fix for editable installs - v3.5.1 β Remove Python
<3.13upper bound β 3.13 and 3.14 now supported - v3.5.0 β Optional GPU acceleration, supported formats table, full README rewrite
- v3.4.3 β MCP stdout save/restore fix (v3.4.2 broke JSON-RPC responses)
- v3.4.0 β Persistent model cache, exclude patterns, Jupyter Notebook parser, inotify resilience, MetaTrader support
See Changelog for full history.
---
Supported Formats
| Format | Extension | Parser | Default | Notes | |--------|-----------|--------|---------|-------| | Markdown | .md | Section-aware (splits at ##) | Yes | Headers preserved as chunk boundaries | | Plain Text | .txt | Fixed-size chunking | Yes | 1000 chars + 200 overlap | | PDF | .pdf | PyMuPDF extraction | Yes | Text-based PDFs only (no OCR) | | Python | .py | Code-aware parser | Yes | Functions/classes as chunks | | JSON | .json | Structure-aware | Yes | Flattened key-value extraction | | CSV | .csv | Row-based parser | Yes | Headers + rows as text | | Word | .docx | python-docx | Yes | Headings preserved as markdown | | Excel | .xlsx | openpyxl | Yes | Sheet-by-sheet extraction | | PowerPoint | .pptx | python-pptx | Yes | Slide-by-slide extraction | | Jupyter Notebook | .ipynb | Cell-aware parser | Yes | Markdown + code cells only, no outputs/base64 | | C Source | .c | Code-aware parser | Yes | Functions/structs/includes extracted | | C/C++ Header | .h | Code-aware parser | Yes | Function declarations/structs extracted | | C++ Source | .cpp | Code-aware parser | Yes | Classes/structs/includes extracted | | JavaScript | .js | Code-aware parser | Yes | Functions/classes/imports (ESM + CJS) | | React JSX | .jsx | Code-aware parser | Yes | Same as JS parser | | TypeScript | .ts | Code-aware parser | Yes | Functions/classes/interfaces/enums/imports | | React TSX | .tsx | Code-aware parser | Yes | Same as TS parser | | XML | .xml | XML parser | Yes | Root element and namespace extraction | | MQL4 Header | .mqh | Code parser | No | MetaTrader β add to supported_formats to enable | | MQL4 Source | .mq4 | Code parser | No | MetaTrader β add to supported_formats to enable |
Tip: The parser dispatch is extensible. Any format mapped in
_parserscan be enabled viasupported_formatsin config.yaml.
---
Features
| Feature | Description | |---------|-------------| | Hybrid Search | Semantic + BM25 keyword search with Reciprocal Rank Fusion | | Cross-Encoder Reranker | Xenova/ms-marco-MiniLM-L-6-v2 re-scores top candidates for precision | | GPU Acceleration | Optional ONNX CUDA support for 5-10x faster indexing | | YAML Configuration | Fully customizable via config.yaml with domain-specific presets | | Query Expansion | Configurable synonym mappings (69 security-term defaults) | | Markdown-Aware Chunking | .md files split by ##/### sections instead of fixed windows | | In-Process Embeddings | FastEmbed ONNX Runtime (BAAI/bge-small-en-v1.5, 384D) | | Keyword Routing | Word-boundary aware routing for domain-specific queries | | 20 Format Parsers | MD, TXT, PDF, PY, C, H, CPP, JS, JSX, TS, TSX, JSON, XML, CSV, DOCX, XLSX, PPTX, IPYNB + opt-in MQH/MQ4 | | Category Organization | Organize docs by folder, auto-tagged by path | | Incremental Indexing | Change detection via mtime/size β only re-indexes modified files | | Chunk Deduplication | SHA256 content hashing prevents duplicate chunks | | Query Cache | LRU cache with 5-min TTL for instant repeat queries | | Document CRUD | Add, update, remove documents via MCP tools | | URL Ingestion | Fetch URLs, strip HTML, convert to markdown, index | | Similarity Search | Find documents similar to a reference document | | Retrieval Evaluation | Built-in MRR@5 and Recall@5 metrics | | File Watcher | Auto-reindex on document changes via watchdog (5s debounce) | | Exclude Patterns | Glob-based file/directory exclusion during indexing | | MMR Diversification | Maximal Marginal Relevance reduces redundant results | | Persistent Model Cache | Embedding models cached in models_cache/ β survives reboots | | Auto-Migration | Detects embedding dimension mismatch and rebuilds automatically | | 13 MCP Tools | Full CRUD + search + evaluation via Claude Code |
---
Architecture
System Overview
flowchart TB
subgraph MCP["MCP SERVER (FastMCP)"]
direction TB
TOOLS["13 MCP Tools<br/>search | get | add | update | remove<br/>reindex | reindex_status | list | stats | url | similar | evaluate"]
end
subgraph SEARCH["HYBRID SEARCH ENGINE"]
direction LR
ROUTER["Keyword Router<br/>(word boundaries)"]
SEMANTIC["Semantic Search<br/>(ChromaDB)"]
BM25["BM25 Keyword<br/>(inverted-index + expansion)"]
RRF["Reciprocal Rank<br/>Fusion (RRF)"]
RERANK["Cross-Encoder<br/>Reranker"]
ROUTER --> SEMANTIC
ROUTER --> BM25
SEMANTIC --> RRF
BM25 --> RRF
RRF --> RERANK
end
subgraph STORAGE["STORAGE LAYER"]
direction LR
CHROMA[("ChromaDB<br/>Vector Database")]
COLLECTIONS["Collections<br/>security | ctf<br/>logscale | development"]
CHROMA --- COLLECTIONS
end
subgraph EMBED["EMBEDDINGS (In-Process)"]
FASTEMBED["FastEmbed ONNX<br/>BAAI/bge-small-en-v1.5<br/>(384D, CPU or GPU)"]
CROSSENC["Cross-Encoder<br/>ms-marco-MiniLM-L-6-v2"]
FASTEMBED --- CROSSENC
end
subgraph INGEST["DOCUMENT INGESTION"]
PARSERS["20 Parsers<br/>MD | PDF | TXT | PY | C | H | CPP | JS | JSX | TS | TSX | JSON | XML | CSV<br/>DOCX | XLSX | PPTX | IPYNB | MQH | MQ4"]
CHUNKER["Chunking<br/>MD: section-aware<br/>Other: 1000 chars + 200 overlap"]
PARSERS --> CHUNKER
end
CLAUDE["Claude Code"] --> MCP
MCP --> SEARCH
SEARCH --> STORAGE
STORAGE --> EMBED
INGEST --> EMBED
EMBED --> STORAGE
Query Processing Flow
flowchart TB
QUERY["User Query<br/>'mimikatz credential dump'"] --> EXPAND
subgraph EXPANSION["Query Expansion"]
EXPAND["Synonym Expansion<br/>mimikatz -> mimikatz, sekurlsa, logonpasswords"]
end
EXPAND --> ROUTER
subgraph ROUTING["Keyword Routing"]
ROUTER["Keyword Router"]
MATCH{"Word Boundary<br/>Match?"}
CATEGORY["Filter: redteam"]
NOFILTER["No Filter"]
ROUTER --> MATCH
MATCH -->|Yes| CATEGORY
MATCH -->|No| NOFILTER
end
subgraph HYBRID["Hybrid Search"]
direction LR
SEMANTIC["Semantic Search<br/>(ChromaDB embeddings)<br/>Conceptual similarity"]
BM25["BM25 Inverted-Index<br/>(posting lists + numpy top-k)<br/>Exact term matching"]
end
subgraph FUSION["Result Fusion + Reranking"]
RRF["Reciprocal Rank Fusion<br/>score = alpha * 1/(k+rank_sem)<br/>+ (1-alpha) * 1/(k+rank_bm25)"]
RERANK["Cross-Encoder Reranker<br/>Re-scores top 3x candidates<br/>query+doc pair scoring"]
SORT["Sort by Reranker Score<br/>Normalize to 0-1"]
ADJ["Adjacent Chunk Expansion<br/>(batch fetch Β±1 chunk)"]
RRF --> RERANK --> SORT --> ADJ
end
subgraph OUTPUT["Output Processing"]
MINSCORE["min_score Filter<br/>(discard below threshold)"]
SNIPPET["snippet_mode Truncation<br/>(~500 chars at natural break)"]
MINSCORE --> SNIPPET
end
CATEGORY --> HYBRID
NOFILTER --> HYBRID
SEMANTIC --> RRF
BM25 --> RRF
ADJ --> MINSCORE
SNIPPET --> RESULTS["Results<br/>search_method: hybrid|semantic|keyword<br/>score + filtered_by_score + content_length"]
Document Ingestion Flow
flowchart LR
subgraph INPUT["Input"]
FILES["documents/<br/>βββ security/<br/>βββ development/<br/>βββ ctf/<br/>βββ general/"]
end
subgraph PARSE["Parse (20 formats)"]
MD["Markdown"]
PDF["PDF<br/>(PyMuPDF)"]
OFFICE["DOCX | XLSX<br/>PPTX | CSV"]
CODE["PY | C | H | CPP | JS | JSX<br/>TS | TSX | JSON | XML | IPYNB"]
end
subgraph CHUNK["Chunk"]
MDSPLIT["MD: Section-Aware<br/>Split at ## headers"]
TXTSPLIT["Other: Fixed-Size<br/>1000 chars + 200 overlap"]
DEDUP["SHA256 Dedup<br/>Skip duplicate content"]
end
subgraph EMBED["Embed"]
FASTEMBED["FastEmbed ONNX<br/>bge-small-en-v1.5<br/>(384D, CPU or GPU)"]
end
subgraph STORE["Store"]
CHROMADB[("ChromaDB")]
BM25IDX["BM25 Index"]
end
FILES --> MD & PDF & OFFICE & CODE
MD --> MDSPLIT
PDF & OFFICE & CODE --> TXTSPLIT
MDSPLIT --> DEDUP
TXTSPLIT --> DEDUP
DEDUP --> EMBED
EMBED --> STORE
hybrid_alpha Parameter Effect
flowchart LR
subgraph ALPHA["hybrid_alpha values"]
A0["0.0<br/>Pure BM25<br/>Instant"]
A3["0.3 (default)<br/>Keyword-heavy<br/>Fast"]
A5["0.5<br/>Balanced"]
A7["0.7<br/>Semantic-heavy"]
A10["1.0<br/>Pure Semantic"]
end
subgraph USE["Best For"]
U0["CVEs, tool names<br/>exact matches"]
U3["Technical queries<br/>specific terms"]
U5["General queries"]
U7["Conceptual queries<br/>related topics"]
U10["'How to...' questions<br/>conceptual search"]
end
A0 --- U0
A3 --- U3
A5 --- U5
A7 --- U7
A10 --- U10
---
Installation
Prerequisites
- Python 3.11+
- Claude Code CLI
- β¦or any other MCP client (Claude Desktop, Cursor, VS Code, Antigravity, opencode, Windsurf) β see Use with other MCP clients
- ~200MB disk for model cache (auto-downloaded on first run)
- Optional: NVIDIA GPU + CUDA 12 for accelerated embeddings (see GPU Acceleration below)
GPU Acceleration
GPU mode accelerates embedding generation during indexing and search. It requires an NVIDIA GPU with CUDA 12 support. No GPU? No problem β the server runs on CPU by default and GPU is entirely optional.
Requirements:
| Component | Minimum | How to check / get it | |-----------|---------|----------------------| | NVIDIA GPU (Turing+) | RTX 20xx / 30xx / 40xx / 50xx, or Tesla T4+ | nvidia-smi | | NVIDIA Driver | β₯ 525 | nvidia-smi β nvidia.com/drivers | | CUDA 12 runtime | Provided by pip packages below | Automatic |
Setup (2 steps):
# 1. Install GPU dependencies (onnxruntime-gpu + all CUDA 12 runtime DLLs)
pip install knowledge-rag[gpu]
# 2. Enable in config.yaml
# models:
# embedding:
# gpu: true
The [gpu] extra installs onnxruntime-gpu plus 7 NVIDIA CUDA 12 packages (cublas, cudnn, cuda-runtime, cufft, cusparse, cusolver, curand, nvjitlink) so you don't need a full CUDA Toolkit install.
Verify GPU is active:
On server startup, look for the GPU status banner: `` ============================================================ GPU STATUS: ACTIVE Provider: CUDAExecutionProvider Device: NVIDIA GeForce RTX 3080 Ti VRAM: 12.0 GB ============================================================ ``
Or programmatically: ```bash python -c "import onnxruntime; print(onnxruntime.get_available_providers())"
Should include: 'CUDAExecutionProvider'
> **Fallback**: If CUDA is unavailable at runtime (wrong driver, missing DLLs, no GPU), the server falls back to CPU automatically with a `[WARN]` log β it never crashes. The `gpu: true` config is a preference, not a requirement.
### Install Methods
Pick one β all produce the same running server.
#### Option A: NPX (fastest)
Requires Node.js 16+. Handles Python venv, pip install, and version upgrades automatically.
claude mcp add knowledge-rag -s user -- npx -y knowledge-rag ```
That's it. On first run, npx creates a venv at ~/.knowledge-rag/, installs the PyPI package, and starts the MCP server. Subsequent runs reuse the cached venv.
Option B: One-line installer
# Linux/macOS:
curl -fsSL https://raw.githubusercontent.com/lyonzin/knowledge-rag/master/install.sh | bash
# Windows (PowerShell):
irm https://raw.githubusercontent.com/lyonzin/knowledge-rag/master/install.ps1 | iex
Then configure Claude Code:
claude mcp add knowledge-rag -s user -- ~/knowledge-rag/venv/bin/python -m mcp_server.server
Windows:
claude mcp add knowledge-rag -s user -- %USERPROFILE%\knowledge-rag\venv\Scripts\python.exe -m mcp_server.server
Option C: pip install
mkdir ~/knowledge-rag && cd ~/knowledge-rag
python3 -m venv venv && source venv/bin/activate
pip install knowledge-rag
knowledge-rag init # Exports config template, presets, creates documents/
Then configure Claude Code:
claude mcp add knowledge-rag -s user -- ~/knowledge-rag/venv/bin/python -m mcp_server.server
Windows users: Use
pythoninstead ofpython3,venv\Scripts\activateinstead ofsource venv/bin/activate. Windows path:claude mcp add knowledge-rag -s user -- %USERPROFILE%\knowledge-rag\venv\Scripts\python.exe -m mcp_server.server
Option D: Clone from source
git clone https://github.com/lyonzin/knowledge-rag.git ~/knowledge-rag
cd ~/knowledge-rag
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
Then configure Claude Code:
claude mcp add knowledge-rag -s user -- ~/knowledge-rag/venv/bin/python -m mcp_server.server
Option E: Docker
docker pull ghcr.io/lyonzin/knowledge-rag:latest
claude mcp add knowledge-rag -s user -- \
docker run -i --rm \
-v ~/knowledge-rag/documents:/app/documents \
-v ~/knowledge-rag/data:/app/data \
ghcr.io/lyonzin/knowledge-rag:latest
Models are pre-downloaded in the image β no first-run delay.
<details> <summary>Alternative: manual JSON config</summary>
Add to ~/.claude.json:
Windows: ``json { "mcpServers": { "knowledge-rag": { "command": "C:\\Users\\YOUR_USER\\knowledge-rag\\venv\\Scripts\\python.exe", "args": ["-m", "mcp_server.server"] } } } ``
Linux / macOS: ``json { "mcpServers": { "knowledge-rag": { "command": "/home/YOUR_USER/knowledge-rag/venv/bin/python", "args": ["-m", "mcp_server.server"] } } } ` > Replace YOUR_USER with your username, or use the full path from echo $HOME`. </details>
Option F: SSE Server Mode (multi-agent)
For multi-agent setups where multiple clients query the same knowledge base simultaneously:
pip install knowledge-rag[server] # Adds uvicorn for SSE/HTTP
knowledge-rag --transport sse # Starts on http://127.0.0.1:8179
Then configure each MCP client to connect via SSE:
{
"mcpServers": {
"knowledge-rag": {
"type": "sse",
"url": "http://127.0.0.1:8179/sse"
}
}
}
One server process serves all agents β shared embedding model, shared cache, shared ChromaDB. See Configuration > Server for rate limiting, metrics, and auth options.
Use with other MCP clients
knowledge-rag supports both stdio (default, 1:1) and SSE (1:N) transport modes. In stdio mode, it works with any MCP-compatible client, not only Claude Code. The launch command is the same everywhere (the python -m mcp_server.server from whichever install method you picked); only the config file location and JSON shape differ per client.
Clients using the standard mcpServers format
For Claude Desktop, Cursor, Antigravity, and Windsurf, use the same block β only the file location changes:
{
"mcpServers": {
"knowledge-rag": {
"command": "/home/YOUR_USER/knowledge-rag/venv/bin/python",
"args": ["-m", "mcp_server.server"]
}
}
}
Windows: set
commandto the full path ofvenv\Scripts\python.exe.
| Client | Config file | Notes | |---|---|---| | Claude Code | use claude mcp add β¦ (see install methods above) | The CLI writes ~/.claude.json for you β manual edits to it aren't reliably picked up. | | Claude Desktop | macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Β· Windows: %APPDATA%\Claude\claude_desktop_config.json | Easiest: Settings β Developer β Edit Config opens the correct file (avoids the Windows Store/MSIX path quirk). | | Cursor | ~/.cursor/mcp.json (global) or .cursor/mcp.json (per project) | β | | Antigravity | macOS/Linux: ~/.gemini/antigravity/mcp_config.json Β· Windows: %USERPROFILE%\.gemini\antigravity\mcp_config.json | Open via Agent panel β "β¦" β Manage MCP Servers β View raw config. | | Windsurf | ~/.codeium/windsurf/mcp_config.json (global only) | Easiest: Cascade panel β MCP β View raw config. |
VS Code β uses a servers key
VS Code (Copilot MCP) nests servers under servers, not mcpServers. Put this in .vscode/mcp.json (workspace) or the file opened by the MCP: Open User Configuration command:
{
"servers": {
"knowledge-rag": {
"type": "stdio",
"command": "/home/YOUR_USER/knowledge-rag/venv/bin/python",
"args": ["-m", "mcp_server.server"]
}
}
}
opencode β uses an mcp key
opencode nests servers under mcp, takes command as a single array, and uses environment instead of env. Put this in opencode.json (project root) or ~/.config/opencode/opencode.json (global):
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"knowledge-rag": {
"type": "local",
"command": ["/home/YOUR_USER/knowledge-rag/venv/bin/python", "-m", "mcp_server.server"],
"enabled": true
}
}
}
Any other MCP client: point it at the same command + args (
β¦/venv/bin/python -m mcp_server.server). If it speaks stdio MCP, knowledge-rag works β only the config file's location and key naming differ. Check your client's docs for the exact path.
Verify
claude mcp list
On first start, the server will:
- Download the embedding model (~50MB, cached in
models_cache/) - Auto-index any documents in the
documents/directory - Start watching for file changes (auto-reindex)
---
Usage
Adding Documents
Place your documents in the documents/ directory, organized by category:
documents/
βββ security/ # Pentest, exploit, vulnerability docs
βββ development/ # Code, APIs, frameworks
βββ ctf/ # CTF writeups and methodology
βββ logscale/ # LogScale/LQL documentation
βββ general/ # Everything else
Or add documents programmatically via MCP tools:
# Add from content
add_document(
content="# My Document\n\nContent here...",
filepath="security/my-technique.md",
category="security"
)
# Add from URL
add_from_url(
url="https://example.com/article",
category="security",
title="Custom Title"
)
Searching
Claude uses the RAG system automatically when configured. You can also control search behavior:
# Pure keyword search β instant, no embedding needed
search_knowledge("gtfobins suid", hybrid_alpha=0.0)
# Keyword-heavy (default) β fast, slight semantic boost
search_knowledge("mimikatz", hybrid_alpha=0.3)
# Balanced hybrid β both engines equally weighted
search_knowledge("SQL injection techniques", hybrid_alpha=0.5)
# Semantic-heavy β better for conceptual queries
search_knowledge("how to escalate privileges", hybrid_alpha=0.7)
# Pure semantic β embedding similarity only
search_knowledge("lateral movement strategies", hybrid_alpha=1.0)
Indexing
Documents are automatically indexed on first startup. All reindex operations run in background β they return immediately and you poll progress via get_reindex_status():
# Incremental: only re-index changed files (fast)
reindex_documents()
# Smart reindex: detect changes + rebuild BM25
reindex_documents(force=True)
# Nuclear rebuild: delete everything, re-embed all (use after model change)
reindex_documents(full_rebuild=True)
# Poll progress (lightweight, no full stats computation)
get_reindex_status()
# β {"reindex": {"active": true, "percent": 56, "progress": "2090/3734", ...}}
Evaluating Retrieval Quality
evaluate_retrieval(test_cases='[
{"query": "sql injection", "expected_filepath": "security/sqli-guide.md"},
{"query": "privilege escalation", "expected_filepath": "security/privesc.md"}
]')
# Returns: MRR@5, Recall@5, per-query results
---
API Reference
Search & Query
search_knowledge
Hybrid search combining semantic search + BM25 keyword search with cross-encoder reranking.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | query | string | required | Search query text (1-3 keywords recommended) | | max_results | int | 5 | Maximum results to return (1-20) | | category | string | null | Filter by category | | hybrid_alpha | float | 0.3 | Balance: 0.0 = keyword only, 1.0 = semantic only | | min_score | float | 0.0 | Minimum relevance score (0.0-1.0) to include a result. Use 0.2-0.4 to cut noise | | snippet_mode | bool | true | Truncate content to ~500 chars at natural break points. Adds content_length field |
Returns:
{
"status": "success",
"query": "mimikatz credential dump",
"hybrid_alpha": 0.5,
"result_count": 3,
"filtered_by_score": 2,
"cache_hit_rate": "0.0%",
"results": [
{
"content": "Mimikatz can extract credentials from memory...",
"source": "documents/security/credential-attacks.md",
"filename": "credential-attacks.md",
"category": "security",
"score": 0.9823,
"raw_rrf_score": 0.016393,
"reranker_score": 0.987654,
"semantic_rank": 2,
"bm25_rank": 1,
"search_method": "hybrid",
"keywords": ["mimikatz", "credential", "lsass"],
"routed_by": "redteam"
}
]
}
Search Method Values:
hybrid: Found by both semantic and BM25 search (highest confidence)semantic: Found only by semantic searchkeyword: Found only by BM25 keyword search
---
get_document
Retrieve the full content of a specific document.
| Parameter | Type | Description | |-----------|------|-------------| | filepath | string | Path to the document file |
Returns: JSON with document content, metadata, keywords, and chunk count.
---
reindex_documents
Index or reindex all documents in the knowledge base. Runs in background β returns immediately. Poll progress via get_reindex_status().
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | force | bool | false | Smart reindex: detects changes, rebuilds BM25. Fast. | | full_rebuild | bool | false | Nuclear rebuild: deletes everything, re-embeds all documents. Use after model change. |
Returns: {"status": "started", "operation": "..."} immediately. If already running, returns {"status": "already_running", "progress": "1200/3734"}.
---
get_reindex_status
Get the current status of a background reindex operation. Lightweight β does not compute full index statistics.
Returns (active): ``json { "status": "success", "reindex": { "active": true, "operation": "nuclear_rebuild", "progress": "1200/3734", "percent": 32, "indexed": 1200, "skipped": 0, "errors": 0, "started_at": "2026-06-17T18:29:49" } } ``
Returns (idle): {"status": "success", "reindex": {"active": false}}
---
list_categories
List all document categories with their document counts.
Returns:
{
"status": "success",
"categories": {
"security": 52,
"development": 8,
"ctf": 12,
"general": 3
},
"total_documents": 75
}
---
list_documents
List all indexed documents, optionally filtered by category.
| Parameter | Type | Description | |-----------|------|-------------| | category | string | Optional category filter |
Returns: JSON array of documents with id, source, category, format, chunks, and keywords.
---
get_index_stats
Get statistics about the knowledge base index.
Returns:
{
"status": "success",
"stats": {
"total_documents": 75,
"total_chunks": 9256,
"categories": {"security": 52, "development": 8},
"supported_formats": [".md", ".txt", ".pdf", ".py", ".json", ".docx", ".xlsx", ".pptx", ".csv", ".ipynb"],
"embedding_model": "BAAI/bge-small-en-v1.5",
"embedding_dim": 384,
"reranker_model": "Xenova/ms-marco-MiniLM-L-6-v2",
"chunk_size": 1000,
"chunk_overlap": 200,
"query_cache": {
"size": 12,
"max_size": 100,
"ttl_seconds": 300,
"hits": 45,
"misses": 23,
"hit_rate": "66.2%"
}
}
}
---
Document Management
add_document
Add a new document to the knowledge base from raw content. Saves the file to the documents directory and indexes it immediately.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | content | string | required | Full text content of the document | | filepath | string | required | Relative path within documents dir (e.g., security/new-technique.md) | | category | string | "general" | Document category |
---
update_document
Update an existing document. Removes old chunks from the index and re-indexes with new content.
| Parameter | Type | Description | |-----------|------|-------------| | filepath | string | Full path to the document file | | content | string | New content for the document |
---
remove_document
Remove a document from the knowledge base index. Optionally deletes the file from disk.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | filepath | string | required | Path to the document file | | delete_file | bool | false | If true, also delete the file from disk |
---
add_from_url
Fetch content from a URL, strip HTML (scripts, styles, nav, footer, header), convert to markdown, and add to the knowledge base.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | url | string | required | URL to fetch content from | | category | string | "general" | Document category | | title | string | null | Custom title (auto-detected from <title> tag if not provided) |
---
search_similar
Find documents similar to a given document using embedding similarity.
| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | filepath | string | required | Path to the reference document | | max_results | int | 5 | Number of similar documents to return (1-20) |
---
evaluate_retrieval
Evaluate retrieval quality with test queries. Useful for tuning hybrid_alpha, testing query expansion effectiveness, or validating after reindexing.
| Parameter | Type | Description | |-----------|------|-------------| | test_cases | string (JSON) | Array of test cases: [{"query": "...", "expected_filepath": "..."}, ...] |
Metrics:
- MRR@5 (Mean Reciprocal Rank): Average of 1/rank for expected documents. 1.0 = always first result.
- Recall@5: Fraction of expected documents found in top 5 results. 1.0 = all found.
---
Configuration
Knowledge RAG is fully configurable via a config.yaml file in the project root. If no config.yaml exists, sensible defaults are used β the system works out of the box with zero configuration.
Quick Start
# Option 1: Use a preset
cp presets/cybersecurity.yaml config.yaml # Offensive/defensive security, CTFs
cp presets/developer.yaml config.yaml # Software engineering, APIs, DevOps
cp presets/research.yaml config.yaml # Academic research, papers, studies
cp presets/general.yaml config.yaml # Blank slate, pure semantic search
# Option 2: Start from the documented template
cp config.example.yaml config.yaml
# Edit config.yaml to your needs
Restart Claude Code after changing config.yaml.
config.yaml Structure
# Paths β where your documents live
paths:
documents_dir: "./documents" # Scanned recursively
data_dir: "./data" # Index storage
models_cache_dir: "./models_cache" # Persistent embedding model cache
# Documents β what gets indexed and how
documents:
supported_formats: # File types to index
- .md
- .txt
- .pdf
- .docx
- .ipynb
# - .py # Uncomment to index code
exclude_patterns: # Glob patterns to skip
- "node_modules"
- ".venv"
- "__pycache__"
chunking:
chunk_size: 1000 # Max chars per chunk
chunk_overlap: 200 # Shared chars between chunks
# Models β AI models for search (all run locally, no API keys)
models:
embedding:
model: "BAAI/bge-small-en-v1.5" # ONNX, ~33MB, auto-downloaded
dimensions: 384
gpu: false # Set true + pip install knowledge-rag[gpu]
reranker:
enabled: true # Falls back to RRF if model is unavailable
model: "Xenova/ms-marco-MiniLM-L-6-v2"
top_k_multiplier: 3 # Candidates fetched before reranking
# Search β result limits and collection name
search:
default_results: 5
max_results: 20
collection_name: "knowledge_base" # Change for separate knowledge bases
# Categories β auto-tag documents by folder path
# Set to {} to disable categorization entirely
category_mappings:
"security/redteam": "redteam"
"security/blueteam": "blueteam"
"notes": "notes"
# Keyword routing β prioritize categories based on query keywords
# Set to {} for pure semantic search with no routing bias
keyword_routes:
redteam:
- pentest
- exploit
- privilege escalation
# Query expansion β expand abbreviations for better BM25 recall
# Set to {} for no expansion (search terms used as-is)
query_expansions:
sqli:
- sql injection
- sqli
privesc:
- privilege escalation
- privesc
# Server β enterprise features (new in v4.0.0)
server:
transport: "stdio" # "stdio" | "sse" | "streamable-http"
host: "127.0.0.1" # Bind address (SSE/HTTP only)
port: 8179 # Bind port (SSE/HTTP only)
auth:
bearer_token: "" # Set a secret to enable auth (SSE/HTTP only)
rate_limit:
enabled: false
requests_per_minute: 60
burst: 10
metrics:
enabled: false
port: 9179 # Separate port for Prometheus scraping
See
config.example.yamlfor the fully documented template with explanations for every field.
Presets
Pre-built configurations for common use cases:
| Preset | File | Categories | Keywords | Expansions | Best For | |--------|------|-----------|----------|-----------|----------| | Cybersecurity | presets/cybersecurity.yaml | 8 | 200+ | 69 | Red/Blue Team, CTFs, threat hunting, exploit dev | | Developer | presets/developer.yaml | 9 | 150+ | 50+ | Full-stack dev, APIs, DevOps, cloud, databases | | Research | presets/research.yaml | 9 | 100+ | 40+ | Academic papers, thesis, lab notebooks, datasets | | General | presets/general.yaml | 0 | 0 | 0 | Blank slate β pure semantic search, no domain logic |
Creating your own preset: Copy config.example.yaml, fill in your categories/keywords/expansions, save to presets/your-domain.yaml.
Configuration Reference
Server
| Field | Default | Description | |-------|---------|-------------| | server.transport | "stdio" | Transport protocol: "stdio", "sse", or "streamable-http" | | server.host | "127.0.0.1" | Bind address for SSE/HTTP mode | | server.port | 8179 | Bind port for SSE/HTTP mode | | server.auth.bearer_token | "" (disabled) | Bearer token for SSE/HTTP auth. Empty = no auth | | server.rate_limit.enabled | false | Enable per-client rate limiting | | server.rate_limit.requests_per_minute | 60 | Max requests per minute | | server.rate_limit.burst | 10 | Burst allowance above steady rate | | server.metrics.enabled | false | Enable Prometheus /metrics endpoint | | server.metrics.port | 9179 | Port for metrics scraping |
In stdio mode (default), server settings are ignored. SSE/HTTP mode auto-enables the single-instance lock.
Paths
| Field | Default | Description | |-------|---------|-------------| | paths.documents_dir | ./documents | Root folder scanned recursively for documents | | paths.data_dir | ./data | Internal storage for ChromaDB and index metadata | | paths.models_cache_dir | ./models_cache | Persistent cache for embedding models (~250MB). Survives reboots |
Relative paths resolve from the project root. Absolute paths work too.
Documents
| Field | Default | Description | |-------|---------|-------------| | documents.supported_formats | .md .txt .pdf .py .json .docx .xlsx .pptx .csv .ipynb | File extensions to index | | documents.exclude_patterns | [] (empty) | Glob patterns for files/dirs to skip during indexing | | documents.chunking.chunk_size | 1000 | Max characters per chunk | | documents.chunking.chunk_overlap | 200 | Characters shared between consecutive chunks |
Chunking guidelines: Short notes β 500/100. General use β 1000/200. Long technical docs β 1500/300.
For .md files, chunking splits at ## and ### header boundaries first. Sections larger than chunk_size are sub-chunked with overlap. Non-markdown files use fixed-size chunking.
Models
| Field | Default | Description | |-------|---------|-------------| | models.embedding.model | BAAI/bge-small-en-v1.5 | Embedding model (ONNX, runs locally) | | models.embedding.dimensions | 384 | Vector dimensions (must match model) | | models.embedding.gpu | false | Enable CUDA GPU acceleration. See GPU Acceleration for full setup | | models.reranker.enabled | true | Enable cross-encoder reranking | | models.reranker.model | Xenova/ms-marco-MiniLM-L-6-v2 | Reranker model | | models.reranker.top_k_multiplier | 3 | Fetch N*multiplier candidates for reranking |
If the reranker model is not available locally and the machine cannot download it, search now falls back to the RRF order from hybrid semantic+BM25 retrieval. This keeps search_knowledge available offline, but result ordering may be less precise for ambiguous queries until the reranker model is cached.
Embedding model options (fastest β most accurate):
BAAI/bge-small-en-v1.5β 384D, ~33MB (default)BAAI/bge-base-en-v1.5β 768D, ~130MBBAAI/bge-large-en-v1.5β 1024D, ~335MBintfloat/multilingual-e5-smallβ 384D, 100+ languages
Warning: Changing the embedding model after indexing requires
reindex_documents(full_rebuild=True).
Search
| Field | Default | Description | |-------|---------|-------------| | search.default_results | 5 | Results returned when no limit specified | | search.max_results | 20 | Hard cap even if client requests more | | search.collection_name | knowledge_base | ChromaDB collection β change for separate KBs |
Categories
Map folder paths to category names. Documents in matching folders get auto-tagged, enabling filtered searches.
category_mappings:
"security/redteam": "redteam"
"security": "security"
Set category_mappings: {} to disable β documents are still searchable, just without category filters.
Keyword Routing
Route queries to categories based on keywords. When a query contains listed keywords, results from that category are prioritized (not filtered β other categories still appear, ranked lower).
keyword_routes:
redteam:
- pentest
- exploit
- sqli
Single-word keywords use regex word boundaries (\b) β "api" won't match "RAPID". Multi-word keywords use substring matching.
Set keyword_routes: {} for pure semantic search.
Query Expansion
Expand search terms with synonyms before BM25 search. Supports single tokens, bigrams, and full query matches.
query_expansions:
sqli:
- sql injection
- sqli
k8s:
- kubernetes
- k8s
Set query_expansions: {} for no expansion.
query_expansions is directional: only the key on the left triggers the terms on the right. If you need mutual expansion without duplicating entries, use query_expansion_groups.
query_expansion_groups:
- ["triple barrier", "tb", "trip_barr"]
- ["profit factor", "pf"]
Each group is interpreted symmetrically, so every term expands to the rest of the group. The final internal expansion table is built by merging both sources:
query_expansionsentries are loaded as-is.query_expansion_groupsadds reciprocal links for every term in each group.- Overlaps are merged by union with duplicate terms removed.
This keeps backward compatibility while allowing concise synonym groups.
Hybrid Search Tuning
| hybrid_alpha | Behavior | Best For | |--------------|----------|----------| | 0.0 | Pure BM25 keyword | Exact terms, CVEs, tool names | | 0.3 | Keyword-heavy (default) | Technical queries with specific terms | | 0.5 | Balanced | General queries | | 0.7 | Semantic-heavy | Conceptual queries, related topics | | 1.0 | Pure semantic | "How to..." questions, abstract concepts |
---
Project Structure
knowledge-rag/
βββ mcp_server/
β βββ __init__.py # Stdout protection + version
β βββ config.py # YAML config loader + defaults
β βββ ingestion.py # 20 parsers, chunking, metadata extraction
β βββ server.py # MCP server, ChromaDB, BM25, reranker, 12 tools
βββ config.example.yaml # Documented config template (copy to config.yaml)
βββ config.yaml # Your active configuration (git-ignored)
βββ presets/ # Ready-to-use domain configurations
β βββ cybersecurity.yaml
β βββ developer.yaml
β βββ research.yaml
β βββ general.yaml
βββ documents/ # Your documents (scanned recursively)
βββ data/
β βββ chroma_db/ # ChromaDB vector database
β βββ index_metadata.json # Incremental indexing state
βββ models_cache/ # Persistent embedding model cache
βββ tests/ # Test suite (82 tests)
βββ install.sh # Linux/macOS installer
βββ install.ps1 # Windows installer
βββ venv/ # Python virtual environment
βββ requirements.txt
βββ pyproject.toml
βββ LICENSE
βββ README.md
---
Troubleshooting
Python version mismatch
Requires Python 3.11 or newer.
python --version # Must be 3.11+
FastEmbed model download fails
On first run, FastEmbed downloads models to models_cache/. If the download fails:
# Clear cache and retry
# Windows:
rmdir /s /q models_cache
# Linux/macOS:
rm -rf models_cache
# Then restart the MCP server
Reranker model download fails
The reranker is lazy-loaded on the first query. If the model is not cached and the machine is offline, search continues without reranking and uses the RRF order from hybrid retrieval. To keep reranking enabled offline, run one query while online or pre-populate models_cache/ on the target machine.
You can still disable reranking explicitly in config.yaml:
models:
reranker:
enabled: false
Disabling reranking reduces memory use and avoids first-query model loading. The tradeoff is lower ranking precision, especially when several chunks match the same terms but only one is the best answer.
ChromaDB index crashes on startup
Native ChromaDB failures can terminate Python before normal exception handling runs. Startup now probes ChromaDB in a child process before initializing the MCP server. If the probe crashes, the active chroma_db/ and index_metadata.json are moved to data/backups/auto-repair-*, and the next startup can rebuild a clean index.
The same guarded behavior is available through either console script:
knowledge-rag
knowledge-rag-guarded
Index is empty
# Check documents directory has files
ls documents/
# Force reindex via Claude Code:
# reindex_documents(force=True)
# Or nuclear rebuild if model changed:
# reindex_documents(full_rebuild=True)
MCP server not loading
- Check
~/.claude.jsonexists and has valid JSON in themcpServerssection - Verify paths use double backslashes (
\\) on Windows - Restart Claude Code completely
- Run
claude mcp listto check connection status
"Failed to connect" error
The MCP server uses stdout for JSON-RPC communication. If a library prints to stdout during init, the stream gets corrupted. v3.4.3+ includes stdout protection that prevents this. If you're on an older version, upgrade:
pip install --upgrade knowledge-rag
Slow first query
The cross-encoder reranker model is lazy-loaded on the first query. This adds a one-time ~2-3 second delay for model download and loading. Subsequent queries are fast. If the model cannot be loaded, search falls back to RRF ordering and does not retry loading the reranker until the server restarts.
Memory usage
With ~200 documents, expect ~300-500MB RAM. The embedding model (~200MB ONNX runtime resident, lazy-loaded on first query since v3.8.0) and reranker (~25MB, lazy-loaded) are loaded into memory only when actually used. For very large knowledge bases (1000+ documents), consider enabling GPU acceleration and using exclude patterns to limit index scope.
Multiple MCP clients spawn duplicate servers
MCP stdio is one process per client by protocol β multiple Claude Code windows, Claude Desktop + IDE, etc. each spawn their own knowledge-rag process. Since v3.8.0 idle processes are cheap (no embedding model loaded until first query). If you've measured and want a hard cap of one server per data directory, opt in:
export KNOWLEDGE_RAG_SINGLE_INSTANCE=1
A second instance exits immediately with code 75. Default is OFF (multi-client friendly). Full guide: docs/single-instance.md. Sample MCP config: examples/mcp-config-single-instance.json.
SSE server won't start
# Check if port 8179 is already in use
# Windows:
netstat -aon | findstr :8179
# Linux/macOS:
lsof -i :8179
If uvicorn is not found, install the server extras: pip install knowledge-rag[server]
Can't connect to SSE server
Verify the server is running and the URL is correct:
curl http://127.0.0.1:8179/sse
Common issues:
- Wrong URL: must end with
/sse(not just the port) - Firewall blocking the port
- Server started with a different host/port than configured in the MCP client
---
Changelog
Unreleased
v4.3.1 (2026-06-22) β Hybrid Search Fixes
- FIX: Accept
"general"as a valid category insearch_knowledge. The parser hardcodes"general"as the fallback in_detect_category(ingestion.py), but the validator only builtvalid_categoriesfromconfig.keyword_routes+config.category_mappings.values()β so users who customizedconfig.yamland dropped the default"general": "general"mapping hitInvalid categoryeven though the index containedgeneraldocuments. Validator now always tolerates"general". (#98, thanks @Hohlas) - FIX: Skip BM25-only search results when Chroma can no longer resolve the chunk ID. Stale BM25 indices (typically right after
remove_documentor in the window between async reindex and BM25 rebuild) returned hits whosecollection.get()came back empty; the previous fallback inserted entries withdocument=""/metadata={}into the reranker, polluting results with empty matches. The pipeline nowcontinues past those, dropping the stale hit cleanly. (#98, thanks @Hohlas) - TEST: Added
tests/test_pr98_regression.py(4 tests) pinning both contracts so future refactors cannot silently revert either fix. Test count baseline: 227 β 231. (#99) - CI: Bumped
[tool.mypy] python_versionfrom 3.11 to 3.12 to accept PEP 695typestatements in the numpy stub (numpy/__init__.pyi) which were breaking the Pillar 7 strict gate. Only affects static analysis;requires-python = ">=3.11"unchanged. (#100)
v4.3.0 (2026-06-17) β Async Reindex, GPU CUDA 12, 13th MCP Tool
- NEW:
get_reindex_statusMCP tool β lightweight reindex progress polling without computing full index stats. Returns active/idle status, percent, processed/total, errors, and last result. - NEW:
reindex_documentsnow runs in background via daemon thread β returns immediately with{"status": "started"}. Eliminates MCP timeout on large document sets (5K+ files). Concurrent calls returnalready_runningwith current progress. - NEW: GPU acceleration with full CUDA 12 support β
onnxruntime-gpu+ 7 NVIDIA pip packages (cublas,cudnn,cuda-runtime,cufft,cusparse,cusolver,curand,nvjitlink). Server auto-detects GPU on startup with 4-step verification (providers, DLLs, nvidia-smi, session creation). Falls back to CPU gracefully. - NEW:
_setup_cuda_dll_paths()adds NVIDIA pip package DLL directories toPATHautomatically on Windows β onnxruntime finds CUDA 12 DLLs without a full CUDA Toolkit install. - DEPS:
[gpu]extra expanded from 3 to 8 packages (addedcufft,cusparse,cusolver,curand,nvjitlink). - FIX: GPU status reporting now uses actual ONNX session creation test instead of just checking
get_available_providers()β prevents false "GPU ACTIVE" when CUDA DLLs are missing. - DOCS: GPU Acceleration section rewritten with complete requirements table, setup steps, verification instructions, and fallback behavior.
- DOCS: Tool reference updated β
reindex_documentsasync behavior documented,get_reindex_statusreference added. - TEST: Backwards-compat baseline updated for 13 MCP tools.
v4.2.0 (2026-06-17) β Search Performance & Output Quality
- PERF: Custom inverted-index BM25 replaces
rank-bm25full-corpus scan β 128Γ faster keyword search on 50K+ chunk corpora. Only documents containing query terms are scored via posting lists. - PERF:
numpy.argpartitionfor O(n) top-k selection instead of O(n log n) sort. - PERF: Batched adjacent chunk fetch β single ChromaDB
collection.get()call replaces N round-trips per result. - PERF: O(1) reverse lookup via
_source_to_dociddict eliminates linear scans of_indexed_docsinsearch_similar,update_document,remove_document, and_expand_with_adjacent_chunks. - NEW:
snippet_modeparameter onsearch_knowledge(default:true) β truncates content to ~500 chars at natural break points withcontent_lengthfield. Reduces token consumption by ~72%. - NEW:
min_scoreparameter onsearch_knowledge(default:0.0) β filters results below a normalized relevance threshold. Response includesfiltered_by_scorecount. - NEW:
filtered_by_scorefield in search response JSON for transparency. - DEPS:
numpyadded as direct dependency (was transitive via fastembed);rank-bm25import removed from server.py. - TEST: 6 new tests for
min_scorefiltering andsnippet_modetruncation. - TEST: Updated backwards-compat baseline to include new
search_knowledgeparameters.
v4.1.2 (2026-06-17)
- FIX:
_save_metadatadict snapshot prevents concurrent modification crash during file watcher events. - STYLE: ruff format applied to server.py.
v4.1.1 (2026-06-17)
- FIX: All
_indexed_docsiterations now uselist()snapshot, preventingdictionary changed size during iterationcrash when FileWatcher modifies the index concurrently with MCP tool calls (affectssearch_knowledge,search_similar,update_document,remove_document,evaluate_retrieval,list_categories,list_documents)
v4.1.0 (2026-06-17)
- Added:
query_expansion_groupsconfig for symmetric synonym expansion (#92) - Improved:
expand_query()now returns deterministic expansion order (set β ordered list with dedup)
v4.0.1 (2026-06-16)
- FIX: Orphan cleanup now runs before indexing loop, preventing chunk loss when files are moved (#90).
- FIX: Chunk deduplication is now per-document instead of global, preventing cross-document chunk deletion (#91).
- FIX: Added
on_movedhandler toDocumentWatcherfor proper file move detection. - FIX: Startup preflight probes ChromaDB in a child process and moves crashing persistent indexes to
data/backups/auto-repair-*before MCP initialization. - FIX: Reranker load failures now fall back to RRF ordering instead of failing
search_knowledgeon offline machines. - FIX: Virtualenv project-root detection now handles Python symlinks that resolve to the system interpreter.
- NEW:
knowledge-rag-guardedconsole script kept as an explicit guarded startup alias.
v4.0.0 (2026-06-09) β Enterprise Concurrent Access
- NEW: SSE and streamable-http transport modes β 1 server serves N clients (
server.transport: "sse"in config.yaml or--transport sseCLI). - NEW: Thread-safe shared state for concurrent queries β QueryCache locking, BM25 build lock, orchestrator double-checked locking.
- NEW: ChromaDB WAL mode enabled automatically in SSE/HTTP mode for concurrent read performance.
- NEW: Optional rate limiting β sliding-window counter, configurable RPM and burst, disabled by default.
- NEW: Optional Prometheus metrics endpoint β tool call counts, latency histograms, separate port, disabled by default.
- NEW: All 13 MCP tools instrumented with
@rate_limitedand@instrumentdecorators (zero-cost when disabled). - NEW:
--transportCLI override for Docker/systemd deployments. - NEW:
pip install knowledge-rag[server]optional dependency for SSE/HTTP (uvicorn). - CHANGED: SSE/HTTP mode auto-enables single-instance lock (port collision prev






