VelociRAG

HaseebKhalid1507/VelociRAG
10 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Lightning-fast RAG for AI agents. 4-layer fusion, ONNX Runtime, sub-200ms search.

README.md

πŸ¦– VelociRAG

Lightning-fast RAG for AI agents.

_Four-layer retrieval fusion powered by ONNX Runtime. No PyTorch. Sub-200ms warm search. Incremental graph updates. MCP-ready._

---

Most RAG solutions either drag in 2GB+ of PyTorch or limit you to single-layer vector search. VelociRAG gives you four retrieval methods β€” vector similarity, BM25 keyword matching, knowledge graph traversal, and metadata filtering β€” fused through reciprocal rank fusion with cross-encoder reranking. All running on ONNX Runtime, no GPU, no API keys. Comes with an MCP server for agent integration, a Unix socket daemon for warm queries, and a CLI that just works.

πŸš€ Quick Start

MCP Server (Claude, Cursor, Windsurf)

pip install "velocirag[mcp]"
velocirag index ./my-docs
velocirag mcp

Claude Code β€” add to .mcp.json in your project root: ``json { "mcpServers": { "velocirag": { "command": "velocirag", "args": ["mcp"], "env": { "VELOCIRAG_DB": "/path/to/data" } } } } ` Then open /mcp in Claude Code and enable the velocirag server. If using a virtualenv, use the full path to the binary (e.g. .venv/bin/velocirag`).

Claude Desktop β€” add to claude_desktop_config.json: ``json { "mcpServers": { "velocirag": { "command": "velocirag", "args": ["mcp", "--db", "/path/to/data"] } } } ``

Cursor β€” add to .cursor/mcp.json: ``json { "mcpServers": { "velocirag": { "command": "velocirag", "args": ["mcp", "--db", "/path/to/data"] } } } ``

Python API

from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder()
store = VectorStore('./my-db', embedder)
store.add_directory('./my-docs')
searcher = Searcher(store, embedder)
results = searcher.search('query', limit=5)

CLI

pip install velocirag
velocirag index ./my-docs
velocirag search "your query here"

Search Daemon (warm engine for CLI users)

velocirag serve --db ./my-data        # start daemon (background)
velocirag search "query"              # auto-routes through daemon
velocirag status                      # check daemon health
velocirag stop                        # stop daemon

The daemon keeps the ONNX model + FAISS index warm over a Unix socket. First query loads the engine (~1s), subsequent queries return in ~180ms with full 4-layer fusion.

🎯 Why VelociRAG?

  • 4-layer search β€” vector + BM25 keyword + knowledge graph + metadata, fused with RRF
  • No LLM needed β€” search runs entirely on local models (MiniLM + TinyBERT, ~80MB total)
  • No GPU needed β€” pure ONNX inference, runs on any machine
  • ~3ms warm search β€” daemon keeps models + indices warm over Unix socket
  • Incremental indexing β€” add files without rebuilding the whole index
  • MCP server β€” plug into Claude, Cursor, Windsurf, any MCP client

Related Projects

  • Memkoshi β€” Agent memory system. Uses VelociRAG as its search engine.
  • Stelline β€” Session intelligence. Crafts memories from conversation logs.
  • Glyph β€” MCP security scanner and runtime protection.

πŸ—οΈ How It Works

The 4-layer pipeline: `` Query β†’ expand (acronyms, variants) β†’ [Vector] FAISS cosine similarity (384d, MiniLM-L6-v2 via ONNX) β†’ [Keyword] BM25 via SQLite FTS5 β†’ [Graph] Knowledge graph traversal β†’ [Metadata] Structured SQL filters (tags, status, project) β†’ RRF Fusion β†’ Cross-encoder rerank β†’ Results ``

What each layer catches:

| Query type | Vector | Keyword | Graph | Metadata | |-----------|:---:|:---:|:---:|:---:| | Conceptual ("improve error handling") | βœ… | β€” | β€” | β€” | | Exact match ("ERR_CONNECTION_REFUSED") | β€” | βœ… | β€” | β€” | | Connected concepts | β€” | β€” | βœ… | β€” | | Filtered ("#python status:active") | β€” | β€” | β€” | βœ… | | Combined ("React state management") | βœ… | βœ… | βœ… | βœ… |

✨ Features

  • ONNX Runtime β€” 184ms cold start, 3ms cached. No PyTorch, no GPU
  • Four-layer fusion β€” FAISS vector similarity + SQLite FTS5 (BM25) + knowledge graph + metadata filtering, merged via reciprocal rank fusion
  • Cross-encoder reranking β€” TinyBERT reranker via ONNX Runtime β€” included in base install, no PyTorch needed. Downloads ~17MB model on first use
  • Incremental graph updates β€” file-centric provenance tracking detects what changed and only rebuilds affected nodes/edges. Cascading deletes maintain consistency across all stores (vector, graph, metadata). Multi-source support with isolated provenance per source
  • MCP server β€” Five tools (search, index, add_document, health, list_sources) for Claude, Cursor, Windsurf
  • Search daemon β€” Unix socket server keeps ONNX model + FAISS index warm between queries
  • Knowledge graph β€” Analyzers build entity, temporal, topic, and explicit-link edges from markdown. Optional GLiNER NER. 418 files in 2.1s
  • Smart chunking β€” Header-aware splitting preserves document structure and parent context
  • Query expansion β€” Acronym registry, casing/spacing variants, underscore-aware tokenization
  • Runs anywhere β€” CPU-only, 8GB RAM, no API keys, no external services

πŸ€– MCP Server

VelociRAG exposes a Model Context Protocol server for seamless agent integration:

Available tools:

  • search β€” 4-layer fusion search with reranking
  • index β€” Add documents to the knowledge base
  • add_document β€” Insert single document
  • health β€” System diagnostics
  • list_sources β€” Show indexed document sources

The MCP server process stays alive between queries, so models load once and every subsequent search is warm. Works with any MCP-compatible client.

🐍 Python API

Full 4-layer unified search: ```python from velocirag import ( Embedder, VectorStore, Searcher, GraphStore, MetadataStore, UnifiedSearch, GraphPipeline )

Build the full stack

embedder = Embedder() store = VectorStore('./search-db', embedder) graph_store = GraphStore('./search-db/graph.db') metadata_store = MetadataStore('./search-db/metadata.db')

Index with graph + metadata

store.add_directory('./docs') pipeline = GraphPipeline(graph_store, embedder, metadata_store) pipeline.build('./docs', source_name='my-docs')

Unified search across all layers

searcher = Searcher(store, embedder) unified = UnifiedSearch(searcher, graph_store, metadata_store) results = unified.search( 'machine learning algorithms', limit=5, enrich_graph=True, filters={'tags': ['python'], 'status': 'active'} ) ```

Quick semantic search: ```python from velocirag import Embedder, VectorStore, Searcher

embedder = Embedder() store = VectorStore('./db', embedder) store.add_directory('./docs') searcher = Searcher(store, embedder) results = searcher.search('neural networks', limit=10) ```

Incremental graph updates: ```python from velocirag import Embedder, GraphStore, GraphPipeline

First run β€” full build, populates provenance

gs = GraphStore('./db/graph.db') pipeline = GraphPipeline(gs, embedder=Embedder()) pipeline.build('./docs', source_name='my-docs') # full build

Subsequent runs β€” only changed files get reprocessed

pipeline.build('./docs', source_name='my-docs') # incremental (automatic)

Force full rebuild

pipeline.build('./docs', source_name='my-docs', force_rebuild=True)

Multi-source graphs

pipeline.build('./project-a', source_name='project-a') pipeline.build('./project-b', source_name='project-b') # isolated provenance

Deleted files automatically cascade across all stores

(vector, FTS5, graph, metadata) on next build


## πŸ’» CLI Reference

Index documents (graph + metadata built by default)

velocirag index <path> [--no-graph] [--no-metadata] [--gliner] [--full-graph] [--force] [--source NAME] [--db PATH]

Search across all layers (auto-routes through daemon if running)

velocirag search <query> [--limit N] [--threshold F] [--format text|json]

Search daemon

velocirag serve [--db PATH] [-f] # start daemon (-f for foreground) velocirag stop # stop daemon velocirag status # check daemon health

Metadata queries

velocirag query [--tags TAG] [--status S] [--project P] [--recent N]

System health and status

velocirag health [--format text|json]

Start MCP server

velocirag mcp [--db PATH] [--transport stdio|sse] ```

Options:

  • --no-graph β€” Skip knowledge graph build
  • --no-metadata β€” Skip metadata extraction
  • --full-graph β€” Build graph WITH semantic similarity edges (~2GB extra RAM)
  • --source NAME β€” Label for multi-source provenance isolation
  • --force β€” Clear and rebuild from scratch
  • --gliner β€” Use GLiNER for entity extraction (requires pip install "velocirag[ner]")

πŸ“Š Performance

Real benchmarks on ByteByteGo/system-design-101 (418 files, 1,001 chunks):

| Metric | Value | |--------|-------| | Index (418 files) | 13.6s | | Search (warm, 5 results) | 35–90ms | | Graph build (light) | 2.1s β†’ 2,397 nodes, 8,717 edges | | Incremental update (1 file) | 1.3s | | Reranker | Cross-encoder TinyBERT via ONNX | | Install size | ~80MB (no PyTorch) | | RAM usage | <1GB with all models loaded |

Production deployment (6,300+ chunks, 3 sources, 950 files):

| Metric | Value | |--------|-------| | Full search (warm) | 16ms avg, 2ms min | | Full search (first run) | 22ms avg, 4ms min | | Search P50 / P95 | 17ms / 55ms | | Hit rate (100-query benchmark) | 99/100 | | Graph | 3,125 nodes, 132,320 edges | | Reranker | Cross-encoder TinyBERT via ONNX | | RAM | <1GB with all models loaded |

βš™οΈ Configuration

| Environment Variable | Default | Description | |---------------------|---------|-------------| | VELOCIRAG_DB | ./.velocirag | Database directory | | VELOCIRAG_SOCKET | /tmp/velocirag-daemon.sock | Daemon socket path | | NO_COLOR | β€” | Disable colored output |

Dependencies (all included in base install):

  • onnxruntime β€” ONNX inference (embedder + reranker)
  • tokenizers + huggingface-hub β€” model loading
  • faiss-cpu β€” vector similarity search
  • networkx + scikit-learn β€” knowledge graph + topic clustering
  • numpy, click, pyyaml, python-frontmatter

Optional extras:

  • pip install "velocirag[mcp]" β€” MCP server (adds fastmcp)
  • pip install "velocirag[ner]" β€” GLiNER entity extraction (adds gliner, requires PyTorch)

πŸ“š References

VelociRAG builds on these foundational works:

Core Fusion & Retrieval > Reciprocal Rank Fusion β€” Cormack, G. V., Clarke, C. L. A., & BΓΌttcher, S. (2009). "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." _SIGIR '09_. > Core fusion algorithm for merging results across retrieval layers.

BM25 β€” Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). "Okapi at TREC-3." _TREC-3_. Keyword search foundation via SQLite FTS5.

Embeddings & Neural IR > Sentence-BERT β€” Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." _EMNLP 2019_. paper > Dense embedding architecture using all-MiniLM-L6-v2.

MiniLM β€” Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." _NeurIPS 2020_. paper Efficient transformer distillation for production embedding models.

Reranking & Neural Models > Cross-Encoder Reranking β€” Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." _arXiv:1901.04085_. paper > Cross-attention reranking with TinyBERT on MS MARCO.

TinyBERT β€” Jiao, X., et al. (2020). "TinyBERT: Distilling BERT for Natural Language Understanding." _Findings of EMNLP 2020_. paper Compressed BERT for fast reranking inference.

Vector Search & Systems > FAISS β€” Johnson, J., Douze, M., & JΓ©gou, H. (2019). "Billion-scale similarity search with GPUs." _IEEE Transactions on Big Data_. paper > High-performance vector similarity search engine.

GLiNER β€” Zaratiana, U., Nzeyimana, A., & Holat, P. (2023). "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer." _arXiv:2311.08526_. paper Generalist NER for knowledge graph entity extraction (optional dependency).

πŸ“„ License

MIT β€” Use it anywhere, build anything.

Need agent integration help? Check AGENTS.md for machine-readable project context.

---

_Built for agents who think fast and remember faster._

<!-- mcp-name: io.github.HaseebKhalid1507/velocirag -->

Related MCP servers

Browse all β†’