MCP-Powered AI Assistant (Local — LlamaIndex + Ollama)

Privacy-first document intelligence: all models run locally via Ollama — no data leaves your machine.

---

What is MCP?

Model Context Protocol (MCP) is an open standard (by Anthropic) that defines how AI models discover and invoke tools at runtime. Think of it as a "USB-C port for AI" — any MCP-compatible client (Claude Desktop, your own agent, etc.) can connect to any MCP server and immediately use its tools.

MCP vs. Advanced RAG — What's the Difference?

| Dimension | Advanced RAG | MCP | |-----------|-------------|-----| | Purpose | Improve retrieval accuracy | Standardise tool/capability exposure | | Core idea | Better chunking, re-ranking, hybrid search | JSON-RPC tool registry with discovery | | What the LLM gets | Retrieved context injected into prompt | A menu of callable functions with schemas | | Execution | Single pipeline (query → retrieve → generate) | Multi-step agent loop (plan → pick tool → call → observe → repeat) | | Tools | Retrieval only | Any function: retrieval, APIs, databases, code | | State | Stateless per query | Stateful agent sessions possible | | This project | RAG is one tool inside the MCP server | MCP wraps 8 RAG tools, discoverable at runtime |

In short: Advanced RAG makes retrieval smarter. MCP makes the entire AI system composable and interoperable.

---

Project Architecture

mcp_rag_assistant/
├── config.py               ← Central config (LLM, embed, chunking, server)
├── rag_engine.py           ← LlamaIndex: load docs → build index → query engine
├── main.py                 ← CLI entrypoint (serve / index / query / demo)
├── mcp_client.py           ← Example client that calls server tools
│
├── mcp_server/
│   └── server.py           ← HTTP JSON-RPC server exposing all tools
│
├── tools/
│   └── rag_tools.py        ← 8 MCP tool implementations
│
├── utils/
│   └── logger.py           ← Structured logging
│
├── my_data/                ← ⬅ DROP YOUR FILES HERE (PDF, DOCX, XLSX, CSV)
├── storage/                ← ChromaDB persistence (auto-created)
├── logs/                   ← Log files (auto-created)
│
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md

Data Flow

User Query
    │
    ▼
MCP Client (mcp_client.py or Claude Desktop or your agent)
    │  JSON-RPC POST /mcp  {"method": "tools/call", "params": {...}}
    ▼
MCP Server (mcp_server/server.py)
    │  dispatches to matching tool function
    ▼
Tool Function (tools/rag_tools.py)
    │  calls get_query_engine().query(...)
    ▼
LlamaIndex Query Engine (rag_engine.py)
    │  embeds query with qwen3-embedding:0.6b via Ollama
    ▼
ChromaDB Vector Store
    │  returns top-K similar chunks
    ▼
Ollama LLM (llama3 or mistral)
    │  synthesises answer from retrieved context
    ▼
JSON response back through MCP → Client

---

Available MCP Tools

| Tool | Description | |------|-------------| | query_documents | General Q&A over all indexed documents | | list_indexed_files | Show files in my_data/ | | rebuild_index | Re-index after adding/removing files | | summarize_document | Summarise a specific file by name | | analyze_data | Plain-English data analysis (CSV/XLSX) | | generate_report | Generate summary / detailed / executive report | | compare_documents | Compare two documents on a given aspect | | extract_entities | Extract people, orgs, dates, numbers |

---

Prerequisites

Python 3.10+
Ollama running locally — ollama.com
Models already pulled (you have these):
llama3:latest
mistral:latest
qwen3-embedding:0.6b

---

Setup

# 1. Clone / unzip the project
cd mcp_rag_assistant

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure (optional — defaults work out of the box)
cp .env.example .env
# Edit .env to change models, ports, chunk sizes etc.

# 5. Add your documents
#    Copy PDFs, DOCX, XLSX, CSV files into:
#    my_data/

# 6. Build the index
python main.py index

# 7. Start the MCP server
python main.py serve

---

Usage

Start the server

python main.py serve
# MCP server listening on http://0.0.0.0:8080

One-shot query (no server needed)

python main.py query "What are the key findings in the Q1 report?"

Run the demo client (server must be running)

# In a second terminal:
python main.py demo

Rebuild index after adding new files

python main.py index
# or via MCP tool:
# call rebuild_index tool from any client

Health check

GET http://localhost:8080/health
GET http://localhost:8080/tools

---

Switching Models

Edit config.py or your .env:

# Use mistral instead of llama3
LLM_MODEL=mistral:latest

# Use nomic-embed-text for embeddings
EMBED_MODEL=nomic-embed-text:latest

---

Tuning Chunk Size

In config.py or .env:

| Setting | Default | Notes | |---------|---------|-------| | CHUNK_SIZE | 256 | Tokens per chunk. Smaller = more precise retrieval | | CHUNK_OVERLAP | 25 | Overlap between chunks. Helps preserve context at boundaries | | SIMILARITY_TOP_K | 5 | Chunks retrieved per query | | RESPONSE_MODE | compact | compact \| tree_summarize \| refine |

---

mcp-rag-assistant