mcp-codebase-index

MikeRecognex/mcp-codebase-index
57 starsAGPL-3.0Community

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Structural codebase indexer exposing 17 query tools (functions, classes, imports, dependency graphs, change impact) via MCP. Zero dependencies.

README.md

<!-- mcp-name: io.github.MikeRecognex/mcp-codebase-index -->

mcp-codebase-index

![PyPI version](https://pypi.org/project/mcp-codebase-index/) ![CI](https://github.com/MikeRecognex/mcp-codebase-index/actions/workflows/ci.yml) ![Python 3.11+](https://www.python.org/downloads/) ![License: AGPL-3.0](https://www.gnu.org/licenses/agpl-3.0) ![MCP](https://modelcontextprotocol.io) ![Zero Dependencies]()

A structural codebase indexer with an MCP server for AI-assisted development. Zero runtime dependencies — uses Python's ast module for Python analysis and regex-based parsing for TypeScript/JS, Go, Rust, and C#. Requires Python 3.11+.

What It Does

Indexes codebases by parsing source files into structural metadata -- functions, classes, imports, dependency graphs, and cross-file call chains -- then exposes 18 query tools via the Model Context Protocol, enabling Claude Code and other MCP clients to navigate codebases efficiently without reading entire files.

Automatic incremental re-indexing: In git repositories, the index stays up to date automatically. Before every query, the server checks git diff and git status (~1-2ms). If files changed, only those files are re-parsed and the dependency graph is rebuilt. No need to manually call reindex after edits, branch switches, or pulls.

Persistent disk cache: The index is saved to a pickle cache file (.codebase-index-cache.pkl) after every build. On subsequent server starts, the cache is loaded and validated against the current git HEAD — if the ref matches, startup is instant. If a small number of files changed (≤20), the cached index is loaded and incrementally updated instead of rebuilt from scratch. This eliminates the cold-start penalty when restarting Claude Code sessions, restarting the MCP server, or resuming work after context compaction.

Language Support

| Language | Method | Extracts | |----------|--------|----------| | Python (.py) | AST parsing | Functions, classes, methods, imports, dependency graph | | TypeScript/JS (.ts, .tsx, .js, .jsx) | Regex-based | Functions, arrow functions, classes, interfaces, type aliases, imports | | Go (.go) | Regex-based | Functions, methods (receiver-based), structs, interfaces, type aliases, imports, doc comments | | Rust (.rs) | Regex-based | Functions (pub/async/const/unsafe), structs, enums, traits, impl blocks, use statements, attributes, doc comments, macro_rules | | C# (.cs) | Regex-based | Classes, interfaces, structs, enums, records, methods, constructors, using directives, [Attributes], /// XML doc comments | | Markdown/Text (.md, .txt, .rst) | Heading detection | Sections (# headings, underlines, numbered, ALL-CAPS) | | Other | Generic | Line counts only |

Installation

pip install "mcp-codebase-index[mcp]"

The [mcp] extra includes the MCP server dependency. Omit it if you only need the programmatic API.

For development (from a local clone):

pip install -e ".[dev,mcp]"

MCP Server

Running

# As a console script
PROJECT_ROOT=/path/to/project mcp-codebase-index

# As a Python module
PROJECT_ROOT=/path/to/project python -m mcp_codebase_index.server

PROJECT_ROOT specifies which directory to index. Defaults to the current working directory.

Persistent Cache

In git repositories, the server automatically caches the index to .codebase-index-cache.pkl in the project root. On startup:

  1. Cache hit (exact match): If the cached git ref matches the current HEAD, the index loads instantly from disk — no parsing, no file walking.
  2. Cache hit (small changeset): If ≤20 files changed since the cached ref, the cached index is loaded and incrementally updated on the first query.
  3. Cache miss: If the changeset is large or no cache exists, a full rebuild runs and saves a new cache.

Add .codebase-index-cache.pkl to your .gitignore — it's a local-only build artifact.

Configuring with OpenClaw

Install the package on the machine where OpenClaw is running:

# Local install
pip install "mcp-codebase-index[mcp]"

# Or inside a Docker container / remote VPS
docker exec -it openclaw bash
pip install "mcp-codebase-index[mcp]"

Add the MCP server to your OpenClaw agent config (openclaw.json):

{
  "agents": {
    "list": [{
      "id": "main",
      "mcp": {
        "servers": [
          {
            "name": "codebase-index",
            "command": "mcp-codebase-index",
            "env": {
              "PROJECT_ROOT": "/path/to/project"
            }
          }
        ]
      }
    }]
  }
}

Restart OpenClaw and verify the connection:

openclaw mcp list

All 18 tools will be available to your agent.

Performance note: The server automatically detects file changes via git diff before every query (~1-2ms) and incrementally re-indexes only what changed. However, OpenClaw's default MCP integration via mcporter spawns a fresh server process per tool call, which discards the in-memory index and forces a full rebuild each time (~1-2s for small projects, longer for large ones). With persistent caching, these cold starts are now significantly faster — the server loads from the disk cache instead of re-parsing the entire codebase. For persistent connections (avoiding even the cache load overhead), use the openclaw-mcp-adapter plugin, which connects once at startup and keeps the server running:

pip install openclaw-mcp-adapter

Configuring with Claude Code

Add to your project's .mcp.json:

{
  "mcpServers": {
    "codebase-index": {
      "command": "mcp-codebase-index",
      "env": {
        "PROJECT_ROOT": "/path/to/project"
      }
    }
  }
}

Or using the Python module directly (useful if installed in a virtualenv):

{
  "mcpServers": {
    "codebase-index": {
      "command": "/path/to/.venv/bin/python3",
      "args": ["-m", "mcp_codebase_index.server"],
      "env": {
        "PROJECT_ROOT": "/path/to/project"
      }
    }
  }
}

Reinforcing Tool Usage with Hooks

Claude Code tends to default to built-in Glob/Grep/Read tools even when codebase-index is available. In addition to CLAUDE.md instructions (see below), you can add hooks that fire on every prompt to reinforce the behavior. Add this to .claude/settings.local.json:

{
  "hooks": {
    "SessionStart": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo 'CRITICAL REMINDER: Use codebase-index MCP tools FIRST for ALL code navigation (find_symbol, get_function_source, search_codebase, get_dependencies, etc). Only fall back to Glob/Grep/Read for non-code files.'"
          }
        ]
      }
    ],
    "UserPromptSubmit": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "echo 'Use codebase-index MCP tools first for code navigation.'"
          }
        ]
      }
    ]
  }
}

Hook stdout is injected as context Claude sees before responding. SessionStart fires on startup, resume, and context compaction. UserPromptSubmit fires on every turn.

Important: Make the AI Actually Use Indexed Tools

By default, AI assistants will ignore the indexed tools and fall back to reading entire files with Glob/Grep/Read. Soft language like "prefer" gets rationalized away. Add this to your project's CLAUDE.md (or equivalent instructions file) with mandatory language:

## Codebase Navigation — MANDATORY

You MUST use codebase-index MCP tools FIRST when exploring or navigating the codebase. This is not optional.

- ALWAYS start with: get_project_summary, find_symbol, get_function_source, get_class_source,
  get_structure_summary, get_dependencies, get_dependents, get_change_impact, get_call_chain, search_codebase
- Only fall back to Read/Glob/Grep when codebase-index tools genuinely don't have what you need
  (e.g. reading non-code files, config, frontmatter)
- If you catch yourself reaching for Glob/Grep/Read to find or understand code, STOP and use
  codebase-index instead

The word "prefer" is too weak — models treat it as a suggestion and default to familiar tools. Mandatory language with explicit fallback criteria is what actually changes behavior.

Available Tools (18)

| Tool | Description | |------|-------------| | get_project_summary | File count, packages, top classes/functions | | list_files | List indexed files with optional glob filter | | get_structure_summary | Structure of a file or the whole project | | get_functions | List functions with name, lines, params | | get_classes | List classes with name, lines, methods, bases | | get_imports | List imports with module, names, line | | get_function_source | Full source of a function/method | | get_class_source | Full source of a class | | find_symbol | Find where a symbol is defined (file, line, type) | | get_dependencies | What a symbol calls/uses | | get_dependents | What calls/uses a symbol | | get_change_impact | Direct + transitive dependents | | get_call_chain | Shortest dependency path (BFS) | | get_file_dependencies | Files imported by a given file | | get_file_dependents | Files that import from a given file | | search_codebase | Regex search across all files (max 100 results) | | reindex | Force full re-index (rarely needed — incremental updates happen automatically in git repos) | | get_usage_stats | Session efficiency stats: tool calls, characters returned vs total source, estimated token savings |

Benchmarks

Tested across four real-world projects on an M-series MacBook Pro, from a small project to CPython itself (1.1 million lines):

Index Build Performance

| Project | Files | Lines | Functions | Classes | Index Time | Peak Memory | |---------|------:|------:|----------:|--------:|-----------:|------------:| | RMLPlus | 36 | 7,762 | 237 | 55 | 0.9s | 2.4 MB | | FastAPI | 2,556 | 332,160 | 4,139 | 617 | 5.7s | 55 MB | | Django | 3,714 | 707,493 | 29,995 | 7,371 | 36.2s | 126 MB | | CPython | 2,464 | 1,115,334 | 59,620 | 9,037 | 55.9s | 197 MB |

With persistent caching, subsequent startups bypass the full build entirely. Cache load time is negligible compared to parsing — a cache hit on CPython restores the full index in under a second instead of 56s.

Query Response Size vs Total Source

Querying CPython — 41 million characters of source code:

| Query | Response | Total Source | Reduction | |-------|-------:|------------:|----------:| | find_symbol("TestCase") | 67 chars | 41,077,561 chars | 99.9998% | | get_dependencies("compile") | 115 chars | 41,077,561 chars | 99.9997% | | get_change_impact("TestCase") | 16,812 chars | 41,077,561 chars | 99.96% | | get_function_source("compile") | 4,531 chars | 41,077,561 chars | 99.99% | | get_function_source("run_unittest") | 439 chars | 41,077,561 chars | 99.999% |

find_symbol returns 54-67 characters regardless of whether the project is 7K lines or 1.1M lines. Response size scales with the answer, not the codebase.

get_change_impact("TestCase") on CPython found 154 direct dependents and 492 transitive dependents in 0.45ms — the kind of query that's impossible without a dependency graph. Use max_direct and max_transitive to cap output to your token budget.

Query Response Time

All targeted queries return in sub-millisecond time, even on CPython's 1.1M lines:

| Query | RMLPlus | FastAPI | Django | CPython | |-------|--------:|--------:|-------:|--------:| | find_symbol | 0.01ms | 0.01ms | 0.03ms | 0.08ms | | get_dependencies | 0.00ms | 0.00ms | 0.00ms | 0.01ms | | get_change_impact | 0.02ms | 0.00ms | 2.81ms | 0.45ms | | get_function_source | 0.01ms | 0.02ms | 0.03ms | 0.10ms |

Run the benchmarks yourself: python benchmarks/benchmark.py

How Is This Different from LSP?

LSP answers "where is this function?" — mcp-codebase-index answers "what happens if I change it?" LSP is point queries: one symbol, one file, one position. It can tell you where LLMClient is defined and who references it. But ask "what breaks transitively if I refactor LLMClient?" and LSP has nothing. This tool returns 11 direct dependents and 31 transitive impacts in a single call — 204 characters. To get the same answer from LSP, the AI would need to chain dozens of find-reference calls recursively, reading files at every step, burning thousands of tokens to reconstruct what the dependency graph already knows.

LSP also requires you to install a separate language server for every language in your project — pyright for Python, vtsls for TypeScript, gopls for Go. Each one is a heavyweight binary with its own dependencies and configuration. mcp-codebase-index is zero dependencies, handles Python + TypeScript/JS + Go + Rust + C# + Markdown out of the box, and every response has built-in token budget controls (max_results, max_lines). LSP was built for IDEs. This was built for AI.

Programmatic Usage

from mcp_codebase_index.project_indexer import ProjectIndexer
from mcp_codebase_index.query_api import create_project_query_functions

indexer = ProjectIndexer("/path/to/project", include_patterns=["**/*.py"])
index = indexer.index()
query_funcs = create_project_query_functions(index)

# Use query functions
print(query_funcs["get_project_summary"]())
print(query_funcs["find_symbol"]("MyClass"))
print(query_funcs["get_change_impact"]("some_function"))

Development

pip install -e ".[dev,mcp]"
pytest tests/ -v
ruff check src/ tests/

References

The structural indexer was originally developed as part of the RMLPlus project, an implementation of the Recursive Language Models framework.

License

This project is dual-licensed:

If you're using mcp-codebase-index as a standalone MCP server for development, the AGPL-3.0 license applies at no cost. If you're embedding it in a proprietary product or offering it as part of a hosted service, you'll need a commercial license. See COMMERCIAL-LICENSE.md for details.

Related MCP servers

Browse all →