DocPulse

mkhanal/docpulse
0 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

DocPulse enables users to distill large documentation into compact, LLM-ready summaries by crawling web, PDF, and local sources, augmented with community insights from Reddit/StackOverflow, all running locally on Apple Silicon.

README.md

๐Ÿฉบ DocPulse

DocPulse is a local-first MCP (Model Context Protocol) Server that transforms massive documentation into high-density, LLM-ready "Implementation Manifestos."

Built for Apple Silicon, it leverages native MLX for local inference, combined with intelligent web crawling and community context search (Reddit/StackOverflow) to provide a 360-degree view of any library, standard, or regulation.

![License: MIT](LICENSE) ![Code of Conduct](CODE_OF_CONDUCT.md)

---

๐ŸŽฏ Why DocPulse?

  • Token Efficiency: Scraping and distilling docs locally saves a massive amount of tokens. Instead of sending thousands of raw HTML lines to a cloud LLM, you send only a dense, 2-3 page distilled manifesto.
  • Internal Network Support: Designed to work seamlessly within organizational networks. If your documentation is hosted on an internal wiki or API that doesn't require MFA/Auth, DocPulse can ingest it and provide context without exposing sensitive raw data to external scraping services.

๐Ÿš€ Key Features

  • Multi-Source Ingestion:
  • Web: Intelligent crawling that bypasses JS-heavy UI noise.
  • PDF: Deep parsing of regulatory or technical PDF documents.
  • Local Files: Ingest single .md, .txt, .py, etc.
  • Directories: Recursive scanning of entire folders (local or mounted remote drives like OneDrive).
  • Local LLM Distillation: Uses mlx-lm with DeepSeek models to extract API signatures, version constraints, and logical edge cases.
  • Community Augmentation: Automatically fetches recent community discussions to identify undocumented bugs or workarounds.
  • Fixed-Resource Optimal Sizing: By default, strictly utilizes a highly optimized 7B distillation model to maximize extraction speed and save RAM for other coding agents without losing extraction accuracy.
  • Human-in-the-Loop Feedback: Save human corrections that are injected into future distillation runs for the same subject.
  • File-System Caching: Fast retrieval of previously synthesized context.

---

๐Ÿง  Intelligent Defaults

DocPulse is designed for a seamless, zero-config startup experience.

  • Dynamic Model Selection: On launch, DocPulse detects your system's total RAM and automatically selects the most capable model from our curated DeepSeek-R1 Distill suite:
  • < 24GB RAM: 7B (High-speed, minimal overhead).
  • 24GB - 64GB RAM: 14B (Deep extraction & reasoning).
  • > 64GB RAM: 32B (Maximum fidelity for complex standards).
  • Auto-Bootstrapping: The system automatically initializes your local config at ~/.config/docpulse/, creates the required cache directories, and downloads MLX model weights on demand.
  • Environment Configuration: We provide a comprehensive .env.example template. Simply cp .env.example .env to manage optional search API keys (Brave/Google) or force a specific model size using the DOCPULSE_MODEL override.

---

๐Ÿ› ๏ธ Requirements

  • Hardware: Apple Silicon (M1, M2, M3, M4).
  • Software: Python 3.10+, uv recommended.
  • Environment: macOS (optimized for unified memory).

---

๐Ÿ“ฆ Installation

  1. Clone the repository:
   git clone https://github.com/your-username/docpulse.git
   cd docpulse
  1. Setup with uv:
   uv sync
  1. Install crawl4ai dependencies:
   uv run crawl4ai-setup
  1. Configure environment:
   cp .env.example .env
   # Edit .env with your preference/keys
  1. Run the CLI:
   uv run docpulse get fastapi --source="https://fastapi.tiangolo.com/tutorial/"

---

โšก Quick Start (Claude Desktop)

  1. Install uv: curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Clone & Sync:
   git clone https://github.com/your-username/docpulse.git && cd docpulse && uv sync
  1. Configure: cp .env.example .env (Add search keys if desired)
  2. Add to Claude: Add the config below to your claude_desktop_config.json.
  3. Start Coding: Ask Claude: "Analyze the FastAPI documentation for memory management patterns."

---

โš™๏ธ Configuration

DocPulse features a robust, tiered configuration system.

1. Configuration Tiers

Settings are loaded in the following priority:

  1. User Config: ~/.config/docpulse/config.toml
  2. Default Config: config.default.toml (bundled with the repo)

2. Configurable Settings

You can override any of these keys in your config.toml:

[app]

  • name: The name of the MCP server.
  • log_level: Set to DEBUG, INFO, WARNING, or ERROR.

[harvester]

  • text_extensions: List of file extensions to include in recursive scans.
  • encoding: Default encoding for local files (default: utf-8).

[distiller]

  • max_tokens: Maximum output length for the manifesto.
  • temperature: LLM sampling temperature.

[prompts]

Prompts are no longer stored in the TOML configuration. Instead, they live in the prompts/ directory.

  • Priority: ~/.config/docpulse/prompts/{name}.txt > prompts/{name}.txt.
  • Placeholder tokens: {raw_text}, {community_context}, {human_feedback}.

[models.entries]

Maps model repository strings to minimum RAM requirements (GB).

3. Environment Variables (.env)

Used for sensitive keys and quick overrides:

  • DOCPULSE_MODEL: The pipeline strictly defaults to a 7B model because data extraction relies heavily on deletion/formatting rather than novel synthesis. Overtaxing VRAM with a 32B model causes severe bottlenecks. If you _must_ override this, set this variable to 14B, 32B, or a full HuggingFace repo link.
  • DOCPULSE_CACHE_DIR: Set the directory where distilled documentation is saved (defaults to .docpulse_cache in the current working directory).
  • BRAVE_API_KEY: For Brave Search augmentation.
  • GOOGLE_API_KEY & GOOGLE_CSE_ID: For Google Search augmentation.
  • _Note: DuckDuckGo is the default and requires no key._

---

๐Ÿงฉ MCP Integration

Add to Claude Desktop

Add the following to your claude_desktop_config.json:

{
  "mcpServers": {
    "docpulse": {
      "command": "uv",
      "args": ["--directory", "/path/to/docpulse", "run", "python", "server.py"]
    }
  }
}

---

๐Ÿงฐ Tools Exposed

get_universal_context

Primary tool for creating or retrieving documentation context.

  • Arguments:
  • subject: Name (e.g., fastapi).
  • version: Version string (e.g., v0.115).
  • source: URL, absolute file path, or absolute directory path.
  • topic_keywords: (Optional) Keywords for community search.

report_context_failure

Allows developers to correct the server's output.

  • Arguments:
  • subject: The subject being corrected.
  • feedback: Detailed workaround or bug fix.
  • _DocPulse will inject this feedback into the prompt the next time you request the same subject._

---

---

๐Ÿงช Testing

DocPulse provides several ways to test the server without requiring an LLM:

1. Dedicated CLI (On-Demand)

Run DocPulse directly from your terminal for one-off distillations:

# Get context for a subject
uv run docpulse get fastapi --source "https://fastapi.tiangolo.com/tutorial/"

# Report a failure or add feedback
uv run docpulse report fastapi "The async client has change in version 0.115"

2. Automated E2E Script

Run the provided E2E test suite which verifies CLI execution and cache persistence:

./scripts/test_e2e.sh

3. Visual Debugging (MCP Inspector)

Open the interactive MCP Inspector to test tools via a web UI:

uv run fastmcp dev server.py

4. Unit & Multi-Layer Tests

Run the standard pytest suite:

uv run pytest tests/ -v

---

๐Ÿ’พ Persistence & Survival

DocPulse is designed to survive reboots and server restarts without losing data.

  • File-System Cache: All distilled context is saved to a local caching directory. By default, this is an operational .docpulse_cache/ directory in the current working directory. You can override this local folder using the DOCPULSE_CACHE_DIR environment variable.
  • Automatic Directory Management: The application will automatically ensure the caching directory exists before saving files to it, keeping things zero-configuration.
  • Cache-First Logic: Before performing a new harvest or distillation, the server checks the caching directory. If a match is found, it returns the stored result instantly.
  • Feedback Loop: Human feedback and failure reports are persisted in the feedback/ subdirectory of the cache and are automatically injected into future distillation prompts for that subject.

๐Ÿค Community & Governance

๐ŸŒŸ Pull Requests We'd Love to See

  • Platform Agnosticism: Currently, DocPulse is optimized for Apple Silicon via MLX. We invite PRs to support other backends (llama.cpp, ONNX, etc.) to make the system truly universal.
  • Integration Plugins: Right now, DocPulse works best with direct API/Web access. We welcome PRs for plugins that integrate with specific documentation platforms (Confluence, Notion, SharePoint, etc.) where documentation often lives.

---

๐Ÿ“œ License

MIT

Related MCP servers

Browse all โ†’