PDF MCP

gzigurella/pdf-mcp
0 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Enables PDF processing and analysis including text extraction, metadata retrieval, search, page manipulation, splitting/merging, conversion to images, and form handling.

README.md

PDF MCP

MCP server for PDF processing and analysis using PyPDFium2.

Features

  • extract_text: Extract text content from PDF files with page range support
  • extract_metadata: Extract PDF metadata including title, author, and page count
  • search_text: Search for specific text within PDF files with context
  • get_page_count: Get the total number of pages in a PDF file
  • extract_pages: Extract specific pages from a PDF and save as a new PDF
  • split_pdf: Split a PDF into multiple page-based PDFs with base64 encoding
  • merge_pdfs: Merge multiple PDF files into a single PDF
  • pdf_to_images: Convert PDF pages to PNG images with configurable DPI
  • get_form_fields: Extract all form fields from a PDF including names, types, and values
  • fill_form: Fill form fields in a PDF with provided values and save to output path

Installation

From Git Repository

# Clone the repository
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package
pip install -e .

With uv (recommended)

# Clone and enter directory
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp

# Install with uv
uv pip install -e .

Integration

OpenCode

Add to your ~/.config/opencode/opencode.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "type": "local",
      "command": [
        "/path/to/pdf-mcp/venv/bin/python",
        "-m",
        "pdf_mcp"
      ],
      "enabled": true
    }
  }
}

Claude Desktop

Add to your Claude Desktop config:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "/path/to/pdf-mcp/venv/bin/python",
      "args": ["-m", "pdf_mcp"]
    }
  }
}

Generic MCP Client

For any MCP-compatible client:

# Start the server directly
/path/to/venv/bin/python -m pdf_mcp

The server communicates via stdio using the MCP protocol.

Tools

extract_text

Extract text content from a PDF file. Supports PDFs with searchable text and can extract text from specific pages or ranges.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract text from | | pages | string | No | "all" | Page range to extract (e.g., '1-5', '3,7,9', 'all') |

{
  "file_path": "/path/to/document.pdf",
  "pages": "1-5"
}

extract_metadata

Extract metadata from a PDF file including title, author, subject, keywords, creator, producer, creation date, modification date, and page count.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract metadata from |

{
  "file_path": "/path/to/document.pdf"
}

search_text

Search for specific text within a PDF file. Returns page numbers and context around the found text. Useful for finding specific content in large documents.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to search within | | query | string | Yes | - | Text to search for in the PDF | | case_sensitive | boolean | No | false | Whether to perform case-sensitive search | | context_words | integer | No | 10 | Number of words to include before and after each match |

{
  "file_path": "/path/to/document.pdf",
  "query": "important term",
  "case_sensitive": false,
  "context_words": 5
}

get_page_count

Get the total number of pages in a PDF file. Returns a simple integer count.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to count pages for |

{
  "file_path": "/path/to/document.pdf"
}

extract_pages

Extract specific pages from a PDF file and save as a new PDF. Supports page ranges and individual page selection.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the source PDF file | | pages | string | Yes | - | Pages to extract (e.g., '1-5', '3,7,9', '1,3-5') | | output_path | string | Yes | - | Path where the extracted pages will be saved as a new PDF |

{
  "file_path": "/path/to/source.pdf",
  "pages": "1,3,5-7",
  "output_path": "/path/to/output.pdf"
}

split_pdf

Split a PDF file into multiple separate PDF files based on page ranges. Returns a JSON with base64-encoded PDFs for each selected page. Supports single pages, page ranges, and all pages.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to split | | page_range | string | Yes | - | Page range to split - 'all', single page (e.g., '1'), or range (e.g., '1-3', '2-5') |

{
  "file_path": "/path/to/document.pdf",
  "page_range": "1-3"
}

merge_pdfs

Merge multiple PDF files into a single PDF. Files are merged in the order provided.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_paths | array | Yes | - | List of PDF file paths to merge | | output_path | string | Yes | - | Path where the merged PDF will be saved |

{
  "file_paths": ["/path/to/doc1.pdf", "/path/to/doc2.pdf", "/path/to/doc3.pdf"],
  "output_path": "/path/to/merged.pdf"
}

pdf_to_images

Convert PDF pages to PNG images. Returns a JSON with base64-encoded PNG images for each page. Supports custom DPI settings for resolution control.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to convert to images | | dpi | integer | No | 150 | Image resolution in dots per inch | | format | string | No | "png" | Image format (PNG only) |

{
  "file_path": "/path/to/document.pdf",
  "dpi": 300,
  "format": "png"
}

get_form_fields

Extract all form fields from a PDF document including field names, types, current values, and available choices for dropdown fields.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract form fields from |

{
  "file_path": "/path/to/form.pdf"
}

Returns a JSON with field information: ``json { "fields": [ { "name": "first_name", "type": "text", "value": "", "page": 1, "rect": {"x0": 50, "y0": 72, "x1": 150, "y1": 92} }, { "name": "country", "type": "combobox", "value": "", "page": 1, "rect": {...}, "choices": ["USA", "Canada", "UK"] }, { "name": "accept_terms", "type": "checkbox", "value": "", "page": 1, "rect": {...}, "on_state": "Yes" } ], "total_fields": 3 } ``

fill_form

Fill form fields in a PDF document with provided values and save to output path. Supports text fields, checkboxes, radio buttons, and dropdowns.

| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the source PDF file | | fields | object | Yes | - | Dictionary of field names and their values to fill | | output_path | string | Yes | - | Path where the filled PDF will be saved |

{
  "file_path": "/path/to/form.pdf",
  "fields": {
    "first_name": "John",
    "last_name": "Doe",
    "country": "USA",
    "accept_terms": true
  },
  "output_path": "/path/to/filled_form.pdf"
}

Checkbox values accept: true/false, "yes"/"no", "1"/"0". Radio buttons: use the value from on_state field (get with get_form_fields first).

Configuration

Environment Variables

| Variable | Default | Description | |----------|---------|-------------| | PDF_MCP_DEBUG | false | Enable debug logging |

# Example
export PDF_MCP_DEBUG=true
python -m pdf_mcp

Development

Running Tests

source venv/bin/activate
pytest

# With coverage
pytest --cov=src --cov-report=html

Project Structure

pdf-mcp/
├── src/pdf_mcp/
│   ├── __init__.py
│   ├── __main__.py
│   ├── server.py
│   ├── config.py
│   └── tools/
│       ├── __init__.py
│       ├── extract_text.py
│       ├── extract_metadata.py
│       ├── search_text.py
│       ├── get_page_count.py
│       ├── extract_pages.py
│       ├── split_pdf.py
│       ├── merge_pdfs.py
│       ├── pdf_to_images.py
│       ├── get_form_fields.py
│       └── fill_form.py
├── tests/
├── pyproject.toml
└── README.md

Troubleshooting

Installation Issues

If you encounter installation errors, ensure you have Python 3.10 or later:

python --version

File Not Found Errors

Make sure the PDF file paths are correct and the files exist:

ls -l /path/to/your/document.pdf

Encrypted PDFs

The tools will raise a RuntimeError if attempting to process encrypted PDFs. Ensure your PDFs are not password-protected.

Memory Issues with Large PDFs

For very large PDF files, consider processing them in smaller chunks using the extract_pages or split_pdf tools.

Permission Errors (Linux)

If you encounter permission errors, ensure the PDF files are readable:

chmod +r /path/to/your/document.pdf

Security Considerations

  • File Access: The server only processes files that exist and are readable by the running process
  • Path Validation: All file paths are validated before processing
  • No Network Access: The server does not make any network requests
  • Temporary Files: Temporary files are properly cleaned up after processing
  • Error Handling: Sensitive information is not exposed in error messages
  • Encrypted PDFs: Password-protected PDFs are rejected with appropriate error messages

Example Usage Scenarios

Scenario 1: Extract Text from Specific Pages

{
  "name": "extract_text",
  "arguments": {
    "file_path": "/documents/report.pdf",
    "pages": "1-3,7,9"
  }
}

Scenario 2: Search and Extract Context

{
  "name": "search_text",
  "arguments": {
    "file_path": "/documents/contract.pdf",
    "query": "liability clause",
    "case_sensitive": true,
    "context_words": 15
  }
}

Scenario 3: Merge Multiple Reports

{
  "name": "merge_pdfs",
  "arguments": {
    "file_paths": [
      "/reports/q1.pdf",
      "/reports/q2.pdf", 
      "/reports/q3.pdf",
      "/reports/q4.pdf"
    ],
    "output_path": "/reports/annual.pdf"
  }
}

Scenario 4: Convert PDF to Images

{
  "name": "pdf_to_images",
  "arguments": {
    "file_path": "/documents/presentation.pdf",
    "dpi": 300
  }
}

License

MIT

Related MCP servers

Browse all →