PDF MCP
MCP server for PDF processing and analysis using PyPDFium2.
Features
- extract_text: Extract text content from PDF files with page range support
- extract_metadata: Extract PDF metadata including title, author, and page count
- search_text: Search for specific text within PDF files with context
- get_page_count: Get the total number of pages in a PDF file
- extract_pages: Extract specific pages from a PDF and save as a new PDF
- split_pdf: Split a PDF into multiple page-based PDFs with base64 encoding
- merge_pdfs: Merge multiple PDF files into a single PDF
- pdf_to_images: Convert PDF pages to PNG images with configurable DPI
- get_form_fields: Extract all form fields from a PDF including names, types, and values
- fill_form: Fill form fields in a PDF with provided values and save to output path
Installation
From Git Repository
# Clone the repository
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the package
pip install -e .
With uv (recommended)
# Clone and enter directory
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp
# Install with uv
uv pip install -e .
Integration
OpenCode
Add to your ~/.config/opencode/opencode.json:
{
"mcpServers": {
"pdf-mcp": {
"type": "local",
"command": [
"/path/to/pdf-mcp/venv/bin/python",
"-m",
"pdf_mcp"
],
"enabled": true
}
}
}
Claude Desktop
Add to your Claude Desktop config:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json Linux: ~/.config/Claude/claude_desktop_config.json
{
"mcpServers": {
"pdf-mcp": {
"command": "/path/to/pdf-mcp/venv/bin/python",
"args": ["-m", "pdf_mcp"]
}
}
}
Generic MCP Client
For any MCP-compatible client:
# Start the server directly
/path/to/venv/bin/python -m pdf_mcp
The server communicates via stdio using the MCP protocol.
Tools
extract_text
Extract text content from a PDF file. Supports PDFs with searchable text and can extract text from specific pages or ranges.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract text from | | pages | string | No | "all" | Page range to extract (e.g., '1-5', '3,7,9', 'all') |
{
"file_path": "/path/to/document.pdf",
"pages": "1-5"
}
extract_metadata
Extract metadata from a PDF file including title, author, subject, keywords, creator, producer, creation date, modification date, and page count.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract metadata from |
{
"file_path": "/path/to/document.pdf"
}
search_text
Search for specific text within a PDF file. Returns page numbers and context around the found text. Useful for finding specific content in large documents.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to search within | | query | string | Yes | - | Text to search for in the PDF | | case_sensitive | boolean | No | false | Whether to perform case-sensitive search | | context_words | integer | No | 10 | Number of words to include before and after each match |
{
"file_path": "/path/to/document.pdf",
"query": "important term",
"case_sensitive": false,
"context_words": 5
}
get_page_count
Get the total number of pages in a PDF file. Returns a simple integer count.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to count pages for |
{
"file_path": "/path/to/document.pdf"
}
extract_pages
Extract specific pages from a PDF file and save as a new PDF. Supports page ranges and individual page selection.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the source PDF file | | pages | string | Yes | - | Pages to extract (e.g., '1-5', '3,7,9', '1,3-5') | | output_path | string | Yes | - | Path where the extracted pages will be saved as a new PDF |
{
"file_path": "/path/to/source.pdf",
"pages": "1,3,5-7",
"output_path": "/path/to/output.pdf"
}
split_pdf
Split a PDF file into multiple separate PDF files based on page ranges. Returns a JSON with base64-encoded PDFs for each selected page. Supports single pages, page ranges, and all pages.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to split | | page_range | string | Yes | - | Page range to split - 'all', single page (e.g., '1'), or range (e.g., '1-3', '2-5') |
{
"file_path": "/path/to/document.pdf",
"page_range": "1-3"
}
merge_pdfs
Merge multiple PDF files into a single PDF. Files are merged in the order provided.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_paths | array | Yes | - | List of PDF file paths to merge | | output_path | string | Yes | - | Path where the merged PDF will be saved |
{
"file_paths": ["/path/to/doc1.pdf", "/path/to/doc2.pdf", "/path/to/doc3.pdf"],
"output_path": "/path/to/merged.pdf"
}
pdf_to_images
Convert PDF pages to PNG images. Returns a JSON with base64-encoded PNG images for each page. Supports custom DPI settings for resolution control.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to convert to images | | dpi | integer | No | 150 | Image resolution in dots per inch | | format | string | No | "png" | Image format (PNG only) |
{
"file_path": "/path/to/document.pdf",
"dpi": 300,
"format": "png"
}
get_form_fields
Extract all form fields from a PDF document including field names, types, current values, and available choices for dropdown fields.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the PDF file to extract form fields from |
{
"file_path": "/path/to/form.pdf"
}
Returns a JSON with field information: ``json { "fields": [ { "name": "first_name", "type": "text", "value": "", "page": 1, "rect": {"x0": 50, "y0": 72, "x1": 150, "y1": 92} }, { "name": "country", "type": "combobox", "value": "", "page": 1, "rect": {...}, "choices": ["USA", "Canada", "UK"] }, { "name": "accept_terms", "type": "checkbox", "value": "", "page": 1, "rect": {...}, "on_state": "Yes" } ], "total_fields": 3 } ``
fill_form
Fill form fields in a PDF document with provided values and save to output path. Supports text fields, checkboxes, radio buttons, and dropdowns.
| Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | file_path | string | Yes | - | Path to the source PDF file | | fields | object | Yes | - | Dictionary of field names and their values to fill | | output_path | string | Yes | - | Path where the filled PDF will be saved |
{
"file_path": "/path/to/form.pdf",
"fields": {
"first_name": "John",
"last_name": "Doe",
"country": "USA",
"accept_terms": true
},
"output_path": "/path/to/filled_form.pdf"
}
Checkbox values accept: true/false, "yes"/"no", "1"/"0". Radio buttons: use the value from on_state field (get with get_form_fields first).
Configuration
Environment Variables
| Variable | Default | Description | |----------|---------|-------------| | PDF_MCP_DEBUG | false | Enable debug logging |
# Example
export PDF_MCP_DEBUG=true
python -m pdf_mcp
Development
Running Tests
source venv/bin/activate
pytest
# With coverage
pytest --cov=src --cov-report=html
Project Structure
pdf-mcp/
├── src/pdf_mcp/
│ ├── __init__.py
│ ├── __main__.py
│ ├── server.py
│ ├── config.py
│ └── tools/
│ ├── __init__.py
│ ├── extract_text.py
│ ├── extract_metadata.py
│ ├── search_text.py
│ ├── get_page_count.py
│ ├── extract_pages.py
│ ├── split_pdf.py
│ ├── merge_pdfs.py
│ ├── pdf_to_images.py
│ ├── get_form_fields.py
│ └── fill_form.py
├── tests/
├── pyproject.toml
└── README.md
Troubleshooting
Installation Issues
If you encounter installation errors, ensure you have Python 3.10 or later:
python --version
File Not Found Errors
Make sure the PDF file paths are correct and the files exist:
ls -l /path/to/your/document.pdf
Encrypted PDFs
The tools will raise a RuntimeError if attempting to process encrypted PDFs. Ensure your PDFs are not password-protected.
Memory Issues with Large PDFs
For very large PDF files, consider processing them in smaller chunks using the extract_pages or split_pdf tools.
Permission Errors (Linux)
If you encounter permission errors, ensure the PDF files are readable:
chmod +r /path/to/your/document.pdf
Security Considerations
- File Access: The server only processes files that exist and are readable by the running process
- Path Validation: All file paths are validated before processing
- No Network Access: The server does not make any network requests
- Temporary Files: Temporary files are properly cleaned up after processing
- Error Handling: Sensitive information is not exposed in error messages
- Encrypted PDFs: Password-protected PDFs are rejected with appropriate error messages
Example Usage Scenarios
Scenario 1: Extract Text from Specific Pages
{
"name": "extract_text",
"arguments": {
"file_path": "/documents/report.pdf",
"pages": "1-3,7,9"
}
}
Scenario 2: Search and Extract Context
{
"name": "search_text",
"arguments": {
"file_path": "/documents/contract.pdf",
"query": "liability clause",
"case_sensitive": true,
"context_words": 15
}
}
Scenario 3: Merge Multiple Reports
{
"name": "merge_pdfs",
"arguments": {
"file_paths": [
"/reports/q1.pdf",
"/reports/q2.pdf",
"/reports/q3.pdf",
"/reports/q4.pdf"
],
"output_path": "/reports/annual.pdf"
}
}
Scenario 4: Convert PDF to Images
{
"name": "pdf_to_images",
"arguments": {
"file_path": "/documents/presentation.pdf",
"dpi": 300
}
}
License
MIT






