<div align="center"> <h1>@cyanheads/internet-archive-mcp-server</h1> <p><b>Search the Wayback Machine and IA library (40M+ items), fetch archived snapshots, retrieve item metadata and full text via MCP. STDIO or Streamable HTTP.</b> <div>5 Tools • 1 Resource</div> </p> </div>
<div align="center">
      
</div>
<div align="center">
  

</div>
---
Tools
Five tools covering two Internet Archive pillars — Wayback Machine snapshot discovery and retrieval, and IA library search and content access:
| Tool | Description | |:---|:---| | ia_find_snapshots | Find Wayback Machine snapshots of a URL. Mode closest returns the nearest capture to a given timestamp. Mode history returns the full capture list via CDX with date range, status, and MIME filters, collapsed by default to one capture per day. Supports resume-key pagination for large histories. | | ia_get_snapshot | Fetch the archived content of a URL at a specific Wayback timestamp. Strips HTML to readable text and returns the canonical replay URL. | | ia_search_items | Search the IA library (40M+ items). Filter by media type, collection, creator, date range, and language. Sort by relevance, date, or downloads. Returns identifiers, titles, types, and pagination context (total_found, page, rows). | | ia_get_item | Retrieve full metadata and the file manifest for an Archive item by identifier — title, creator, description, subjects, collections, license, and every file with its format, size, and direct download URL. | | ia_get_text | Retrieve readable OCR text (DjVuTXT or plain-text) from a text item. Length-aware truncation with continuation pointer (char_offset) for paging through large documents. |
ia_find_snapshots
Discover what the Wayback Machine has captured for any URL.
closestmode: single fast lookup via the Availability API — returns the nearest capture to a given timestamphistorymode: full capture list via the CDX API, filterable by date range (from/to), HTTP status (status_filter), and MIME type- Default collapse of
timestamp:8(one capture per day) keeps responses tractable for popular URLs; adjust with thecollapseparameter (timestamp:N, N=1–14) - Resume-key pagination (
resume_key) for stepping through large CDX histories without re-scanning
---
ia_get_snapshot
Retrieve what a page actually said at a point in time.
- Resolves to the nearest available capture when the exact timestamp has no snapshot
- Strips Wayback banner injections and extracts readable text — returns clean content alongside the canonical replay URL for browser access
- Useful for fact-checking, citation verification, and tracing how content changed over time
---
ia_search_items
Search across 40M+ Archive items by keyword and metadata filters.
- Full-text Solr query syntax plus structured filters:
mediatype(texts, audio, video, software, image),collection,creator,language, and date range - Sort by relevance, date added, or download count
- Pagination via
pageandrows; output includestotal_foundand currentpage/rowsso agents can paginate correctly without guessing
---
ia_get_item
Fetch the complete metadata and file manifest for any Archive item.
- Returns structured fields:
title,creator,description,subjects,collections,date,license, and more files[]includes every file in the item with itsformat,size, and direct download URL — the primary way to act on a search resultmetadataresponse{}on unknown identifier → typeditem_not_founderror
---
ia_get_text
Read the OCR text of public-domain books, documents, and transcripts.
- Locates the best available text file in the item's manifest (DjVuTXT preferred, falls back to plain text)
max_charsandchar_offsetenable efficient paging through long documents without re-fetching- Surfaces
download_forbidden(HTTP 403) as a typed error for restricted collections rather than failing silently
Resource
| Type | Name | Description | |:---|:---|:---| | Resource | ia://item/{identifier} | Metadata snapshot for an Archive item — title, creator, mediatype, description, subjects, collections, date, license, and file count. Stable URIs for injectable context. |
All resource data is also reachable via ia_get_item. The resource provides a stable, injectable URI for referencing a specific item across workflows.
Features
Built on @cyanheads/mcp-ts-core:
- Declarative tool, resource, and prompt definitions — single file per primitive, framework handles registration and validation
- Unified error handling — handlers throw, framework catches, classifies, and formats
- Pluggable auth:
none,jwt,oauth - Swappable storage backends:
in-memory,filesystem,Supabase,Cloudflare KV/R2/D1 - Structured logging with optional OpenTelemetry tracing
- STDIO and Streamable HTTP transports
Internet Archive-specific:
- No credentials required — all four APIs are public
- Three service layers:
WaybackService(Availability + CDX),ArchiveSearchService(Solr),ArchiveMetadataService(Metadata + downloads) - CDX collapse-by-day default and configurable
limitkeep responses tractable for high-capture URLs - Identifies User-Agent on every request as required by IA's terms; configurable via
IA_USER_AGENT
Agent-friendly output:
- Pagination context on every list response —
total_found,page,rows(search) andresume_key(CDX history) so agents never have to guess whether results are complete - Typed error reasons (
no_snapshots,no_snapshot_available,item_not_found,no_text_file,download_forbidden) with recovery hints so callers can retry or explain to users without parsing text - Structured file manifests — every
ia_get_itemresponse includes file-level metadata (format, size, URL) enabling agents to select the right file without a follow-up call
Getting started
No API key required — the Internet Archive's APIs are fully public.
Add the following to your MCP client configuration file:
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "bunx",
"args": ["@cyanheads/internet-archive-mcp-server@latest"],
"env": {
"MCP_TRANSPORT_TYPE": "stdio",
"MCP_LOG_LEVEL": "info"
}
}
}
}
Or with npx (no Bun required):
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@cyanheads/internet-archive-mcp-server@latest"],
"env": {
"MCP_TRANSPORT_TYPE": "stdio",
"MCP_LOG_LEVEL": "info"
}
}
}
}
Or with Docker:
{
"mcpServers": {
"internet-archive-mcp-server": {
"type": "stdio",
"command": "docker",
"args": [
"run", "-i", "--rm",
"-e", "MCP_TRANSPORT_TYPE=stdio",
"ghcr.io/cyanheads/internet-archive-mcp-server:latest"
]
}
}
}
For Streamable HTTP, set the transport and start the server:
MCP_TRANSPORT_TYPE=http MCP_HTTP_PORT=3010 bun run start:http
# Server listens at http://localhost:3010/mcp
Prerequisites
- Bun v1.3.2 or higher (or Node.js v24+).
- No external accounts or API keys required.
Installation
- Clone the repository:
git clone https://github.com/cyanheads/internet-archive-mcp-server.git
- Navigate into the directory:
cd internet-archive-mcp-server
- Install dependencies:
bun install
- Configure environment:
cp .env.example .env
# Optional: edit .env for custom User-Agent, timeouts, etc.
Configuration
All configuration is validated at startup via Zod schemas in src/config/server-config.ts.
| Variable | Description | Default | |:---------|:------------|:--------| | MCP_TRANSPORT_TYPE | Transport: stdio or http | stdio | | MCP_HTTP_PORT | HTTP server port | 3010 | | MCP_AUTH_MODE | Auth mode: none, jwt, or oauth | none | | MCP_LOG_LEVEL | Log level (debug, info, notice, warning, error) | info | | LOGS_DIR | Directory for log files (Node.js only) | <project-root>/logs | | STORAGE_PROVIDER_TYPE | Storage backend | in-memory | | OTEL_ENABLED | Enable OpenTelemetry instrumentation | false | | IA_USER_AGENT | Custom User-Agent for IA API requests | internet-archive-mcp-server/{version} (github.com/cyanheads/internet-archive-mcp-server) | | IA_REQUEST_TIMEOUT_MS | HTTP request timeout in milliseconds | 30000 | | IA_MAX_SNAPSHOT_CHARS | Default character cap for ia_get_text responses | 50000 |
See .env.example for the full list of optional overrides.
Running the server
Local development
- Build and run:
# One-time build
bun run rebuild
# Run the built server
bun run start:stdio
# or
bun run start:http
- Run checks and tests:
bun run devcheck # Lint, format, typecheck, security
bun run test # Vitest test suite
bun run lint:mcp # Validate MCP definitions against spec
Docker
docker build -t internet-archive-mcp-server .
docker run --rm -p 3010:3010 internet-archive-mcp-server
The Dockerfile defaults to HTTP transport, stateless session mode, and logs to /var/log/internet-archive-mcp-server. OpenTelemetry peer dependencies are installed by default — build with --build-arg OTEL_ENABLED=false to omit them.
Project structure
| Directory | Purpose | |:----------|:--------| | src/index.ts | createApp() entry point — registers tools, resource, and inits services. | | src/config | Server-specific environment variable parsing and validation with Zod. | | src/mcp-server/tools | Tool definitions (*.tool.ts). Five tools across Wayback and IA library. | | src/mcp-server/resources | Resource definitions. ia://item/{identifier} item metadata resource. | | src/services/wayback | WaybackService — Availability API + CDX API client. | | src/services/archive-search | ArchiveSearchService — Solr Advanced Search client. | | src/services/archive-metadata | ArchiveMetadataService — Metadata API + file download client. | | tests/ | Unit and integration tests mirroring src/. |
Development guide
See CLAUDE.md for development guidelines and architectural rules. The short version:
- Handlers throw, framework catches — no
try/catchin tool logic - Use
ctx.logfor request-scoped logging,ctx.statefor tenant-scoped storage - Register new tools and resources via the barrels in
src/mcp-server/*/definitions/index.ts - Wrap external API calls: validate raw → normalize to domain type → return output schema; never fabricate missing fields
Contributing
Issues and pull requests are welcome. Run checks and tests before submitting:
bun run devcheck
bun run test
License
Apache-2.0 — see LICENSE for details.






