Firecrawl Toolkit

Turn web pages and PDFs into local, searchable Markdown for agents.

Firecrawl Toolkit is a CLI-first web capture tool built on top of the Firecrawl API. It is designed for shell-based agents such as Codex, Claude Code, OpenCode, and other local automation workflows.

Instead of dumping long web pages into model context, the CLI saves the page as a local Markdown file and prints only a minimal success/failure signal to stdout.

firecrawl scrape --url "https://example.com/article" --output article --path ./Temp-Scrape
# true

Then let your agent inspect the file with normal local tools:

rg -n "pricing|risk|governance|download|PDF" ./Temp-Scrape
bat --paging=never --line-range 40:120 ./Temp-Scrape/article.md

Scrape once. Search locally. Keep context clean.

Why this toolkit exists

Most web-search or scrape tools treat the web page as immediate model input. That works for short pages, but it breaks down quickly with long reports, news articles, PDFs, documentation pages, and research material.

This toolkit uses a different workflow:

remote URL / PDF
→ local Markdown file
→ rg / bat / sed / awk / local file tools
→ agent reads only the relevant sections

This is especially useful when an agent needs to collect multiple sources, compare reports, inspect citations, or build a local research folder.

The goal is not to produce a perfectly clean article-only extraction. The goal is to produce complete, searchable, agent-readable local source material with reduced web noise and minimal context pollution.

Core workflow

1. Scrape a web page or PDF into local Markdown

firecrawl scrape \
  --url "https://example.com/report-or-article" \
  --output report \
  --path ./Temp-Scrape

On success:

true

The file is saved as:

./Temp-Scrape/report.md

The generated Markdown begins with source metadata:

## title:
## description:
## url:
## language:
## creditsUsed:

markdown content

2. Search locally

rg -n "agentic|governance|pricing|risk|EBIT|download" ./Temp-Scrape

3. Read only the relevant section

bat --paging=never --line-range 80:150 ./Temp-Scrape/report.md

Or with standard Unix tools:

sed -n '80,150p' ./Temp-Scrape/report.md

Why stdout is intentionally small

For file-producing commands such as scrape, stdout is deliberately minimal.

On success:

true

On failure:

false
<short error reason>

Large page content is written to disk, not printed to stdout by default. This is intentional: agents should not accidentally ingest a 50k, 100k, or 500k character page into context.

The recommended pattern is:

scrape to file
→ inspect file size / headings / keywords
→ read only the useful ranges

Built-in boilerplate filtering

The scrape command applies a built-in noise filter by default.

It reduces common non-content regions such as:

scripts, styles, forms, inputs, buttons
nav bars, headers, footers, asides
menus and navigation blocks
logos and brand blocks
accessibility skip links and visually hidden elements
ads and advertisement containers
sidebars
breadcrumbs and pagination
related, recommended, and trending sections
common layout offset/module blocks

This is not meant to aggressively delete every non-article element. The priority is high recall with reduced noise: keep the source material useful for local search while removing obvious boilerplate.

If a page contains useful content in an unusual region, you can disable the built-in filter:

firecrawl scrape \
  --url "https://example.com/page" \
  --output page-raw \
  --empty-tags

You can also add your own exclusions:

firecrawl scrape \
  --url "https://example.com/page" \
  --output page \
  --exclude-tags ".newsletter,.promo,aside.related"

Installation

Python package

The Python package provides the MCP server.

uvx firecrawl-toolkit

Package:

firecrawl-toolkit

Go CLI

The standalone CLI is located in the cli directory.

Build from source:

cd cli
go test ./...
go build -o firecrawl .

Run:

./firecrawl --help

Build Windows example:

cd cli
go build -o firecrawl.exe .

API key

The Go CLI reads the Firecrawl API key from:

FIRECRAWL_KEY

Linux/macOS:

export FIRECRAWL_KEY="fc-..."

Windows PowerShell:

$env:FIRECRAWL_KEY="fc-..."

The Python MCP server uses:

FIRECRAWL_API_KEY

API base URL

By default, both the Go CLI and the Python MCP server use the official Firecrawl API base URL:

https://api.firecrawl.dev/v2

For a self-hosted Firecrawl-compatible service, set:

FIRECRAWL_BASE_URL

Linux/macOS:

export FIRECRAWL_BASE_URL="https://your-firecrawl.example/v2"

Windows PowerShell:

$env:FIRECRAWL_BASE_URL="https://your-firecrawl.example/v2"

CLI commands

firecrawl aggregated
firecrawl web
firecrawl news
firecrawl image
firecrawl scrape
firecrawl credit-usage

Scrape command

Basic usage

firecrawl scrape \
  --url "https://example.com/article" \
  --output article

This writes:

article.md

Save into a directory

firecrawl scrape \
  --url "https://example.com/article" \
  --output article \
  --path ./Temp-Scrape

This writes:

./Temp-Scrape/article.md

If the directory does not exist, the CLI tries to create it.

Scrape a PDF

firecrawl scrape \
  --url "https://example.com/report.pdf" \
  --output report-pdf \
  --path ./Temp-Scrape

The CLI requests Markdown output and enables Firecrawl’s PDF parser.

Scrape options

`--output`

Required. Export name.

firecrawl scrape --url "https://example.com" --output example

The CLI writes:

example.md

If the provided name already ends with .md, it is preserved.

`--path`

Optional. Output directory.

firecrawl scrape --url "https://example.com" --output example --path ./exports

`--url`

Required. Target web page or PDF URL.

firecrawl scrape --url "https://example.com" --output example

`--include-tags`

Optional. CSS selectors to include.

Use this when you know the useful content is inside a specific region:

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --include-tags "article"

Multiple selectors:

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --include-tags ".article-body,#content,main"

JSON array form is recommended when selectors contain spaces, quotes, or commas:

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --include-tags '["main article",".post-content","#content"]'

`--exclude-tags`

Optional. Additional CSS selectors to exclude.

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --exclude-tags ".nav,.footer,#sidebar"

This is merged with the built-in boilerplate filter.

`--empty-tags`

Optional. Disable the built-in exclude selector list for this request.

firecrawl scrape \
  --url "https://example.com" \
  --output page-raw \
  --empty-tags

User-provided --exclude-tags are still applied:

firecrawl scrape \
  --url "https://example.com" \
  --output page-custom \
  --empty-tags \
  --exclude-tags ".nav"

`--skip-tls`

Optional. Skip TLS certificate verification for the upstream scrape target.

By default, TLS certificate verification is enabled.

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --skip-tls

`--headers`

Optional. JSON object of request headers.

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --headers '{"X-Trace-Id":"abc123"}'

Use sensitive headers carefully. Avoid passing credentials, cookies, or authorization tokens unless you understand the risk.

`--timeout`

Optional. Request timeout in seconds. Default is 120.

firecrawl scrape \
  --url "https://example.com" \
  --output page \
  --timeout 180

Recommended agent usage

Give your agent a narrow workflow instead of exposing every scrape option.

Default capture

firecrawl scrape --url "$URL" --output "$NAME" --path ./Temp-Scrape

Inspect the captured source

rg -n "$KEYWORDS" ./Temp-Scrape
bat --paging=never --line-range 1:120 "./Temp-Scrape/$NAME.md"

If the page is noisy

Try adding exclusions:

firecrawl scrape \
  --url "$URL" \
  --output "$NAME-clean" \
  --path ./Temp-Scrape \
  --exclude-tags ".newsletter,.promo,aside.related"

If the built-in filter removes something useful

Capture a raw version:

firecrawl scrape \
  --url "$URL" \
  --output "$NAME-raw" \
  --path ./Temp-Scrape \
  --empty-tags

Search commands

Search commands return compact single-line JSON.

firecrawl aggregated --query "AI governance 2026" --country US --search-num 10
firecrawl web        --query "AI governance 2026" --country US --search-num 10
firecrawl news       --query "OpenAI news" --search-time week
firecrawl image      --query "firecrawl logo" --search-num 10

Available search commands:

aggregated  web + news + images
web         web results
news        news results
image       image results

Search options

`--query`

Required. Search keywords.

firecrawl web --query "AI pricing SaaS"

`--country`

Optional. Country or region name / ISO code. Default is US.

firecrawl web --query "AI policy" --country "United States"
firecrawl web --query "AI policy" --country US

`--search-num`

Optional. Number of results, from 1 to 100. Default is 20.

firecrawl web --query "AI policy" --search-num 5

`--search-time`

Optional. Time filter.

Allowed values:

hour
day
week
month
year

Example:

firecrawl news --query "AI regulation" --search-time week

`--timeout`

Optional. Request timeout in seconds. Default is 120.

Search output

Search commands output compact JSON:

{"success":true,"data":{"web":[],"news":[],"images":[]},"creditsUsed":1}

Mapped fields:

data.web[]:
  title
  description
  url

data.news[]:
  title
  snippet
  url
  date

data.images[]:
  title
  imageUrl
  url

Search results are intended to help agents discover URLs. For detailed reading, scrape selected URLs into local Markdown files.

Recommended pattern:

firecrawl web --query "AI trust maturity survey 2026" --search-num 5
firecrawl scrape --url "<selected-url>" --output ai-trust-survey --path ./Temp-Scrape
rg -n "governance|risk|agentic|maturity" ./Temp-Scrape

Credit usage

Check Firecrawl team credit usage:

firecrawl credit-usage

Pretty-print:

firecrawl credit-usage --pretty

Default output is JSON:

{"success":true,"data":{"remainingCredits":1000,"planCredits":500000,"billingPeriodStart":"2025-01-01T00:00:00Z","billingPeriodEnd":"2025-01-31T23:59:59Z"}}

Exit behavior

Scrape

Success:

true

Failure:

false
<error reason>

The CLI writes the Markdown file only after a successful scrape. Existing files are not created or overwritten on scrape failure.

Search and credit usage

Search and credit usage commands output JSON.

Example: local research folder

mkdir -p ./Temp-Scrape

firecrawl scrape \
  --url "https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/tech-forward/state-of-ai-trust-in-2026-shifting-to-the-agentic-era" \
  --output mckinsey-ai-trust-2026 \
  --path ./Temp-Scrape

firecrawl scrape \
  --url "https://www.reuters.com/business/world-at-work/ai-will-lead-labour-shortages-jeff-bezos-says-vivatech-2026-06-17/" \
  --output reuters-bezos-ai-labor \
  --path ./Temp-Scrape

rg -n "agentic|governance|risk|labor shortage|AI" ./Temp-Scrape
bat --paging=never --line-range 1:120 ./Temp-Scrape/mckinsey-ai-trust-2026.md

This creates a local source folder that can be searched and revisited without repeatedly fetching or pasting web pages into context.

Python MCP server

The project also includes a Python MCP server.

Run with:

uvx firecrawl-toolkit

Example MCP client configuration:

{
  "mcpServers": {
    "firecrawl": {
      "command": "uvx",
      "args": ["firecrawl-toolkit"],
      "env": {
        "FIRECRAWL_API_KEY": "<Your Firecrawl API key>",
        "FIRECRAWL_MCP_ENABLE_STDIO": "1"
      }
    }
  }
}

MCP environment variables:

| Variable | Default | Description | | ----------------------------------- | ------------------------------- | ------------------------------------------------------------ | | FIRECRAWL_API_KEY | fc-xxx | Firecrawl API key. | | FIRECRAWL_BASE_URL | https://api.firecrawl.dev/v2 | Firecrawl API base URL. Set this for self-hosted services. | | FIRECRAWL_HTTP2 | 0 | Enable HTTP/2 with 1. | | FIRECRAWL_MAX_WORKERS | 10 | Number of worker processes. | | FIRECRAWL_MAX_CONNECTIONS | 200 | Maximum HTTP connections. | | FIRECRAWL_MAX_CONCURRENT_REQUESTS | 200 | Maximum concurrent requests. | | FIRECRAWL_KEEPALIVE | 20 | Maximum keepalive connections. | | FIRECRAWL_RETRY_COUNT | 3 | Maximum retry count. | | FIRECRAWL_RETRY_BASE_DELAY | 0.5 | Base retry delay in seconds. | | FIRECRAWL_ENDPOINT_CONCURRENCY | {"search":10,"scrape":2} | Per-endpoint concurrency limits. | | FIRECRAWL_ENDPOINT_RETRYABLE | {"scrape": false} | Per-endpoint retry configuration. | | FIRECRAWL_MCP_ENABLE_STDIO | 0 | Enable STDIO transport. | | FIRECRAWL_MCP_ENABLE_HTTP | 0 | Enable HTTP transport. | | FIRECRAWL_MCP_ENABLE_SSE | 0 | Enable SSE transport. | | FIRECRAWL_MCP_HTTP_HOST | 127.0.0.1 | HTTP host. | | FIRECRAWL_MCP_HTTP_PORT | 7001 | HTTP port. | | FIRECRAWL_MCP_SSE_HOST | 127.0.0.1 | SSE host. | | FIRECRAWL_MCP_SSE_PORT | 7001 | SSE port. | | FIRECRAWL_MCP_LOCK_FILE | /tmp/firecrawl_mcp.lock | Lock file path. |

STDIO, HTTP, and SSE should be used one at a time. Start separate services with different lock files if multiple transports are needed.

MCP tools

The MCP server provides:

| Tool | Description | | ----------------------------- | ------------------------------------------------ | | firecrawl-aggregated-search | Aggregated web, news, and image search. | | firecrawl-web-search | Web search. | | firecrawl-news-search | News search. | | firecrawl-image-search | Image search. | | firecrawl-scrape | Scrape a URL and return mapped Markdown content. |

For local shell-based agents, the Go CLI is usually the simpler and safer interface because it writes large scrape results to files instead of returning them directly to model context.

Development

Run Go CLI tests:

cd cli
go test ./...

Build the CLI:

cd cli
go build -o firecrawl .

Run Python tests:

pytest

License

This project is licensed under the GNU General Public License v3.0 or later.

Firecrawl MCP Toolkit

Firecrawl Toolkit

Why this toolkit exists

Core workflow

1. Scrape a web page or PDF into local Markdown

2. Search locally

3. Read only the relevant section

Why stdout is intentionally small

Built-in boilerplate filtering

Installation

Python package

Go CLI

API key

API base URL

CLI commands

Scrape command

Basic usage

Save into a directory

Scrape a PDF

Scrape options

--output

--path

--url

--include-tags

--exclude-tags

--empty-tags

--skip-tls

--headers

--timeout

Recommended agent usage

Default capture

Inspect the captured source

If the page is noisy

If the built-in filter removes something useful

Search commands

Search options

--query

--country

--search-num

--search-time

--timeout

Search output

Credit usage

Exit behavior

Scrape

Search and credit usage

Example: local research folder

Python MCP server

MCP tools

Development

License

Related MCP servers

MCP servers by category

`--output`

`--path`

`--url`

`--include-tags`

`--exclude-tags`

`--empty-tags`

`--skip-tls`

`--headers`

`--timeout`

`--query`

`--country`

`--search-num`

`--search-time`

`--timeout`