token-compressor

base76-research-lab/token-compressor
9 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Compress prompts 40-60% using local LLM + embedding validation. Preserves all conditionals.

README.md

token-compressor

Reduce LLM prompt tokens by 30–70% while preserving semantic meaning.

mcp-name: io.github.base76-research-lab/token-compressor

Semantic prompt compression for LLM workflows. Reduce token usage by 40–60% without losing meaning.

![License: MIT](LICENSE) ![Requires: Ollama](https://ollama.com) ![MCP Compatible](https://modelcontextprotocol.io)

Built by Base76 Research Lab — research into epistemic AI architecture.

---

Live demo

Intent Compiler MVP is now live and uses this project as part of the idea -> spec -> compressed output flow:

  • Live: https://intent-compiler-mvp.pages.dev
  • Product repo: https://github.com/base76-research-lab/token-compressor

---

What it does

token-compressor is a two-stage pipeline that compresses prompts before they reach an LLM:

  1. LLM compression — a local model (llama3.2:1b via Ollama) rewrites the prompt to its semantic minimum, preserving all conditionals and negations
  2. Embedding validation — cosine similarity between original and compressed embeddings must exceed a threshold (default: 0.85) — if not, the original is sent unchanged

The result: shorter prompts, lower costs, same intent.

Input prompt (300 tokens)
        ↓
  LLM compresses
        ↓
  Embedding validates (cosine ≥ 0.85?)
        ↓
  Pass → compressed (120 tokens)   Fail → original (300 tokens)

Key design principle: conditionality is never sacrificed. If your prompt says "only do X if Y", that constraint survives compression.

---

Requirements

  • Python 3.10+
  • Ollama running locally
  • Two models pulled:
ollama pull llama3.2:1b
ollama pull nomic-embed-text
  • Python dependencies:
pip install ollama numpy

---

Quick start

from compressor import LLMCompressEmbedValidate

pipeline = LLMCompressEmbedValidate()
result = pipeline.process("Your prompt text here...")

print(result.output_text)   # compressed (or original if validation failed)
print(result.report())      # MODE / COVERAGE / TOKENS saved

Result object:

| Field | Description | |-------|-------------| | output_text | Text to send to your LLM | | mode | compressed / raw_fallback / skipped | | coverage | Cosine similarity (0.0–1.0) | | tokens_in | Estimated input tokens | | tokens_out | Estimated output tokens | | tokens_saved | Difference |

---

CLI usage

echo "Your long prompt here..." | python3 cli.py

Output: compressed text on stdout, stats on stderr.

---

Claude Code hook (recommended setup)

Add to your ~/.claude/settings.json under hooks → UserPromptSubmit:

{
  "type": "command",
  "command": "echo \"${CLAUDE_USER_PROMPT:-}\" | python3 /path/to/token-compressor/cli.py > /tmp/compressed_prompt.txt 2>/tmp/compress.log || true"
}

This runs on every prompt submission and writes the compressed version to a temp file, which can be injected back into context via a second hook or MCP server.

---

MCP server

The MCP server exposes compression as a tool callable from Claude Code and any MCP-compatible client.

Install:

pip install token-compressor-mcp

Tool: compress_prompt

  • Input: text (string)
  • Output: compressed text + stats footer

Claude Code MCP config (~/.claude/settings.json):

{
  "mcpServers": {
    "token-compressor": {
      "command": "uvx",
      "args": ["token-compressor-mcp"]
    }
  }
}

Or from source:

{
  "mcpServers": {
    "token-compressor": {
      "command": "python3",
      "args": ["-m", "token_compressor_mcp"],
      "cwd": "/path/to/token-compressor"
    }
  }
}

---

Configuration

pipeline = LLMCompressEmbedValidate(
    threshold=0.85,          # cosine similarity floor (lower = more aggressive)
    min_tokens=80,           # skip pipeline below this (not worth compressing)
    compress_model="llama3.2:1b",
    embed_model="nomic-embed-text",
)

---

How it works

Stage 1 — LLM compression

The compression prompt instructs the model to:

  • Preserve all conditionals (if, only if, unless, when, but only)
  • Preserve all negations
  • Remove filler, hedging, redundancy
  • Target 40–60% of original length

Stage 2 — Embedding validation

Computes cosine similarity between the original and compressed text using nomic-embed-text. If similarity falls below threshold, the original is returned unchanged. This prevents silent meaning loss.

---

Results

Tested across Swedish and English prompts, technical and natural language:

| Input | Tokens in | Tokens out | Saved | |-------|-----------|------------|-------| | Research abstract (EN) | 89 | 38 | 57% | | Session intent (SV) | 32 | 18 | 44% | | Technical instruction | 47 | 22 | 53% | | Short command (<80t) | — | — | skipped |

---

Research background

This tool implements the architecture from:

Wikström, B. (2026). When Alignment Reduces Uncertainty: Epistemic Variance Collapse and Its Implications for Metacognitive AI. DOI: 10.5281/zenodo.18731535

Part of the Base76 Research Lab toolchain for epistemic AI infrastructure.

---

License

MIT — Base76 Research Lab, Sweden

Related MCP servers

Browse all →