datoon

<h1 align="center">datoon</h1>

<p align="center"> <strong>smart structured-data→TOON gateway — converts only when it actually saves tokens</strong> </p>

<p align="center"> <a href="#before--after">Before/After</a> • <a href="#install">Install</a> • <a href="#what-you-get">What You Get</a> • <a href="#how-it-works">How It Works</a> • <a href="#benchmarks">Benchmarks</a> • <a href="./INSTALL.md">Full install guide</a> </p>

______________________________________________________________________

Raw structured data is often verbose in LLM prompts. TOON can save tokens — but blind conversion can also make payloads worse. datoon adds a decision layer: convert when structure and savings justify it, skip when they don't, and always explain why.

Supports JSON, CSV, JSONL, YAML, XML, Parquet, Avro, ORC, Excel, and Apple Numbers — auto-detected from file extension.

Before / After

JSON in the prompt (43 tokens)

{"users":[
  {"id":1,"name":"Ada","role":"admin"},
  {"id":2,"name":"Lin","role":"analyst"},
  {"id":3,"name":"Grace","role":"viewer"}
]}

</td> <td width="50%">

datoon converts → TOON (24 tokens)

users[3]{id,name,role}:
  1,Ada,admin
  2,Lin,analyst
  3,Grace,viewer

{"decision":"convert","reason":"Estimated savings 44.19% (threshold 15.00%)."}

</td> </tr> <tr> <td>

CSV from a data pipeline (111 tokens as JSON)

id,name,role
1,Ada,admin
2,Lin,analyst
3,Grace,viewer

</td> <td>

datoon auto-converts → TOON (24 tokens)

datoon data.csv --report-stdout

Same result. Zero JSON serialization in your code.

</td> </tr> <tr> <td>

Non-uniform payload (26 tokens)

{"config":{"debug":true},"tags":["a","b"]}

</td> <td>

datoon skips → keeps JSON

{"decision":"skip","reason":"No uniform object arrays found with at least 3 rows."}

No Node.js call. No silent corruption.

</td> </tr> </table>

Same data. Right format. Always explained.

┌──────────────────────────────────────────────────┐
│  PAYLOAD SAVINGS (auto avg)    ████░░░░░░   28%  │
│  PAYLOAD SAVINGS (agent skill) ████████░░   62%  │
│  DECISION ACCURACY             ██████████  100%  │
│  HARMFUL CONVERSIONS BLOCKED   ██████████  100%  │
└──────────────────────────────────────────────────┘

[!IMPORTANT] datoon saves payload tokens — the structured data portion of your prompt. Token savings depend on payload shape: uniform tabular data converts well; deeply nested or non-uniform structures are skipped. Every decision includes a reason so pipelines can log, debug, and trust the outcome.

Install

# core (JSON, CSV, JSONL, XML — no extra deps)
uv add datoon
pip install datoon

# with YAML support
pip install "datoon[yaml]"

# with Excel support
pip install "datoon[excel]"

# with Parquet / ORC / Avro support
pip install "datoon[columnar]"

# with Apple Numbers support
pip install "datoon[numbers]"

# with tiktoken-based token counting
pip install "datoon[tokens]"

# with MCP server
pip install "datoon[mcp]"

# everything
pip install "datoon[all]"

Requires Python 3.12+. TOON conversion requires Node.js with npx in PATH — analysis and format reading work without it.

For Claude Code plugin, Codex, and MCP config → INSTALL.md.

What You Get

| | What | |---|---| | datoon CLI | Auto-gate any supported format → TOON from terminal or scripts | | Python API | convert_json_for_llm() + read_tabular() for any LLM pipeline | | MCP Server | convert_json, convert_text, analyze_json tools for Claude Desktop, Cursor, Windsurf | | Claude Code Plugin | /datoon in-session trigger, installs from GitHub in one command | | Codex Plugin | Marketplace plugin — structured-data mode for Codex |

Supported input formats

| Format | Extension | Extra needed | |---|---|---| | JSON | .json | — | | JSONL | .jsonl, .ndjson | — | | CSV | .csv | — | | XML | .xml | — | | YAML | .yaml, .yml | datoon[yaml] | | Excel | .xlsx, .xls | datoon[excel] | | Parquet | .parquet | datoon[columnar] | | Avro | .avro | datoon[columnar] | | ORC | .orc | datoon[columnar] | | Apple Numbers | .numbers | datoon[numbers] |

How It Works

Detect format — from --format flag, file extension, or default to JSON for stdin
Read + normalize — parse source into list of row dicts; serialize to compact JSON
Analyze structure — uniform object arrays? acceptable depth? minimum rows?
Gate early — non-candidates skip before any CLI call; no Node.js overhead
Convert + estimate — TOON CLI runs, token savings calculated
Gate savings — below threshold → return JSON; above → return TOON with report

Every path returns a ConversionReport with decision, reason, and token estimates. Pipelines never get silent surprises.

______________________________________________________________________

Quick Start

JSON (stdin):

echo '{"users":[{"id":1,"name":"Ada"},{"id":2,"name":"Lin"},{"id":3,"name":"Grace"}]}' | datoon --report-stdout

CSV (auto-detected from extension):

datoon data.csv --report-stdout

JSONL:

datoon data.jsonl -o output.toon

YAML (requires datoon[yaml]):

datoon data.yaml --report-stdout

Parquet (requires datoon[columnar]):

datoon data.parquet --report ./report.json

Explicit format override:

datoon --format csv < data.csv --report-stdout

Force conversion (bypass gating — for experiments):

datoon data.json --force --report-stdout

______________________________________________________________________

Python API

JSON conversion:

from datoon import convert_json_for_llm, ConversionConfig, DatoonError

config = ConversionConfig(min_savings_ratio=0.15, max_depth=6, min_uniform_rows=3)

try:
    outcome = convert_json_for_llm(raw_json, config)
except DatoonError as exc:
    raise

# outcome.payload_text  — TOON or original JSON
# outcome.report.decision  — "convert" | "skip"
# outcome.report.reason    — human-readable explanation
send_to_model(outcome.payload_text)

Any format via read_tabular:

import json
from pathlib import Path
from datoon import read_tabular, convert_json_for_llm, ConversionConfig

# text formats: csv, jsonl, yaml, xml
rows = read_tabular("csv", text=csv_string)

# binary formats: excel, parquet, orc, avro, numbers
rows = read_tabular("parquet", path=Path("data.parquet"))

json_text = json.dumps(rows, separators=(",", ":"))
outcome = convert_json_for_llm(json_text, ConversionConfig())
send_to_model(outcome.payload_text)

Structure-only analysis (no Node.js required):

from datoon.analyzer import analyze_payload
from datoon.models import ConversionConfig

analysis = analyze_payload(parsed_data, ConversionConfig())
print(analysis.is_candidate, analysis.reason)

______________________________________________________________________

MCP Server

datoon ships an MCP server with three tools:

| Tool | Description | |---|---| | convert_json | Full JSON conversion with policy gating | | convert_text | Converts CSV, YAML, XML, or JSONL text with policy gating | | analyze_json | Structure analysis only — no Node.js needed |

Claude Desktop / Cursor / Windsurf config:

{
  "mcpServers": {
    "datoon": {
      "command": "uvx",
      "args": ["--from", "datoon[mcp]", "datoon", "mcp"]
    }
  }
}

Run locally:

datoon mcp     # or the standalone script: datoon-mcp

Listed on the MCP Registry, Smithery, and Glama. See MARKETPLACES.md.

______________________________________________________________________

Claude Code Plugin

Install directly from GitHub:

claude plugin marketplace add andrii-su/datoon
claude plugin install datoon@datoon

Trigger in-session:

/datoon
convert this JSON to TOON if it saves tokens
use datoon mode for structured data

______________________________________________________________________

CLI Reference

| Flag | Default | Description | |---|---|---| | --format | auto | Input format: json, csv, jsonl, yaml, xml, excel, parquet, avro, orc, numbers | | --force | false | Bypass gating and minimum savings threshold | | --min-savings | 0.15 | Minimum relative token savings required | | --max-depth | 6 | Maximum nesting depth for auto-conversion | | --min-uniform-rows | 3 | Minimum rows in uniform object arrays | | --timeout | 30 | Seconds before TOON CLI call is aborted | | --report <path> | — | Write JSON conversion report to file | | --report-stdout | — | Print JSON conversion report to stderr | | -o <path> | stdout | Output file path | | --version | — | Print version and exit |

Format is auto-detected from file extension. Use --format to override or when reading from stdin.

______________________________________________________________________

Benchmarks

PYTHONPATH=src python benchmarks/run.py --dry-run
PYTHONPATH=src python benchmarks/run.py
PYTHONPATH=src python benchmarks/run.py --update-readme

Why auto mode outperforms forced conversion

Auto mode avoids low-benefit and high-risk payloads (orders-nested, mixed-non-uniform) while matching forced TOON's average token count on suitable ones. Every decision comes with a reasoned report.

| Scenario | JSON Baseline | Forced TOON | datoon Auto | |---|---:|---:|---:| | Average tokens | 77 | 50 | 50 | | Avg token saved | 0.0% | 26.8% | 28.1% | | Decision quality | n/a | Converts all | Converts 3/5, skips harmful cases |

| Dataset | JSON | TOON (forced) | Raw Saved | Auto | Auto Tokens | Auto Saved | |---|---:|---:|---:|---|---:|---:| | users-small | 56 | 31 | 44.6% | convert | 31 | 44.6% | | events-medium | 198 | 111 | 43.9% | convert | 111 | 43.9% | | orders-nested | 93 | 91 | 2.2% | skip | 93 | 0.0% | | mixed-non-uniform | 35 | 37 | -5.7% | skip | 35 | 0.0% | | metrics-wide | 133 | 63 | 52.6% | convert | 63 | 52.6% | | Average | 103 | 67 | 27.5% | 3/5 convert | 67 | 28.2% |

Forced conversion succeeded for 5/5 payloads.

Format conversion benchmark

Token savings when converting from common structured formats (CSV, JSONL, XML, YAML). Baseline is the JSON representation of the same data — what an LLM would receive without datoon.

| Dataset | Format | JSON Tokens | TOON (forced) | Auto | Auto Tokens | Auto Saved | |---|---|---:|---:|---|---:|---:| | users-csv | csv | 53 | 29 | convert | 29 | 45.3% | | events-jsonl | jsonl | 194 | 109 | convert | 109 | 43.8% | | catalog-xml | xml | 96 | 50 | convert | 50 | 47.9% | | metrics-yaml | yaml | 129 | 61 | convert | 61 | 52.7% | | Average | — | 118 | 62 | 4/4 convert | 62 | 47.4% |

Forced conversion succeeded for 4/4 payloads.

Agent skill evaluation

Artifact-based subagent comparison — identical analysis tasks, two modes:

with_skill: agent received the datoon skill and followed the conversion workflow.
without_skill: agent used JSON directly, no TOON or datoon.

3 payload sizes × 3 iterations = 18 total agent runs. Both modes: 100% correct answers.

| Scenario | Avg JSON Tokens | Avg TOON Tokens | Avg Payload Saved | |---|---:|---:|---:| | small | 225 | 118 | 47.6% | | medium | 2,972 | 1,138 | 61.7% | | large | 17,757 | 6,673 | 62.4% |

Full report and raw outputs: benchmarks/agent_skill_eval/. Savings are payload-token estimates, not full end-to-end model-token usage.

______________________________________________________________________

Development

Contributor workflow: CONTRIBUTING.md. Maintainer/agent notes: CLAUDE.md.

Setup:

uv sync --extra dev
uvx pre-commit install

Tests:

pytest -m "not integration"   # unit only (102 tests)
pytest                        # with integration (requires Node.js + npx)

Skill sync + plugin metadata:

python scripts/validate_skill_sync.py
python scripts/validate_plugin_metadata.py

______________________________________________________________________

License

MIT

Before / After

JSON in the prompt (43 tokens)

datoon converts → TOON (24 tokens)

CSV from a data pipeline (111 tokens as JSON)

datoon auto-converts → TOON (24 tokens)

Non-uniform payload (26 tokens)

datoon skips → keeps JSON

Install

What You Get

Supported input formats

How It Works

Quick Start

Python API

MCP Server

Claude Code Plugin

CLI Reference

Benchmarks

Why auto mode outperforms forced conversion

Format conversion benchmark

Agent skill evaluation

Development

Links

License

Related MCP servers

MCP servers by category