kokoro-tts-mcp

scottschram/kokoro-tts-mcp
3 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Text-to-speech MCP server using the Kokoro-82M model accelerated with MLX on Apple Silicon, enabling local Claude and Codex clients to speak text aloud and convert text to audio.

README.md

kokoro-tts-mcp

Text-to-speech using the Kokoro-82M model, accelerated with MLX on Apple Silicon. Works three ways:

  • MCP server — gives local Claude and Codex clients (Claude Chat/Code/Cowork, Codex App, Codex CLI) the ability to speak text aloud and convert text to audio.
  • ChatGPT Mac App — supported via kokoro-clipboard + Keyboard Maestro workaround (not MCP-native yet).
  • Command-line toolskokoro and kokoro-clipboard commands for use in scripts, the terminal, or piped workflows

Both share the same generation engine and playback code, so pause/stop controls (via Stream Deck, hotkeys, etc.) work identically regardless of how audio was started.

The MCP server lazy-loads the model on first use and keeps it resident in memory (~600 MB), so subsequent requests start instantly. The CLI loads the model fresh each invocation (~3s startup), which is negligible for longer text.

Requirements

  • macOS on Apple Silicon (M1/M2/M3/M4)
  • Python 3.12 (not 3.13+ due to spacy/pydantic incompatibility)
  • espeak (brew install espeak)
  • ffmpeg (optional, only needed for MP3 export)

Setup

git clone https://github.com/scottschram/kokoro-tts-mcp.git
cd kokoro-tts-mcp

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

After installing, download the spaCy English model:

python -m spacy download en_core_web_sm

Usage

Command Line (kokoro)

kokoro "Hello, world."                         # play immediately
cat article.txt | kokoro                       # pipe input
kokoro -v bm_fable "Good morning, London."     # British male voice
kokoro -f article.txt -o article.wav           # save to WAV
kokoro -f article.txt --mp3                    # save as MP3 to /tmp
kokoro -o talk.wav -p "Hello"                  # save AND play
kokoro -s 1.3 "A bit faster."                 # speed adjustment
kokoro -v list                                 # show all voices
kokoro -h                                      # full help

Playback via the MCP speak() tool: text ~2500 words or less starts within a few seconds; beyond that, first-audio latency grows roughly linearly with text size (~3 min at 3000 words, ~4 min at 5000). The delay sits in the MCP client's tool-call dispatch — not in the Kokoro pipeline, which streams audio within seconds at any size when driven via the CLI or a direct Python import. For long reads, use the CLI: kokoro -f file.txt -o file.wav (play with your preferred audio player) or cat file.txt | kokoro. Pause and stop work at any point during playback. See CLAUDE.md for the bisection.

To make kokoro available globally, symlink it:

ln -sf /path/to/kokoro-tts-mcp/kokoro ~/bin/kokoro

Command Line (kokoro-clipboard)

kokoro-clipboard                                # speak current clipboard
kokoro-clipboard --dry-run                      # preview cleaned speech text
kokoro-clipboard --silent-nontext               # do not speak non-text clipboard
kokoro-clipboard --raw                          # skip markdown cleanup
kokoro-clipboard --max-chars 20000              # character cap before truncation
kokoro-clipboard --text "[kokoro]Hello[/kokoro]" --dry-run

kokoro-clipboard reads the current macOS clipboard and speaks it with markdown cleanup. If [kokoro]...[/kokoro] markers are present, only the text between markers is spoken. If markers are absent, the full clipboard text is spoken.

If clipboard content is non-text (image/PDF/file/URL), it speaks a short type message unless --silent-nontext is used.

Arguments:

| Argument | Description | |----------|-------------| | -v, --voice | Voice name (default: af_heart) | | -s, --speed | Speed multiplier (default: 1.0) | | --kokoro-cmd | Command/path used to invoke kokoro | | --raw | Skip markdown cleanup | | --silent-nontext | Exit without speaking when clipboard is non-text | | --max-chars | Character cap before truncation (default: 20000) | | --dry-run | Print final text instead of speaking | | --text | Use provided text instead of reading clipboard |

To make kokoro-clipboard available globally, symlink it:

ln -sf /path/to/kokoro-tts-mcp/kokoro-clipboard ~/bin/kokoro-clipboard

Keyboard Maestro (ChatGPT Mac workaround)

If ChatGPT Mac does not have MCP support for your account/workflow, you can still get spoken output by triggering kokoro-clipboard from Keyboard Maestro.

  1. Create a new Keyboard Maestro macro group limited to ChatGPT (com.openai.chat).
  2. Create a macro named Speak Clipboard.
  3. Set trigger: The clipboard changes.
  4. Add action: If Then Else with If All Conditions Met:
  • The clipboard contains [kokoro]
  • The clipboard contains [/kokoro]
  1. In the Then branch, add action: Execute Shell Script.
  2. Configure shell script:
  • Shell: /bin/zsh
  • Input: None
  • Script:
~/bin/kokoro-clipboard

Optional variants:

~/bin/kokoro-clipboard --silent-nontext
~/bin/kokoro-clipboard -v bm_fable -s 1.1

Usage notes:

  1. This If Then Else setup is marker-only: it speaks only when both markers exist.
  2. Inside the copied text, kokoro-clipboard speaks only the text between [kokoro]...[/kokoro].
  3. If you remove the If Then Else gate, kokoro-clipboard will speak any copied ChatGPT text.
  4. Non-text clipboard items (images/files/PDF) are announced unless --silent-nontext is set.

MCP Server (Claude Code)

Register the MCP server:

claude mcp add kokoro-tts -- \
    /path/to/kokoro-tts-mcp/.venv/bin/python3.12 \
    /path/to/kokoro-tts-mcp/mcp_server.py

Then in Claude Code, you can ask Claude to speak:

"Say hello" "Read that summary aloud using the British male voice bm_george" "Save that explanation as an MP3"

MCP Server (Claude Desktop — Chat / Cowork)

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "kokoro-tts": {
      "command": "/path/to/kokoro-tts-mcp/.venv/bin/python3.12",
      "args": ["/path/to/kokoro-tts-mcp/mcp_server.py"]
    }
  }
}

Restart the Claude app after editing.

MCP Server (Codex CLI)

Register the MCP server:

codex mcp add kokoro-tts -- \
    /path/to/kokoro-tts-mcp/.venv/bin/python3.12 \
    /path/to/kokoro-tts-mcp/mcp_server.py

Then in Codex CLI, you can ask Codex to speak:

"Say hello" "Read that summary aloud using the British male voice bm_george" "Save that explanation as an MP3"

MCP Server (Codex Mac App)

Codex Mac App and Codex CLI share the same global Codex config (~/.codex/config.toml). After registering kokoro-tts with codex mcp add ... in a terminal, restart the Codex app.

Smoke Test

A quick test script to verify the TTS pipeline without MCP or the full CLI:

./test-tts                          # default test phrase
./test-tts "Custom text"            # speak custom text
./test-tts "Cheerio" bm_fable       # specify voice

Tools

| Tool | Description | |------|-------------| | speak(text, voice?, speed?) | Play text aloud (non-blocking, returns immediately) | | pause() | Pause current playback | | resume() | Resume paused playback | | stop() | Stop playback immediately | | status() | Return current state: idle, playing, or paused | | user_stop_requested() | Check if the user stopped playback externally (returns True once, then clears) | | speak_and_save(text, output_path?, voice?, speed?, mp3?) | Generate and save audio to a file | | list_voices() | List all available voices |

Voices

28 English voices are available. The naming convention is: first letter = accent (a = American, b = British), second letter = gender (f = female, m = male).

American Female: af_heart (default), af_alloy, af_aoede, af_bella, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky

American Male: am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa

British Female: bf_alice, bf_emma, bf_isabella, bf_lily

British Male: bm_daniel, bm_fable, bm_george, bm_lewis

Playback Control

Two shell scripts control playback from outside Claude (e.g., via Stream Deck, Keyboard Maestro, or a hotkey). They work with both the MCP server and the CLI — whichever is currently playing:

  • kokoro-pause — Toggle pause/resume. Also supports kokoro-pause pause, kokoro-pause resume, and kokoro-pause status.
  • kokoro-stop — Stop playback immediately and discard audio.

These work by creating/removing sentinel files (/tmp/kokoro-tts-pause, /tmp/kokoro-tts-stop) that the playback loop monitors.

Multi-Segment Playback

When Claude plays multiple segments sequentially (e.g., reading a list of items one by one), it polls status() until idle before starting the next segment. If the user stops playback externally (via kokoro-stop, Stream Deck, etc.), user_stop_requested() returns True once, signaling Claude to skip remaining segments instead of immediately starting the next one. The MCP stop() tool does not set this flag — it only applies to external stops, so Claude can distinguish "user wants silence" from "Claude decided to stop."

Text Preprocessing

MCP server and CLI — Negative numbers (e.g., -3) are expanded to words (minus 3) before generation. The Kokoro phonemizer silently drops bare negative-sign tokens, so without this preprocessing, -3 degrees would be spoken as just degrees.

kokoro-clipboard — Clipboard text goes through additional preprocessing to improve listening quality:

  • Markdown syntax stripped (headings, bold, italic, links, fences, tables, etc.)
  • URLs expanded to speakable form (https://foo.com/pathhttps colon slash slash foo dot com slash path)
  • Negative numbers expanded (-3minus 3)
  • Punctuation between digits/words preserved (3.14, 10:30, $1,299.99 stay intact)
  • [kokoro]...[/kokoro] markers supported to limit what gets spoken
  • Use --dry-run to preview the cleaned text without audio

Known Issues

  • Python 3.13+ not supported — spacy and pydantic have incompatibilities on 3.13+. Use Python 3.12.
  • Short text workaround — Text under 25 characters is automatically padded to avoid an mlx-audio hang bug. This is handled transparently.
  • Do not install phonemizer — The phonemizer package conflicts with phonemizer-fork (pulled in by mlx-audio). Installing it causes out-of-dictionary words to be silently skipped. See requirements.txt for details.
  • misaki must be <0.9 — Version 0.9+ breaks EspeakWrapper.set_data_path. This is pinned in requirements.txt.

License

MIT

Related MCP servers

Browse all →