kokoro-tts-mcp
Text-to-speech using the Kokoro-82M model, accelerated with MLX on Apple Silicon. Works three ways:
- MCP server — gives local Claude and Codex clients (Claude Chat/Code/Cowork, Codex App, Codex CLI) the ability to speak text aloud and convert text to audio.
- ChatGPT Mac App — supported via
kokoro-clipboard+ Keyboard Maestro workaround (not MCP-native yet). - Command-line tools —
kokoroandkokoro-clipboardcommands for use in scripts, the terminal, or piped workflows
Both share the same generation engine and playback code, so pause/stop controls (via Stream Deck, hotkeys, etc.) work identically regardless of how audio was started.
The MCP server lazy-loads the model on first use and keeps it resident in memory (~600 MB), so subsequent requests start instantly. The CLI loads the model fresh each invocation (~3s startup), which is negligible for longer text.
Requirements
- macOS on Apple Silicon (M1/M2/M3/M4)
- Python 3.12 (not 3.13+ due to spacy/pydantic incompatibility)
- espeak (
brew install espeak) - ffmpeg (optional, only needed for MP3 export)
Setup
git clone https://github.com/scottschram/kokoro-tts-mcp.git
cd kokoro-tts-mcp
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
After installing, download the spaCy English model:
python -m spacy download en_core_web_sm
Usage
Command Line (kokoro)
kokoro "Hello, world." # play immediately
cat article.txt | kokoro # pipe input
kokoro -v bm_fable "Good morning, London." # British male voice
kokoro -f article.txt -o article.wav # save to WAV
kokoro -f article.txt --mp3 # save as MP3 to /tmp
kokoro -o talk.wav -p "Hello" # save AND play
kokoro -s 1.3 "A bit faster." # speed adjustment
kokoro -v list # show all voices
kokoro -h # full help
Playback via the MCP speak() tool: text ~2500 words or less starts within a few seconds; beyond that, first-audio latency grows roughly linearly with text size (~3 min at 3000 words, ~4 min at 5000). The delay sits in the MCP client's tool-call dispatch — not in the Kokoro pipeline, which streams audio within seconds at any size when driven via the CLI or a direct Python import. For long reads, use the CLI: kokoro -f file.txt -o file.wav (play with your preferred audio player) or cat file.txt | kokoro. Pause and stop work at any point during playback. See CLAUDE.md for the bisection.
To make kokoro available globally, symlink it:
ln -sf /path/to/kokoro-tts-mcp/kokoro ~/bin/kokoro
Command Line (kokoro-clipboard)
kokoro-clipboard # speak current clipboard
kokoro-clipboard --dry-run # preview cleaned speech text
kokoro-clipboard --silent-nontext # do not speak non-text clipboard
kokoro-clipboard --raw # skip markdown cleanup
kokoro-clipboard --max-chars 20000 # character cap before truncation
kokoro-clipboard --text "[kokoro]Hello[/kokoro]" --dry-run
kokoro-clipboard reads the current macOS clipboard and speaks it with markdown cleanup. If [kokoro]...[/kokoro] markers are present, only the text between markers is spoken. If markers are absent, the full clipboard text is spoken.
If clipboard content is non-text (image/PDF/file/URL), it speaks a short type message unless --silent-nontext is used.
Arguments:
| Argument | Description | |----------|-------------| | -v, --voice | Voice name (default: af_heart) | | -s, --speed | Speed multiplier (default: 1.0) | | --kokoro-cmd | Command/path used to invoke kokoro | | --raw | Skip markdown cleanup | | --silent-nontext | Exit without speaking when clipboard is non-text | | --max-chars | Character cap before truncation (default: 20000) | | --dry-run | Print final text instead of speaking | | --text | Use provided text instead of reading clipboard |
To make kokoro-clipboard available globally, symlink it:
ln -sf /path/to/kokoro-tts-mcp/kokoro-clipboard ~/bin/kokoro-clipboard
Keyboard Maestro (ChatGPT Mac workaround)
If ChatGPT Mac does not have MCP support for your account/workflow, you can still get spoken output by triggering kokoro-clipboard from Keyboard Maestro.
- Create a new Keyboard Maestro macro group limited to ChatGPT (
com.openai.chat). - Create a macro named
Speak Clipboard. - Set trigger:
The clipboard changes. - Add action:
If Then ElsewithIf All Conditions Met:
The clipboard contains [kokoro]The clipboard contains [/kokoro]
- In the
Thenbranch, add action:Execute Shell Script. - Configure shell script:
- Shell:
/bin/zsh - Input:
None - Script:
~/bin/kokoro-clipboard
Optional variants:
~/bin/kokoro-clipboard --silent-nontext
~/bin/kokoro-clipboard -v bm_fable -s 1.1
Usage notes:
- This
If Then Elsesetup is marker-only: it speaks only when both markers exist. - Inside the copied text,
kokoro-clipboardspeaks only the text between[kokoro]...[/kokoro]. - If you remove the
If Then Elsegate,kokoro-clipboardwill speak any copied ChatGPT text. - Non-text clipboard items (images/files/PDF) are announced unless
--silent-nontextis set.
MCP Server (Claude Code)
Register the MCP server:
claude mcp add kokoro-tts -- \
/path/to/kokoro-tts-mcp/.venv/bin/python3.12 \
/path/to/kokoro-tts-mcp/mcp_server.py
Then in Claude Code, you can ask Claude to speak:
"Say hello" "Read that summary aloud using the British male voice bm_george" "Save that explanation as an MP3"
MCP Server (Claude Desktop — Chat / Cowork)
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"kokoro-tts": {
"command": "/path/to/kokoro-tts-mcp/.venv/bin/python3.12",
"args": ["/path/to/kokoro-tts-mcp/mcp_server.py"]
}
}
}
Restart the Claude app after editing.
MCP Server (Codex CLI)
Register the MCP server:
codex mcp add kokoro-tts -- \
/path/to/kokoro-tts-mcp/.venv/bin/python3.12 \
/path/to/kokoro-tts-mcp/mcp_server.py
Then in Codex CLI, you can ask Codex to speak:
"Say hello" "Read that summary aloud using the British male voice bm_george" "Save that explanation as an MP3"
MCP Server (Codex Mac App)
Codex Mac App and Codex CLI share the same global Codex config (~/.codex/config.toml). After registering kokoro-tts with codex mcp add ... in a terminal, restart the Codex app.
Smoke Test
A quick test script to verify the TTS pipeline without MCP or the full CLI:
./test-tts # default test phrase
./test-tts "Custom text" # speak custom text
./test-tts "Cheerio" bm_fable # specify voice
Tools
| Tool | Description | |------|-------------| | speak(text, voice?, speed?) | Play text aloud (non-blocking, returns immediately) | | pause() | Pause current playback | | resume() | Resume paused playback | | stop() | Stop playback immediately | | status() | Return current state: idle, playing, or paused | | user_stop_requested() | Check if the user stopped playback externally (returns True once, then clears) | | speak_and_save(text, output_path?, voice?, speed?, mp3?) | Generate and save audio to a file | | list_voices() | List all available voices |
Voices
28 English voices are available. The naming convention is: first letter = accent (a = American, b = British), second letter = gender (f = female, m = male).
American Female: af_heart (default), af_alloy, af_aoede, af_bella, af_jessica, af_kore, af_nicole, af_nova, af_river, af_sarah, af_sky
American Male: am_adam, am_echo, am_eric, am_fenrir, am_liam, am_michael, am_onyx, am_puck, am_santa
British Female: bf_alice, bf_emma, bf_isabella, bf_lily
British Male: bm_daniel, bm_fable, bm_george, bm_lewis
Playback Control
Two shell scripts control playback from outside Claude (e.g., via Stream Deck, Keyboard Maestro, or a hotkey). They work with both the MCP server and the CLI — whichever is currently playing:
kokoro-pause— Toggle pause/resume. Also supportskokoro-pause pause,kokoro-pause resume, andkokoro-pause status.kokoro-stop— Stop playback immediately and discard audio.
These work by creating/removing sentinel files (/tmp/kokoro-tts-pause, /tmp/kokoro-tts-stop) that the playback loop monitors.
Multi-Segment Playback
When Claude plays multiple segments sequentially (e.g., reading a list of items one by one), it polls status() until idle before starting the next segment. If the user stops playback externally (via kokoro-stop, Stream Deck, etc.), user_stop_requested() returns True once, signaling Claude to skip remaining segments instead of immediately starting the next one. The MCP stop() tool does not set this flag — it only applies to external stops, so Claude can distinguish "user wants silence" from "Claude decided to stop."
Text Preprocessing
MCP server and CLI — Negative numbers (e.g., -3) are expanded to words (minus 3) before generation. The Kokoro phonemizer silently drops bare negative-sign tokens, so without this preprocessing, -3 degrees would be spoken as just degrees.
kokoro-clipboard — Clipboard text goes through additional preprocessing to improve listening quality:
- Markdown syntax stripped (headings, bold, italic, links, fences, tables, etc.)
- URLs expanded to speakable form (
https://foo.com/path→https colon slash slash foo dot com slash path) - Negative numbers expanded (
-3→minus 3) - Punctuation between digits/words preserved (
3.14,10:30,$1,299.99stay intact) [kokoro]...[/kokoro]markers supported to limit what gets spoken- Use
--dry-runto preview the cleaned text without audio
Known Issues
- Python 3.13+ not supported — spacy and pydantic have incompatibilities on 3.13+. Use Python 3.12.
- Short text workaround — Text under 25 characters is automatically padded to avoid an mlx-audio hang bug. This is handled transparently.
- Do not install
phonemizer— Thephonemizerpackage conflicts withphonemizer-fork(pulled in by mlx-audio). Installing it causes out-of-dictionary words to be silently skipped. Seerequirements.txtfor details. misakimust be <0.9 — Version 0.9+ breaksEspeakWrapper.set_data_path. This is pinned inrequirements.txt.






