tts-audio-mcp
MCP server that analyzes TTS audio recordings — enabling Claude Code, OpenCode, and Qwen Code to debug voice call center audio the same way they debug code errors.
Feed it an audio file, get back a structured report with transcription, quality scores, pacing analysis, and mispronunciation detection.
What It Does
Audio File (.wav/.mp3/.m4a)
|
v
+-------------------------------+
| tts-audio-mcp server |
| |
| transcribe -> whisper.cpp |
| quality_score -> librosa |
| compare_tts -> whisper+diff |
| analyze_tts -> all combined |
| |
| Transport: stdio (MCP) |
+-------------------------------+
|
v
Structured report -> LLM reasons about fixes
Tools
transcribe
Transcribe audio to text with word-level timestamps.
Input: audio_path (string), language (string, default: "en")
Output: Full transcription text, per-word timestamps (start_ms, end_ms), segments, detected language, duration.
quality_score
Analyze speech quality — pitch variation, energy dynamics, silence ratio.
Input: audio_path (string)
Output:
- Pitch: mean/std/range Hz, monotone risk flag, interpretation
- Energy: RMS level, dynamic range dB, interpretation
- Silence ratio: percentage of audio that is silent
- Overall assessment with list of detected issues
compare_tts
Compare TTS output against expected text to find mispronunciations.
Input: audio_path (string), expected_text (string), language (string, default: "en")
Output: Word Error Rate (WER), substitutions, insertions, deletions with positions.
analyze_tts
Full composite analysis — runs all of the above and returns a single structured report.
Input: audio_path (string), expected_text (string, optional), language (string, default: "en")
Output: Combined report with transcription, quality scores, pacing analysis (WPM, rushed words, long pauses), and pronunciation diff.
Example Output
TTS Analysis Report
File: /tmp/tts-test-speech.wav
Duration: 4.15s | Words: 13 | Rate: 195 WPM
--- Transcription ---
Hello. Thank you for calling Acme Support. How can I help you today?
--- Quality Scores ---
Pitch: mean 258.5Hz, std 61.1Hz, range 264.3Hz
Good variation — expressive
Energy: RMS 0.0912, dynamic range 80dB
Wide dynamic range
Silence ratio: 37.3%
Overall: Minor issues detected (1)
! Very wide dynamic range — may clip
--- Pacing Analysis ---
Speaking rate: 195 WPM (natural: 120-180)
Minor pacing issues
Rushed words:
"How" spoken in 40ms
! Speaking rate too fast: 195 WPM (natural: 120-180)
! 1 rushed word(s) detected (<80ms)
--- Pronunciation Check ---
Expected: Hello. Thank you for calling Acme Support. How can I help you today?
Got: Hello. Thank you for calling Acme Support. How can I help you today?
WER: 0.0%
Perfect match — no mispronunciations detected
--- Issues Summary ---
1. Very wide dynamic range — may clip
2. Speaking rate too fast: 195 WPM (natural: 120-180)
3. 1 rushed word(s) detected (<80ms)
Prerequisites
- whisper.cpp with Metal acceleration:
brew install whisper-cpp - Whisper model:
ggml-large-v3-turbo.bin(~1.5 GB) inmodels/ - Python 3.12 with librosa:
.venv/bin/python3withpip install librosa - ffmpeg for audio format conversion:
brew install ffmpeg - Node.js 18+
Installation
git clone https://github.com/reactiongears/tts-audio-mcp.git
cd tts-audio-mcp
# Node dependencies
npm install
# Python venv for audio analysis
python3.12 -m venv .venv
.venv/bin/pip install librosa 'setuptools<82'
# Download whisper model
mkdir -p models
curl -L -o models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Build
npm run build
Integration
Claude Code
Add to ~/.claude/.mcp.json:
{
"mcpServers": {
"tts-audio": {
"command": "node",
"args": ["/path/to/tts-audio-mcp/dist/index.js"]
}
}
}
OpenCode
Add to ~/.config/opencode/opencode.json under "mcp":
"tts-audio": {
"type": "local",
"command": ["node", "/path/to/tts-audio-mcp/dist/index.js"],
"enabled": true
}
Qwen Code
Add to ~/.qwen/settings.json under "mcpServers":
"tts-audio": {
"command": "node",
"args": ["/path/to/tts-audio-mcp/dist/index.js"]
}
Environment Variables
| Variable | Default | Description | |----------|---------|-------------| | WHISPER_BINARY | whisper-cli | Path to whisper.cpp binary | | WHISPER_MODEL_PATH | ~/Documents/_dev/tts-audio-mcp/models/ggml-large-v3-turbo.bin | Path to Whisper model file | | TTS_PYTHON_BIN | .venv/bin/python3 | Python binary with librosa installed |
Usage
Once the MCP server is configured in your coding assistant, the tools are available automatically. You talk to your assistant in natural language — it decides when to call the tools and interprets the results for you.
Quick Start
Generate a test audio file to try it out:
# macOS — use the built-in TTS engine
say -o /tmp/test-greeting.wav --data-format=LEI16@16000 \
"Hello. Thank you for calling Acme Support. How can I help you today?"
Then in Claude Code, OpenCode, or Qwen Code:
> Analyze the audio at /tmp/test-greeting.wav
The assistant calls analyze_tts behind the scenes and returns a full report with transcription, quality scores, pacing, and issues.
Debugging TTS Problems
"It sounds robotic" — Check pitch variation and monotone risk:
> Run quality_score on /recordings/agent-greeting.wav — customers say it sounds robotic
The report shows pitch std < 20 Hz = monotone risk. You know to increase prosody variation in your TTS config.
"Words are getting swallowed" — Compare against expected script:
> Compare /recordings/transfer-prompt.wav against the expected text:
> "Thank you for your patience. I'll transfer you to a specialist now."
The tool transcribes the audio, diffs it against your script, and reports substitutions ("specialist" → "specialist's"), deletions, and WER. You know exactly which words the TTS is mangling.
"It's talking too fast / has weird pauses" — Check pacing:
> Analyze /recordings/ivr-menu.wav — callers are complaining it's too fast
The report flags speaking rate (natural range: 120-180 WPM), individual rushed words (< 80ms), and unnatural pauses (> 500ms). You know where to add SSML breaks or adjust rate.
"Something is off but I'm not sure what" — Full analysis:
> Run a full analysis on /recordings/hold-message.wav
> The expected text is: "Your call is important to us. Please hold and an agent will be with you shortly."
Returns everything: transcription, quality metrics, pacing analysis, pronunciation diff, and a prioritized issues summary.
Batch Debugging
You can analyze multiple recordings in a conversation:
> Compare these three recordings against their scripts and tell me which one has the most issues:
> 1. /recordings/greeting.wav — "Welcome to Acme Support"
> 2. /recordings/hold.wav — "Please hold while I look that up"
> 3. /recordings/goodbye.wav — "Thank you for calling. Have a great day!"
The assistant calls compare_tts for each file and summarizes which recordings need attention.
Using Individual Tools
You can also ask for specific analysis:
| What you want | What to ask | |---------------|-------------| | Just the transcription | "Transcribe /path/to/audio.wav" | | Just quality metrics | "Check the audio quality of /path/to/audio.wav" | | Just pronunciation accuracy | "Compare /path/to/audio.wav against 'expected text here'" | | Everything at once | "Full TTS analysis on /path/to/audio.wav" |
Supported Audio Formats
.wav— processed directly (best performance).mp3— auto-converted to WAV via ffmpeg.m4a— auto-converted to WAV via ffmpeg
Interpreting Results
Quality Scores: | Metric | Good | Concerning | |--------|------|------------| | Pitch std | 25-80 Hz (natural variation) | < 15 Hz (monotone/robotic) | | Dynamic range | 10-50 dB | < 10 dB (flat) or > 70 dB (may clip) | | Silence ratio | 10-50% | > 50% (too much dead air) or < 10% (no breathing room) |
Pacing: | Metric | Natural range | Flag | |--------|---------------|------| | Speaking rate | 120-180 WPM | Outside range | | Word duration | > 80ms | < 80ms = rushed | | Inter-word gap | < 500ms | > 500ms = unnatural pause |
Pronunciation (WER): | WER | Interpretation | |-----|----------------| | 0% | Perfect match | | 1-5% | Minor issues (articles, contractions) | | 5-15% | Noticeable mispronunciations | | > 15% | Significant problems |
Real-World Workflow
A typical voice call center debugging session:
- Customer reports: "The bot sounds weird when it says the account number"
- You pull the call recording:
/recordings/call-1234-segment.wav - You know the expected script:
"Your account number is 7 8 4 2 0 1 3" - In Claude Code:
> Compare /recordings/call-1234-segment.wav against "Your account number is 7 8 4 2 0 1 3"
> What's wrong and how should I fix the TTS config?
- Claude calls
compare_tts, sees the TTS is running digits together ("seven eight four" → "seventy-eight four"), and suggests adding SSML<say-as interpret-as="digits">tags or inter-digit pauses to your TTS configuration
The LLM doesn't just report the numbers — it reasons about the root cause and suggests specific fixes to your TTS code or configuration.
License
MIT






