OpenClaw · Skill
Pocket TTS
Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
Install
Start with the primary install command. Alternate entrypoints are included below for ClawHub and OpenClaw CLI users.
Primary command
clawhub install leonaaardob/lb-pocket-tts-skillClawHub installer
npx clawhub@latest install leonaaardob/lb-pocket-tts-skillOpenClaw CLI
openclaw skills install leonaaardob/lb-pocket-tts-skillDirect OpenClaw install
openclaw install leonaaardob/lb-pocket-tts-skillWhat this skill does
Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
Why it matters
Runs entirely on CPU without cloud API dependencies, making it faster to set up and cheaper to operate than hosted TTS services for local or offline workflows.
Typical use cases
- Narrating documentation without a cloud TTS subscription
- Cloning a custom voice for consistent audio branding
- Generating speech offline on a laptop during travel
- Prototyping voice interfaces without API rate limits
- Batch-converting written content to audio files
Source instructions
Pocket TTS
Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.
When to Use
- Generating speech from text on CPU without GPU
- Voice cloning from audio samples
- Streaming audio generation (low latency)
- Local TTS without API dependencies
- Real-time speech synthesis (~6x faster than real-time)
Key Features
- 100M parameters - Small, efficient model
- CPU-optimized - No GPU needed, uses only 2 cores
- ~6x real-time - Fast generation on modern CPUs
- ~200ms latency - To first audio chunk (streaming)
- Voice cloning - From 3-10s audio samples
- 24kHz mono WAV - High-quality output
- English only - More languages planned
Installation
pip install pocket-tts
# or
uv add pocket-tts
CLI Commands
Generate Speech
# Basic generation (default voice)
pocket-tts generate --text "Hello world"
# Custom voice (local file, URL, or safetensors)
pocket-tts generate --voice ./my_voice.wav
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
pocket-tts generate --voice ./voice.safetensors
# Quality tuning
pocket-tts generate --temperature 0.7 --lsd-decode-steps 3
See docs/generate.md for full CLI reference.
Start Web Server
# Start FastAPI server with web UI
pocket-tts serve
# Custom host/port
pocket-tts serve --host localhost --port 8080
See docs/serve.md for server options.
Export Voice Embeddings
Convert audio files to .safetensors for faster loading:
# Single file
pocket-tts export-voice voice.mp3 voice.safetensors
# Batch conversion
pocket-tts export-voice voices/ embeddings/ --truncate
See docs/export_voice.md for export options.
Python API
Basic Usage
from pocket_tts import TTSModel
import scipy.io.wavfile
# Load model
model = TTSModel.load_model()
# Get voice state
voice = model.get_state_for_audio_prompt(
"hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)
# Generate audio
audio = model.generate_audio(voice, "Hello world!")
# Save
scipy.io.wavfile.write("output.wav", model.sample_rate, audio.numpy())
Load Model
model = TTSModel.load_model(
config="b6369a24", # Model variant
temp=0.7, # Temperature (0.5-1.0)
lsd_decode_steps=1, # Generation steps (1-5)
eos_threshold=-4.0 # End-of-sequence threshold
)
Voice State
# From audio file/URL
voice = model.get_state_for_audio_prompt("./voice.wav")
voice = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")
# From safetensors (fast loading)
voice = model.get_state_for_audio_prompt("./voice.safetensors")
Streaming Generation
# Stream audio chunks
for chunk in model.generate_audio_stream(voice, "Long text..."):
# Process/save/play each chunk as generated
print(f"Chunk: {chunk.shape[0]} samples")
Multi-Voice Management
# Preload multiple voices
voices = {
"casual": model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav"),
"announcer": model.get_state_for_audio_prompt("./announcer.safetensors"),
}
# Use different voices
audio1 = model.generate_audio(voices["casual"], "Hey there!")
audio2 = model.generate_audio(voices["announcer"], "Breaking news!")
See docs/python-api.md for complete API reference.
Available Voices
Pre-made voices from hf://kyutai/tts-voices/:
alba-mackenna/casual.wav(default, female)jessica-jian/casual.wav(female)voice-donations/Selfie.wav(male, marius)voice-donations/Butter.wav(male, javert)ears/p010/freeform_speech_01.wav(male, jean)vctk/p244_023.wav(female, fantine)vctk/p262_023.wav(female, eponine)vctk/p303_023.wav(female, azelma)
Or clone any voice from your own audio samples.
Voice Cloning Tips
- Clean audio - Remove background noise (use Adobe Podcast Enhance)
- Length - 3-10 seconds of speech is ideal
- Quality - Input quality affects output quality
- Format - WAV, MP3, or any common audio format supported
Performance Tips
- CPU-only - GPU provides no speedup (model too small, batch size 1)
- 2 cores - Uses only 2 CPU cores efficiently
- Streaming - Low latency (<200ms to first chunk)
- Safetensors - Pre-process voices to
.safetensorsfor instant loading
Output Format
All commands output WAV files:
- Sample rate: 24 kHz
- Channels: Mono
- Bit depth: 16-bit PCM