|
|
2 өдөр өмнө | |
|---|---|---|
| cache | 6 өдөр өмнө | |
| models | 6 өдөр өмнө | |
| voices | 6 өдөр өмнө | |
| .gitignore | 6 өдөр өмнө | |
| Dockerfile | 6 өдөр өмнө | |
| README.md | 2 өдөр өмнө | |
| build.sh | 6 өдөр өмнө | |
| run.sh | 6 өдөр өмнө | |
| tts_server.py | 2 өдөр өмнө | |
| tts_server_nochunks.py | 6 өдөр өмнө | |
| tts_server_noncaching.py | 6 өдөр өмнө | |
| tts_server_simple.py | 6 өдөр өмнө | |
| tts_server_unstable.py | 3 өдөр өмнө |
A Dockerized Coqui XTTS server with multilingual support, voice cloning, smart text preprocessing, and robust GPU memory management.
.mp3 → .wav conversion with staleness detectionKI, EU, US, BRD etc. are expanded to phonetic letter-spellings appropriate for German or English (Ka-I, E-U, U-Es …)empty_cache() after every chunk and after every completed request keeps VRAM usage flat across long documentstorch.set_num_threads)/tts, /api/tts — full-file synthesis/tts_stream, /api/tts_stream — progressive streaming synthesis/health — liveness + VRAM statuscoqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py # Production server
├── models/ # XTTS model weights (host-mount recommended)
├── voices/ # Speaker reference files (.wav or .mp3)
└── cache/ # Persistent speaker embeddings (.pkl)
./build.sh
./run.sh
The scripts handle GPU detection, volume mounts for /models, /voices, /cache, and acceptance of the Coqui TTS license terms.
GET /healthReturns server liveness and current VRAM status.
Example response (GPU):
{
"status": "ok",
"device": "cuda",
"vram_free_mb": 2814,
"vram_total_mb": 4096,
"vram_used_pct": 31.3
}
Example response (CPU):
{ "status": "ok", "device": "cpu" }
GET /tts · GET /api/ttsSynthesise speech and return the complete audio file.
Returns: audio/wav
| Parameter | Default | Description |
|---|---|---|
text |
required | Text to synthesise (plain or markdown) |
voice |
default |
Voice name (stem of file in /voices) |
lang |
en |
BCP-47 language code (en, de, fr …) |
Example:
curl "http://localhost:5002/tts?text=Hello+world&voice=alice&lang=en" --output out.wav
GET /tts_stream · GET /api/tts_streamStreams synthesised audio progressively as each chunk completes.
Useful for conversational agents, low-latency playback, or very long documents.
Parameters are identical to /tts.
Example:
curl "http://localhost:5002/tts_stream?text=Hello+world&voice=alice&lang=en" --output out.wav
Streaming works best with audio players that support progressive WAV input (e.g. VLC, ffplay, most browser
<audio>elements).
GET /voicesReturns available voices (deduplicated, sorted).
Example response:
{ "voices": ["alice", "default", "narrator"] }
All text passes through a two-stage pipeline before synthesis:
Markdown structure is translated into punctuation cues that XTTS responds to prosodically — no spoken labels, just natural pauses:
| Markdown element | Spoken rendering |
|---|---|
# H1 |
Long pause (...) before and after |
## H2 / ### H3 |
Medium pause (.) before and after |
#### H4–H6 |
Short pause after |
**bold** / *italic* |
Comma-breath either side , text, |
- bullet / 1. item |
Comma-breath before, period after |
| Blank line | Full stop (paragraph break) |
--- horizontal rule |
Long section-break pause (...) |
`code` / block |
Plain text, fences stripped |
[link](url) |
Label text only |
Acronyms and symbols are expanded to phonetic spellings before tokenisation, preventing the CUDA device-side assert errors caused by out-of-range token IDs.
Expansion is language-aware — pass lang=de for German rules, any other value for English rules.
German examples (lang=de):
| Input | Expanded |
|---|---|
KI |
Ka-I |
EU |
E-U |
US |
U-Es |
ARD |
A-Er-De |
z.B. |
zum Beispiel |
€ |
Euro |
English examples (lang=en):
| Input | Expanded |
|---|---|
KI |
Kay Eye |
EU |
E-U |
€ |
euros |
To add a new term, edit ACRONYMS_DE or ACRONYMS_EN at the top of tts_server.py — no other changes needed.
The server is designed to run stably on small GPUs (~4 GB VRAM) across long documents and back-to-back requests.
Proactive headroom check (before synthesis starts)
If free VRAM is below 20% when a request arrives, the entire request — model, embeddings, and inference — is pinned to CPU from the start. This avoids the more expensive mid-document fallback.
Per-chunk cache flush (during synthesis)
torch.cuda.empty_cache() is called after every successfully synthesised chunk. XTTS leaves GPT decoder KV-cache and attention buffers behind; without explicit flushing these accumulate across a 20-chunk document and starve subsequent requests.
End-of-request flush (after synthesis)
A final empty_cache() after the full request completes catches anything the per-chunk flushes missed, including output tensors that lived in the accumulation buffer.
OOM recovery (safety net)
If a chunk triggers a CUDA OOM despite the above, the model moves to CPU for that chunk and returns to GPU immediately after, inside a finally block so a CPU-side failure cannot strand the model on CPU permanently.
CPU multi-core utilisation
When running on CPU (fallback or CPU-only host), torch.set_num_threads(os.cpu_count()) is set at startup so all available cores are used.
XTTS performs zero-shot voice cloning — no training required. Any speaker reference
audio dropped into the /voices folder is immediately available as a voice.
/voices as either:
<voice-name>.wav — used directly<voice-name>.mp3 — converted to .wav automatically on first useCall the API with voice=<voice-name>:
curl "http://localhost:5002/tts?text=Hello&voice=alice&lang=en" --output out.wav
No server restart required.
| Step | Detail |
|---|---|
| Format conversion | .mp3 → .wav (22050 Hz mono) via ffmpeg on first request |
| Reconversion | If the .mp3 is newer than the .wav (i.e. you replaced it), the .wav is regenerated |
| Embedding computation | Speaker embedding extracted from the .wav and saved to /cache/<voice-name>.pkl |
| Cache invalidation | The .pkl is keyed by SHA-256 hash of the .wav — replacing the audio automatically triggers recomputation |
| In-memory caching | After first use the embedding lives in memory; subsequent requests for the same voice pay no disk cost |
ffmpeg resamples to 22050 Hz automaticallytext, but accent and vocal character transfer better when it isDrop a new file with the same name into /voices. The cache key is a content hash,
so the stale embedding is detected automatically on the next request and recomputed —
no manual cache clearing needed.
lang should match the language of the text parameter — XTTS uses it for tokenisation, not just accent/tts and /api/tts are fully interchangeable (same handler)/models, /voices, and /cache should be Docker volume mounts so data persists across image rebuildsVRAM_HEADROOM at the top of tts_server.pyMAX_CHUNK_LENlicensing@coqui.ai