Ingen beskrivning

Lukas Goldschmidt 70df063766 improved vram mangement, preprocessing 2 dagar sedan
cache b534ee03ff all features implemented 6 dagar sedan
models b534ee03ff all features implemented 6 dagar sedan
voices b534ee03ff all features implemented 6 dagar sedan
.gitignore b534ee03ff all features implemented 6 dagar sedan
Dockerfile b534ee03ff all features implemented 6 dagar sedan
README.md 70df063766 improved vram mangement, preprocessing 2 dagar sedan
build.sh b534ee03ff all features implemented 6 dagar sedan
run.sh b534ee03ff all features implemented 6 dagar sedan
tts_server.py 70df063766 improved vram mangement, preprocessing 2 dagar sedan
tts_server_nochunks.py b534ee03ff all features implemented 6 dagar sedan
tts_server_noncaching.py b534ee03ff all features implemented 6 dagar sedan
tts_server_simple.py b534ee03ff all features implemented 6 dagar sedan
tts_server_unstable.py 40810df9e6 stability improved 2 dagar sedan

README.md

Coqui TTS Docker Server

A Dockerized Coqui XTTS server with multilingual support, voice cloning, smart text preprocessing, and robust GPU memory management.


Features

  • Multilingual TTS via Coqui XTTS v2
  • Multi-speaker support and voice cloning
  • Automatic .mp3.wav conversion with staleness detection
  • Persistent speaker embeddings cache for fast repeated synthesis
  • Markdown-aware text preprocessing — headings, bold, lists and paragraphs are converted to natural prosody cues (pauses) before synthesis
  • Language-aware acronym expansionKI, EU, US, BRD etc. are expanded to phonetic letter-spellings appropriate for German or English (Ka-I, E-U, U-Es …)
  • Automatic chunking of large inputs with sentence- and word-boundary awareness
  • Proactive VRAM headroom check — entire request pins to CPU if GPU is already low before synthesis starts
  • Automatic per-chunk CPU fallback on CUDA OOM, with model restored to GPU afterward
  • Per-chunk VRAM flushempty_cache() after every chunk and after every completed request keeps VRAM usage flat across long documents
  • Full CPU multi-core utilisation when running on CPU (torch.set_num_threads)
  • /tts, /api/tts — full-file synthesis
  • /tts_stream, /api/tts_stream — progressive streaming synthesis
  • /health — liveness + VRAM status
  • Optimised for small VRAM GPUs (GTX 1650 / ~4 GB) and larger

Repository Structure

coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py   # Production server
├── models/         # XTTS model weights (host-mount recommended)
├── voices/         # Speaker reference files (.wav or .mp3)
└── cache/          # Persistent speaker embeddings (.pkl)

Setup

1. Build the Docker image

./build.sh

2. Run the server

./run.sh

The scripts handle GPU detection, volume mounts for /models, /voices, /cache, and acceptance of the Coqui TTS license terms.


API

GET /health

Returns server liveness and current VRAM status.

Example response (GPU):

{
  "status": "ok",
  "device": "cuda",
  "vram_free_mb": 2814,
  "vram_total_mb": 4096,
  "vram_used_pct": 31.3
}

Example response (CPU):

{ "status": "ok", "device": "cpu" }

GET /tts · GET /api/tts

Synthesise speech and return the complete audio file.

Returns: audio/wav

Parameter Default Description
text required Text to synthesise (plain or markdown)
voice default Voice name (stem of file in /voices)
lang en BCP-47 language code (en, de, fr …)

Example:

curl "http://localhost:5002/tts?text=Hello+world&voice=alice&lang=en" --output out.wav

GET /tts_stream · GET /api/tts_stream

Streams synthesised audio progressively as each chunk completes.
Useful for conversational agents, low-latency playback, or very long documents.

Parameters are identical to /tts.

Example:

curl "http://localhost:5002/tts_stream?text=Hello+world&voice=alice&lang=en" --output out.wav

Streaming works best with audio players that support progressive WAV input (e.g. VLC, ffplay, most browser <audio> elements).


GET /voices

Returns available voices (deduplicated, sorted).

Example response:

{ "voices": ["alice", "default", "narrator"] }

Text Preprocessing

All text passes through a two-stage pipeline before synthesis:

1. Markdown → prosody

Markdown structure is translated into punctuation cues that XTTS responds to prosodically — no spoken labels, just natural pauses:

Markdown element Spoken rendering
# H1 Long pause (...) before and after
## H2 / ### H3 Medium pause (.) before and after
#### H4–H6 Short pause after
**bold** / *italic* Comma-breath either side , text,
- bullet / 1. item Comma-breath before, period after
Blank line Full stop (paragraph break)
--- horizontal rule Long section-break pause (...)
`code` / block Plain text, fences stripped
[link](url) Label text only

2. Acronym expansion

Acronyms and symbols are expanded to phonetic spellings before tokenisation, preventing the CUDA device-side assert errors caused by out-of-range token IDs.

Expansion is language-aware — pass lang=de for German rules, any other value for English rules.

German examples (lang=de):

Input Expanded
KI Ka-I
EU E-U
US U-Es
ARD A-Er-De
z.B. zum Beispiel
Euro

English examples (lang=en):

Input Expanded
KI Kay Eye
EU E-U
euros

To add a new term, edit ACRONYMS_DE or ACRONYMS_EN at the top of tts_server.py — no other changes needed.


GPU Memory Management

The server is designed to run stably on small GPUs (~4 GB VRAM) across long documents and back-to-back requests.

Proactive headroom check (before synthesis starts)
If free VRAM is below 20% when a request arrives, the entire request — model, embeddings, and inference — is pinned to CPU from the start. This avoids the more expensive mid-document fallback.

Per-chunk cache flush (during synthesis)
torch.cuda.empty_cache() is called after every successfully synthesised chunk. XTTS leaves GPT decoder KV-cache and attention buffers behind; without explicit flushing these accumulate across a 20-chunk document and starve subsequent requests.

End-of-request flush (after synthesis)
A final empty_cache() after the full request completes catches anything the per-chunk flushes missed, including output tensors that lived in the accumulation buffer.

OOM recovery (safety net)
If a chunk triggers a CUDA OOM despite the above, the model moves to CPU for that chunk and returns to GPU immediately after, inside a finally block so a CPU-side failure cannot strand the model on CPU permanently.

CPU multi-core utilisation
When running on CPU (fallback or CPU-only host), torch.set_num_threads(os.cpu_count()) is set at startup so all available cores are used.


Adding a New Voice (Voice Cloning)

XTTS performs zero-shot voice cloning — no training required. Any speaker reference audio dropped into the /voices folder is immediately available as a voice.

Quick start

  1. Take any recording of the speaker — an interview, a voice memo, a podcast clip.
  2. Aim for 10–30 seconds of clean speech. Longer is fine; shorter degrades quality.
  3. Drop the file into /voices as either:
    • <voice-name>.wav — used directly
    • <voice-name>.mp3 — converted to .wav automatically on first use
  4. Call the API with voice=<voice-name>:

    curl "http://localhost:5002/tts?text=Hello&voice=alice&lang=en" --output out.wav
    

No server restart required.

What happens automatically

Step Detail
Format conversion .mp3.wav (22050 Hz mono) via ffmpeg on first request
Reconversion If the .mp3 is newer than the .wav (i.e. you replaced it), the .wav is regenerated
Embedding computation Speaker embedding extracted from the .wav and saved to /cache/<voice-name>.pkl
Cache invalidation The .pkl is keyed by SHA-256 hash of the .wav — replacing the audio automatically triggers recomputation
In-memory caching After first use the embedding lives in memory; subsequent requests for the same voice pay no disk cost

Tips for best quality

  • Mono, minimal background noise — music, reverb, and crosstalk confuse the encoder
  • Consistent tone — avoid clips that mix whispering and shouting
  • Trim silence — remove long silent gaps at the start and end before dropping in
  • Sample rate — any rate works; ffmpeg resamples to 22050 Hz automatically
  • Language match — the reference audio does not need to be in the same language as text, but accent and vocal character transfer better when it is

Replacing a voice

Drop a new file with the same name into /voices. The cache key is a content hash, so the stale embedding is detected automatically on the next request and recomputed — no manual cache clearing needed.


Notes

  • lang should match the language of the text parameter — XTTS uses it for tokenisation, not just accent
  • /tts and /api/tts are fully interchangeable (same handler)
  • /models, /voices, and /cache should be Docker volume mounts so data persists across image rebuilds
  • The VRAM threshold (default 20%) can be tuned via VRAM_HEADROOM at the top of tts_server.py
  • The chunk size (default 200 chars) can be tuned via MAX_CHUNK_LEN

License

  • Non-commercial use: Coqui CPML
  • Commercial licensing: licensing@coqui.ai