暫無描述

Lukas Goldschmidt 70df063766 improved vram mangement, preprocessing		4 月之前
cache	b534ee03ff all features implemented	4 月之前
models	b534ee03ff all features implemented	4 月之前
voices	b534ee03ff all features implemented	4 月之前
.gitignore	b534ee03ff all features implemented	4 月之前
Dockerfile	b534ee03ff all features implemented	4 月之前
README.md	70df063766 improved vram mangement, preprocessing	4 月之前
build.sh	b534ee03ff all features implemented	4 月之前
run.sh	b534ee03ff all features implemented	4 月之前
tts_server.py	70df063766 improved vram mangement, preprocessing	4 月之前
tts_server_nochunks.py	b534ee03ff all features implemented	4 月之前
tts_server_noncaching.py	b534ee03ff all features implemented	4 月之前
tts_server_simple.py	b534ee03ff all features implemented	4 月之前
tts_server_unstable.py	40810df9e6 stability improved	4 月之前

Coqui TTS Docker Server

A Dockerized Coqui XTTS server with multilingual support, voice cloning, smart text preprocessing, and robust GPU memory management.

Features

Multilingual TTS via Coqui XTTS v2
Multi-speaker support and voice cloning
Automatic .mp3 → .wav conversion with staleness detection
Persistent speaker embeddings cache for fast repeated synthesis
Markdown-aware text preprocessing — headings, bold, lists and paragraphs are converted to natural prosody cues (pauses) before synthesis
Language-aware acronym expansion — KI, EU, US, BRD etc. are expanded to phonetic letter-spellings appropriate for German or English (Ka-I, E-U, U-Es …)
Automatic chunking of large inputs with sentence- and word-boundary awareness
Proactive VRAM headroom check — entire request pins to CPU if GPU is already low before synthesis starts
Automatic per-chunk CPU fallback on CUDA OOM, with model restored to GPU afterward
Per-chunk VRAM flush — empty_cache() after every chunk and after every completed request keeps VRAM usage flat across long documents
Full CPU multi-core utilisation when running on CPU (torch.set_num_threads)
/tts, /api/tts — full-file synthesis
/tts_stream, /api/tts_stream — progressive streaming synthesis
/health — liveness + VRAM status
Optimised for small VRAM GPUs (GTX 1650 / ~4 GB) and larger

Repository Structure

coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py   # Production server
├── models/         # XTTS model weights (host-mount recommended)
├── voices/         # Speaker reference files (.wav or .mp3)
└── cache/          # Persistent speaker embeddings (.pkl)

Setup

1. Build the Docker image

./build.sh

2. Run the server

./run.sh

The scripts handle GPU detection, volume mounts for /models, /voices, /cache, and acceptance of the Coqui TTS license terms.

API

`GET /health`

Returns server liveness and current VRAM status.

Example response (GPU):

{
  "status": "ok",
  "device": "cuda",
  "vram_free_mb": 2814,
  "vram_total_mb": 4096,
  "vram_used_pct": 31.3
}

Example response (CPU):

{ "status": "ok", "device": "cpu" }

`GET /tts` · `GET /api/tts`

Synthesise speech and return the complete audio file.

Returns: audio/wav

Parameter	Default	Description
`text`	required	Text to synthesise (plain or markdown)
`voice`	`default`	Voice name (stem of file in `/voices`)
`lang`	`en`	BCP-47 language code (`en`, `de`, `fr` …)

Example:

curl "http://localhost:5002/tts?text=Hello+world&voice=alice&lang=en" --output out.wav

`GET /tts_stream` · `GET /api/tts_stream`

Streams synthesised audio progressively as each chunk completes.
Useful for conversational agents, low-latency playback, or very long documents.

Parameters are identical to /tts.

Example:

curl "http://localhost:5002/tts_stream?text=Hello+world&voice=alice&lang=en" --output out.wav

Streaming works best with audio players that support progressive WAV input (e.g. VLC, ffplay, most browser <audio> elements).

`GET /voices`

Returns available voices (deduplicated, sorted).

Example response:

{ "voices": ["alice", "default", "narrator"] }

Text Preprocessing

All text passes through a two-stage pipeline before synthesis:

1. Markdown → prosody

Markdown structure is translated into punctuation cues that XTTS responds to prosodically — no spoken labels, just natural pauses:

Markdown element	Spoken rendering
`# H1`	Long pause (`...`) before and after
`## H2` / `### H3`	Medium pause (`.`) before and after
`#### H4–H6`	Short pause after
`bold` / `italic`	Comma-breath either side `, text,`
`- bullet` / `1. item`	Comma-breath before, period after
Blank line	Full stop (paragraph break)
`---` horizontal rule	Long section-break pause (`...`)
`code` / `block`	Plain text, fences stripped
`[link](url)`	Label text only

2. Acronym expansion

Acronyms and symbols are expanded to phonetic spellings before tokenisation, preventing the CUDA device-side assert errors caused by out-of-range token IDs.

Expansion is language-aware — pass lang=de for German rules, any other value for English rules.

German examples (lang=de):

Input	Expanded
`KI`	`Ka-I`
`EU`	`E-U`
`US`	`U-Es`
`ARD`	`A-Er-De`
`z.B.`	`zum Beispiel`
`€`	`Euro`

English examples (lang=en):

Input	Expanded
`KI`	`Kay Eye`
`EU`	`E-U`
`€`	`euros`

To add a new term, edit ACRONYMS_DE or ACRONYMS_EN at the top of tts_server.py — no other changes needed.

GPU Memory Management

The server is designed to run stably on small GPUs (~4 GB VRAM) across long documents and back-to-back requests.

Proactive headroom check (before synthesis starts)
If free VRAM is below 20% when a request arrives, the entire request — model, embeddings, and inference — is pinned to CPU from the start. This avoids the more expensive mid-document fallback.

Per-chunk cache flush (during synthesis)
torch.cuda.empty_cache() is called after every successfully synthesised chunk. XTTS leaves GPT decoder KV-cache and attention buffers behind; without explicit flushing these accumulate across a 20-chunk document and starve subsequent requests.

End-of-request flush (after synthesis)
A final empty_cache() after the full request completes catches anything the per-chunk flushes missed, including output tensors that lived in the accumulation buffer.

OOM recovery (safety net)
If a chunk triggers a CUDA OOM despite the above, the model moves to CPU for that chunk and returns to GPU immediately after, inside a finally block so a CPU-side failure cannot strand the model on CPU permanently.

CPU multi-core utilisation
When running on CPU (fallback or CPU-only host), torch.set_num_threads(os.cpu_count()) is set at startup so all available cores are used.

Adding a New Voice (Voice Cloning)

XTTS performs zero-shot voice cloning — no training required. Any speaker reference audio dropped into the /voices folder is immediately available as a voice.

Quick start

Take any recording of the speaker — an interview, a voice memo, a podcast clip.
Aim for 10–30 seconds of clean speech. Longer is fine; shorter degrades quality.
Drop the file into /voices as either:
- <voice-name>.wav — used directly
- <voice-name>.mp3 — converted to .wav automatically on first use

Call the API with voice=<voice-name>:

curl "http://localhost:5002/tts?text=Hello&voice=alice&lang=en" --output out.wav

No server restart required.

What happens automatically

Step	Detail
Format conversion	`.mp3` → `.wav` (22050 Hz mono) via `ffmpeg` on first request
Reconversion	If the `.mp3` is newer than the `.wav` (i.e. you replaced it), the `.wav` is regenerated
Embedding computation	Speaker embedding extracted from the `.wav` and saved to `/cache/<voice-name>.pkl`
Cache invalidation	The `.pkl` is keyed by SHA-256 hash of the `.wav` — replacing the audio automatically triggers recomputation
In-memory caching	After first use the embedding lives in memory; subsequent requests for the same voice pay no disk cost

Tips for best quality

Mono, minimal background noise — music, reverb, and crosstalk confuse the encoder
Consistent tone — avoid clips that mix whispering and shouting
Trim silence — remove long silent gaps at the start and end before dropping in
Sample rate — any rate works; ffmpeg resamples to 22050 Hz automatically
Language match — the reference audio does not need to be in the same language as text, but accent and vocal character transfer better when it is

Replacing a voice

Drop a new file with the same name into /voices. The cache key is a content hash, so the stale embedding is detected automatically on the next request and recomputed — no manual cache clearing needed.

Notes

lang should match the language of the text parameter — XTTS uses it for tokenisation, not just accent
/tts and /api/tts are fully interchangeable (same handler)
/models, /voices, and /cache should be Docker volume mounts so data persists across image rebuilds
The VRAM threshold (default 20%) can be tuned via VRAM_HEADROOM at the top of tts_server.py
The chunk size (default 200 chars) can be tuned via MAX_CHUNK_LEN

License

Non-commercial use: Coqui CPML
Commercial licensing: licensing@coqui.ai

README.md