# Coqui TTS Docker Server A Dockerized Coqui XTTS server with multilingual support, voice cloning, smart text preprocessing, and robust GPU memory management. --- ## Features * Multilingual TTS via [Coqui XTTS v2](https://github.com/coqui-ai/TTS) * Multi-speaker support and voice cloning * Automatic `.mp3` → `.wav` conversion with staleness detection * Persistent speaker embeddings cache for fast repeated synthesis * **Markdown-aware text preprocessing** — headings, bold, lists and paragraphs are converted to natural prosody cues (pauses) before synthesis * **Language-aware acronym expansion** — `KI`, `EU`, `US`, `BRD` etc. are expanded to phonetic letter-spellings appropriate for German or English (`Ka-I`, `E-U`, `U-Es` …) * **Automatic chunking** of large inputs with sentence- and word-boundary awareness * **Proactive VRAM headroom check** — entire request pins to CPU if GPU is already low before synthesis starts * **Automatic per-chunk CPU fallback** on CUDA OOM, with model restored to GPU afterward * **Per-chunk VRAM flush** — `empty_cache()` after every chunk and after every completed request keeps VRAM usage flat across long documents * **Full CPU multi-core utilisation** when running on CPU (`torch.set_num_threads`) * `/tts`, `/api/tts` — full-file synthesis * `/tts_stream`, `/api/tts_stream` — progressive streaming synthesis * `/health` — liveness + VRAM status * Optimised for small VRAM GPUs (GTX 1650 / ~4 GB) and larger --- ## Repository Structure ```text coqui-docker/ ├── Dockerfile ├── build.sh ├── run.sh ├── README.md ├── tts_server.py # Production server ├── models/ # XTTS model weights (host-mount recommended) ├── voices/ # Speaker reference files (.wav or .mp3) └── cache/ # Persistent speaker embeddings (.pkl) ``` --- ## Setup ### 1. Build the Docker image ```bash ./build.sh ``` ### 2. Run the server ```bash ./run.sh ``` The scripts handle GPU detection, volume mounts for `/models`, `/voices`, `/cache`, and acceptance of the Coqui TTS license terms. --- ## API ### `GET /health` Returns server liveness and current VRAM status. **Example response (GPU):** ```json { "status": "ok", "device": "cuda", "vram_free_mb": 2814, "vram_total_mb": 4096, "vram_used_pct": 31.3 } ``` **Example response (CPU):** ```json { "status": "ok", "device": "cpu" } ``` --- ### `GET /tts` · `GET /api/tts` Synthesise speech and return the complete audio file. **Returns:** `audio/wav` | Parameter | Default | Description | |-----------|-----------|------------------------------------| | `text` | required | Text to synthesise (plain or markdown) | | `voice` | `default` | Voice name (stem of file in `/voices`) | | `lang` | `en` | BCP-47 language code (`en`, `de`, `fr` …) | **Example:** ```bash curl "http://localhost:5002/tts?text=Hello+world&voice=alice&lang=en" --output out.wav ``` --- ### `GET /tts_stream` · `GET /api/tts_stream` Streams synthesised audio progressively as each chunk completes. Useful for conversational agents, low-latency playback, or very long documents. Parameters are identical to `/tts`. **Example:** ```bash curl "http://localhost:5002/tts_stream?text=Hello+world&voice=alice&lang=en" --output out.wav ``` > Streaming works best with audio players that support progressive WAV input (e.g. VLC, ffplay, most browser `