# Coqui TTS Docker Server A Dockerized Coqui TTS server with multilingual XTTS, multi-speaker support, voice caching, and automatic audio conversion. --- ## Features * Multilingual TTS via [Coqui XTTS](https://github.com/coqui-ai/TTS) * Multi-speaker support and voice cloning ready * Automatic `.mp3` → `.wav` conversion * Persistent embeddings cache for fast synthesis * Adjustable `speed`, `pitch`, and `language` * **Automatic chunking of large input texts** to prevent memory overflow * **Automatic CPU fallback if GPU VRAM is exhausted** * **Streaming endpoint for progressive audio output** * Optimized for **small VRAM GPUs (e.g. GTX 1650 / ~4GB VRAM)** * `/tts` and `/api/tts` endpoints --- ## Repository Structure ```text coqui-docker/ ├── Dockerfile ├── build.sh ├── run.sh ├── README.md ├── tts_server.py # Production server with caching and fallback ├── tts_server_simple.py # Simple version ├── tts_server_noncaching.py # Legacy stages ├── models/ # XTTS models (host mount recommended) ├── voices/ # User voices (.wav or .mp3) └── cache/ # Persistent embeddings ``` --- # Setup ## 1. Build the Docker image ```bash ./build.sh ``` ## 2. Run the server ```bash ./run.sh ``` The scripts handle: * GPU detection * volume mounts for `/models`, `/voices`, `/cache` * accepting Coqui TTS license terms --- # API ## `/tts` or `/api/tts` Synthesize speech and return the full audio file. **Method:** `GET` ### Parameters | Parameter | Default | Description | | --------- | --------- | ----------------------- | | `text` | required | Text to synthesize | | `voice` | `default` | Voice name in `/voices` | | `lang` | `en` | Language code | **Returns:** `audio/wav` ### Example ```bash curl "http://localhost:5002/tts?text=Hello%20world&voice=trump" --output hello.wav ``` --- # Streaming Endpoint ## `/tts_stream` or `/api/tts_stream` Streams generated audio while synthesis is happening. This is useful for: * conversational agents * low-latency playback * long text generation **Method:** `GET` ### Parameters Same as `/tts`. ### Example ```bash curl "http://localhost:5002/tts_stream?text=Hello%20world&voice=trump" --output hello.wav ``` Streaming works best with audio players capable of handling progressive WAV streams. --- # Voices ## `/voices` Returns available voices. **Method:** `GET` ### Example response ```json { "voices": ["trump","narrator","alice"] } ``` --- # Voice Handling * `.wav` is the canonical internal format * `.mp3` is converted automatically when needed * If a `.mp3` is newer than the `.wav`, reconversion is triggered * Voice embeddings are cached in `/cache` for faster synthesis * Cached embeddings persist across container restarts --- # Large Text Handling Long inputs are automatically **split into smaller chunks** before synthesis. This provides several advantages: * prevents **CUDA out-of-memory errors** * improves reliability on **low VRAM GPUs** * allows long paragraphs or documents to be synthesized safely Chunked outputs are automatically concatenated into a single audio stream. --- # GPU Memory Handling The server is designed to work even on **small GPUs (~4GB VRAM)** such as: * GTX 1650 * GTX 1050 Ti * low-end cloud GPUs If the GPU runs out of memory: 1. The system automatically catches the CUDA OOM error 2. The synthesis request **falls back to CPU mode** 3. Audio generation continues without crashing the server This allows stable operation even with long text inputs. --- # Notes * GPU recommended for real-time XTTS synthesis * CPU fallback ensures stability even on limited hardware * `/models`, `/voices`, and `/cache` should be mounted as Docker volumes * `/tts` endpoint is backward-compatible with `/api/tts` * Set `DEFAULT_VOICE = "default"` in `tts_server.py` for missing voice parameters --- # License * Non-commercial use: [Coqui CPML](https://coqui.ai/cpml) * Commercial license available: `licensing@coqui.ai` ---