# Coqui TTS Docker Server

A Dockerized Coqui TTS server with multilingual XTTS, multi-speaker support, voice caching, and automatic audio conversion.

---

## Features

* Multilingual TTS via [Coqui XTTS](https://github.com/coqui-ai/TTS)
* Multi-speaker support and voice cloning ready
* Automatic `.mp3` → `.wav` conversion
* Persistent embeddings cache for fast synthesis
* Adjustable `speed`, `pitch`, and `language`
* **Automatic chunking of large input texts** to prevent memory overflow
* **Automatic CPU fallback if GPU VRAM is exhausted**
* **Streaming endpoint for progressive audio output**
* Optimized for **small VRAM GPUs (e.g. GTX 1650 / ~4GB VRAM)**
* `/tts` and `/api/tts` endpoints

---

## Repository Structure

```text
coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py            # Production server with caching and fallback
├── tts_server_simple.py     # Simple version
├── tts_server_noncaching.py # Legacy stages
├── models/                  # XTTS models (host mount recommended)
├── voices/                  # User voices (.wav or .mp3)
└── cache/                   # Persistent embeddings
```

---

# Setup

## 1. Build the Docker image

```bash
./build.sh
```

## 2. Run the server

```bash
./run.sh
```

The scripts handle:

* GPU detection
* volume mounts for `/models`, `/voices`, `/cache`
* accepting Coqui TTS license terms

---

# API

## `/tts` or `/api/tts`

Synthesize speech and return the full audio file.

**Method:** `GET`

### Parameters

| Parameter | Default   | Description             |
| --------- | --------- | ----------------------- |
| `text`    | required  | Text to synthesize      |
| `voice`   | `default` | Voice name in `/voices` |
| `lang`    | `en`      | Language code           |

**Returns:** `audio/wav`

### Example

```bash
curl "http://localhost:5002/tts?text=Hello%20world&voice=trump" --output hello.wav
```

---

# Streaming Endpoint

## `/tts_stream` or `/api/tts_stream`

Streams generated audio while synthesis is happening.

This is useful for:

* conversational agents
* low-latency playback
* long text generation

**Method:** `GET`

### Parameters

Same as `/tts`.

### Example

```bash
curl "http://localhost:5002/tts_stream?text=Hello%20world&voice=trump" --output hello.wav
```

Streaming works best with audio players capable of handling progressive WAV streams.

---

# Voices

## `/voices`

Returns available voices.

**Method:** `GET`

### Example response

```json
{
  "voices": ["trump","narrator","alice"]
}
```

---

# Voice Handling

* `.wav` is the canonical internal format
* `.mp3` is converted automatically when needed
* If a `.mp3` is newer than the `.wav`, reconversion is triggered
* Voice embeddings are cached in `/cache` for faster synthesis
* Cached embeddings persist across container restarts

---

# Large Text Handling

Long inputs are automatically **split into smaller chunks** before synthesis.

This provides several advantages:

* prevents **CUDA out-of-memory errors**
* improves reliability on **low VRAM GPUs**
* allows long paragraphs or documents to be synthesized safely

Chunked outputs are automatically concatenated into a single audio stream.

---

# GPU Memory Handling

The server is designed to work even on **small GPUs (~4GB VRAM)** such as:

* GTX 1650
* GTX 1050 Ti
* low-end cloud GPUs

If the GPU runs out of memory:

1. The system automatically catches the CUDA OOM error
2. The synthesis request **falls back to CPU mode**
3. Audio generation continues without crashing the server

This allows stable operation even with long text inputs.

---

# Notes

* GPU recommended for real-time XTTS synthesis
* CPU fallback ensures stability even on limited hardware
* `/models`, `/voices`, and `/cache` should be mounted as Docker volumes
* `/tts` endpoint is backward-compatible with `/api/tts`
* Set `DEFAULT_VOICE = "default"` in `tts_server.py` for missing voice parameters

---

# License

* Non-commercial use: [Coqui CPML](https://coqui.ai/cpml)
* Commercial license available: `licensing@coqui.ai`

---