# Coqui TTS Docker Server

A Dockerized Coqui XTTS server with multilingual support, voice cloning, smart text preprocessing, and robust GPU memory management.

---

## Features

* Multilingual TTS via [Coqui XTTS v2](https://github.com/coqui-ai/TTS)
* Multi-speaker support and voice cloning
* Automatic `.mp3` → `.wav` conversion with staleness detection
* Persistent speaker embeddings cache for fast repeated synthesis
* **Markdown-aware text preprocessing** — headings, bold, lists and paragraphs are converted to natural prosody cues (pauses) before synthesis
* **Language-aware acronym expansion** — `KI`, `EU`, `US`, `BRD` etc. are expanded to phonetic letter-spellings appropriate for German or English (`Ka-I`, `E-U`, `U-Es` …)
* **Automatic chunking** of large inputs with sentence- and word-boundary awareness
* **Proactive VRAM headroom check** — entire request pins to CPU if GPU is already low before synthesis starts
* **Automatic per-chunk CPU fallback** on CUDA OOM, with model restored to GPU afterward
* **Per-chunk VRAM flush** — `empty_cache()` after every chunk and after every completed request keeps VRAM usage flat across long documents
* **Full CPU multi-core utilisation** when running on CPU (`torch.set_num_threads`)
* `/tts`, `/api/tts` — full-file synthesis
* `/tts_stream`, `/api/tts_stream` — progressive streaming synthesis
* `/health` — liveness + VRAM status
* Optimised for small VRAM GPUs (GTX 1650 / ~4 GB) and larger

---

## Repository Structure

```text
coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py   # Production server
├── models/         # XTTS model weights (host-mount recommended)
├── voices/         # Speaker reference files (.wav or .mp3)
└── cache/          # Persistent speaker embeddings (.pkl)
```

---

## Setup

### 1. Build the Docker image

```bash
./build.sh
```

### 2. Run the server

```bash
./run.sh
```

The scripts handle GPU detection, volume mounts for `/models`, `/voices`, `/cache`, and acceptance of the Coqui TTS license terms.

---

## API

### `GET /health`

Returns server liveness and current VRAM status.

**Example response (GPU):**
```json
{
  "status": "ok",
  "device": "cuda",
  "vram_free_mb": 2814,
  "vram_total_mb": 4096,
  "vram_used_pct": 31.3
}
```

**Example response (CPU):**
```json
{ "status": "ok", "device": "cpu" }
```

---

### `GET /tts` · `GET /api/tts`

Synthesise speech and return the complete audio file.

**Returns:** `audio/wav`

| Parameter | Default   | Description                        |
|-----------|-----------|------------------------------------|
| `text`    | required  | Text to synthesise (plain or markdown) |
| `voice`   | `default` | Voice name (stem of file in `/voices`) |
| `lang`    | `en`      | BCP-47 language code (`en`, `de`, `fr` …) |

**Example:**
```bash
curl "http://localhost:5002/tts?text=Hello+world&voice=alice&lang=en" --output out.wav
```

---

### `GET /tts_stream` · `GET /api/tts_stream`

Streams synthesised audio progressively as each chunk completes.  
Useful for conversational agents, low-latency playback, or very long documents.

Parameters are identical to `/tts`.

**Example:**
```bash
curl "http://localhost:5002/tts_stream?text=Hello+world&voice=alice&lang=en" --output out.wav
```

> Streaming works best with audio players that support progressive WAV input (e.g. VLC, ffplay, most browser `<audio>` elements).

---

### `GET /voices`

Returns available voices (deduplicated, sorted).

**Example response:**
```json
{ "voices": ["alice", "default", "narrator"] }
```

---

## Text Preprocessing

All text passes through a two-stage pipeline before synthesis:

### 1. Markdown → prosody

Markdown structure is translated into punctuation cues that XTTS responds to prosodically — no spoken labels, just natural pauses:

| Markdown element | Spoken rendering |
|-----------------|-----------------|
| `# H1` | Long pause (`...`) before and after |
| `## H2` / `### H3` | Medium pause (`.`) before and after |
| `#### H4–H6` | Short pause after |
| `**bold**` / `*italic*` | Comma-breath either side `, text,` |
| `- bullet` / `1. item` | Comma-breath before, period after |
| Blank line | Full stop (paragraph break) |
| `---` horizontal rule | Long section-break pause (`...`) |
| `` `code` `` / ` ```block``` ` | Plain text, fences stripped |
| `[link](url)` | Label text only |

### 2. Acronym expansion

Acronyms and symbols are expanded to phonetic spellings before tokenisation, preventing the CUDA device-side assert errors caused by out-of-range token IDs.

Expansion is **language-aware** — pass `lang=de` for German rules, any other value for English rules.

**German examples (`lang=de`):**

| Input | Expanded |
|-------|----------|
| `KI`  | `Ka-I` |
| `EU`  | `E-U` |
| `US`  | `U-Es` |
| `ARD` | `A-Er-De` |
| `z.B.` | `zum Beispiel` |
| `€`   | `Euro` |

**English examples (`lang=en`):**

| Input | Expanded |
|-------|----------|
| `KI`  | `Kay Eye` |
| `EU`  | `E-U` |
| `€`   | `euros` |

To add a new term, edit `ACRONYMS_DE` or `ACRONYMS_EN` at the top of `tts_server.py` — no other changes needed.

---

## GPU Memory Management

The server is designed to run stably on small GPUs (~4 GB VRAM) across long documents and back-to-back requests.

**Proactive headroom check (before synthesis starts)**  
If free VRAM is below 20% when a request arrives, the entire request — model, embeddings, and inference — is pinned to CPU from the start. This avoids the more expensive mid-document fallback.

**Per-chunk cache flush (during synthesis)**  
`torch.cuda.empty_cache()` is called after every successfully synthesised chunk. XTTS leaves GPT decoder KV-cache and attention buffers behind; without explicit flushing these accumulate across a 20-chunk document and starve subsequent requests.

**End-of-request flush (after synthesis)**  
A final `empty_cache()` after the full request completes catches anything the per-chunk flushes missed, including output tensors that lived in the accumulation buffer.

**OOM recovery (safety net)**  
If a chunk triggers a CUDA OOM despite the above, the model moves to CPU for that chunk and returns to GPU immediately after, inside a `finally` block so a CPU-side failure cannot strand the model on CPU permanently.

**CPU multi-core utilisation**  
When running on CPU (fallback or CPU-only host), `torch.set_num_threads(os.cpu_count())` is set at startup so all available cores are used.

---

## Adding a New Voice (Voice Cloning)

XTTS performs zero-shot voice cloning — no training required. Any speaker reference
audio dropped into the `/voices` folder is immediately available as a voice.

### Quick start

1. Take any recording of the speaker — an interview, a voice memo, a podcast clip.
2. Aim for **10–30 seconds** of clean speech. Longer is fine; shorter degrades quality.
3. Drop the file into `/voices` as either:
   * `<voice-name>.wav` — used directly
   * `<voice-name>.mp3` — converted to `.wav` automatically on first use
4. Call the API with `voice=<voice-name>`:

```bash
curl "http://localhost:5002/tts?text=Hello&voice=alice&lang=en" --output out.wav
```

No server restart required.

### What happens automatically

| Step | Detail |
|------|--------|
| Format conversion | `.mp3` → `.wav` (22050 Hz mono) via `ffmpeg` on first request |
| Reconversion | If the `.mp3` is **newer** than the `.wav` (i.e. you replaced it), the `.wav` is regenerated |
| Embedding computation | Speaker embedding extracted from the `.wav` and saved to `/cache/<voice-name>.pkl` |
| Cache invalidation | The `.pkl` is keyed by SHA-256 hash of the `.wav` — replacing the audio automatically triggers recomputation |
| In-memory caching | After first use the embedding lives in memory; subsequent requests for the same voice pay no disk cost |

### Tips for best quality

* **Mono, minimal background noise** — music, reverb, and crosstalk confuse the encoder
* **Consistent tone** — avoid clips that mix whispering and shouting
* **Trim silence** — remove long silent gaps at the start and end before dropping in
* **Sample rate** — any rate works; `ffmpeg` resamples to 22050 Hz automatically
* **Language match** — the reference audio does not need to be in the same language as `text`, but accent and vocal character transfer better when it is

### Replacing a voice

Drop a new file with the same name into `/voices`. The cache key is a content hash,
so the stale embedding is detected automatically on the next request and recomputed —
no manual cache clearing needed.

---

## Notes

* `lang` should match the language of the `text` parameter — XTTS uses it for tokenisation, not just accent
* `/tts` and `/api/tts` are fully interchangeable (same handler)
* `/models`, `/voices`, and `/cache` should be Docker volume mounts so data persists across image rebuilds
* The VRAM threshold (default 20%) can be tuned via `VRAM_HEADROOM` at the top of `tts_server.py`
* The chunk size (default 200 chars) can be tuned via `MAX_CHUNK_LEN`

---

## License

* Non-commercial use: [Coqui CPML](https://coqui.ai/cpml)
* Commercial licensing: `licensing@coqui.ai`