暂无描述

Lukas Goldschmidt b534ee03ff all features implemented		2 月之前
cache	b534ee03ff all features implemented	2 月之前
models	b534ee03ff all features implemented	2 月之前
voices	b534ee03ff all features implemented	2 月之前
.gitignore	b534ee03ff all features implemented	2 月之前
Dockerfile	b534ee03ff all features implemented	2 月之前
README.md	b534ee03ff all features implemented	2 月之前
build.sh	b534ee03ff all features implemented	2 月之前
run.sh	b534ee03ff all features implemented	2 月之前
tts_server.py	b534ee03ff all features implemented	2 月之前
tts_server_nochunks.py	b534ee03ff all features implemented	2 月之前
tts_server_noncaching.py	b534ee03ff all features implemented	2 月之前
tts_server_simple.py	b534ee03ff all features implemented	2 月之前

Coqui TTS Docker Server

A Dockerized Coqui TTS server with multilingual XTTS, multi-speaker support, voice caching, and automatic audio conversion.

Features

Multilingual TTS via Coqui XTTS
Multi-speaker support and voice cloning ready
Automatic .mp3 → .wav conversion
Persistent embeddings cache for fast synthesis
Adjustable speed, pitch, and language
Automatic chunking of large input texts to prevent memory overflow
Automatic CPU fallback if GPU VRAM is exhausted
Streaming endpoint for progressive audio output
Optimized for small VRAM GPUs (e.g. GTX 1650 / ~4GB VRAM)
/tts and /api/tts endpoints

Repository Structure

coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py            # Production server with caching and fallback
├── tts_server_simple.py     # Simple version
├── tts_server_noncaching.py # Legacy stages
├── models/                  # XTTS models (host mount recommended)
├── voices/                  # User voices (.wav or .mp3)
└── cache/                   # Persistent embeddings

Setup

1. Build the Docker image

./build.sh

2. Run the server

./run.sh

The scripts handle:

GPU detection
volume mounts for /models, /voices, /cache
accepting Coqui TTS license terms

API

`/tts` or `/api/tts`

Synthesize speech and return the full audio file.

Method: GET

Parameters

Parameter	Default	Description
`text`	required	Text to synthesize
`voice`	`default`	Voice name in `/voices`
`lang`	`en`	Language code

Returns: audio/wav

Example

curl "http://localhost:5002/tts?text=Hello%20world&voice=trump" --output hello.wav

Streaming Endpoint

`/tts_stream` or `/api/tts_stream`

Streams generated audio while synthesis is happening.

This is useful for:

conversational agents
low-latency playback
long text generation

Method: GET

Parameters

Same as /tts.

Example

curl "http://localhost:5002/tts_stream?text=Hello%20world&voice=trump" --output hello.wav

Streaming works best with audio players capable of handling progressive WAV streams.

Voices

`/voices`

Returns available voices.

Method: GET

Example response

{
  "voices": ["trump","narrator","alice"]
}

Voice Handling

.wav is the canonical internal format
.mp3 is converted automatically when needed
If a .mp3 is newer than the .wav, reconversion is triggered
Voice embeddings are cached in /cache for faster synthesis
Cached embeddings persist across container restarts

Large Text Handling

Long inputs are automatically split into smaller chunks before synthesis.

This provides several advantages:

prevents CUDA out-of-memory errors
improves reliability on low VRAM GPUs
allows long paragraphs or documents to be synthesized safely

Chunked outputs are automatically concatenated into a single audio stream.

GPU Memory Handling

The server is designed to work even on small GPUs (~4GB VRAM) such as:

GTX 1650
GTX 1050 Ti
low-end cloud GPUs

If the GPU runs out of memory:

The system automatically catches the CUDA OOM error
The synthesis request falls back to CPU mode
Audio generation continues without crashing the server

This allows stable operation even with long text inputs.

Notes

GPU recommended for real-time XTTS synthesis
CPU fallback ensures stability even on limited hardware
/models, /voices, and /cache should be mounted as Docker volumes
/tts endpoint is backward-compatible with /api/tts
Set DEFAULT_VOICE = "default" in tts_server.py for missing voice parameters

License

Non-commercial use: Coqui CPML
Commercial license available: licensing@coqui.ai

README.md

Coqui TTS Docker Server

Features

Repository Structure

Setup

1. Build the Docker image

2. Run the server

API

/tts or /api/tts

Parameters

Example

Streaming Endpoint

/tts_stream or /api/tts_stream

Parameters

Example

Voices

/voices

Example response

Voice Handling

Large Text Handling

GPU Memory Handling

Notes

License

`/tts` or `/api/tts`

`/tts_stream` or `/api/tts_stream`

`/voices`