暂无描述

Lukas Goldschmidt b534ee03ff all features implemented 1 周之前
cache b534ee03ff all features implemented 1 周之前
models b534ee03ff all features implemented 1 周之前
voices b534ee03ff all features implemented 1 周之前
.gitignore b534ee03ff all features implemented 1 周之前
Dockerfile b534ee03ff all features implemented 1 周之前
README.md b534ee03ff all features implemented 1 周之前
build.sh b534ee03ff all features implemented 1 周之前
run.sh b534ee03ff all features implemented 1 周之前
tts_server.py b534ee03ff all features implemented 1 周之前
tts_server_nochunks.py b534ee03ff all features implemented 1 周之前
tts_server_noncaching.py b534ee03ff all features implemented 1 周之前
tts_server_simple.py b534ee03ff all features implemented 1 周之前

README.md

Coqui TTS Docker Server

A Dockerized Coqui TTS server with multilingual XTTS, multi-speaker support, voice caching, and automatic audio conversion.


Features

  • Multilingual TTS via Coqui XTTS
  • Multi-speaker support and voice cloning ready
  • Automatic .mp3.wav conversion
  • Persistent embeddings cache for fast synthesis
  • Adjustable speed, pitch, and language
  • Automatic chunking of large input texts to prevent memory overflow
  • Automatic CPU fallback if GPU VRAM is exhausted
  • Streaming endpoint for progressive audio output
  • Optimized for small VRAM GPUs (e.g. GTX 1650 / ~4GB VRAM)
  • /tts and /api/tts endpoints

Repository Structure

coqui-docker/
├── Dockerfile
├── build.sh
├── run.sh
├── README.md
├── tts_server.py            # Production server with caching and fallback
├── tts_server_simple.py     # Simple version
├── tts_server_noncaching.py # Legacy stages
├── models/                  # XTTS models (host mount recommended)
├── voices/                  # User voices (.wav or .mp3)
└── cache/                   # Persistent embeddings

Setup

1. Build the Docker image

./build.sh

2. Run the server

./run.sh

The scripts handle:

  • GPU detection
  • volume mounts for /models, /voices, /cache
  • accepting Coqui TTS license terms

API

/tts or /api/tts

Synthesize speech and return the full audio file.

Method: GET

Parameters

Parameter Default Description
text required Text to synthesize
voice default Voice name in /voices
lang en Language code

Returns: audio/wav

Example

curl "http://localhost:5002/tts?text=Hello%20world&voice=trump" --output hello.wav

Streaming Endpoint

/tts_stream or /api/tts_stream

Streams generated audio while synthesis is happening.

This is useful for:

  • conversational agents
  • low-latency playback
  • long text generation

Method: GET

Parameters

Same as /tts.

Example

curl "http://localhost:5002/tts_stream?text=Hello%20world&voice=trump" --output hello.wav

Streaming works best with audio players capable of handling progressive WAV streams.


Voices

/voices

Returns available voices.

Method: GET

Example response

{
  "voices": ["trump","narrator","alice"]
}

Voice Handling

  • .wav is the canonical internal format
  • .mp3 is converted automatically when needed
  • If a .mp3 is newer than the .wav, reconversion is triggered
  • Voice embeddings are cached in /cache for faster synthesis
  • Cached embeddings persist across container restarts

Large Text Handling

Long inputs are automatically split into smaller chunks before synthesis.

This provides several advantages:

  • prevents CUDA out-of-memory errors
  • improves reliability on low VRAM GPUs
  • allows long paragraphs or documents to be synthesized safely

Chunked outputs are automatically concatenated into a single audio stream.


GPU Memory Handling

The server is designed to work even on small GPUs (~4GB VRAM) such as:

  • GTX 1650
  • GTX 1050 Ti
  • low-end cloud GPUs

If the GPU runs out of memory:

  1. The system automatically catches the CUDA OOM error
  2. The synthesis request falls back to CPU mode
  3. Audio generation continues without crashing the server

This allows stable operation even with long text inputs.


Notes

  • GPU recommended for real-time XTTS synthesis
  • CPU fallback ensures stability even on limited hardware
  • /models, /voices, and /cache should be mounted as Docker volumes
  • /tts endpoint is backward-compatible with /api/tts
  • Set DEFAULT_VOICE = "default" in tts_server.py for missing voice parameters

License

  • Non-commercial use: Coqui CPML
  • Commercial license available: licensing@coqui.ai