# Whisper Transcription HTTP Server

A minimal HTTP API for **speech-to-text transcription** built with:

* **Flask**
* **whisper.cpp**
* **ffmpeg**

The service accepts audio uploads and returns a **single transcription result** using the Whisper model.

It is designed to be:

* simple
* reliable
* easy to containerize
* suitable for internal automation pipelines

Typical use cases include:

* voice message transcription
* voice assistant pipelines
* transcription preprocessing
* automation workflows (Node-RED, etc.)

---

# Architecture

The server works in three stages:

1. **Upload audio**

   A client sends a `multipart/form-data` POST request containing an audio file.

2. **Convert audio**

   The server converts the audio into **16 kHz mono WAV** using `ffmpeg`.
   This ensures compatibility and stable input for Whisper.

3. **Transcribe**

   The server calls the **whisper.cpp CLI** (`whisper-cli`) with a specified model.

4. **Return text**

   The transcription result is returned as JSON.

If the transcription result is empty, the server returns **diagnostic information** to help debugging.

---

# API

## Health Check

```
GET /health
```

Returns server status and verifies:

* whisper binary exists
* model file exists
* ffmpeg is available

Example response:

```json
{
  "ok": true,
  "problems": []
}
```

---

## Transcribe Audio

```
POST /transcribe
```

### Request

`multipart/form-data`

Field name:

```
file
```

Example:

```
file=@audio.wav
```

Supported formats (handled by ffmpeg):

* wav
* mp3
* ogg
* m4a
* most other common audio formats

---

### Response

Success:

```json
{
  "text": "Hello this is a transcription."
}
```

If the transcription is empty, the server returns diagnostics:

```json
{
  "text": "",
  "note": "empty transcript; returning diagnostics",
  "stdout": "...",
  "stderr": "...",
  "cmd": [...]
}
```

---

# Project Structure

```
.
├── server.py
├── requirements.txt
├── Dockerfile
└── README.md
```

At runtime the container will also contain:

```
/app/whisper.cpp
/app/whisper.cpp/build/bin/whisper-cli
/app/whisper.cpp/models/ggml-small.bin
```

---

# Dependencies

Runtime components:

* Python 3.11
* Flask
* Gunicorn
* ffmpeg
* whisper.cpp
* Whisper model (`ggml-small.bin`)

The Docker image builds whisper.cpp automatically.

---

# Configuration

The server reads these environment variables:

```
WHISPER_BIN
MODEL_PATH
```

Defaults inside the container:

```
WHISPER_BIN=/app/whisper.cpp/build/bin/whisper-cli
MODEL_PATH=/app/whisper.cpp/models/ggml-small.bin
```

---

# Build Docker Image

From the project directory:

```bash
docker build -t whisper-api .
```

---

# Run Container

```
docker run -d -it -p 5005:5005 --name whisper-server whisper-api
```

The API will be available at:

```
http://localhost:5005
```

---

# Test the Server

### Health Check

```
curl http://localhost:5005/health
```

---

### Transcribe Audio

```
curl -X POST \
  -F "file=@test.wav" \
  http://localhost:5005/transcribe
```

Example response:

```
{"text":"hello this is a test"}
```

---

# Development (Without Docker)

Create a Python virtual environment:

```
python -m venv venv
source venv/bin/activate
```

Install dependencies:

```
pip install -r requirements.txt
```

Run server:

```
python server.py
```

---

# Model Choice

The default Docker build downloads:

```
ggml-small.bin
```

You can switch models by modifying the Dockerfile:

| Model  | Speed     | Accuracy |
| ------ | --------- | -------- |
| tiny   | very fast | low      |
| base   | fast      | moderate |
| small  | balanced  | good     |
| medium | slow      | high     |
| large  | very slow | best     |

---

# Performance Notes

* whisper.cpp runs **fully on CPU**
* transcription speed depends on:

  * CPU cores
  * CPU vector extensions (AVX/AVX2)
  * model size

Typical small-model performance on modern CPUs:

```
~0.5x – 2x realtime
```

---

# Security Notes

This server:

* accepts arbitrary audio uploads
* runs ffmpeg on them

For production deployments consider:

* reverse proxy (nginx / traefik)
* request size limits
* authentication
* rate limiting

---

# License

This project uses:

* **whisper.cpp** — MIT License
* **Flask** — BSD License

See their respective repositories for details.