Brak opisu

Lukas Goldschmidt 8954fd7bed dockerized		3 dni temu
.gitignore	38a118086b Initial commit - working	4 dni temu
Dockerfile	8954fd7bed dockerized	3 dni temu
README.md	8954fd7bed dockerized	3 dni temu
requirements.txt	38a118086b Initial commit - working	4 dni temu
reranker_server.py	38a118086b Initial commit - working	4 dni temu

Local Two-Stage Reranker Server

A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback. The server exposes a simple REST API that reranks documents for a given query.

It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.

Features

Two-stage reranking pipeline
Fast CPU FlashRank filtering
Accurate MiniLM GPU cross-encoder
Automatic GPU detection
VRAM safety check
CUDA OOM fallback
Optimized for small GPUs (e.g. GTX 1650 4GB)
Works fully CPU-only
Simple FastAPI REST interface
Built-in health endpoint

Docker support with persistent model cache

Architecture

The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.

incoming documents
    │
    ▼
FlashRank CPU reranker
    │
    ▼
top 10 candidates
    │
    ▼
MiniLM cross-encoder (GPU if available)
    │
    ▼
final ranking

Advantages:

~70-90% less GPU usage
very low latency
stable operation on small GPUs

safe fallback when GPU memory is unavailable

Project Structure

reranker-server/
│
├── reranker_server.py
├── Dockerfile
├── requirements.txt
└── README.md

Installation

1. Clone the repository

git clone <repo-url>
cd reranker-server

2. Create a Python environment

python3 -m venv venv

Activate the environment. Linux / macOS:

source venv/bin/activate

Windows:

venv\Scripts\activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Required packages:

fastapi
uvicorn
sentence-transformers
flashrank

torch

Running the Server

Start the server with:

uvicorn reranker_server:app --host 0.0.0.0 --port 5200

Server will run at:

http://localhost:5200

Interactive API docs:

http://localhost:5200/docs

Docker

Build the image

From the repository root:

docker build -t reranker-server .

Run the container (CPU only)

docker run -d \
--name reranker-server \
-p 5200:5200 \
-v "$(pwd)/hf_cache:/app/hf_cache" \
reranker-server

The hf_cache bind-mount maps the local ./hf_cache directory into the container. Downloaded models are written there and reused on every subsequent start — no re-downloading required.

Run the container (GPU)

Requires the NVIDIA Container Toolkit to be installed on the host.

docker run -d \
  --name reranker-server \
  --gpus all \
  -p 5200:5200 \
  -v "$(pwd)/hf_cache:/app/hf_cache" \
  reranker-server

Pass --gpus '"device=0"' instead of --gpus all to pin a specific GPU.

Stop / remove the container

docker stop reranker
docker rm reranker

Rebuild after code changes

docker build --no-cache -t reranker-server .

API

POST `/rerank`

Rerank a list of documents for a query.

Request

{
  "query": "What is a reranker?",
  "documents": [
    "A reranker sorts retrieved documents by relevance.",
    "Cats are mammals.",
    "Rerankers improve search pipelines."
  ],
  "top_k": 2
}

Response

{
  "results": [
    {
      "text": "A reranker sorts retrieved documents by relevance.",
      "score": 0.84
    },
    {
      "text": "Rerankers improve search pipelines.",
      "score": 0.79
    }
  ],
  "backend": "gpu"
}

Field backend indicates which system produced the final ranking:

gpu → cross-encoder used
cpu → FlashRank only

Health Check
```
GET /health
```
Example response:
```
{
"status": "ok",
"cuda_available": true,
"gpu_model_loaded": true
}
```
GPU Usage

The server automatically detects CUDA. GPU reranking is used only when:
CUDA is available
sufficient VRAM is free
no CUDA errors occur

If GPU fails, the system falls back to CPU automatically.

Recommended RAG Settings

Typical pipeline:
```
vector search → top 20 documents
reranker → top 5 documents
LLM context
```
This keeps latency low while improving retrieval quality.

Testing the Server

Example curl request:
```
curl http://localhost:5200/rerank \
-X POST \
-H "Content-Type: application/json" \
-d '{
"query": "reranker",
"documents": [
  "cats",
  "reranking documents improves retrieval"
],
"top_k": 1
}'
```
Model Information

CPU Stage

FlashRank model:
```
ms-marco-MiniLM-L-12-v2
```
Fast ONNX reranker optimized for CPU.

GPU Stage

Cross-encoder model:
```
cross-encoder/ms-marco-MiniLM-L-6-v2
```
Widely used semantic reranker with good performance.

Future Improvements

Possible enhancements:
result caching
Docker Compose support
batch reranking API
integration with mem0
Prometheus metrics
request logging

License

MIT License

README.md

Local Two-Stage Reranker Server

It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.

Features

Docker support with persistent model cache

Architecture

safe fallback when GPU memory is unavailable

Project Structure

Installation

1. Clone the repository

2. Create a Python environment

3. Install dependencies

torch

Running the Server

Docker

Build the image

Run the container (CPU only)

Run the container (GPU)

Stop / remove the container

Rebuild after code changes

API

POST `/rerank`

Request

Response

`cpu` → FlashRank only

Health Check

GPU Usage

If GPU fails, the system falls back to CPU automatically.

Recommended RAG Settings

This keeps latency low while improving retrieval quality.

Testing the Server

Model Information

CPU Stage

GPU Stage

Widely used semantic reranker with good performance.

Future Improvements

request logging

License

README.md

Local Two-Stage Reranker Server

It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.

Features

Docker support with persistent model cache

Architecture

safe fallback when GPU memory is unavailable

Project Structure

Installation

1. Clone the repository

2. Create a Python environment

3. Install dependencies

torch

Running the Server

Docker

Build the image

Run the container (CPU only)

Run the container (GPU)

Stop / remove the container

Rebuild after code changes

API

POST /rerank

Request

Response

cpu → FlashRank only

Health Check

GPU Usage

If GPU fails, the system falls back to CPU automatically.

Recommended RAG Settings

This keeps latency low while improving retrieval quality.

Testing the Server

Model Information

CPU Stage

GPU Stage

Widely used semantic reranker with good performance.

Future Improvements

request logging

License

POST `/rerank`

`cpu` → FlashRank only