説明なし

Lukas Goldschmidt 38a118086b Initial commit - working		4 ヶ月前
.gitignore	38a118086b Initial commit - working	4 ヶ月前
README.md	38a118086b Initial commit - working	4 ヶ月前
requirements.txt	38a118086b Initial commit - working	4 ヶ月前
reranker_server.py	38a118086b Initial commit - working	4 ヶ月前

Local Two-Stage Reranker Server

A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback.

The server exposes a simple REST API that reranks documents for a given query. It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.

Features

Two-stage reranking pipeline
Fast CPU FlashRank filtering
Accurate MiniLM GPU cross-encoder
Automatic GPU detection
VRAM safety check
CUDA OOM fallback
Optimized for small GPUs (e.g. GTX 1650 4GB)
Works fully CPU-only
Simple FastAPI REST interface
Built-in health endpoint

Architecture

The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.

incoming documents
        │
        ▼
FlashRank CPU reranker
        │
        ▼
top 10 candidates
        │
        ▼
MiniLM cross-encoder (GPU if available)
        │
        ▼
final ranking

Advantages:

~70-90% less GPU usage
very low latency
stable operation on small GPUs
safe fallback when GPU memory is unavailable

Project Structure

reranker-server/
│
├── reranker_server.py
├── requirements.txt
└── README.md

Installation

1. Clone the repository

git clone <repo-url>
cd reranker-server

2. Create a Python environment

python3 -m venv venv

Activate the environment.

Linux / macOS:

source venv/bin/activate

Windows:

venv\Scripts\activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Required packages:

fastapi
uvicorn
sentence-transformers
flashrank
torch

Running the Server

Start the server with:

uvicorn reranker_server:app --host 0.0.0.0 --port 5200

Server will run at:

http://localhost:5200

Interactive API docs:

http://localhost:5200/docs

API

POST `/rerank`

Rerank a list of documents for a query.

Request

{
  "query": "What is a reranker?",
  "documents": [
    "A reranker sorts retrieved documents by relevance.",
    "Cats are mammals.",
    "Rerankers improve search pipelines."
  ],
  "top_k": 2
}

Response

{
  "results": [
    {
      "text": "A reranker sorts retrieved documents by relevance.",
      "score": 0.84
    },
    {
      "text": "Rerankers improve search pipelines.",
      "score": 0.79
    }
  ],
  "backend": "gpu"
}

Field backend indicates which system produced the final ranking:

gpu → cross-encoder used
cpu → FlashRank only

Health Check

GET /health

Example response:

{
  "status": "ok",
  "cuda_available": true,
  "gpu_model_loaded": true
}

GPU Usage

The server automatically detects CUDA.

GPU reranking is used only when:

CUDA is available
sufficient VRAM is free
no CUDA errors occur

If GPU fails, the system falls back to CPU automatically.

Recommended RAG Settings

Typical pipeline:

vector search → top 20 documents
reranker → top 5 documents
LLM context

This keeps latency low while improving retrieval quality.

Testing the Server

Example curl request:

curl http://localhost:5200/rerank \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "query": "reranker",
    "documents": [
      "cats",
      "reranking documents improves retrieval"
    ],
    "top_k": 1
  }'

Model Information

CPU Stage

FlashRank model:

ms-marco-MiniLM-L-12-v2

Fast ONNX reranker optimized for CPU.

GPU Stage

Cross-encoder model:

cross-encoder/ms-marco-MiniLM-L-6-v2

Widely used semantic reranker with good performance.

Future Improvements

Possible enhancements:

result caching
Docker container
batch reranking API
integration with mem0
Prometheus metrics
request logging

License

MIT License

README.md