説明なし

Lukas Goldschmidt 38a118086b Initial commit - working 4 日 前
.gitignore 38a118086b Initial commit - working 4 日 前
README.md 38a118086b Initial commit - working 4 日 前
requirements.txt 38a118086b Initial commit - working 4 日 前
reranker_server.py 38a118086b Initial commit - working 4 日 前

README.md

Local Two-Stage Reranker Server

A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback.

The server exposes a simple REST API that reranks documents for a given query. It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.


Features

  • Two-stage reranking pipeline
  • Fast CPU FlashRank filtering
  • Accurate MiniLM GPU cross-encoder
  • Automatic GPU detection
  • VRAM safety check
  • CUDA OOM fallback
  • Optimized for small GPUs (e.g. GTX 1650 4GB)
  • Works fully CPU-only
  • Simple FastAPI REST interface
  • Built-in health endpoint

Architecture

The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.

incoming documents
        │
        ▼
FlashRank CPU reranker
        │
        ▼
top 10 candidates
        │
        ▼
MiniLM cross-encoder (GPU if available)
        │
        ▼
final ranking

Advantages:

  • ~70-90% less GPU usage
  • very low latency
  • stable operation on small GPUs
  • safe fallback when GPU memory is unavailable

Project Structure

reranker-server/
│
├── reranker_server.py
├── requirements.txt
└── README.md

Installation

1. Clone the repository

git clone <repo-url>
cd reranker-server

2. Create a Python environment

python3 -m venv venv

Activate the environment.

Linux / macOS:

source venv/bin/activate

Windows:

venv\Scripts\activate

3. Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Required packages:

  • fastapi
  • uvicorn
  • sentence-transformers
  • flashrank
  • torch

Running the Server

Start the server with:

uvicorn reranker_server:app --host 0.0.0.0 --port 5200

Server will run at:

http://localhost:5200

Interactive API docs:

http://localhost:5200/docs

API

POST /rerank

Rerank a list of documents for a query.

Request

{
  "query": "What is a reranker?",
  "documents": [
    "A reranker sorts retrieved documents by relevance.",
    "Cats are mammals.",
    "Rerankers improve search pipelines."
  ],
  "top_k": 2
}

Response

{
  "results": [
    {
      "text": "A reranker sorts retrieved documents by relevance.",
      "score": 0.84
    },
    {
      "text": "Rerankers improve search pipelines.",
      "score": 0.79
    }
  ],
  "backend": "gpu"
}

Field backend indicates which system produced the final ranking:

  • gpu → cross-encoder used
  • cpu → FlashRank only

Health Check

GET /health

Example response:

{
  "status": "ok",
  "cuda_available": true,
  "gpu_model_loaded": true
}

GPU Usage

The server automatically detects CUDA.

GPU reranking is used only when:

  • CUDA is available
  • sufficient VRAM is free
  • no CUDA errors occur

If GPU fails, the system falls back to CPU automatically.


Recommended RAG Settings

Typical pipeline:

vector search → top 20 documents
reranker → top 5 documents
LLM context

This keeps latency low while improving retrieval quality.


Testing the Server

Example curl request:

curl http://localhost:5200/rerank \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "query": "reranker",
    "documents": [
      "cats",
      "reranking documents improves retrieval"
    ],
    "top_k": 1
  }'

Model Information

CPU Stage

FlashRank model:

ms-marco-MiniLM-L-12-v2

Fast ONNX reranker optimized for CPU.

GPU Stage

Cross-encoder model:

cross-encoder/ms-marco-MiniLM-L-6-v2

Widely used semantic reranker with good performance.


Future Improvements

Possible enhancements:

  • result caching
  • Docker container
  • batch reranking API
  • integration with mem0
  • Prometheus metrics
  • request logging

License

MIT License