Brak opisu

Lukas Goldschmidt 8954fd7bed dockerized 3 dni temu
.gitignore 38a118086b Initial commit - working 4 dni temu
Dockerfile 8954fd7bed dockerized 3 dni temu
README.md 8954fd7bed dockerized 3 dni temu
requirements.txt 38a118086b Initial commit - working 4 dni temu
reranker_server.py 38a118086b Initial commit - working 4 dni temu

README.md

Local Two-Stage Reranker Server

A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback. The server exposes a simple REST API that reranks documents for a given query.

It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.

Features

  • Two-stage reranking pipeline
  • Fast CPU FlashRank filtering
  • Accurate MiniLM GPU cross-encoder
  • Automatic GPU detection
  • VRAM safety check
  • CUDA OOM fallback
  • Optimized for small GPUs (e.g. GTX 1650 4GB)
  • Works fully CPU-only
  • Simple FastAPI REST interface
  • Built-in health endpoint
  • Docker support with persistent model cache

    Architecture

    The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.

    incoming documents
        │
        ▼
    FlashRank CPU reranker
        │
        ▼
    top 10 candidates
        │
        ▼
    MiniLM cross-encoder (GPU if available)
        │
        ▼
    final ranking
    

    Advantages:

  • ~70-90% less GPU usage

  • very low latency

  • stable operation on small GPUs

  • safe fallback when GPU memory is unavailable

    Project Structure

    reranker-server/
    │
    ├── reranker_server.py
    ├── Dockerfile
    ├── requirements.txt
    └── README.md
    

    Installation

    1. Clone the repository

    git clone <repo-url>
    cd reranker-server
    

    2. Create a Python environment

    python3 -m venv venv
    

    Activate the environment. Linux / macOS:

    source venv/bin/activate
    

    Windows:

    venv\Scripts\activate
    

    3. Install dependencies

    pip install --upgrade pip
    pip install -r requirements.txt
    

    Required packages:

  • fastapi

  • uvicorn

  • sentence-transformers

  • flashrank

  • torch

    Running the Server

    Start the server with:

    uvicorn reranker_server:app --host 0.0.0.0 --port 5200
    

    Server will run at:

    http://localhost:5200
    

    Interactive API docs:

    http://localhost:5200/docs
    

    Docker

    Build the image

    From the repository root:

    docker build -t reranker-server .
    

    Run the container (CPU only)

    docker run -d \
    --name reranker-server \
    -p 5200:5200 \
    -v "$(pwd)/hf_cache:/app/hf_cache" \
    reranker-server
    

The hf_cache bind-mount maps the local ./hf_cache directory into the container. Downloaded models are written there and reused on every subsequent start — no re-downloading required.


Run the container (GPU)

Requires the NVIDIA Container Toolkit to be installed on the host.

docker run -d \
  --name reranker-server \
  --gpus all \
  -p 5200:5200 \
  -v "$(pwd)/hf_cache:/app/hf_cache" \
  reranker-server

Pass --gpus '"device=0"' instead of --gpus all to pin a specific GPU.


Stop / remove the container

docker stop reranker
docker rm reranker

Rebuild after code changes

docker build --no-cache -t reranker-server .

API

POST /rerank

Rerank a list of documents for a query.

Request

{
  "query": "What is a reranker?",
  "documents": [
    "A reranker sorts retrieved documents by relevance.",
    "Cats are mammals.",
    "Rerankers improve search pipelines."
  ],
  "top_k": 2
}

Response

{
  "results": [
    {
      "text": "A reranker sorts retrieved documents by relevance.",
      "score": 0.84
    },
    {
      "text": "Rerankers improve search pipelines.",
      "score": 0.79
    }
  ],
  "backend": "gpu"
}

Field backend indicates which system produced the final ranking:

  • gpu → cross-encoder used
  • cpu → FlashRank only

    Health Check

    GET /health
    

    Example response:

    {
    "status": "ok",
    "cuda_available": true,
    "gpu_model_loaded": true
    }
    

    GPU Usage

    The server automatically detects CUDA. GPU reranking is used only when:

  • CUDA is available

  • sufficient VRAM is free

  • no CUDA errors occur

    If GPU fails, the system falls back to CPU automatically.

    Recommended RAG Settings

    Typical pipeline:

    vector search → top 20 documents
    reranker → top 5 documents
    LLM context
    

    This keeps latency low while improving retrieval quality.

    Testing the Server

    Example curl request:

    curl http://localhost:5200/rerank \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{
    "query": "reranker",
    "documents": [
      "cats",
      "reranking documents improves retrieval"
    ],
    "top_k": 1
    }'
    

    Model Information

    CPU Stage

    FlashRank model:

    ms-marco-MiniLM-L-12-v2
    

    Fast ONNX reranker optimized for CPU.

    GPU Stage

    Cross-encoder model:

    cross-encoder/ms-marco-MiniLM-L-6-v2
    

    Widely used semantic reranker with good performance.

    Future Improvements

    Possible enhancements:

  • result caching

  • Docker Compose support

  • batch reranking API

  • integration with mem0

  • Prometheus metrics

  • request logging

    License

    MIT License