# Local Two-Stage Reranker Server
A lightweight **self-hosted reranking service** optimized for small GPUs and CPU fallback.
The server exposes a simple REST API that reranks documents for a given query.
It is designed to integrate easily with **RAG pipelines**, **mem0**, **LangChain**, or custom retrieval systems.
---
# Features
* Two-stage reranking pipeline
* Fast **CPU FlashRank filtering**
* Accurate **MiniLM GPU cross-encoder**
* **Automatic GPU detection**
* **VRAM safety check**
* **CUDA OOM fallback**
* Optimized for **small GPUs (e.g. GTX 1650 4GB)**
* Works fully **CPU-only**
* Simple **FastAPI REST interface**
* Built-in **health endpoint**
* **Docker support** with persistent model cache
---
# Architecture
The server uses a **two-stage reranking pipeline** to reduce GPU load while maintaining high ranking quality.
```
incoming documents
        │
        ▼
FlashRank CPU reranker
        │
        ▼
top 10 candidates
        │
        ▼
MiniLM cross-encoder (GPU if available)
        │
        ▼
final ranking
```
Advantages:
* ~70-90% less GPU usage
* very low latency
* stable operation on small GPUs
* safe fallback when GPU memory is unavailable
---
# Project Structure
```
reranker-server/
│
├── reranker_server.py
├── Dockerfile
├── requirements.txt
└── README.md
```
---
# Installation
## 1. Clone the repository
```
git clone <repo-url>
cd reranker-server
```
---
## 2. Create a Python environment
```
python3 -m venv venv
```
Activate the environment.
Linux / macOS:
```
source venv/bin/activate
```
Windows:
```
venv\Scripts\activate
```
---
## 3. Install dependencies
```
pip install --upgrade pip
pip install -r requirements.txt
```
Required packages:
* fastapi
* uvicorn
* sentence-transformers
* flashrank
* torch
---
# Running the Server
Start the server with:
```
uvicorn reranker_server:app --host 0.0.0.0 --port 5200
```
Server will run at:
```
http://localhost:5200
```
Interactive API docs:
```
http://localhost:5200/docs
```
---
# Docker
## Build the image
From the repository root:
```
docker build -t reranker-server .
```
---
## Run the container (CPU only)
```
docker run -d \
  --name reranker-server \
  -p 5200:5200 \
  -v "$(pwd)/hf_cache:/app/hf_cache" \
  reranker-server
```

The `hf_cache` bind-mount maps the local `./hf_cache` directory into the container.
Downloaded models are written there and reused on every subsequent start — no re-downloading required.

---
## Run the container (GPU)
Requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to be installed on the host.
```
docker run -d \
  --name reranker-server \
  --gpus all \
  -p 5200:5200 \
  -v "$(pwd)/hf_cache:/app/hf_cache" \
  reranker-server
```
Pass `--gpus '"device=0"'` instead of `--gpus all` to pin a specific GPU.

---
## Stop / remove the container
```
docker stop reranker
docker rm reranker
```
---
## Rebuild after code changes
```
docker build --no-cache -t reranker-server .
```
---
# API
## POST `/rerank`
Rerank a list of documents for a query.
### Request
```
{
  "query": "What is a reranker?",
  "documents": [
    "A reranker sorts retrieved documents by relevance.",
    "Cats are mammals.",
    "Rerankers improve search pipelines."
  ],
  "top_k": 2
}
```
### Response
```
{
  "results": [
    {
      "text": "A reranker sorts retrieved documents by relevance.",
      "score": 0.84
    },
    {
      "text": "Rerankers improve search pipelines.",
      "score": 0.79
    }
  ],
  "backend": "gpu"
}
```
Field `backend` indicates which system produced the final ranking:
* `gpu` → cross-encoder used
* `cpu` → FlashRank only
---
# Health Check
```
GET /health
```
Example response:
```
{
  "status": "ok",
  "cuda_available": true,
  "gpu_model_loaded": true
}
```
---
# GPU Usage
The server automatically detects CUDA.
GPU reranking is used only when:
* CUDA is available
* sufficient VRAM is free
* no CUDA errors occur
If GPU fails, the system **falls back to CPU automatically**.
---
# Recommended RAG Settings
Typical pipeline:
```
vector search → top 20 documents
reranker → top 5 documents
LLM context
```
This keeps latency low while improving retrieval quality.
---
# Testing the Server
Example curl request:
```
curl http://localhost:5200/rerank \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "query": "reranker",
    "documents": [
      "cats",
      "reranking documents improves retrieval"
    ],
    "top_k": 1
  }'
```
---
# Model Information
### CPU Stage
FlashRank model:
```
ms-marco-MiniLM-L-12-v2
```
Fast ONNX reranker optimized for CPU.
### GPU Stage
Cross-encoder model:
```
cross-encoder/ms-marco-MiniLM-L-6-v2
```
Widely used semantic reranker with good performance.
---
# Future Improvements
Possible enhancements:
* result caching
* Docker Compose support
* batch reranking API
* integration with mem0
* Prometheus metrics
* request logging
---
# License
MIT License