# Local Two-Stage Reranker Server A lightweight **self-hosted reranking service** optimized for small GPUs and CPU fallback. The server exposes a simple REST API that reranks documents for a given query. It is designed to integrate easily with **RAG pipelines**, **mem0**, **LangChain**, or custom retrieval systems. --- # Features * Two-stage reranking pipeline * Fast **CPU FlashRank filtering** * Accurate **MiniLM GPU cross-encoder** * **Automatic GPU detection** * **VRAM safety check** * **CUDA OOM fallback** * Optimized for **small GPUs (e.g. GTX 1650 4GB)** * Works fully **CPU-only** * Simple **FastAPI REST interface** * Built-in **health endpoint** * **Docker support** with persistent model cache --- # Architecture The server uses a **two-stage reranking pipeline** to reduce GPU load while maintaining high ranking quality. ``` incoming documents │ ▼ FlashRank CPU reranker │ ▼ top 10 candidates │ ▼ MiniLM cross-encoder (GPU if available) │ ▼ final ranking ``` Advantages: * ~70-90% less GPU usage * very low latency * stable operation on small GPUs * safe fallback when GPU memory is unavailable --- # Project Structure ``` reranker-server/ │ ├── reranker_server.py ├── Dockerfile ├── requirements.txt └── README.md ``` --- # Installation ## 1. Clone the repository ``` git clone cd reranker-server ``` --- ## 2. Create a Python environment ``` python3 -m venv venv ``` Activate the environment. Linux / macOS: ``` source venv/bin/activate ``` Windows: ``` venv\Scripts\activate ``` --- ## 3. Install dependencies ``` pip install --upgrade pip pip install -r requirements.txt ``` Required packages: * fastapi * uvicorn * sentence-transformers * flashrank * torch --- # Running the Server Start the server with: ``` uvicorn reranker_server:app --host 0.0.0.0 --port 5200 ``` Server will run at: ``` http://localhost:5200 ``` Interactive API docs: ``` http://localhost:5200/docs ``` --- # Docker ## Build the image From the repository root: ``` docker build -t reranker-server . ``` --- ## Run the container (CPU only) ``` docker run -d \ --name reranker-server \ -p 5200:5200 \ -v "$(pwd)/hf_cache:/app/hf_cache" \ reranker-server ``` The `hf_cache` bind-mount maps the local `./hf_cache` directory into the container. Downloaded models are written there and reused on every subsequent start — no re-downloading required. --- ## Run the container (GPU) Requires the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) to be installed on the host. ``` docker run -d \ --name reranker-server \ --gpus all \ -p 5200:5200 \ -v "$(pwd)/hf_cache:/app/hf_cache" \ reranker-server ``` Pass `--gpus '"device=0"'` instead of `--gpus all` to pin a specific GPU. --- ## Stop / remove the container ``` docker stop reranker docker rm reranker ``` --- ## Rebuild after code changes ``` docker build --no-cache -t reranker-server . ``` --- # API ## POST `/rerank` Rerank a list of documents for a query. ### Request ``` { "query": "What is a reranker?", "documents": [ "A reranker sorts retrieved documents by relevance.", "Cats are mammals.", "Rerankers improve search pipelines." ], "top_k": 2 } ``` ### Response ``` { "results": [ { "text": "A reranker sorts retrieved documents by relevance.", "score": 0.84 }, { "text": "Rerankers improve search pipelines.", "score": 0.79 } ], "backend": "gpu" } ``` Field `backend` indicates which system produced the final ranking: * `gpu` → cross-encoder used * `cpu` → FlashRank only --- # Health Check ``` GET /health ``` Example response: ``` { "status": "ok", "cuda_available": true, "gpu_model_loaded": true } ``` --- # GPU Usage The server automatically detects CUDA. GPU reranking is used only when: * CUDA is available * sufficient VRAM is free * no CUDA errors occur If GPU fails, the system **falls back to CPU automatically**. --- # Recommended RAG Settings Typical pipeline: ``` vector search → top 20 documents reranker → top 5 documents LLM context ``` This keeps latency low while improving retrieval quality. --- # Testing the Server Example curl request: ``` curl http://localhost:5200/rerank \ -X POST \ -H "Content-Type: application/json" \ -d '{ "query": "reranker", "documents": [ "cats", "reranking documents improves retrieval" ], "top_k": 1 }' ``` --- # Model Information ### CPU Stage FlashRank model: ``` ms-marco-MiniLM-L-12-v2 ``` Fast ONNX reranker optimized for CPU. ### GPU Stage Cross-encoder model: ``` cross-encoder/ms-marco-MiniLM-L-6-v2 ``` Widely used semantic reranker with good performance. --- # Future Improvements Possible enhancements: * result caching * Docker Compose support * batch reranking API * integration with mem0 * Prometheus metrics * request logging --- # License MIT License