|
|
3 hari lalu | |
|---|---|---|
| .gitignore | 4 hari lalu | |
| Dockerfile | 3 hari lalu | |
| README.md | 3 hari lalu | |
| requirements.txt | 4 hari lalu | |
| reranker_server.py | 4 hari lalu |
A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback. The server exposes a simple REST API that reranks documents for a given query.
The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.
incoming documents
│
▼
FlashRank CPU reranker
│
▼
top 10 candidates
│
▼
MiniLM cross-encoder (GPU if available)
│
▼
final ranking
Advantages:
~70-90% less GPU usage
very low latency
stable operation on small GPUs
reranker-server/
│
├── reranker_server.py
├── Dockerfile
├── requirements.txt
└── README.md
git clone <repo-url>
cd reranker-server
python3 -m venv venv
Activate the environment. Linux / macOS:
source venv/bin/activate
Windows:
venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
Required packages:
fastapi
uvicorn
sentence-transformers
flashrank
Start the server with:
uvicorn reranker_server:app --host 0.0.0.0 --port 5200
Server will run at:
http://localhost:5200
Interactive API docs:
http://localhost:5200/docs
From the repository root:
docker build -t reranker-server .
docker run -d \
--name reranker-server \
-p 5200:5200 \
-v "$(pwd)/hf_cache:/app/hf_cache" \
reranker-server
The hf_cache bind-mount maps the local ./hf_cache directory into the container.
Downloaded models are written there and reused on every subsequent start — no re-downloading required.
Requires the NVIDIA Container Toolkit to be installed on the host.
docker run -d \
--name reranker-server \
--gpus all \
-p 5200:5200 \
-v "$(pwd)/hf_cache:/app/hf_cache" \
reranker-server
Pass --gpus '"device=0"' instead of --gpus all to pin a specific GPU.
docker stop reranker
docker rm reranker
docker build --no-cache -t reranker-server .
/rerankRerank a list of documents for a query.
{
"query": "What is a reranker?",
"documents": [
"A reranker sorts retrieved documents by relevance.",
"Cats are mammals.",
"Rerankers improve search pipelines."
],
"top_k": 2
}
{
"results": [
{
"text": "A reranker sorts retrieved documents by relevance.",
"score": 0.84
},
{
"text": "Rerankers improve search pipelines.",
"score": 0.79
}
],
"backend": "gpu"
}
Field backend indicates which system produced the final ranking:
gpu → cross-encoder usedcpu → FlashRank onlyGET /health
Example response:
{
"status": "ok",
"cuda_available": true,
"gpu_model_loaded": true
}
The server automatically detects CUDA. GPU reranking is used only when:
CUDA is available
sufficient VRAM is free
no CUDA errors occur
Typical pipeline:
vector search → top 20 documents
reranker → top 5 documents
LLM context
Example curl request:
curl http://localhost:5200/rerank \
-X POST \
-H "Content-Type: application/json" \
-d '{
"query": "reranker",
"documents": [
"cats",
"reranking documents improves retrieval"
],
"top_k": 1
}'
FlashRank model:
ms-marco-MiniLM-L-12-v2
Fast ONNX reranker optimized for CPU.
Cross-encoder model:
cross-encoder/ms-marco-MiniLM-L-6-v2
Possible enhancements:
result caching
Docker Compose support
batch reranking API
integration with mem0
Prometheus metrics
MIT License