|
|
4 dienas atpakaļ | |
|---|---|---|
| .gitignore | 4 dienas atpakaļ | |
| README.md | 4 dienas atpakaļ | |
| requirements.txt | 4 dienas atpakaļ | |
| reranker_server.py | 4 dienas atpakaļ |
A lightweight self-hosted reranking service optimized for small GPUs and CPU fallback.
The server exposes a simple REST API that reranks documents for a given query. It is designed to integrate easily with RAG pipelines, mem0, LangChain, or custom retrieval systems.
The server uses a two-stage reranking pipeline to reduce GPU load while maintaining high ranking quality.
incoming documents
│
▼
FlashRank CPU reranker
│
▼
top 10 candidates
│
▼
MiniLM cross-encoder (GPU if available)
│
▼
final ranking
Advantages:
reranker-server/
│
├── reranker_server.py
├── requirements.txt
└── README.md
git clone <repo-url>
cd reranker-server
python3 -m venv venv
Activate the environment.
Linux / macOS:
source venv/bin/activate
Windows:
venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
Required packages:
Start the server with:
uvicorn reranker_server:app --host 0.0.0.0 --port 5200
Server will run at:
http://localhost:5200
Interactive API docs:
http://localhost:5200/docs
/rerankRerank a list of documents for a query.
{
"query": "What is a reranker?",
"documents": [
"A reranker sorts retrieved documents by relevance.",
"Cats are mammals.",
"Rerankers improve search pipelines."
],
"top_k": 2
}
{
"results": [
{
"text": "A reranker sorts retrieved documents by relevance.",
"score": 0.84
},
{
"text": "Rerankers improve search pipelines.",
"score": 0.79
}
],
"backend": "gpu"
}
Field backend indicates which system produced the final ranking:
gpu → cross-encoder usedcpu → FlashRank onlyGET /health
Example response:
{
"status": "ok",
"cuda_available": true,
"gpu_model_loaded": true
}
The server automatically detects CUDA.
GPU reranking is used only when:
If GPU fails, the system falls back to CPU automatically.
Typical pipeline:
vector search → top 20 documents
reranker → top 5 documents
LLM context
This keeps latency low while improving retrieval quality.
Example curl request:
curl http://localhost:5200/rerank \
-X POST \
-H "Content-Type: application/json" \
-d '{
"query": "reranker",
"documents": [
"cats",
"reranking documents improves retrieval"
],
"top_k": 1
}'
FlashRank model:
ms-marco-MiniLM-L-12-v2
Fast ONNX reranker optimized for CPU.
Cross-encoder model:
cross-encoder/ms-marco-MiniLM-L-6-v2
Widely used semantic reranker with good performance.
Possible enhancements:
MIT License