Reranking
This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the reranking model is available upon request.
Reranking models enable developers to score and reorder documents based on their relevance to a query. These examples will use the BAAI/bge-reranker-v2-m3 model, which is a state-of-the-art reranking model that supports multilingual text reranking.
Overview
The Reranking API allows you to score a list of documents against a query and return them in order of relevance. This is particularly useful for improving search results, filtering content, and enhancing retrieval-augmented generation (RAG) systems. The API is aligned with the Cohere rerank API specification.
Use Cases
Reranking improves search results (e-commerce product ranking, documentation prioritization, content discovery), enhances RAG systems (two-stage retrieval, context selection, answer quality improvement), and enables content filtering (recommendation systems, content moderation, personalization).
Quick Start
Endpoint
POST https://api.inference.nebul.io/v1/rerank
Alternatively, you can use the direct endpoint:
POST https://api.inference.nebul.io/rerank
Both endpoints behave identically.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | The model ID to use (e.g., BAAI/bge-reranker-v2-m3). |
query | String | Yes | The search query to rank documents against. |
documents | Array[String] | Yes | List of documents to rerank. |
top_n | Integer | No | Number of top results to return. If not specified, returns all documents sorted by relevance score. |
return_documents | Boolean | No | Whether to return the original documents in the response. Defaults to false. |
raw_scores | Boolean | No | Whether to return raw scores instead of normalized scores. Defaults to false. |
Code Examples
- Python
- cURL
Using the requests library:
import requestsurl = "https://api.inference.nebul.io/v1/rerank"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "BAAI/bge-reranker-v2-m3","query": "What is machine learning?","documents": ["Machine learning is a subset of artificial intelligence.","The weather today is sunny and warm.","Deep learning uses neural networks with multiple layers.","Python is a popular programming language."],"top_n": 3,"return_documents": True}response = requests.post(url, headers=headers, json=payload)data = response.json()for result in data["results"]:print(f"Score: {result['relevance_score']:.4f}")print(f"Document: {result['document']}")print(f"Index: {result['index']}\n")
curl -X POST https://api.inference.nebul.io/v1/rerank \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "BAAI/bge-reranker-v2-m3","query": "What is machine learning?","documents": ["Machine learning is a subset of artificial intelligence.","The weather today is sunny and warm.","Deep learning uses neural networks with multiple layers.","Python is a popular programming language."],"top_n": 3,"return_documents": true}'
Top N Selection: Use the top_n parameter to limit results to the most relevant documents. This reduces response size and improves performance when you only need the top results. If not specified, all documents are returned sorted by relevance score.
Score Interpretation: Relevance scores are normalized between 0 and 1, with higher scores indicating better relevance. Scores above 0.7 typically indicate strong relevance, while scores below 0.3 suggest weak relevance.
Response Format
The API returns a JSON object containing the reranked results:
{"object": "rerank","id": "rerank-12345","model": "BAAI/bge-reranker-v2-m3","results": [{"index": 0,"relevance_score": 0.9876,"document": "Machine learning is a subset of artificial intelligence."},{"index": 2,"relevance_score": 0.8543,"document": "Deep learning uses neural networks with multiple layers."},{"index": 3,"relevance_score": 0.1234,"document": "Python is a popular programming language."}],"usage": {"prompt_tokens": 45,"total_tokens": 45},"created": 1731500000}
Response Fields
| Field | Type | Description |
|---|---|---|
object | String | Always "rerank" for reranking responses. |
id | String | Unique identifier for the reranking request. |
model | String | The model ID used for reranking. |
results | Array | List of reranked documents, sorted by relevance score (highest first). |
results[].index | Integer | Original index of the document in the input array. |
results[].relevance_score | Number | Relevance score between the query and document (higher is more relevant). |
results[].document | String | The document text (only included if return_documents is true). |
usage | Object | Token usage information. |
usage.prompt_tokens | Integer | Number of tokens in the input. |
usage.total_tokens | Integer | Total tokens processed. |
created | Integer | Unix timestamp of when the request was created. |
Advanced Usage
Limiting Results with top_n
Use top_n to return only the most relevant documents:
- Python
- cURL
payload = {"model": "BAAI/bge-reranker-v2-m3","query": "What is artificial intelligence?","documents": ["AI is the simulation of human intelligence.","The sky is blue.","Machine learning enables AI systems to learn.","Today is Monday.","Neural networks are used in AI."],"top_n": 2,"return_documents": True}response = requests.post(url, headers=headers, json=payload)data = response.json()print(f"Top {len(data['results'])} most relevant documents:")for result in data["results"]:print(f" - {result['document']} (score: {result['relevance_score']:.4f})")
curl -X POST https://api.inference.nebul.io/v1/rerank \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "BAAI/bge-reranker-v2-m3","query": "What is artificial intelligence?","documents": ["AI is the simulation of human intelligence.","The sky is blue.","Machine learning enables AI systems to learn.","Today is Monday.","Neural networks are used in AI."],"top_n": 2,"return_documents": true}'
Using Raw Scores
Get raw (unnormalized) scores instead of normalized relevance scores:
- Python
- cURL
payload = {"model": "BAAI/bge-reranker-v2-m3","query": "Python programming","documents": ["Python is a high-level programming language.","Java is another programming language.","Python supports multiple programming paradigms."],"raw_scores": True,"return_documents": True}response = requests.post(url, headers=headers, json=payload)data = response.json()for result in data["results"]:print(f"Raw score: {result['relevance_score']}")print(f"Document: {result['document']}\n")
curl -X POST https://api.inference.nebul.io/v1/rerank \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "BAAI/bge-reranker-v2-m3","query": "Python programming","documents": ["Python is a high-level programming language.","Java is another programming language.","Python supports multiple programming paradigms."],"raw_scores": true,"return_documents": true}'
Multilingual Reranking
The model supports multilingual queries and documents:
- Python
- cURL
# Dutch query with mixed language documentspayload = {"model": "BAAI/bge-reranker-v2-m3","query": "Wat is kunstmatige intelligentie?","documents": ["Artificial intelligence simulates human intelligence.","Kunstmatige intelligentie simuleert menselijke intelligentie.","L'intelligence artificielle simule l'intelligence humaine.","The weather is nice today."],"top_n": 2,"return_documents": True}response = requests.post(url, headers=headers, json=payload)data = response.json()for result in data["results"]:print(f"Score: {result['relevance_score']:.4f}")print(f"Document: {result['document']}\n")
curl -X POST https://api.inference.nebul.io/v1/rerank \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "BAAI/bge-reranker-v2-m3","query": "Wat is kunstmatige intelligentie?","documents": ["Artificial intelligence simulates human intelligence.","Kunstmatige intelligentie simuleert menselijke intelligentie.","L'\''intelligence artificielle simule l'\''intelligence humaine.","The weather is nice today."],"top_n": 2,"return_documents": true}'
RAG Integration Example
Reranking is commonly used to improve retrieval-augmented generation (RAG) systems:
- Python
- cURL
# Step 1: Retrieve candidate documents (e.g., from a vector database)candidate_docs = ["Machine learning algorithms can be supervised or unsupervised.","The capital of France is Paris.","Neural networks consist of layers of interconnected nodes.","Python has a large ecosystem of libraries for data science.","Deep learning is a subset of machine learning."]# Step 2: Rerank documents based on the user queryuser_query = "How do neural networks work?"payload = {"model": "BAAI/bge-reranker-v2-m3","query": user_query,"documents": candidate_docs,"top_n": 2,"return_documents": True}response = requests.post(url, headers=headers, json=payload)data = response.json()# Step 3: Use top-ranked documents for contexttop_documents = [result["document"] for result in data["results"]]context = "\n\n".join(top_documents)print("Top relevant documents for context:")for i, doc in enumerate(top_documents, 1):print(f"{i}. {doc}")
curl -X POST https://api.inference.nebul.io/v1/rerank \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "BAAI/bge-reranker-v2-m3","query": "How do neural networks work?","documents": ["Machine learning algorithms can be supervised or unsupervised.","The capital of France is Paris.","Neural networks consist of layers of interconnected nodes.","Python has a large ecosystem of libraries for data science.","Deep learning is a subset of machine learning."],"top_n": 2,"return_documents": true}'
Model Specifications
The following reranking models are available:
BAAI/bge-reranker-v2-m3- 568M parameters, 8K context, float16 precision, supports Text, Multilingual (100+ languages) (Preview)
Best Practices
-
Batch Processing: Rerank multiple queries in separate requests rather than batching queries together.
-
Document Length: Keep documents reasonably sized. Very long documents may be truncated or split.
-
Top N Selection: Use
top_nto limit results when you only need the most relevant documents, which can improve response time. -
Score Interpretation:
- Normalized scores (default): Range typically 0-1, higher is more relevant
- Raw scores: Unnormalized, can vary in range depending on the model
-
Multilingual Queries: The model handles multilingual queries well, but matching the query language to document languages can improve results.
-
RAG Workflow: Combine with vector search for best results:
- Use vector search to retrieve a larger candidate set (e.g., 50-100 documents)
- Rerank the candidates to get the top N most relevant (e.g., top 5-10)
- Use the reranked results as context for your LLM