Skip to main content

Reranking

PREVIEW MODE

This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the reranking model is available upon request.

Reranking models enable developers to score and reorder documents based on their relevance to a query. These examples will use the BAAI/bge-reranker-v2-m3 model, which is a state-of-the-art reranking model that supports multilingual text reranking.

Overview

The Reranking API allows you to score a list of documents against a query and return them in order of relevance. This is particularly useful for improving search results, filtering content, and enhancing retrieval-augmented generation (RAG) systems. The API is aligned with the Cohere rerank API specification.

Use Cases

Reranking improves search results (e-commerce product ranking, documentation prioritization, content discovery), enhances RAG systems (two-stage retrieval, context selection, answer quality improvement), and enables content filtering (recommendation systems, content moderation, personalization).

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/rerank

Alternatively, you can use the direct endpoint:

POST https://api.inference.nebul.io/rerank

Both endpoints behave identically.

Parameters

ParameterTypeRequiredDescription
modelStringYesThe model ID to use (e.g., BAAI/bge-reranker-v2-m3).
queryStringYesThe search query to rank documents against.
documentsArray[String]YesList of documents to rerank.
top_nIntegerNoNumber of top results to return. If not specified, returns all documents sorted by relevance score.
return_documentsBooleanNoWhether to return the original documents in the response. Defaults to false.
raw_scoresBooleanNoWhether to return raw scores instead of normalized scores. Defaults to false.

Code Examples

Using the requests library:

python
12345678910111213141516171819202122232425262728
import requests
url = "https://api.inference.nebul.io/v1/rerank"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": "What is machine learning?",
"documents": [
"Machine learning is a subset of artificial intelligence.",
"The weather today is sunny and warm.",
"Deep learning uses neural networks with multiple layers.",
"Python is a popular programming language."
],
"top_n": 3,
"return_documents": True
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
for result in data["results"]:
print(f"Score: {result['relevance_score']:.4f}")
print(f"Document: {result['document']}")
print(f"Index: {result['index']}\n")
tip

Top N Selection: Use the top_n parameter to limit results to the most relevant documents. This reduces response size and improves performance when you only need the top results. If not specified, all documents are returned sorted by relevance score.

tip

Score Interpretation: Relevance scores are normalized between 0 and 1, with higher scores indicating better relevance. Scores above 0.7 typically indicate strong relevance, while scores below 0.3 suggest weak relevance.

Response Format

The API returns a JSON object containing the reranked results:

json
123456789101112131415161718192021222324252627
{
"object": "rerank",
"id": "rerank-12345",
"model": "BAAI/bge-reranker-v2-m3",
"results": [
{
"index": 0,
"relevance_score": 0.9876,
"document": "Machine learning is a subset of artificial intelligence."
},
{
"index": 2,
"relevance_score": 0.8543,
"document": "Deep learning uses neural networks with multiple layers."
},
{
"index": 3,
"relevance_score": 0.1234,
"document": "Python is a popular programming language."
}
],
"usage": {
"prompt_tokens": 45,
"total_tokens": 45
},
"created": 1731500000
}

Response Fields

FieldTypeDescription
objectStringAlways "rerank" for reranking responses.
idStringUnique identifier for the reranking request.
modelStringThe model ID used for reranking.
resultsArrayList of reranked documents, sorted by relevance score (highest first).
results[].indexIntegerOriginal index of the document in the input array.
results[].relevance_scoreNumberRelevance score between the query and document (higher is more relevant).
results[].documentStringThe document text (only included if return_documents is true).
usageObjectToken usage information.
usage.prompt_tokensIntegerNumber of tokens in the input.
usage.total_tokensIntegerTotal tokens processed.
createdIntegerUnix timestamp of when the request was created.

Advanced Usage

Limiting Results with top_n

Use top_n to return only the most relevant documents:

python
1234567891011121314151617181920
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": "What is artificial intelligence?",
"documents": [
"AI is the simulation of human intelligence.",
"The sky is blue.",
"Machine learning enables AI systems to learn.",
"Today is Monday.",
"Neural networks are used in AI."
],
"top_n": 2,
"return_documents": True
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
print(f"Top {len(data['results'])} most relevant documents:")
for result in data["results"]:
print(f" - {result['document']} (score: {result['relevance_score']:.4f})")

Using Raw Scores

Get raw (unnormalized) scores instead of normalized relevance scores:

python
123456789101112131415161718
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": "Python programming",
"documents": [
"Python is a high-level programming language.",
"Java is another programming language.",
"Python supports multiple programming paradigms."
],
"raw_scores": True,
"return_documents": True
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
for result in data["results"]:
print(f"Raw score: {result['relevance_score']}")
print(f"Document: {result['document']}\n")

Multilingual Reranking

The model supports multilingual queries and documents:

python
1234567891011121314151617181920
# Dutch query with mixed language documents
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": "Wat is kunstmatige intelligentie?",
"documents": [
"Artificial intelligence simulates human intelligence.",
"Kunstmatige intelligentie simuleert menselijke intelligentie.",
"L'intelligence artificielle simule l'intelligence humaine.",
"The weather is nice today."
],
"top_n": 2,
"return_documents": True
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
for result in data["results"]:
print(f"Score: {result['relevance_score']:.4f}")
print(f"Document: {result['document']}\n")

RAG Integration Example

Reranking is commonly used to improve retrieval-augmented generation (RAG) systems:

python
123456789101112131415161718192021222324252627282930
# Step 1: Retrieve candidate documents (e.g., from a vector database)
candidate_docs = [
"Machine learning algorithms can be supervised or unsupervised.",
"The capital of France is Paris.",
"Neural networks consist of layers of interconnected nodes.",
"Python has a large ecosystem of libraries for data science.",
"Deep learning is a subset of machine learning."
]
# Step 2: Rerank documents based on the user query
user_query = "How do neural networks work?"
payload = {
"model": "BAAI/bge-reranker-v2-m3",
"query": user_query,
"documents": candidate_docs,
"top_n": 2,
"return_documents": True
}
response = requests.post(url, headers=headers, json=payload)
data = response.json()
# Step 3: Use top-ranked documents for context
top_documents = [result["document"] for result in data["results"]]
context = "\n\n".join(top_documents)
print("Top relevant documents for context:")
for i, doc in enumerate(top_documents, 1):
print(f"{i}. {doc}")

Model Specifications

The following reranking models are available:

  • BAAI/bge-reranker-v2-m3 - 568M parameters, 8K context, float16 precision, supports Text, Multilingual (100+ languages) (Preview)

Best Practices

  1. Batch Processing: Rerank multiple queries in separate requests rather than batching queries together.

  2. Document Length: Keep documents reasonably sized. Very long documents may be truncated or split.

  3. Top N Selection: Use top_n to limit results when you only need the most relevant documents, which can improve response time.

  4. Score Interpretation:

    • Normalized scores (default): Range typically 0-1, higher is more relevant
    • Raw scores: Unnormalized, can vary in range depending on the model
  5. Multilingual Queries: The model handles multilingual queries well, but matching the query language to document languages can improve results.

  6. RAG Workflow: Combine with vector search for best results:

    • Use vector search to retrieve a larger candidate set (e.g., 50-100 documents)
    • Rerank the candidates to get the top N most relevant (e.g., top 5-10)
    • Use the reranked results as context for your LLM