Skip to main content

Completions & Responses API

The Completions and Responses APIs provide OpenAI-compatible endpoints for text generation. The /v1/completions endpoint is designed for simple text completion tasks, while the /v1/responses endpoint offers advanced features like asynchronous processing and reasoning support.

Overview

The API provides two main endpoints:

  • /v1/completions: OpenAI-compatible text completion endpoint for generating text from a prompt.
  • /v1/responses: Advanced responses API with support for asynchronous processing, reasoning, and extended capabilities.

Completions Endpoint

Endpoint

POST https://api.inference.nebul.io/v1/completions

Parameters

ParameterTypeRequiredDescription
modelStringYesThe model ID to use.
promptString or ArrayYesThe prompt(s) to complete. Can be a string or array of strings.
temperatureNumberNoSampling temperature (0.0-2.0). Defaults to 1.0.
top_pNumberNoNucleus sampling parameter (0.0-1.0). Defaults to 1.0.
nIntegerNoNumber of completions to generate. Defaults to 1.
streamBooleanNoWhether to stream responses. Defaults to false.
stopString or ArrayNoStop sequences. Can be a string or array of strings.
max_tokensIntegerNoMaximum number of tokens to generate. Defaults to 16.
presence_penaltyNumberNoPenalty for token presence (-2.0 to 2.0). Defaults to 0.0.
frequency_penaltyNumberNoPenalty for token frequency (-2.0 to 2.0). Defaults to 0.0.
seedIntegerNoRandom seed for reproducible outputs.
echoBooleanNoEcho back the prompt in addition to the completion. Defaults to false.
suffixStringNoSuffix that comes after a completion of inserted text.
best_ofIntegerNoGenerates best_of completions and returns the one with the highest log probability.
logprobsIntegerNoInclude log probabilities on the logprobs most likely tokens.
userStringNoUnique identifier for the end user.

Code Examples

python
1234567891011121314151617
import requests
url = "https://api.inference.nebul.io/v1/completions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"prompt": "The capital of France is",
"max_tokens": 10,
"temperature": 0.7
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())
tip

Asynchronous Processing: The /v1/responses endpoint is designed for long-running requests. Use the request ID returned in the response to check status and retrieve results asynchronously, especially for reasoning models that may take longer to process.

tip

Reasoning Models: Some models support extended reasoning capabilities. When using reasoning models, the response may include intermediate reasoning steps and extended metadata beyond standard completions.

Response Format

json
12345678910111213141516171819
{
"id": "cmpl-abc123",
"object": "text_completion",
"created": 1677652288,
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"choices": [
{
"text": " Paris.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 2,
"total_tokens": 7
}
}

Responses Endpoint

The /v1/responses endpoint provides advanced features including asynchronous processing, reasoning support, and extended capabilities beyond the standard completions API.

Endpoint

POST https://api.inference.nebul.io/v1/responses

Parameters

ParameterTypeRequiredDescription
modelStringYesThe model ID to use.
inputString or ArrayYesThe input(s) to process. Can be a string or array of inputs.
temperatureNumberNoSampling temperature (0.0-2.0). Defaults to 1.0.
top_pNumberNoNucleus sampling parameter (0.0-1.0). Defaults to 1.0.
streamBooleanNoWhether to stream responses. Defaults to false.
max_output_tokensIntegerNoMaximum number of tokens to generate. Defaults to 16.
max_tool_callsIntegerNoMaximum number of tool calls.
userStringNoUnique identifier for the end user.

Features

The Responses API is designed for more complex use cases that require:

  • Asynchronous request processing
  • Reasoning model support
  • Extended response metadata
  • Request cancellation and status tracking

Basic Usage

python
123456789101112131415161718192021222324252627
import requests
import time
url = "https://api.inference.nebul.io/v1/responses"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
# Create a response request
payload = {
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"input": "Explain quantum computing in simple terms.",
"max_output_tokens": 200
}
response = requests.post(url, headers=headers, json=payload)
response_data = response.json()
# The response includes a response_id for tracking
response_id = response_data.get("id")
# Retrieve the response status
if response_id:
status_url = f"https://api.inference.nebul.io/v1/responses/{response_id}"
status_response = requests.get(status_url, headers=headers)
print(status_response.json())

Retrieving Responses

After creating a response request, you can retrieve the status and results using the response ID:

Endpoint: GET https://api.inference.nebul.io/v1/responses/{response_id}

python
123456789
import requests
url = f"https://api.inference.nebul.io/v1/responses/{response_id}"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
}
response = requests.get(url, headers=headers)
print(response.json())

Canceling Responses

You can cancel an in-progress response request:

Endpoint: POST https://api.inference.nebul.io/v1/responses/{response_id}/cancel

python
123456789
import requests
url = f"https://api.inference.nebul.io/v1/responses/{response_id}/cancel"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
}
response = requests.post(url, headers=headers)
print(response.json())

Model Specifications

The Responses API supports the same LLM models as the Chat Completions API. See the Chat Completions Model Specifications for a complete list of available models.

Reasoning Support

The Responses API supports reasoning models that provide step-by-step reasoning before generating a final answer. See the Reasoning Models guide for detailed information about reasoning-capable models and how to use them.

Best Practices

  1. Use Completions for Simple Tasks: Use /v1/completions for straightforward text completion tasks.
  2. Use Responses for Complex Workflows: Use /v1/responses when you need asynchronous processing, status tracking, or reasoning support.
  3. Polling: When using the Responses API, implement appropriate polling intervals to check response status.
  4. Error Handling: Always handle errors and check response status codes.
  5. Cancellation: Cancel long-running requests when they're no longer needed to free up resources.