Completions & Responses API
The Completions and Responses APIs provide OpenAI-compatible endpoints for text generation. The /v1/completions endpoint is designed for simple text completion tasks, while the /v1/responses endpoint offers advanced features like asynchronous processing and reasoning support.
Overview
The API provides two main endpoints:
/v1/completions: OpenAI-compatible text completion endpoint for generating text from a prompt./v1/responses: Advanced responses API with support for asynchronous processing, reasoning, and extended capabilities.
Completions Endpoint
Endpoint
POST https://api.inference.nebul.io/v1/completions
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | The model ID to use. |
prompt | String or Array | Yes | The prompt(s) to complete. Can be a string or array of strings. |
temperature | Number | No | Sampling temperature (0.0-2.0). Defaults to 1.0. |
top_p | Number | No | Nucleus sampling parameter (0.0-1.0). Defaults to 1.0. |
n | Integer | No | Number of completions to generate. Defaults to 1. |
stream | Boolean | No | Whether to stream responses. Defaults to false. |
stop | String or Array | No | Stop sequences. Can be a string or array of strings. |
max_tokens | Integer | No | Maximum number of tokens to generate. Defaults to 16. |
presence_penalty | Number | No | Penalty for token presence (-2.0 to 2.0). Defaults to 0.0. |
frequency_penalty | Number | No | Penalty for token frequency (-2.0 to 2.0). Defaults to 0.0. |
seed | Integer | No | Random seed for reproducible outputs. |
echo | Boolean | No | Echo back the prompt in addition to the completion. Defaults to false. |
suffix | String | No | Suffix that comes after a completion of inserted text. |
best_of | Integer | No | Generates best_of completions and returns the one with the highest log probability. |
logprobs | Integer | No | Include log probabilities on the logprobs most likely tokens. |
user | String | No | Unique identifier for the end user. |
Code Examples
- Python
- cURL
import requestsurl = "https://api.inference.nebul.io/v1/completions"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","prompt": "The capital of France is","max_tokens": 10,"temperature": 0.7}response = requests.post(url, headers=headers, json=payload)print(response.json())
curl -X POST https://api.inference.nebul.io/v1/completions \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","prompt": "The capital of France is","max_tokens": 10,"temperature": 0.7}'
Asynchronous Processing: The /v1/responses endpoint is designed for long-running requests. Use the request ID returned in the response to check status and retrieve results asynchronously, especially for reasoning models that may take longer to process.
Reasoning Models: Some models support extended reasoning capabilities. When using reasoning models, the response may include intermediate reasoning steps and extended metadata beyond standard completions.
Response Format
{"id": "cmpl-abc123","object": "text_completion","created": 1677652288,"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","choices": [{"text": " Paris.","index": 0,"logprobs": null,"finish_reason": "stop"}],"usage": {"prompt_tokens": 5,"completion_tokens": 2,"total_tokens": 7}}
Responses Endpoint
The /v1/responses endpoint provides advanced features including asynchronous processing, reasoning support, and extended capabilities beyond the standard completions API.
Endpoint
POST https://api.inference.nebul.io/v1/responses
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | The model ID to use. |
input | String or Array | Yes | The input(s) to process. Can be a string or array of inputs. |
temperature | Number | No | Sampling temperature (0.0-2.0). Defaults to 1.0. |
top_p | Number | No | Nucleus sampling parameter (0.0-1.0). Defaults to 1.0. |
stream | Boolean | No | Whether to stream responses. Defaults to false. |
max_output_tokens | Integer | No | Maximum number of tokens to generate. Defaults to 16. |
max_tool_calls | Integer | No | Maximum number of tool calls. |
user | String | No | Unique identifier for the end user. |
Features
The Responses API is designed for more complex use cases that require:
- Asynchronous request processing
- Reasoning model support
- Extended response metadata
- Request cancellation and status tracking
Basic Usage
- Python
- cURL
import requestsimport timeurl = "https://api.inference.nebul.io/v1/responses"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}# Create a response requestpayload = {"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","input": "Explain quantum computing in simple terms.","max_output_tokens": 200}response = requests.post(url, headers=headers, json=payload)response_data = response.json()# The response includes a response_id for trackingresponse_id = response_data.get("id")# Retrieve the response statusif response_id:status_url = f"https://api.inference.nebul.io/v1/responses/{response_id}"status_response = requests.get(status_url, headers=headers)print(status_response.json())
# Create a response requestRESPONSE_ID=$(curl -X POST https://api.inference.nebul.io/v1/responses \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","input": "Explain quantum computing in simple terms.","max_output_tokens": 200}' | jq -r '.id')# Retrieve the response statuscurl -X GET "https://api.inference.nebul.io/v1/responses/${RESPONSE_ID}" \-H "Authorization: Bearer <YOUR_API_KEY>"
Retrieving Responses
After creating a response request, you can retrieve the status and results using the response ID:
Endpoint: GET https://api.inference.nebul.io/v1/responses/{response_id}
- Python
- cURL
import requestsurl = f"https://api.inference.nebul.io/v1/responses/{response_id}"headers = {"Authorization": "Bearer <YOUR_API_KEY>",}response = requests.get(url, headers=headers)print(response.json())
curl -X GET "https://api.inference.nebul.io/v1/responses/{response_id}" \-H "Authorization: Bearer <YOUR_API_KEY>"
Canceling Responses
You can cancel an in-progress response request:
Endpoint: POST https://api.inference.nebul.io/v1/responses/{response_id}/cancel
- Python
- cURL
import requestsurl = f"https://api.inference.nebul.io/v1/responses/{response_id}/cancel"headers = {"Authorization": "Bearer <YOUR_API_KEY>",}response = requests.post(url, headers=headers)print(response.json())
curl -X POST "https://api.inference.nebul.io/v1/responses/{response_id}/cancel" \-H "Authorization: Bearer <YOUR_API_KEY>"
Model Specifications
The Responses API supports the same LLM models as the Chat Completions API. See the Chat Completions Model Specifications for a complete list of available models.
Reasoning Support
The Responses API supports reasoning models that provide step-by-step reasoning before generating a final answer. See the Reasoning Models guide for detailed information about reasoning-capable models and how to use them.
Best Practices
- Use Completions for Simple Tasks: Use
/v1/completionsfor straightforward text completion tasks. - Use Responses for Complex Workflows: Use
/v1/responseswhen you need asynchronous processing, status tracking, or reasoning support. - Polling: When using the Responses API, implement appropriate polling intervals to check response status.
- Error Handling: Always handle errors and check response status codes.
- Cancellation: Cancel long-running requests when they're no longer needed to free up resources.