Chat Completions
The Chat Completions API enables developers to interact with language models through conversational interfaces. This OpenAI-compatible endpoint supports multi-turn conversations, system instructions, and advanced features like function calling and structured outputs.
Overview
The API allows you to send a series of messages and receive model-generated responses. It supports various roles (system, user, assistant) and can handle complex conversational flows with context management.
Quick Start
Endpoint
POST https://api.inference.nebul.io/v1/chat/completions
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | The model ID to use (e.g., Qwen/Qwen3-30B-A3B-Instruct-2507). |
messages | Array | Yes | Array of message objects with role and content. Roles: system, user, assistant. |
temperature | Number | No | Sampling temperature (0.0-2.0). Higher values make output more random. Defaults to 1.0. |
top_p | Number | No | Nucleus sampling parameter (0.0-1.0). Defaults to 1.0. |
n | Integer | No | Number of completions to generate. Defaults to 1. |
stream | Boolean | No | Whether to stream responses. Defaults to false. |
stop | String or Array | No | Stop sequences. Can be a string or array of strings. |
max_completion_tokens | Integer | No | Maximum number of tokens to generate. |
presence_penalty | Number | No | Penalty for token presence (-2.0 to 2.0). Defaults to 0.0. |
frequency_penalty | Number | No | Penalty for token frequency (-2.0 to 2.0). Defaults to 0.0. |
seed | Integer | No | Random seed for reproducible outputs. |
tools | Array | No | Array of tool/function definitions. See Function Calling & Tools. |
tool_choice | String or Object | No | Tool choice strategy. See Function Calling & Tools. |
response_format | Object | No | Response format constraints. See Structured Output & JSON. |
user | String | No | Unique identifier for the end user. |
Code Examples
- Python
- cURL
import requestsurl = "https://api.inference.nebul.io/v1/chat/completions"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "What is the capital of France?"}],"temperature": 0.7,"max_completion_tokens": 100}response = requests.post(url, headers=headers, json=payload)print(response.json())
curl -X POST https://api.inference.nebul.io/v1/chat/completions \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "What is the capital of France?"}],"temperature": 0.7,"max_completion_tokens": 100}'
API Key Security: Store your API key in environment variables rather than hardcoding it. API keys always start with sk- ("secret key-").
Temperature Tuning: Lower temperatures (0.0-0.3) produce more focused and deterministic outputs, while higher temperatures (0.7-1.0) increase creativity and variability. Use temperature: 0 for tasks requiring factual accuracy.
Response Format
The API returns a JSON object with the following structure:
{"id": "chatcmpl-abc123","object": "chat.completion","created": 1677652288,"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","choices": [{"index": 0,"message": {"role": "assistant","content": "The capital of France is Paris."},"finish_reason": "stop"}],"usage": {"prompt_tokens": 15,"completion_tokens": 8,"total_tokens": 23}}
Response Fields
| Field | Type | Description |
|---|---|---|
id | String | Unique identifier for the completion. |
object | String | Object type, always "chat.completion". |
created | Integer | Unix timestamp of when the completion was created. |
model | String | Model ID used for the completion. |
choices | Array | Array of completion choices. |
choices[].index | Integer | Index of the choice in the array. |
choices[].message | Object | Message object with role and content. |
choices[].message.role | String | Role of the message (assistant). |
choices[].message.content | String | Content of the message. |
choices[].finish_reason | String | Reason for completion (stop, length, tool_calls, etc.). |
usage | Object | Token usage statistics. |
usage.prompt_tokens | Integer | Number of tokens in the prompt. |
usage.completion_tokens | Integer | Number of tokens in the completion. |
usage.total_tokens | Integer | Total number of tokens used. |
Message Roles
The messages array supports three role types:
system: Sets the behavior and context for the assistant. Typically used at the beginning of the conversation.user: Represents the user's input or question.assistant: Represents the model's previous responses. Used for multi-turn conversations.
Multi-turn Conversations
To maintain conversation context, include previous messages in the messages array:
- Python
- cURL
import requestsurl = "https://api.inference.nebul.io/v1/chat/completions"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "What is the capital of France?"},{"role": "assistant","content": "The capital of France is Paris."},{"role": "user","content": "What is its population?"}]}response = requests.post(url, headers=headers, json=payload)print(response.json())
curl -X POST https://api.inference.nebul.io/v1/chat/completions \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "You are a helpful assistant."},{"role": "user","content": "What is the capital of France?"},{"role": "assistant","content": "The capital of France is Paris."},{"role": "user","content": "What is its population?"}]}'
Streaming Responses
Enable streaming by setting stream: true. The response will be sent as a series of Server-Sent Events (SSE):
- Python
- cURL
import requestsimport jsonurl = "https://api.inference.nebul.io/v1/chat/completions"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "user","content": "Tell me a short story."}],"stream": True}response = requests.post(url, headers=headers, json=payload, stream=True)for line in response.iter_lines():if line:line_text = line.decode('utf-8')if line_text.startswith('data: '):data = line_text[6:]if data == '[DONE]':breaktry:chunk = json.loads(data)if 'choices' in chunk and len(chunk['choices']) > 0:delta = chunk['choices'][0].get('delta', {})if 'content' in delta:print(delta['content'], end='', flush=True)except json.JSONDecodeError:pass
curl -X POST https://api.inference.nebul.io/v1/chat/completions \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "user","content": "Tell me a short story."}],"stream": true}' \--no-buffer
Model Specifications
The following LLM models support chat completions:
mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4- 675B parameters, 256K context, NVFP4 precision, supports Text, Image, Tools, JSON (Preview)Qwen/Qwen3-30B-A3B-Instruct-2507- 30B parameters, 262K context, bfloat16 precision, supports Textopenai/gpt-oss-120b- 120B parameters, 131K context, bfloat16 precision, supports Text, ToolsQwen/Qwen3-VL-235B-A22B-Thinking- 235B parameters, 262K context, fp8 precision, supports Text, Image, Toolsmeta-llama/Llama-4-Maverick-17B-128E-Instruct- 400B parameters, 300K context, float8 precision, supports Text, Image, ToolsQwen/Qwen3-VL-235B-A22B-Instruct-FP8- 235B parameters, 262K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)mistralai/Devstral-2-123B-Instruct-2512- 123B parameters, 256K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)
Advanced Features
Function Calling
The Chat Completions API supports function calling, allowing models to request execution of external functions. See the Function Calling & Tools guide for detailed information.
Structured Output
You can constrain model outputs to follow specific JSON schemas. See the Structured Output & JSON guide for details.
Reasoning Models
Some models support chain-of-thought reasoning. See the Reasoning Models guide for information about reasoning-capable models.
Best Practices
- System Messages: Use system messages to set the assistant's behavior and context at the beginning of conversations.
- Context Management: Include relevant conversation history in the
messagesarray to maintain context. - Token Limits: Be mindful of
max_completion_tokensand model context limits to avoid truncation. - Temperature: Adjust
temperaturebased on your use case:- Lower values (0.0-0.3) for deterministic, factual responses
- Medium values (0.7-1.0) for balanced creativity
- Higher values (1.0-2.0) for more creative outputs
- Error Handling: Always check response status codes and handle errors appropriately.
- Streaming: Use streaming for long responses to improve perceived latency.