Skip to main content

Chat Completions

The Chat Completions API enables developers to interact with language models through conversational interfaces. This OpenAI-compatible endpoint supports multi-turn conversations, system instructions, and advanced features like function calling and structured outputs.

Overview

The API allows you to send a series of messages and receive model-generated responses. It supports various roles (system, user, assistant) and can handle complex conversational flows with context management.

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/chat/completions

Parameters

ParameterTypeRequiredDescription
modelStringYesThe model ID to use (e.g., Qwen/Qwen3-30B-A3B-Instruct-2507).
messagesArrayYesArray of message objects with role and content. Roles: system, user, assistant.
temperatureNumberNoSampling temperature (0.0-2.0). Higher values make output more random. Defaults to 1.0.
top_pNumberNoNucleus sampling parameter (0.0-1.0). Defaults to 1.0.
nIntegerNoNumber of completions to generate. Defaults to 1.
streamBooleanNoWhether to stream responses. Defaults to false.
stopString or ArrayNoStop sequences. Can be a string or array of strings.
max_completion_tokensIntegerNoMaximum number of tokens to generate.
presence_penaltyNumberNoPenalty for token presence (-2.0 to 2.0). Defaults to 0.0.
frequency_penaltyNumberNoPenalty for token frequency (-2.0 to 2.0). Defaults to 0.0.
seedIntegerNoRandom seed for reproducible outputs.
toolsArrayNoArray of tool/function definitions. See Function Calling & Tools.
tool_choiceString or ObjectNoTool choice strategy. See Function Calling & Tools.
response_formatObjectNoResponse format constraints. See Structured Output & JSON.
userStringNoUnique identifier for the end user.

Code Examples

python
1234567891011121314151617181920212223242526
import requests
url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"temperature": 0.7,
"max_completion_tokens": 100
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())
tip

API Key Security: Store your API key in environment variables rather than hardcoding it. API keys always start with sk- ("secret key-").

tip

Temperature Tuning: Lower temperatures (0.0-0.3) produce more focused and deterministic outputs, while higher temperatures (0.7-1.0) increase creativity and variability. Use temperature: 0 for tasks requiring factual accuracy.

Response Format

The API returns a JSON object with the following structure:

json
123456789101112131415161718192021
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1677652288,
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 8,
"total_tokens": 23
}
}

Response Fields

FieldTypeDescription
idStringUnique identifier for the completion.
objectStringObject type, always "chat.completion".
createdIntegerUnix timestamp of when the completion was created.
modelStringModel ID used for the completion.
choicesArrayArray of completion choices.
choices[].indexIntegerIndex of the choice in the array.
choices[].messageObjectMessage object with role and content.
choices[].message.roleStringRole of the message (assistant).
choices[].message.contentStringContent of the message.
choices[].finish_reasonStringReason for completion (stop, length, tool_calls, etc.).
usageObjectToken usage statistics.
usage.prompt_tokensIntegerNumber of tokens in the prompt.
usage.completion_tokensIntegerNumber of tokens in the completion.
usage.total_tokensIntegerTotal number of tokens used.

Message Roles

The messages array supports three role types:

  • system: Sets the behavior and context for the assistant. Typically used at the beginning of the conversation.
  • user: Represents the user's input or question.
  • assistant: Represents the model's previous responses. Used for multi-turn conversations.

Multi-turn Conversations

To maintain conversation context, include previous messages in the messages array:

python
1234567891011121314151617181920212223242526272829303132
import requests
url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
},
{
"role": "assistant",
"content": "The capital of France is Paris."
},
{
"role": "user",
"content": "What is its population?"
}
]
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())

Streaming Responses

Enable streaming by setting stream: true. The response will be sent as a series of Server-Sent Events (SSE):

python
12345678910111213141516171819202122232425262728293031323334353637
import requests
import json
url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
"messages": [
{
"role": "user",
"content": "Tell me a short story."
}
],
"stream": True
}
response = requests.post(url, headers=headers, json=payload, stream=True)
for line in response.iter_lines():
if line:
line_text = line.decode('utf-8')
if line_text.startswith('data: '):
data = line_text[6:]
if data == '[DONE]':
break
try:
chunk = json.loads(data)
if 'choices' in chunk and len(chunk['choices']) > 0:
delta = chunk['choices'][0].get('delta', {})
if 'content' in delta:
print(delta['content'], end='', flush=True)
except json.JSONDecodeError:
pass

Model Specifications

The following LLM models support chat completions:

  • mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 - 675B parameters, 256K context, NVFP4 precision, supports Text, Image, Tools, JSON (Preview)
  • Qwen/Qwen3-30B-A3B-Instruct-2507 - 30B parameters, 262K context, bfloat16 precision, supports Text
  • openai/gpt-oss-120b - 120B parameters, 131K context, bfloat16 precision, supports Text, Tools
  • Qwen/Qwen3-VL-235B-A22B-Thinking - 235B parameters, 262K context, fp8 precision, supports Text, Image, Tools
  • meta-llama/Llama-4-Maverick-17B-128E-Instruct - 400B parameters, 300K context, float8 precision, supports Text, Image, Tools
  • Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 - 235B parameters, 262K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)
  • mistralai/Devstral-2-123B-Instruct-2512 - 123B parameters, 256K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)

Advanced Features

Function Calling

The Chat Completions API supports function calling, allowing models to request execution of external functions. See the Function Calling & Tools guide for detailed information.

Structured Output

You can constrain model outputs to follow specific JSON schemas. See the Structured Output & JSON guide for details.

Reasoning Models

Some models support chain-of-thought reasoning. See the Reasoning Models guide for information about reasoning-capable models.

Best Practices

  1. System Messages: Use system messages to set the assistant's behavior and context at the beginning of conversations.
  2. Context Management: Include relevant conversation history in the messages array to maintain context.
  3. Token Limits: Be mindful of max_completion_tokens and model context limits to avoid truncation.
  4. Temperature: Adjust temperature based on your use case:
    • Lower values (0.0-0.3) for deterministic, factual responses
    • Medium values (0.7-1.0) for balanced creativity
    • Higher values (1.0-2.0) for more creative outputs
  5. Error Handling: Always check response status codes and handle errors appropriately.
  6. Streaming: Use streaming for long responses to improve perceived latency.