Chat Completions

The Chat Completions API enables developers to interact with language models through conversational interfaces. This OpenAI-compatible endpoint supports multi-turn conversations, system instructions, and advanced features like function calling and structured outputs.

Overview

The API allows you to send a series of messages and receive model-generated responses. It supports various roles (system, user, assistant) and can handle complex conversational flows with context management.

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/chat/completions

Parameters

Parameter	Type	Required	Description
`model`	String	Yes	The model ID to use (e.g., `Qwen/Qwen3-30B-A3B-Instruct-2507`).
`messages`	Array	Yes	Array of message objects with `role` and `content`. Roles: `system`, `user`, `assistant`.
`temperature`	Number	No	Sampling temperature (0.0-2.0). Higher values make output more random. Defaults to `1.0`.
`top_p`	Number	No	Nucleus sampling parameter (0.0-1.0). Defaults to `1.0`.
`n`	Integer	No	Number of completions to generate. Defaults to `1`.
`stream`	Boolean	No	Whether to stream responses. Defaults to `false`.
`stop`	String or Array	No	Stop sequences. Can be a string or array of strings.
`max_completion_tokens`	Integer	No	Maximum number of tokens to generate.
`presence_penalty`	Number	No	Penalty for token presence (-2.0 to 2.0). Defaults to `0.0`.
`frequency_penalty`	Number	No	Penalty for token frequency (-2.0 to 2.0). Defaults to `0.0`.
`seed`	Integer	No	Random seed for reproducible outputs.
`tools`	Array	No	Array of tool/function definitions. See Function Calling & Tools.
`tool_choice`	String or Object	No	Tool choice strategy. See Function Calling & Tools.
`response_format`	Object	No	Response format constraints. See Structured Output & JSON.
`user`	String	No	Unique identifier for the end user.

Code Examples

Python
cURL

python
import requests

url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json",
}

payload = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the capital of France?"
        }
    ],
    "temperature": 0.7,
    "max_completion_tokens": 100
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

bash
curl -X POST https://api.inference.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ],
    "temperature": 0.7,
    "max_completion_tokens": 100
  }'

tip

API Key Security: Store your API key in environment variables rather than hardcoding it. API keys always start with sk- ("secret key-").

tip

Temperature Tuning: Lower temperatures (0.0-0.3) produce more focused and deterministic outputs, while higher temperatures (0.7-1.0) increase creativity and variability. Use temperature: 0 for tasks requiring factual accuracy.

Response Format

The API returns a JSON object with the following structure:

json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 8,
    "total_tokens": 23
  }
}

Response Fields

Field	Type	Description
`id`	String	Unique identifier for the completion.
`object`	String	Object type, always `"chat.completion"`.
`created`	Integer	Unix timestamp of when the completion was created.
`model`	String	Model ID used for the completion.
`choices`	Array	Array of completion choices.
`choices[].index`	Integer	Index of the choice in the array.
`choices[].message`	Object	Message object with `role` and `content`.
`choices[].message.role`	String	Role of the message (`assistant`).
`choices[].message.content`	String	Content of the message.
`choices[].finish_reason`	String	Reason for completion (`stop`, `length`, `tool_calls`, etc.).
`usage`	Object	Token usage statistics.
`usage.prompt_tokens`	Integer	Number of tokens in the prompt.
`usage.completion_tokens`	Integer	Number of tokens in the completion.
`usage.total_tokens`	Integer	Total number of tokens used.

Message Roles

The messages array supports three role types:

system: Sets the behavior and context for the assistant. Typically used at the beginning of the conversation.
user: Represents the user's input or question.
assistant: Represents the model's previous responses. Used for multi-turn conversations.

Multi-turn Conversations

To maintain conversation context, include previous messages in the messages array:

Python
cURL

python
import requests

url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json",
}

payload = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the capital of France?"
        },
        {
            "role": "assistant",
            "content": "The capital of France is Paris."
        },
        {
            "role": "user",
            "content": "What is its population?"
        }
    ]
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())

bash
curl -X POST https://api.inference.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the capital of France?"
      },
      {
        "role": "assistant",
        "content": "The capital of France is Paris."
      },
      {
        "role": "user",
        "content": "What is its population?"
      }
    ]
  }'

Streaming Responses

Enable streaming by setting stream: true. The response will be sent as a series of Server-Sent Events (SSE):

Python
cURL

python
import requests
import json

url = "https://api.inference.nebul.io/v1/chat/completions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json",
}

payload = {
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
        {
            "role": "user",
            "content": "Tell me a short story."
        }
    ],
    "stream": True
}

response = requests.post(url, headers=headers, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        line_text = line.decode('utf-8')
        if line_text.startswith('data: '):
            data = line_text[6:]
            if data == '[DONE]':
                break
            try:
                chunk = json.loads(data)
                if 'choices' in chunk and len(chunk['choices']) > 0:
                    delta = chunk['choices'][0].get('delta', {})
                    if 'content' in delta:
                        print(delta['content'], end='', flush=True)
            except json.JSONDecodeError:
                pass

bash
curl -X POST https://api.inference.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
      {
        "role": "user",
        "content": "Tell me a short story."
      }
    ],
    "stream": true
  }' \
  --no-buffer

Model Specifications

The following LLM models support chat completions:

mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4 - 675B parameters, 256K context, NVFP4 precision, supports Text, Image, Tools, JSON (Preview)
Qwen/Qwen3-30B-A3B-Instruct-2507 - 30B parameters, 262K context, bfloat16 precision, supports Text
openai/gpt-oss-120b - 120B parameters, 131K context, bfloat16 precision, supports Text, Tools
Qwen/Qwen3-VL-235B-A22B-Thinking - 235B parameters, 262K context, fp8 precision, supports Text, Image, Tools
meta-llama/Llama-4-Maverick-17B-128E-Instruct - 400B parameters, 300K context, float8 precision, supports Text, Image, Tools
Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 - 235B parameters, 262K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)
mistralai/Devstral-2-123B-Instruct-2512 - 123B parameters, 256K context, FP8 precision, supports Text, Image, Tools, JSON (Preview)

Advanced Features

Function Calling

The Chat Completions API supports function calling, allowing models to request execution of external functions. See the Function Calling & Tools guide for detailed information.

Structured Output

You can constrain model outputs to follow specific JSON schemas. See the Structured Output & JSON guide for details.

Reasoning Models

Some models support chain-of-thought reasoning. See the Reasoning Models guide for information about reasoning-capable models.

Best Practices

System Messages: Use system messages to set the assistant's behavior and context at the beginning of conversations.
Context Management: Include relevant conversation history in the messages array to maintain context.
Token Limits: Be mindful of max_completion_tokens and model context limits to avoid truncation.
Temperature: Adjust temperature based on your use case:
- Lower values (0.0-0.3) for deterministic, factual responses
- Medium values (0.7-1.0) for balanced creativity
- Higher values (1.0-2.0) for more creative outputs
Error Handling: Always check response status codes and handle errors appropriately.
Streaming: Use streaming for long responses to improve perceived latency.

Overview​

Quick Start​

Endpoint​

Parameters​

Code Examples​

Response Format​

Response Fields​

Message Roles​

Multi-turn Conversations​

Streaming Responses​

Model Specifications​

Advanced Features​

Function Calling​

Structured Output​

Reasoning Models​

Best Practices​