π Getting Started with the Private Inference API
Overviewβ
Private Inference API is a secure inferencing service run on Nebul's private NeoCloud, ensuring compliance and data protection. It offers open-source and fine-tuned AI models, ideal for industries handling sensitive information, with seamless integration and transparent pricing.
β Prerequisitesβ
- API Key for authentication.
- Familiarity with OpenAI-compatible APIs (e.g., GPT-4, ChatCompletion, Completion endpoints).
curl, Postman, or any HTTP client (e.g., Pythonrequests, OpenAI SDK).
Note: API keys begin with
sk-(e.g.,sk-dummy).
π Authenticationβ
Authenticate requests using a Bearer token:
Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY
Note: API keys begin with
sk-(e.g.,sk-dummy).
π Base URLβ
https://api.chat.nebul.io/v1
This base URL mimics the OpenAI format. All endpoints align with OpenAI's structure to ensure minimal friction for integration.
π Quickstart Example (Chat Completion)β
Requestβ
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/gemma-3-27b-it",
"messages": [{"role": "user", "content": "What is the capital of the Netherlands?"}],
"temperature": 0.7
}'
Responseβ
{
"id": "chatcmpl-abc123xyz",
"object": "chat.completion",
"created": 1685580297,
"model": "demo/gemma-3-27b-it",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of the Netherlands is Amsterdam. However, The Hague is the seat of government and home to the Dutch parliament."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 25,
"total_tokens": 37
}
}
π¦ Supported Endpointsβ
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat-based model responses |
/v1/completions | POST | Classic completion model endpoint |
/v1/embeddings | POST | Generate vector embeddings |
/v1/models | GET | List available models |
π Using Python (via OpenAI SDK)β
Example: Listing Available Modelsβ
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
models = client.models.list()
for model in models.data:
print(model.id)
Equivalent bash:
curl -X GET "https://api.chat.nebul.io/v1/models" \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY"
Example: Sending an Image for Analysisβ
import base64
import os
from openai import OpenAI
def encode_image_to_base64(image_path):
"""Read an image file and return a base64-encoded string."""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY") # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
image_path = "your_path_to_image.jpg"
base64_image = encode_image_to_base64(image_path)
if base64_image:
stream = client.chat.completions.create(
model="demo/gemma-3-27b-it",
messages=[
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
]}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Equivalent bash:
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/gemma-3-27b-it",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": "_YOUR_BASE64_IMAGE_HERE"}}
]}
],
"stream": true
}'
Notesβ
- Model selection: Make sure the model you use supports vision/image input. Use
/v1/modelsto list available models. - Image format: This example assumes JPEG. For PNG, change the MIME type to
image/png. - Base64 encoding: The image is encoded and sent as a data URL in the message content.
- File path: Replace
"your_path_to_image.jpg"with the actual path to your image file on your local system. - Error handling: The script includes error handling for missing files and API errors.
π Streaming & Non-Streaming Usageβ
Minimal Python Example (Non-Streaming)β
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY") # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
response = client.chat.completions.create(
model="demo/gemma-3-27b-it",
messages=[{"role": "user", "content": "What is the capital of France?"}],
stream=False,
)
print(response.choices[0].message.content.strip())
Equivalent bash:
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/gemma-3-27b-it",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": false
}'
Minimal Python Example (Streaming)β
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY") # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
stream = client.chat.completions.create(
model="demo/gemma-3-27b-it",
messages=[{"role": "user", "content": "Write a short sentence."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Equivalent bash:
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/gemma-3-27b-it",
"messages": [{"role": "user", "content": "Write a short sentence."}],
"stream": true
}'
Notes:
- Replace the API key with your own.
- The streaming curl response will be in Server-Sent Events (SSE) format.
- Only use models listed in
/v1/modelsfor best compatibility.
Error Response Formatβ
API errors follow LiteLLM format:
{
"error": {
"message": "Authentication Error, LiteLLM Virtual Key expected. Received=INVALID_API_KEY, expected to start with 'sk-'.",
"type": "auth_error",
"param": "None",
"code": "401"
}
}
π οΈ Qwen-3 Tool Callingβ
Qwen-3 models on Nebul already run with automatic function calling.
All you do is add a "tools" array to your normal request; the endpoint takes care of the rest.
π Same Auth & URL as the other endpointsβ
| Item | Value |
|---|---|
| Base URL | https://api.chat.nebul.io/v1 |
| Header | Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY |
π Step-by-Step (curl)β
- Send your prompt + tool spec
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/qwen-3-14b",
"messages": [
{ "role": "user",
"content": "Remind me to submit the report at 17:00." }
],
"tools": [{
"type": "function",
"function": {
"name": "set_reminder",
"description": "Create a reminder",
"parameters": {
"type": "object",
"properties": {
"text": { "type": "string" },
"time": { "type": "string" }
},
"required": ["text", "time"]
}
}
}]
}'
- Parse the first reply
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_0",
"function": {
"name": "set_reminder",
"arguments": "{ \"text\": \"Submit the report\", \"time\": \"17:00\" }"
}
}]
},
"finish_reason": "tool_calls"
}]
}
- Execute the function in your app and return the result:
curl https://api.chat.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/qwen-3-14b",
"messages": [
{ "role": "user", "content": "Remind me to submit the report at 17:00." },
{ "role": "assistant", "content": null,
"tool_calls": [{
"id": "call_0",
"function": {
"name": "set_reminder",
"arguments": "{ \"text\": \"Submit the report\", \"time\": \"17:00\" }"
}
}]
},
{ "role": "tool",
"tool_call_id": "call_0",
"name": "set_reminder",
"content": "{ \"status\": \"Reminder set for 17:00.\" }"
}
]
}'
- Get the final, human-readable answer
{
"choices": [{
"message": {
"role": "assistant",
"content": "Got it! Iβve scheduled the reminder for 17:00. Anything else?"
}
}]
}
π End-to-End in Pythonβ
import json
from openai import OpenAI
client = OpenAI(
api_key="YOUR_PRIVATE_INFERENCE_API_KEY",
base_url="https://api.chat.nebul.io/v1"
)
tools = [{
"type": "function",
"function": {
"name": "set_reminder",
"description": "Create a reminder",
"parameters": {
"type": "object",
"properties": {
"text": {"type": "string"},
"time": {"type": "string"}
},
"required": ["text", "time"]
}
}
}]
msgs = [{"role": "user",
"content": "Remind me to call Alice at 15:00."}]
# 1οΈ ask the model
r1 = client.chat.completions.create(
model="demo/qwen-3-14b",
messages=msgs,
tools=tools
)
call = r1.choices[0].message.tool_calls[0]
msgs.append(r1.choices[0].message.model_dump())
# 2οΈ run the function yourself
args = json.loads(call.function.arguments)
result = {"status": f"Reminder set for {args['time']}."}
# 3οΈ send back the result, get final reply
msgs.append({
"role": "tool",
"tool_call_id": call.id,
"name": call.function.name,
"content": json.dumps(result)
})
r2 = client.chat.completions.create(
model="demo/qwen-3-14b",
messages=msgs
)
print(r2.choices[0].message.content)
βοΈ Quick Troubleshootingβ
| Symptom | Check |
|---|---|
No tool_calls returned | Ensure you passed the "tools" array. |
| Arguments not JSON | Validate your JSON Schema (type, required, spelling). |
| Multiple calls | Send one role:"tool" message per call before the final request. |
| Need streaming | Add "stream": true to either request. |
DeepSeek-V3.1 Chat API β Requests & Reasoning Toggle (Streaming / Non-Streaming)
Base URL: https://api.chat.nebul.io/v1
Auth: Authorization: Bearer YOUR_PRIVATE_API_KEY
Model: deepseek-ai/DeepSeek-V3.1
Toggle reasoning: "chat_template_kwargs": {"thinking": true | false}
Non-Streaming (Chat Completions)β
Python β reasoning ONβ
from openai import OpenAI
client = OpenAI(
api_key="YOUR_PRIVATE_API_KEY",
base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1"
messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={"chat_template_kwargs": {"thinking": True}},
)
print("reasoning_content:", resp.choices[0].message.reasoning_content)
print("content:", resp.choices[0].message.content)
Python β reasoning OFFβ
from openai import OpenAI
client = OpenAI(
api_key="YOUR_PRIVATE_API_KEY",
base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1"
messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]
resp = client.chat.completions.create(
model=model,
messages=messages,
extra_body={"chat_template_kwargs": {"thinking": False}},
)
print("content:", resp.choices[0].message.reasoning_content)
Annoyingly for now when reasoning is turned off the content is still returend in the reasoning_content field.
cURL β reasoning ONβ
curl https://ds-v3-54f9ad31-nebul-inference-box-productions-models.caas.ai.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1",
"messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}],
"chat_template_kwargs": {"thinking": true}
}'
cURL β reasoning OFFβ
curl https://ds-v3-54f9ad31-nebul-inference-box-productions-models.caas.ai.nebul.io/v1/chat/completions \
-H "Authorization: Bearer YOUR_PRIVATE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V3.1",
"messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}],
"chat_template_kwargs": {"thinking": false}
}'
Streaming (Python)β
Use stream=True. When reasoning is enabled, chunks may include delta.reasoning_content; final answer chunks arrive in delta.content.
from openai import OpenAI
# Point to your OpenAI-compatible endpoint and auth.
client = OpenAI(
api_key="YOUR_PRIVATE_API_KEY",
base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1"
messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]
def stream_chat(thinking: bool):
"""Stream a chat completion and return (final_answer, chain_of_thought)."""
stream = client.chat.completions.create(
model=model,
messages=messages,
stream=True,
extra_body={"chat_template_kwargs": {"thinking": thinking}},
)
cot = []
ans = []
printed_reasoning_header = False
printed_answer_header = False
for chunk in stream:
delta = chunk.choices[0].delta
# when thinking=True (extra attribute in the OpenAI SDK model).
rc = getattr(delta, "reasoning_content", None)
if rc:
if not printed_reasoning_header:
printed_reasoning_header = True
print("Chain of thought: ", end="", flush=True)
cot.append(rc)
print(rc, end="", flush=True)
continue
content = getattr(delta, "content", None)
if content:
if not printed_answer_header:
printed_answer_header = True
print("\nAnswer: ", end="", flush=True)
ans.append(content)
print(content, end="", flush=True)
print() # newline after streaming completes
chain_of_thought = "".join(cot) if cot else None
final_answer = "".join(ans) if ans else None
# Fallback for rare cases where the answer lands in reasoning_content
if (not final_answer or final_answer.strip() == "") and chain_of_thought:
if "</think>" in chain_of_thought:
final_answer = chain_of_thought.split("</think>", 1)[-1].strip()
else:
final_answer = chain_of_thought.strip()
return final_answer, chain_of_thought
print("=== With reasoning (thinking=True) ===")
final_on, cot_on = stream_chat(thinking=True)
print("\n=== Without reasoning (thinking=False) ===")
final_off, cot_off = stream_chat(thinking=False)
Minimal Response Shape (non-streaming)β
{
"id": "chatcmpl-β¦",
"object": "chat.completion",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "β¦present when thinking=trueβ¦",
"content": "β¦final answerβ¦"
},
"finish_reason": "stop"
}]
}
Thatβs it: send your messages, set "thinking": true|false per request, and read reasoning_content (when enabled) and content for the final answer.
π Embeddingsβ
The embeddings endpoint converts text into vector representations for similarity search, clustering, and other machine learning tasks.
Available Modelsβ
Use the /v1/models endpoint to find embedding models. Currently available:
demo/multilingual-e5-large-instruct- Multilingual embedding model supporting many languages
Example: Generate Embeddingsβ
Python:
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
response = client.embeddings.create(
model="demo/multilingual-e5-large-instruct",
input="Machine learning is transforming the world"
)
embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
print(f"Tokens used: {response.usage.total_tokens}")
Equivalent bash:
curl https://api.chat.nebul.io/v1/embeddings \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/multilingual-e5-large-instruct",
"input": "Machine learning is transforming the world"
}'
Response Formatβ
{
"model": "intfloat/multilingual-e5-large-instruct",
"data": [
{
"embedding": [0.0057, 0.0143, -0.0203, ...],
"index": 0,
"object": "embedding"
}
],
"object": "list",
"usage": {
"completion_tokens": 0,
"prompt_tokens": 42,
"total_tokens": 42
}
}
Multiple Inputsβ
You can process multiple texts in a single request:
Python:
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
texts = [
"Happy customer review",
"Error: Connection failed",
"Neutral product description"
]
response = client.embeddings.create(
model="demo/multilingual-e5-large-instruct",
input=texts
)
for i, data in enumerate(response.data):
print(f"Text {i+1}: {len(data.embedding)} dimensions")
Equivalent bash:
curl https://api.chat.nebul.io/v1/embeddings \
-H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "demo/multilingual-e5-large-instruct",
"input": ["Happy customer review", "Error: Connection failed", "Neutral product description"]
}'
Error Handlingβ
Invalid Model Error:
{
"error": {
"message": "Team not allowed to access model. Team=f697846b-bc9b-4fc0-8e0e-3c2ee0162289, Model=invalid-model. Allowed team models = ['demo/gemma-3-27b-it', 'demo/multilingual-e5-large-instruct']",
"type": "team_model_access_denied",
"param": "model",
"code": "401"
}
}
Authentication Error:
{
"error": {
"message": "Authentication Error, Invalid proxy server token passed. valid_token=None.",
"type": "auth_error",
"param": "None",
"code": "401"
}
}
Python Error Handling Example:
import os
from openai import OpenAI
API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
try:
response = client.embeddings.create(
model="demo/multilingual-e5-large-instruct",
input="Your text here"
)
embedding = response.data[0].embedding
print(f"Success: {len(embedding)} dimensions")
except Exception as e:
print(f"Error: {e}")
---
## π© Support
For support, reach out to your enterprise contact or email: `engineering@nebul.com`.
---