📘 Getting Started with the Private Inference API

Overview

Private Inference API is a secure inferencing service run on Nebul's private NeoCloud, ensuring compliance and data protection. It offers open-source and fine-tuned AI models, ideal for industries handling sensitive information, with seamless integration and transparent pricing.

✅ Prerequisites

API Key for authentication.
Familiarity with OpenAI-compatible APIs (e.g., GPT-4, ChatCompletion, Completion endpoints).
curl, Postman, or any HTTP client (e.g., Python requests, OpenAI SDK).

Note: API keys begin with sk- (e.g., sk-dummy).

🔑 Authentication

Authenticate requests using a Bearer token:

Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY

Note: API keys begin with sk- (e.g., sk-dummy).

🔌 Base URL

https://api.chat.nebul.io/v1

This base URL mimics the OpenAI format. All endpoints align with OpenAI's structure to ensure minimal friction for integration.

🚀 Quickstart Example (Chat Completion)

Request

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/gemma-3-27b-it",
    "messages": [{"role": "user", "content": "What is the capital of the Netherlands?"}],
    "temperature": 0.7
  }'

Response

{
  "id": "chatcmpl-abc123xyz",
  "object": "chat.completion",
  "created": 1685580297,
  "model": "demo/gemma-3-27b-it",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of the Netherlands is Amsterdam. However, The Hague is the seat of government and home to the Dutch parliament."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 25,
    "total_tokens": 37
  }
}

📦 Supported Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat-based model responses
`/v1/completions`	POST	Classic completion model endpoint
`/v1/embeddings`	POST	Generate vector embeddings
`/v1/models`	GET	List available models

📘 Using Python (via OpenAI SDK)

Example: Listing Available Models

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")

models = client.models.list()
for model in models.data:
    print(model.id)

Equivalent bash:

curl -X GET "https://api.chat.nebul.io/v1/models" \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY"

Example: Sending an Image for Analysis

import base64
import os
from openai import OpenAI

def encode_image_to_base64(image_path):
    """Read an image file and return a base64-encoded string."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")  # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
image_path = "your_path_to_image.jpg"
base64_image = encode_image_to_base64(image_path)
if base64_image:
    stream = client.chat.completions.create(
        model="demo/gemma-3-27b-it",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": "Describe this image"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
            ]}
        ],
        stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

Equivalent bash:

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/gemma-3-27b-it",
    "messages": [
      {"role": "user", "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,PASTE_YOUR_BASE64_IMAGE_HERE"}}
      ]}
    ],
    "stream": true
  }'

Notes

Model selection: Make sure the model you use supports vision/image input. Use /v1/models to list available models.
Image format: This example assumes JPEG. For PNG, change the MIME type to image/png.
Base64 encoding: The image is encoded and sent as a data URL in the message content.
File path: Replace "your_path_to_image.jpg" with the actual path to your image file on your local system.
Error handling: The script includes error handling for missing files and API errors.

🔄 Streaming & Non-Streaming Usage

Minimal Python Example (Non-Streaming)

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")  # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
response = client.chat.completions.create(
    model="demo/gemma-3-27b-it",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    stream=False,
)
print(response.choices[0].message.content.strip())

Equivalent bash:

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/gemma-3-27b-it",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "stream": false
  }'

Minimal Python Example (Streaming)

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")  # Use environment variable for API key
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")
stream = client.chat.completions.create(
    model="demo/gemma-3-27b-it",
    messages=[{"role": "user", "content": "Write a short sentence."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
print()

Equivalent bash:

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/gemma-3-27b-it",
    "messages": [{"role": "user", "content": "Write a short sentence."}],
    "stream": true
  }'

Notes:

Replace the API key with your own.
The streaming curl response will be in Server-Sent Events (SSE) format.
Only use models listed in /v1/models for best compatibility.

Error Response Format

API errors follow LiteLLM format:

{
  "error": {
    "message": "Authentication Error, LiteLLM Virtual Key expected. Received=INVALID_API_KEY, expected to start with 'sk-'.",
    "type": "auth_error",
    "param": "None",
    "code": "401"
  }
}

🛠️ Qwen-3 Tool Calling

Qwen-3 models on Nebul already run with automatic function calling. All you do is add a "tools" array to your normal request; the endpoint takes care of the rest.

🔑 Same Auth & URL as the other endpoints

Item	Value
Base URL	`https://api.chat.nebul.io/v1`
Header	`Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY`

🚀 Step-by-Step (curl)

Send your prompt + tool spec

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "demo/qwen-3-14b",
        "messages": [
          { "role": "user",
            "content": "Remind me to submit the report at 17:00." }
        ],
        "tools": [{
          "type": "function",
          "function": {
            "name": "set_reminder",
            "description": "Create a reminder",
            "parameters": {
              "type": "object",
              "properties": {
                "text": { "type": "string" },
                "time": { "type": "string" }
              },
              "required": ["text", "time"]
            }
          }
        }]
      }'

Parse the first reply

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_0",
        "function": {
          "name": "set_reminder",
          "arguments": "{ \"text\": \"Submit the report\", \"time\": \"17:00\" }"
        }
      }]
    },
    "finish_reason": "tool_calls"
  }]
}

Execute the function in your app and return the result:

curl https://api.chat.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "demo/qwen-3-14b",
        "messages": [
          { "role": "user",      "content": "Remind me to submit the report at 17:00." },
          { "role": "assistant", "content": null,
            "tool_calls": [{
              "id": "call_0",
              "function": {
                "name": "set_reminder",
                "arguments": "{ \"text\": \"Submit the report\", \"time\": \"17:00\" }"
              }
            }]
          },
          { "role": "tool",
            "tool_call_id": "call_0",
            "name": "set_reminder",
            "content": "{ \"status\": \"Reminder set for 17:00.\" }"
          }
        ]
      }'

Get the final, human-readable answer

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Got it! I’ve scheduled the reminder for 17:00. Anything else?"
    }
  }]
}

🐍 End-to-End in Python

import json
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_PRIVATE_INFERENCE_API_KEY",
    base_url="https://api.chat.nebul.io/v1"
)

tools = [{
    "type": "function",
    "function": {
        "name": "set_reminder",
        "description": "Create a reminder",
        "parameters": {
            "type": "object",
            "properties": {
                "text": {"type": "string"},
                "time": {"type": "string"}
            },
            "required": ["text", "time"]
        }
    }
}]

msgs = [{"role": "user",
         "content": "Remind me to call Alice at 15:00."}]

# 1️ ask the model
r1 = client.chat.completions.create(
    model="demo/qwen-3-14b",
    messages=msgs,
    tools=tools
)
call = r1.choices[0].message.tool_calls[0]
msgs.append(r1.choices[0].message.model_dump())

# 2️ run the function yourself
args = json.loads(call.function.arguments)
result = {"status": f"Reminder set for {args['time']}."}

# 3️ send back the result, get final reply
msgs.append({
    "role": "tool",
    "tool_call_id": call.id,
    "name": call.function.name,
    "content": json.dumps(result)
})
r2 = client.chat.completions.create(
    model="demo/qwen-3-14b",
    messages=msgs
)
print(r2.choices[0].message.content)

⚙️ Quick Troubleshooting

Symptom	Check
No `tool_calls` returned	Ensure you passed the `"tools"` array.
Arguments not JSON	Validate your JSON Schema (`type`, `required`, spelling).
Multiple calls	Send one `role:"tool"` message per call before the final request.
Need streaming	Add `"stream": true` to either request.

DeepSeek-V3.1 Chat API — Requests & Reasoning Toggle (Streaming / Non-Streaming)

Base URL: https://api.chat.nebul.io/v1 Auth: Authorization: Bearer YOUR_PRIVATE_API_KEY Model: deepseek-ai/DeepSeek-V3.1 Toggle reasoning: "chat_template_kwargs": {"thinking": true | false}

Non-Streaming (Chat Completions)

Python — reasoning ON

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_PRIVATE_API_KEY",
    base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1"

messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]

resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"thinking": True}},
)

print("reasoning_content:", resp.choices[0].message.reasoning_content)
print("content:", resp.choices[0].message.content)

Python — reasoning OFF

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_PRIVATE_API_KEY",
    base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1"

messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]

resp = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"thinking": False}},
)

print("content:", resp.choices[0].message.reasoning_content)

Annoyingly for now when reasoning is turned off the content is still returend in the reasoning_content field.

cURL — reasoning ON

curl https://ds-v3-54f9ad31-nebul-inference-box-productions-models.caas.ai.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.1",
    "messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}],
    "chat_template_kwargs": {"thinking": true}
  }'

cURL — reasoning OFF

curl https://ds-v3-54f9ad31-nebul-inference-box-productions-models.caas.ai.nebul.io/v1/chat/completions \
  -H "Authorization: Bearer YOUR_PRIVATE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3.1",
    "messages": [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}],
    "chat_template_kwargs": {"thinking": false}
  }'

Streaming (Python)

Use stream=True. When reasoning is enabled, chunks may include delta.reasoning_content; final answer chunks arrive in delta.content.

from openai import OpenAI

# Point to your OpenAI-compatible endpoint and auth.
client = OpenAI(
    api_key="YOUR_PRIVATE_API_KEY",
    base_url="https://api.chat.nebul.io/v1",
)
model = "deepseek-ai/DeepSeek-V3.1" 

messages = [{"role": "user", "content": "Which is larger, 9.11 or 9.8?"}]


def stream_chat(thinking: bool):
    """Stream a chat completion and return (final_answer, chain_of_thought)."""
    stream = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
        extra_body={"chat_template_kwargs": {"thinking": thinking}},
    )

    cot = []
    ans = []
    printed_reasoning_header = False
    printed_answer_header = False

    for chunk in stream:
        delta = chunk.choices[0].delta

        # when thinking=True (extra attribute in the OpenAI SDK model).
        rc = getattr(delta, "reasoning_content", None)
        if rc:
            if not printed_reasoning_header:
                printed_reasoning_header = True
                print("Chain of thought: ", end="", flush=True)
            cot.append(rc)
            print(rc, end="", flush=True)
            continue

        content = getattr(delta, "content", None)
        if content:
            if not printed_answer_header:
                printed_answer_header = True
                print("\nAnswer: ", end="", flush=True)
            ans.append(content)
            print(content, end="", flush=True)

    print()  # newline after streaming completes

    chain_of_thought = "".join(cot) if cot else None
    final_answer = "".join(ans) if ans else None

    # Fallback for rare cases where the answer lands in reasoning_content
    if (not final_answer or final_answer.strip() == "") and chain_of_thought:
        if "</think>" in chain_of_thought:
            final_answer = chain_of_thought.split("</think>", 1)[-1].strip()
        else:
            final_answer = chain_of_thought.strip()

    return final_answer, chain_of_thought


print("=== With reasoning (thinking=True) ===")
final_on, cot_on = stream_chat(thinking=True)

print("\n=== Without reasoning (thinking=False) ===")
final_off, cot_off = stream_chat(thinking=False)

Minimal Response Shape (non-streaming)

{
  "id": "chatcmpl-…",
  "object": "chat.completion",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "reasoning_content": "…present when thinking=true…",
      "content": "…final answer…"
    },
    "finish_reason": "stop"
  }]
}

That’s it: send your messages, set "thinking": true|false per request, and read reasoning_content (when enabled) and content for the final answer.

🔗 Embeddings

The embeddings endpoint converts text into vector representations for similarity search, clustering, and other machine learning tasks.

Available Models

Use the /v1/models endpoint to find embedding models. Currently available:

demo/multilingual-e5-large-instruct - Multilingual embedding model supporting many languages

Example: Generate Embeddings

Python:

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")

response = client.embeddings.create(
    model="demo/multilingual-e5-large-instruct",
    input="Machine learning is transforming the world"
)

embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
print(f"Tokens used: {response.usage.total_tokens}")

Equivalent bash:

curl https://api.chat.nebul.io/v1/embeddings \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/multilingual-e5-large-instruct",
    "input": "Machine learning is transforming the world"
  }'

Response Format

{
  "model": "intfloat/multilingual-e5-large-instruct",
  "data": [
    {
      "embedding": [0.0057, 0.0143, -0.0203, ...],
      "index": 0,
      "object": "embedding"
    }
  ],
  "object": "list",
  "usage": {
    "completion_tokens": 0,
    "prompt_tokens": 42,
    "total_tokens": 42
  }
}

Multiple Inputs

You can process multiple texts in a single request:

Python:

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")

texts = [
    "Happy customer review",
    "Error: Connection failed", 
    "Neutral product description"
]

response = client.embeddings.create(
    model="demo/multilingual-e5-large-instruct",
    input=texts
)

for i, data in enumerate(response.data):
    print(f"Text {i+1}: {len(data.embedding)} dimensions")

Equivalent bash:

curl https://api.chat.nebul.io/v1/embeddings \
  -H "Authorization: Bearer YOUR_PRIVATE_INFERENCE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "demo/multilingual-e5-large-instruct",
    "input": ["Happy customer review", "Error: Connection failed", "Neutral product description"]
  }'

Error Handling

Invalid Model Error:

{
  "error": {
    "message": "Team not allowed to access model. Team=f697846b-bc9b-4fc0-8e0e-3c2ee0162289, Model=invalid-model. Allowed team models = ['demo/gemma-3-27b-it', 'demo/multilingual-e5-large-instruct']",
    "type": "team_model_access_denied",
    "param": "model",
    "code": "401"
  }
}

Authentication Error:

{
  "error": {
    "message": "Authentication Error, Invalid proxy server token passed. valid_token=None.",
    "type": "auth_error", 
    "param": "None",
    "code": "401"
  }
}

Python Error Handling Example:

import os
from openai import OpenAI

API_KEY = os.environ.get("YOUR_PRIVATE_INFERENCE_API_KEY")
client = OpenAI(api_key=API_KEY, base_url="https://api.chat.nebul.io/v1")

try:
    response = client.embeddings.create(
        model="demo/multilingual-e5-large-instruct",
        input="Your text here"
    )
    embedding = response.data[0].embedding
    print(f"Success: {len(embedding)} dimensions")
except Exception as e:
    print(f"Error: {e}")



---

## 📩 Support

For support, reach out to your enterprise contact or email: `engineering@nebul.com`.

---

Overview​

✅ Prerequisites​

🔑 Authentication​

🔌 Base URL​

🚀 Quickstart Example (Chat Completion)​

Request​

Response​

📦 Supported Endpoints​

📘 Using Python (via OpenAI SDK)​

Example: Listing Available Models​

Example: Sending an Image for Analysis​

Notes​

🔄 Streaming & Non-Streaming Usage​

Minimal Python Example (Non-Streaming)​

Minimal Python Example (Streaming)​

Error Response Format​

🛠️ Qwen-3 Tool Calling​

🔑 Same Auth & URL as the other endpoints​

🚀 Step-by-Step (curl)​

🐍 End-to-End in Python​

⚙️ Quick Troubleshooting​

DeepSeek-V3.1 Chat API — Requests & Reasoning Toggle (Streaming / Non-Streaming)

Non-Streaming (Chat Completions)​

Python — reasoning ON​

Python — reasoning OFF​

cURL — reasoning ON​

cURL — reasoning OFF​

Streaming (Python)​

Minimal Response Shape (non-streaming)​

🔗 Embeddings​

Available Models​

Example: Generate Embeddings​

Response Format​

Multiple Inputs​

Error Handling​

Overview

✅ Prerequisites

🔑 Authentication

🔌 Base URL

🚀 Quickstart Example (Chat Completion)

Request

Response

📦 Supported Endpoints

📘 Using Python (via OpenAI SDK)

Example: Listing Available Models

Example: Sending an Image for Analysis

Notes

🔄 Streaming & Non-Streaming Usage

Minimal Python Example (Non-Streaming)

Minimal Python Example (Streaming)

Error Response Format

🛠️ Qwen-3 Tool Calling

🔑 Same Auth & URL as the other endpoints

🚀 Step-by-Step (curl)

🐍 End-to-End in Python

⚙️ Quick Troubleshooting

Non-Streaming (Chat Completions)

Python — reasoning ON

Python — reasoning OFF

cURL — reasoning ON

cURL — reasoning OFF

Streaming (Python)

Minimal Response Shape (non-streaming)

🔗 Embeddings

Available Models

Example: Generate Embeddings

Response Format

Multiple Inputs

Error Handling