Structured Output & JSON

Force models to respond with valid JSON following a specific schema. This makes it easy to parse model outputs and integrate them into your application logic.

Overview

By default, models respond with natural language text. Using the response_format parameter, you can constrain the output to:

JSON Schema: Follows a strict schema you provide
JSON Object: Valid JSON with no specific schema

This is useful for:

Extracting structured data from text
Building pipelines that require typed outputs
Ensuring consistent response formats across requests

Response Format Limit

Nebul enforces a strict limit of 200 flattened fields in response_format. Requests that exceed this limit are blocked.

To avoid performance issues and schema-size failures, see:

Schema Size, Memory Usage, and Reliability
Inspecting and Debugging Your response_format

JSON Schema Mode

Provide a JSON schema and the model will produce output that conforms to it.

Python
cURL

python
import json
from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal

client = OpenAI(
    api_key="sk-your-api-key-here",
    base_url="https://api.inference.nebul.io/v1"
)

# Define your schema using Pydantic
class MovieInfo(BaseModel):
    title: str
    year: int
    director: str
    genres: List[Literal["drama", "comedy", "thriller", "sci-fi", "horror", "action"]]
    rating: float


response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B-Instruct-2507",
    messages=[
        {
            "role": "system",
            "content": "Extract movie information from the user's description. Respond only with valid JSON matching the schema.",
        },
        {
            "role": "user",
            "content": "The Dark Knight from 2008, directed by Christopher Nolan. It's an action thriller rated 9.0",
        },
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "movie_info", "schema": MovieInfo.model_json_schema()},
    },
)
print("JSON output")
print(response.choices[0].message.content)

print("Filtered output")
movie = json.loads(response.choices[0].message.content)
print(f"{movie['title']} ({movie['year']}) - {movie['rating']}/10")

bash
curl https://api.inference.nebul.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key-here" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
      {
        "role": "system",
        "content": "Extract movie information from the user'\''s description. Respond only with valid JSON matching the schema."
      },
      {
        "role": "user",
        "content": "The Dark Knight from 2008, directed by Christopher Nolan. It'\''s an action thriller rated 9.0"
      }
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "movie_info",
        "schema": {
          "type": "object",
          "properties": {
            "title": {"type": "string"},
            "year": {"type": "integer"},
            "director": {"type": "string"},
            "genres": {
              "type": "array",
              "items": {
                "type": "string",
                "enum": ["drama", "comedy", "thriller", "sci-fi", "horror", "action"]
              }
            },
            "rating": {"type": "number"}
          },
          "required": ["title", "year", "director", "genres", "rating"]
        }
      }
    }
  }'

Example Output

json
{
  "title": "The Dark Knight",
  "year": 2008,
  "director": "Christopher Nolan",
  "genres": ["action", "thriller"],
  "rating": 9.0
}

JSON Object Mode

When you need valid JSON but don't want to enforce a specific schema, use json_object mode. The model will produce well-formed JSON based on your prompt instructions.

Python
cURL

python
import json
from openai import OpenAI

client = OpenAI(
    api_key="sk-your-api-key-here",
    base_url="https://api.inference.nebul.io/v1"
)

response = client.chat.completions.create(
    model="Qwen/Qwen3-30B-A3B-Instruct-2507",
    messages=[
        {
            "role": "system",
            "content": "You are an API that returns product information as JSON. Include name, price, and availability."
        },
        {
            "role": "user",
            "content": "Tell me about the iPhone 15 Pro"
        }
    ],
    response_format={"type": "json_object"}
)

data = json.loads(response.choices[0].message.content)
print(data)

bash
curl https://api.inference.nebul.io/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-api-key-here" \
  -d '{
    "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
    "messages": [
      {
        "role": "system",
        "content": "You are an API that returns product information as JSON. Include name, price, and availability."
      },
      {
        "role": "user",
        "content": "Tell me about the iPhone 15 Pro"
      }
    ],
    "response_format": {"type": "json_object"}
  }'

tip

Always instruct the model to output JSON in your system or user prompt, even when using response_format. This improves reliability and output quality.

Schema Best Practices

Use Descriptive Field Names

python
# Good - self-documenting
class OrderSummary(BaseModel):
    order_id: str
    total_amount_usd: float
    items_count: int

# Avoid - ambiguous
class Order(BaseModel):
    id: str
    total: float
    count: int

Include Field Descriptions

python
from pydantic import BaseModel, Field

class CustomerFeedback(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"] = Field(
        description="Overall sentiment of the customer feedback"
    )
    key_points: List[str] = Field(
        description="Main points or concerns raised by the customer"
    )
    urgency: int = Field(
        description="Urgency level from 1 (low) to 5 (critical)",
        ge=1,
        le=5
    )

Handle Potential Refusals

Some requests may result in the model refusing to respond. Always check for refusals:

python
output = response.choices[0].message

if output.refusal:
    print(f"Model refused: {output.refusal}")
elif output.content:
    try:
        data = json.loads(output.content)
        # Process data
    except json.JSONDecodeError as e:
        print(f"Invalid JSON: {e}")

Schema Size, Memory Usage, and Reliability

When using response_format, it’s important to remember that the provided schema is not just a validation hint — it becomes part of the request itself and actively participates in how the model generates output.

In json_schema mode, the schema is serialized, transmitted, kept in memory, and consulted throughout the entire decoding process. This allows the model to produce strongly typed, well-formed JSON, but it also means that very large or complex schemas can significantly increase memory usage and processing time.

Even if the model’s final output is small, the cost of enforcing the schema can be large.

As schema size grows, you may notice slower request initialization, longer generation times, or reduced reliability. In more extreme cases, requests may fail due to memory exhaustion or internal timeouts. These failures can be difficult to diagnose because they are often caused by schema complexity rather than the prompt or the model output itself.

Takeaways

json_schema mode actively constrains decoding, not just validation
Schema complexity affects memory usage and latency even for small outputs
Reliability issues may originate from the schema, not the prompt

What Causes Schemas to Become Large

Schemas often grow unintentionally. Deep nesting, large enumerations, or heavily reused model definitions can all expand the serialized schema far beyond what is obvious from the original Python or JSON definition.

This commonly happens when Pydantic models are reused directly from application logic. Database models, API response objects, or domain models frequently contain far more structure than is required for inference.

Schemas that rely heavily on $defs, anyOf, or oneOf are particularly expensive, as they force the model to reason over many possible output shapes during generation.

Takeaways

Deep nesting, large enums, and unions are the biggest contributors
Reusing “full” application or database models often causes schema bloat
$defs combined with anyOf / oneOf can rapidly increase complexity

Designing Schemas for Stability

For reliable structured output, schemas should be treated as interfaces, not full representations of internal data models.

Include only the fields required for downstream logic. Prefer identifiers or compact categorical values over deeply embedded objects. If a constraint can be validated safely in your application code, it is often better to keep the schema simpler and enforce that constraint outside the model.

When extracting large or complex structures, consider splitting the task across multiple calls with smaller schemas rather than enforcing everything in a single request.

Takeaways

Keep schemas minimal and task-focused
Prefer IDs or references over embedded objects
Split complex extractions into multiple smaller calls

Inspecting and Debugging the `response_format`

Schema size is not always obvious, especially when using Pydantic models with nested references. Inspecting the final response_format payload can help identify unexpected complexity before it causes issues in production.

The OpenAI SDK exposes an internal helper that converts a Python type into the final schema sent to the API.

Takeaways

Inspect the generated response_format, not just your model definition
Nested Pydantic models can produce much larger schemas than expected

Flattening the Generated Schema (Python)

python
from typing import Any, Dict
from openai.lib._parsing import type_to_response_format_param


def flatten_json(obj: Any, prefix: str = "", out: Dict[str, Any] | None = None) -> Dict[str, Any]:
    """
    Flatten a JSON-like structure (dicts + lists) into a single-level dict.
    """
    if out is None:
        out = {}

    if isinstance(obj, dict):
        for k, v in obj.items():
            key = f"{prefix}.{k}" if prefix else str(k)
            flatten_json(v, key, out)

    elif isinstance(obj, list):
        for i, v in enumerate(obj):
            key = f"{prefix}[{i}]"
            flatten_json(v, key, out)

    else:
        out[prefix] = obj

    return out


my_response_format_flattened = flatten_json(
    type_to_response_format_param(MY_PYDANTIC_MODEL_CLASS)
)

Flattening the schema makes it easier to see how deep it goes, how many definitions are involved, and where complexity is accumulating — especially when schemas are composed from multiple nested models.

Takeaways

Flattening exposes hidden depth and repetition
Useful for spotting large enums and deeply nested paths
Useful for checking what hidden configs/definitions are send with the request

Checking Field Count Limits

To protect performance and reliability, Nebul enforces a hard limit on schema complexity.

Requests are blocked when the flattened response_format contains more than 200 fields. This prevents extremely large schemas from consuming excessive memory or causing long-running constrained decoding.

You can check this locally before sending a request:

python
from openai.lib._parsing import type_to_response_format_param

rf = type_to_response_format_param(MyPydanticModel)
flat = flatten_json(rf)

field_count = len(flat)

print(f"Flattened response_format fields: {field_count}")

if field_count > 200:
    raise ValueError(
        f"response_format too large: {field_count} fields (limit: 200). "
        "Reduce nesting, remove unused fields, or split the task into multiple calls."
    )

Important

Nebul blocks schemas with more than 200 flattened fields
Always validate locally to avoid runtime request failures
If you hit the limit, reduce nesting or split the task

Overview​

JSON Schema Mode​

Example Output​

JSON Object Mode​

Schema Best Practices​

Use Descriptive Field Names​

Include Field Descriptions​

Handle Potential Refusals​

Schema Size, Memory Usage, and Reliability​

What Causes Schemas to Become Large​

Designing Schemas for Stability​

Inspecting and Debugging the response_format​

Flattening the Generated Schema (Python)​

Checking Field Count Limits​

Overview

JSON Schema Mode

Example Output

JSON Object Mode

Schema Best Practices

Use Descriptive Field Names

Include Field Descriptions

Handle Potential Refusals

Schema Size, Memory Usage, and Reliability

What Causes Schemas to Become Large

Designing Schemas for Stability

Inspecting and Debugging the `response_format`

Flattening the Generated Schema (Python)

Checking Field Count Limits