Skip to main content

Structured Output & JSON

Force models to respond with valid JSON following a specific schema. This makes it easy to parse model outputs and integrate them into your application logic.

Overview

By default, models respond with natural language text. Using the response_format parameter, you can constrain the output to:

  • JSON Schema: Follows a strict schema you provide
  • JSON Object: Valid JSON with no specific schema

This is useful for:

  • Extracting structured data from text
  • Building pipelines that require typed outputs
  • Ensuring consistent response formats across requests
Response Format Limit

Nebul enforces a strict limit of 200 flattened fields in response_format. Requests that exceed this limit are blocked.

To avoid performance issues and schema-size failures, see:

JSON Schema Mode

Provide a JSON schema and the model will produce output that conforms to it.

python
12345678910111213141516171819202122232425262728293031323334353637383940414243
import json
from openai import OpenAI
from pydantic import BaseModel
from typing import List, Literal
client = OpenAI(
api_key="sk-your-api-key-here",
base_url="https://api.inference.nebul.io/v1"
)
# Define your schema using Pydantic
class MovieInfo(BaseModel):
title: str
year: int
director: str
genres: List[Literal["drama", "comedy", "thriller", "sci-fi", "horror", "action"]]
rating: float
response = client.chat.completions.create(
model="Qwen/Qwen3-30B-A3B-Instruct-2507",
messages=[
{
"role": "system",
"content": "Extract movie information from the user's description. Respond only with valid JSON matching the schema.",
},
{
"role": "user",
"content": "The Dark Knight from 2008, directed by Christopher Nolan. It's an action thriller rated 9.0",
},
],
response_format={
"type": "json_schema",
"json_schema": {"name": "movie_info", "schema": MovieInfo.model_json_schema()},
},
)
print("JSON output")
print(response.choices[0].message.content)
print("Filtered output")
movie = json.loads(response.choices[0].message.content)
print(f"{movie['title']} ({movie['year']}) - {movie['rating']}/10")

Example Output

json
1234567
{
"title": "The Dark Knight",
"year": 2008,
"director": "Christopher Nolan",
"genres": ["action", "thriller"],
"rating": 9.0
}

JSON Object Mode

When you need valid JSON but don't want to enforce a specific schema, use json_object mode. The model will produce well-formed JSON based on your prompt instructions.

python
12345678910111213141516171819202122232425
import json
from openai import OpenAI
client = OpenAI(
api_key="sk-your-api-key-here",
base_url="https://api.inference.nebul.io/v1"
)
response = client.chat.completions.create(
model="Qwen/Qwen3-30B-A3B-Instruct-2507",
messages=[
{
"role": "system",
"content": "You are an API that returns product information as JSON. Include name, price, and availability."
},
{
"role": "user",
"content": "Tell me about the iPhone 15 Pro"
}
],
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
print(data)
tip

Always instruct the model to output JSON in your system or user prompt, even when using response_format. This improves reliability and output quality.

Schema Best Practices

Use Descriptive Field Names

python
1234567891011
# Good - self-documenting
class OrderSummary(BaseModel):
order_id: str
total_amount_usd: float
items_count: int
# Avoid - ambiguous
class Order(BaseModel):
id: str
total: float
count: int

Include Field Descriptions

python
1234567891011121314
from pydantic import BaseModel, Field
class CustomerFeedback(BaseModel):
sentiment: Literal["positive", "negative", "neutral"] = Field(
description="Overall sentiment of the customer feedback"
)
key_points: List[str] = Field(
description="Main points or concerns raised by the customer"
)
urgency: int = Field(
description="Urgency level from 1 (low) to 5 (critical)",
ge=1,
le=5
)

Handle Potential Refusals

Some requests may result in the model refusing to respond. Always check for refusals:

python
12345678910
output = response.choices[0].message
if output.refusal:
print(f"Model refused: {output.refusal}")
elif output.content:
try:
data = json.loads(output.content)
# Process data
except json.JSONDecodeError as e:
print(f"Invalid JSON: {e}")

Schema Size, Memory Usage, and Reliability

When using response_format, it’s important to remember that the provided schema is not just a validation hint — it becomes part of the request itself and actively participates in how the model generates output.

In json_schema mode, the schema is serialized, transmitted, kept in memory, and consulted throughout the entire decoding process. This allows the model to produce strongly typed, well-formed JSON, but it also means that very large or complex schemas can significantly increase memory usage and processing time.

Even if the model’s final output is small, the cost of enforcing the schema can be large.

As schema size grows, you may notice slower request initialization, longer generation times, or reduced reliability. In more extreme cases, requests may fail due to memory exhaustion or internal timeouts. These failures can be difficult to diagnose because they are often caused by schema complexity rather than the prompt or the model output itself.

Takeaways
  • json_schema mode actively constrains decoding, not just validation
  • Schema complexity affects memory usage and latency even for small outputs
  • Reliability issues may originate from the schema, not the prompt

What Causes Schemas to Become Large

Schemas often grow unintentionally. Deep nesting, large enumerations, or heavily reused model definitions can all expand the serialized schema far beyond what is obvious from the original Python or JSON definition.

This commonly happens when Pydantic models are reused directly from application logic. Database models, API response objects, or domain models frequently contain far more structure than is required for inference.

Schemas that rely heavily on $defs, anyOf, or oneOf are particularly expensive, as they force the model to reason over many possible output shapes during generation.

Takeaways
  • Deep nesting, large enums, and unions are the biggest contributors
  • Reusing “full” application or database models often causes schema bloat
  • $defs combined with anyOf / oneOf can rapidly increase complexity

Designing Schemas for Stability

For reliable structured output, schemas should be treated as interfaces, not full representations of internal data models.

Include only the fields required for downstream logic. Prefer identifiers or compact categorical values over deeply embedded objects. If a constraint can be validated safely in your application code, it is often better to keep the schema simpler and enforce that constraint outside the model.

When extracting large or complex structures, consider splitting the task across multiple calls with smaller schemas rather than enforcing everything in a single request.

Takeaways
  • Keep schemas minimal and task-focused
  • Prefer IDs or references over embedded objects
  • Split complex extractions into multiple smaller calls

Inspecting and Debugging the response_format

Schema size is not always obvious, especially when using Pydantic models with nested references. Inspecting the final response_format payload can help identify unexpected complexity before it causes issues in production.

The OpenAI SDK exposes an internal helper that converts a Python type into the final schema sent to the API.

Takeaways
  • Inspect the generated response_format, not just your model definition
  • Nested Pydantic models can produce much larger schemas than expected

Flattening the Generated Schema (Python)

python
123456789101112131415161718192021222324252627282930
from typing import Any, Dict
from openai.lib._parsing import type_to_response_format_param
def flatten_json(obj: Any, prefix: str = "", out: Dict[str, Any] | None = None) -> Dict[str, Any]:
"""
Flatten a JSON-like structure (dicts + lists) into a single-level dict.
"""
if out is None:
out = {}
if isinstance(obj, dict):
for k, v in obj.items():
key = f"{prefix}.{k}" if prefix else str(k)
flatten_json(v, key, out)
elif isinstance(obj, list):
for i, v in enumerate(obj):
key = f"{prefix}[{i}]"
flatten_json(v, key, out)
else:
out[prefix] = obj
return out
my_response_format_flattened = flatten_json(
type_to_response_format_param(MY_PYDANTIC_MODEL_CLASS)
)

Flattening the schema makes it easier to see how deep it goes, how many definitions are involved, and where complexity is accumulating — especially when schemas are composed from multiple nested models.

Takeaways
  • Flattening exposes hidden depth and repetition
  • Useful for spotting large enums and deeply nested paths
  • Useful for checking what hidden configs/definitions are send with the request

Checking Field Count Limits

To protect performance and reliability, Nebul enforces a hard limit on schema complexity.

Requests are blocked when the flattened response_format contains more than 200 fields. This prevents extremely large schemas from consuming excessive memory or causing long-running constrained decoding.

You can check this locally before sending a request:

python
1234567891011121314
from openai.lib._parsing import type_to_response_format_param
rf = type_to_response_format_param(MyPydanticModel)
flat = flatten_json(rf)
field_count = len(flat)
print(f"Flattened response_format fields: {field_count}")
if field_count > 200:
raise ValueError(
f"response_format too large: {field_count} fields (limit: 200). "
"Reduce nesting, remove unused fields, or split the task into multiple calls."
)
Important
  • Nebul blocks schemas with more than 200 flattened fields
  • Always validate locally to avoid runtime request failures
  • If you hit the limit, reduce nesting or split the task