Structured Output & JSON
Force models to respond with valid JSON following a specific schema. This makes it easy to parse model outputs and integrate them into your application logic.
Overview
By default, models respond with natural language text. Using the response_format parameter, you can constrain the output to:
- JSON Schema: Follows a strict schema you provide
- JSON Object: Valid JSON with no specific schema
This is useful for:
- Extracting structured data from text
- Building pipelines that require typed outputs
- Ensuring consistent response formats across requests
Nebul enforces a strict limit of 200 flattened fields in response_format. Requests that exceed this limit are blocked.
To avoid performance issues and schema-size failures, see:
JSON Schema Mode
Provide a JSON schema and the model will produce output that conforms to it.
- Python
- cURL
import jsonfrom openai import OpenAIfrom pydantic import BaseModelfrom typing import List, Literalclient = OpenAI(api_key="sk-your-api-key-here",base_url="https://api.inference.nebul.io/v1")# Define your schema using Pydanticclass MovieInfo(BaseModel):title: stryear: intdirector: strgenres: List[Literal["drama", "comedy", "thriller", "sci-fi", "horror", "action"]]rating: floatresponse = client.chat.completions.create(model="Qwen/Qwen3-30B-A3B-Instruct-2507",messages=[{"role": "system","content": "Extract movie information from the user's description. Respond only with valid JSON matching the schema.",},{"role": "user","content": "The Dark Knight from 2008, directed by Christopher Nolan. It's an action thriller rated 9.0",},],response_format={"type": "json_schema","json_schema": {"name": "movie_info", "schema": MovieInfo.model_json_schema()},},)print("JSON output")print(response.choices[0].message.content)print("Filtered output")movie = json.loads(response.choices[0].message.content)print(f"{movie['title']} ({movie['year']}) - {movie['rating']}/10")
curl https://api.inference.nebul.io/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer sk-your-api-key-here" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "Extract movie information from the user'\''s description. Respond only with valid JSON matching the schema."},{"role": "user","content": "The Dark Knight from 2008, directed by Christopher Nolan. It'\''s an action thriller rated 9.0"}],"response_format": {"type": "json_schema","json_schema": {"name": "movie_info","schema": {"type": "object","properties": {"title": {"type": "string"},"year": {"type": "integer"},"director": {"type": "string"},"genres": {"type": "array","items": {"type": "string","enum": ["drama", "comedy", "thriller", "sci-fi", "horror", "action"]}},"rating": {"type": "number"}},"required": ["title", "year", "director", "genres", "rating"]}}}}'
Example Output
{"title": "The Dark Knight","year": 2008,"director": "Christopher Nolan","genres": ["action", "thriller"],"rating": 9.0}
JSON Object Mode
When you need valid JSON but don't want to enforce a specific schema, use json_object mode. The model will produce well-formed JSON based on your prompt instructions.
- Python
- cURL
import jsonfrom openai import OpenAIclient = OpenAI(api_key="sk-your-api-key-here",base_url="https://api.inference.nebul.io/v1")response = client.chat.completions.create(model="Qwen/Qwen3-30B-A3B-Instruct-2507",messages=[{"role": "system","content": "You are an API that returns product information as JSON. Include name, price, and availability."},{"role": "user","content": "Tell me about the iPhone 15 Pro"}],response_format={"type": "json_object"})data = json.loads(response.choices[0].message.content)print(data)
curl https://api.inference.nebul.io/v1/chat/completions \-H "Content-Type: application/json" \-H "Authorization: Bearer sk-your-api-key-here" \-d '{"model": "Qwen/Qwen3-30B-A3B-Instruct-2507","messages": [{"role": "system","content": "You are an API that returns product information as JSON. Include name, price, and availability."},{"role": "user","content": "Tell me about the iPhone 15 Pro"}],"response_format": {"type": "json_object"}}'
Always instruct the model to output JSON in your system or user prompt, even when using response_format. This improves reliability and output quality.
Schema Best Practices
Use Descriptive Field Names
# Good - self-documentingclass OrderSummary(BaseModel):order_id: strtotal_amount_usd: floatitems_count: int# Avoid - ambiguousclass Order(BaseModel):id: strtotal: floatcount: int
Include Field Descriptions
from pydantic import BaseModel, Fieldclass CustomerFeedback(BaseModel):sentiment: Literal["positive", "negative", "neutral"] = Field(description="Overall sentiment of the customer feedback")key_points: List[str] = Field(description="Main points or concerns raised by the customer")urgency: int = Field(description="Urgency level from 1 (low) to 5 (critical)",ge=1,le=5)
Handle Potential Refusals
Some requests may result in the model refusing to respond. Always check for refusals:
output = response.choices[0].messageif output.refusal:print(f"Model refused: {output.refusal}")elif output.content:try:data = json.loads(output.content)# Process dataexcept json.JSONDecodeError as e:print(f"Invalid JSON: {e}")
Schema Size, Memory Usage, and Reliability
When using response_format, it’s important to remember that the provided schema is not just a validation hint — it becomes part of the request itself and actively participates in how the model generates output.
In json_schema mode, the schema is serialized, transmitted, kept in memory, and consulted throughout the entire decoding process. This allows the model to produce strongly typed, well-formed JSON, but it also means that very large or complex schemas can significantly increase memory usage and processing time.
Even if the model’s final output is small, the cost of enforcing the schema can be large.
As schema size grows, you may notice slower request initialization, longer generation times, or reduced reliability. In more extreme cases, requests may fail due to memory exhaustion or internal timeouts. These failures can be difficult to diagnose because they are often caused by schema complexity rather than the prompt or the model output itself.
json_schemamode actively constrains decoding, not just validation- Schema complexity affects memory usage and latency even for small outputs
- Reliability issues may originate from the schema, not the prompt
What Causes Schemas to Become Large
Schemas often grow unintentionally. Deep nesting, large enumerations, or heavily reused model definitions can all expand the serialized schema far beyond what is obvious from the original Python or JSON definition.
This commonly happens when Pydantic models are reused directly from application logic. Database models, API response objects, or domain models frequently contain far more structure than is required for inference.
Schemas that rely heavily on $defs, anyOf, or oneOf are particularly expensive, as they force the model to reason over many possible output shapes during generation.
- Deep nesting, large enums, and unions are the biggest contributors
- Reusing “full” application or database models often causes schema bloat
$defscombined withanyOf/oneOfcan rapidly increase complexity
Designing Schemas for Stability
For reliable structured output, schemas should be treated as interfaces, not full representations of internal data models.
Include only the fields required for downstream logic. Prefer identifiers or compact categorical values over deeply embedded objects. If a constraint can be validated safely in your application code, it is often better to keep the schema simpler and enforce that constraint outside the model.
When extracting large or complex structures, consider splitting the task across multiple calls with smaller schemas rather than enforcing everything in a single request.
- Keep schemas minimal and task-focused
- Prefer IDs or references over embedded objects
- Split complex extractions into multiple smaller calls
Inspecting and Debugging the response_format
Schema size is not always obvious, especially when using Pydantic models with nested references. Inspecting the final response_format payload can help identify unexpected complexity before it causes issues in production.
The OpenAI SDK exposes an internal helper that converts a Python type into the final schema sent to the API.
- Inspect the generated
response_format, not just your model definition - Nested Pydantic models can produce much larger schemas than expected
Flattening the Generated Schema (Python)
from typing import Any, Dictfrom openai.lib._parsing import type_to_response_format_paramdef flatten_json(obj: Any, prefix: str = "", out: Dict[str, Any] | None = None) -> Dict[str, Any]:"""Flatten a JSON-like structure (dicts + lists) into a single-level dict."""if out is None:out = {}if isinstance(obj, dict):for k, v in obj.items():key = f"{prefix}.{k}" if prefix else str(k)flatten_json(v, key, out)elif isinstance(obj, list):for i, v in enumerate(obj):key = f"{prefix}[{i}]"flatten_json(v, key, out)else:out[prefix] = objreturn outmy_response_format_flattened = flatten_json(type_to_response_format_param(MY_PYDANTIC_MODEL_CLASS))
Flattening the schema makes it easier to see how deep it goes, how many definitions are involved, and where complexity is accumulating — especially when schemas are composed from multiple nested models.
- Flattening exposes hidden depth and repetition
- Useful for spotting large enums and deeply nested paths
- Useful for checking what hidden configs/definitions are send with the request
Checking Field Count Limits
To protect performance and reliability, Nebul enforces a hard limit on schema complexity.
Requests are blocked when the flattened response_format contains more than 200 fields. This prevents extremely large schemas from consuming excessive memory or causing long-running constrained decoding.
You can check this locally before sending a request:
from openai.lib._parsing import type_to_response_format_paramrf = type_to_response_format_param(MyPydanticModel)flat = flatten_json(rf)field_count = len(flat)print(f"Flattened response_format fields: {field_count}")if field_count > 200:raise ValueError(f"response_format too large: {field_count} fields (limit: 200). ""Reduce nesting, remove unused fields, or split the task into multiple calls.")
- Nebul blocks schemas with more than 200 flattened fields
- Always validate locally to avoid runtime request failures
- If you hit the limit, reduce nesting or split the task