Rate Limits & Scaling

Understand how rate limiting works and how to handle limits gracefully.

Overview

The Nebul inference-api applies rate limits to ensure fair usage and platform stability. Limits are applied per API key and vary based on your account tier.

When you exceed your rate limit, the API returns an HTTP 429 Too Many Requests response.

Rate Limit Headers

Every API response includes headers that help you track your current usage:

Header	Description
`x-ratelimit-limit-requests`	Maximum requests allowed per minute
`x-ratelimit-limit-tokens`	Maximum tokens allowed per minute
`x-ratelimit-remaining-requests`	Requests remaining in current window
`x-ratelimit-remaining-tokens`	Tokens remaining in current window
`x-ratelimit-reset-requests`	Time until request limit resets
`x-ratelimit-reset-tokens`	Time until token limit resets

Handling 429 Responses

When you hit a rate limit, implement exponential backoff:

python
import time
from openai import OpenAI, RateLimitError

client = OpenAI(
    api_key="sk-your-api-key-here",
    base_url="https://api.inference.nebul.io/v1"
)

def make_request_with_retry(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="Qwen/Qwen3-VL-235B-A22B-Thinking",
                messages=messages
            )
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + 1  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)

Checking Your Limits

Your current rate limits and usage are visible in your Nebul AI Studio dashboard under the API section. Here you can:

View your current tier and associated limits
Monitor real-time usage against your quotas
Request limit increases if needed

Best Practices

Monitor headers: Track x-ratelimit-remaining-* headers to stay under limits
Implement backoff: Always use exponential backoff when retrying after 429s
Batch requests: Combine multiple prompts where possible to reduce request count
Cache responses: Cache identical requests to avoid redundant API calls
Use streaming: For long responses, streaming doesn't change token limits but improves perceived latency

Need Higher Limits?

If you consistently hit rate limits, you can:

Check your tier: Upgrade your account tier in AI Studio for higher limits
Contact us: Reach out for custom enterprise limits tailored to your workload

tip

For high-volume async workloads, consider using Batch Inference (coming soon) which offers significantly higher limits at reduced cost.

Overview​

Rate Limit Headers​

Handling 429 Responses​

Checking Your Limits​

Best Practices​

Need Higher Limits?​