Skip to main content

Rate Limits & Scaling

Understand how rate limiting works and how to handle limits gracefully.

Overview

The Nebul inference-api applies rate limits to ensure fair usage and platform stability. Limits are applied per API key and vary based on your account tier.

When you exceed your rate limit, the API returns an HTTP 429 Too Many Requests response.

Rate Limit Headers

Every API response includes headers that help you track your current usage:

HeaderDescription
x-ratelimit-limit-requestsMaximum requests allowed per minute
x-ratelimit-limit-tokensMaximum tokens allowed per minute
x-ratelimit-remaining-requestsRequests remaining in current window
x-ratelimit-remaining-tokensTokens remaining in current window
x-ratelimit-reset-requestsTime until request limit resets
x-ratelimit-reset-tokensTime until token limit resets

Handling 429 Responses

When you hit a rate limit, implement exponential backoff:

python
123456789101112131415161718192021
import time
from openai import OpenAI, RateLimitError
client = OpenAI(
api_key="sk-your-api-key-here",
base_url="https://api.inference.nebul.io/v1"
)
def make_request_with_retry(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="Qwen/Qwen3-VL-235B-A22B-Thinking",
messages=messages
)
except RateLimitError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + 1 # Exponential backoff
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)

Checking Your Limits

Your current rate limits and usage are visible in your Nebul AI Studio dashboard under the API section. Here you can:

  • View your current tier and associated limits
  • Monitor real-time usage against your quotas
  • Request limit increases if needed

Best Practices

  1. Monitor headers: Track x-ratelimit-remaining-* headers to stay under limits
  2. Implement backoff: Always use exponential backoff when retrying after 429s
  3. Batch requests: Combine multiple prompts where possible to reduce request count
  4. Cache responses: Cache identical requests to avoid redundant API calls
  5. Use streaming: For long responses, streaming doesn't change token limits but improves perceived latency

Need Higher Limits?

If you consistently hit rate limits, you can:

  1. Check your tier: Upgrade your account tier in AI Studio for higher limits
  2. Contact us: Reach out for custom enterprise limits tailored to your workload
tip

For high-volume async workloads, consider using Batch Inference (coming soon) which offers significantly higher limits at reduced cost.