Rate Limits & Scaling
Understand how rate limiting works and how to handle limits gracefully.
Overview
The Nebul inference-api applies rate limits to ensure fair usage and platform stability. Limits are applied per API key and vary based on your account tier.
When you exceed your rate limit, the API returns an HTTP 429 Too Many Requests response.
Rate Limit Headers
Every API response includes headers that help you track your current usage:
| Header | Description |
|---|---|
x-ratelimit-limit-requests | Maximum requests allowed per minute |
x-ratelimit-limit-tokens | Maximum tokens allowed per minute |
x-ratelimit-remaining-requests | Requests remaining in current window |
x-ratelimit-remaining-tokens | Tokens remaining in current window |
x-ratelimit-reset-requests | Time until request limit resets |
x-ratelimit-reset-tokens | Time until token limit resets |
Handling 429 Responses
When you hit a rate limit, implement exponential backoff:
python
123456789101112131415161718192021
import timefrom openai import OpenAI, RateLimitErrorclient = OpenAI(api_key="sk-your-api-key-here",base_url="https://api.inference.nebul.io/v1")def make_request_with_retry(messages, max_retries=5):for attempt in range(max_retries):try:return client.chat.completions.create(model="Qwen/Qwen3-VL-235B-A22B-Thinking",messages=messages)except RateLimitError as e:if attempt == max_retries - 1:raisewait_time = (2 ** attempt) + 1 # Exponential backoffprint(f"Rate limited. Waiting {wait_time}s...")time.sleep(wait_time)
Checking Your Limits
Your current rate limits and usage are visible in your Nebul AI Studio dashboard under the API section. Here you can:
- View your current tier and associated limits
- Monitor real-time usage against your quotas
- Request limit increases if needed
Best Practices
- Monitor headers: Track
x-ratelimit-remaining-*headers to stay under limits - Implement backoff: Always use exponential backoff when retrying after 429s
- Batch requests: Combine multiple prompts where possible to reduce request count
- Cache responses: Cache identical requests to avoid redundant API calls
- Use streaming: For long responses, streaming doesn't change token limits but improves perceived latency
Need Higher Limits?
If you consistently hit rate limits, you can:
- Check your tier: Upgrade your account tier in AI Studio for higher limits
- Contact us: Reach out for custom enterprise limits tailored to your workload
tip
For high-volume async workloads, consider using Batch Inference (coming soon) which offers significantly higher limits at reduced cost.