Text to Speech
This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the text-to-speech model is available upon request.
Text-to-speech (TTS) models enable developers to convert text into natural-sounding speech. This capability is powered by models like Chatterbox Multilingual, a state-of-the-art TTS model that supports 23 languages and voice cloning.
Overview
The Text to Speech API allows you to generate high-quality speech audio from text input. It supports multiple languages, voice cloning via reference audio, and fine-grained control over speech characteristics like emotional intensity, pacing, and variability.
Quick Start
Endpoint
POST https://api.inference.nebul.io/v1/audio/speech
Alternatively, you can use the alias endpoint:
POST https://api.inference.nebul.io/v1/audio/generations
Both endpoints behave identically.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | String | Yes | The model ID to use (e.g., resemble-ai/chatterbox-multilingual). |
input | String | Yes | The text to convert to speech. |
language | String | No | ISO 639-1 language code (e.g., en, fr, es). Defaults to en. |
response_format | String | No | Audio output format: wav or mp3. Defaults to wav. |
voice | String | No | Voice identifier (OpenAI-compatible, currently unused). |
speed | Number | No | Speech speed multiplier. Defaults to 1.0. |
seed | Integer | No | Random seed for reproducible outputs. |
exaggeration | Number | No | Controls emotional intensity (0.25-2.0). Defaults to 0.5. Lower values (0.3-0.4) produce neutral tones; higher values (0.7-0.8) are more expressive; above 1.0 is very dramatic. |
cfg_weight | Number | No | Adjusts speech pacing (0.0-1.0). Defaults to 0.5. Lower values (0.2-0.3) speed up speech; higher values (0.7-0.8) slow it down. |
temperature | Number | No | Controls randomness and creativity (0.05-5.0). Defaults to 0.8. Lower values (0.4-0.6) yield more consistent outputs; higher values (1.0+) introduce more variability. |
audio_prompt | Object | No | Optional reference audio for voice cloning. See Voice Cloning below. |
Code Examples
- Python
- cURL
Using the requests library:
import requestsurl = "https://api.inference.nebul.io/v1/audio/speech"headers = {"Authorization": "Bearer <YOUR_API_KEY>","Content-Type": "application/json",}payload = {"model": "resemble-ai/chatterbox-multilingual","input": "Hello, this is a test of the Chatterbox TTS service.","language": "en","response_format": "wav"}response = requests.post(url, headers=headers, json=payload)with open("output.wav", "wb") as audio_file:audio_file.write(response.content)
curl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "Hello, this is a test of the Chatterbox TTS service.","language": "en","response_format": "wav"}' \--output output.wav
Language Codes: Use ISO 639-1 two-letter language codes (e.g., en, fr, de, nl). The model defaults to en if no language is specified. Supported languages include English, French, German, Dutch, Spanish, Italian, and 18 others.
Audio Formats: Use wav for highest quality and compatibility, or mp3 for smaller file sizes. WAV files are uncompressed and larger, while MP3 files are compressed and more suitable for web applications.
Response Format
The API returns raw audio bytes with the appropriate content type:
- WAV format:
Content-Type: audio/wav - MP3 format:
Content-Type: audio/mpeg
The response body contains the audio file bytes that can be saved directly to disk or streamed to audio players.
Supported Languages
The service supports 23 languages using ISO 639-1 codes:
| Code | Language | Code | Language |
|---|---|---|---|
ar | Arabic | nl | Dutch |
da | Danish | no | Norwegian |
de | German | pl | Polish |
el | Greek | pt | Portuguese |
en | English | ru | Russian |
es | Spanish | sv | Swedish |
fi | Finnish | sw | Swahili |
fr | French | tr | Turkish |
he | Hebrew | zh | Chinese |
hi | Hindi | ||
it | Italian | ||
ja | Japanese | ||
ko | Korean | ||
ms | Malay |
Advanced Usage
Dramatic/Expressive Speech
Use higher exaggeration values and lower cfg_weight for dramatic, expressive speech:
- Python
- cURL
payload = {"model": "resemble-ai/chatterbox-multilingual","input": "This is a dramatic and expressive speech!","language": "en","response_format": "wav","exaggeration": 1.2,"cfg_weight": 0.3,"temperature": 0.9}response = requests.post(url, headers=headers, json=payload)
curl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "This is a dramatic and expressive speech!","language": "en","response_format": "wav","exaggeration": 1.2,"cfg_weight": 0.3,"temperature": 0.9}' \--output dramatic.wav
Professional/Neutral Speech
Use lower exaggeration and higher cfg_weight for professional, neutral tones:
- Python
- cURL
payload = {"model": "resemble-ai/chatterbox-multilingual","input": "This is a professional, neutral tone.","language": "en","response_format": "wav","exaggeration": 0.3,"cfg_weight": 0.7}response = requests.post(url, headers=headers, json=payload)
curl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "This is a professional, neutral tone.","language": "en","response_format": "wav","exaggeration": 0.3,"cfg_weight": 0.7}' \--output professional.wav
Multilingual Examples
- Python
- cURL
# Dutchpayload = {"model": "resemble-ai/chatterbox-multilingual","input": "Hallo, hoe gaat het? Dit is een test van de TTS-service.","language": "nl","response_format": "wav"}# Spanishpayload = {"model": "resemble-ai/chatterbox-multilingual","input": "Hola, ¿cómo estás? Esta es una prueba del servicio TTS.","language": "es","response_format": "wav"}# Chinesepayload = {"model": "resemble-ai/chatterbox-multilingual","input": "你好,今天天气真不错。","language": "zh","response_format": "wav"}
# Dutchcurl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "Hallo, hoe gaat het?","language": "nl","response_format": "wav"}' \--output dutch.wav# Spanishcurl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "Hola, ¿cómo estás?","language": "es","response_format": "wav"}' \--output spanish.wav
Reproducible Outputs
Use the seed parameter to generate the same audio output for the same input:
- Python
- cURL
payload = {"model": "resemble-ai/chatterbox-multilingual","input": "This will produce the same audio with the same seed.","language": "en","response_format": "wav","seed": 42}response = requests.post(url, headers=headers, json=payload)
curl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "This will produce the same audio with the same seed.","language": "en","response_format": "wav","seed": 42}' \--output seeded.wav
Voice Cloning
You can clone a voice by providing a reference audio clip using the audio_prompt parameter. The reference audio should be a WAV file encoded as base64.
payload = {"model": "resemble-ai/chatterbox-multilingual","input": "This will clone the voice from the reference audio.","language": "en","response_format": "wav","audio_prompt": {"data": audio_b64, # Base64-encoded WAV file"mime_type": "audio/wav"}}
For detailed information about voice cloning, including usage examples, reference audio requirements, limitations, and best practices, see the Voice Cloning guide.
MP3 Output Format
To receive audio in MP3 format instead of WAV:
- Python
- cURL
payload = {"model": "resemble-ai/chatterbox-multilingual","input": "Testing MP3 output format.","language": "en","response_format": "mp3"}response = requests.post(url, headers=headers, json=payload)with open("output.mp3", "wb") as audio_file:audio_file.write(response.content)
curl -X POST https://api.inference.nebul.io/v1/audio/speech \-H "Authorization: Bearer <YOUR_API_KEY>" \-H "Content-Type: application/json" \-d '{"model": "resemble-ai/chatterbox-multilingual","input": "Testing MP3 output format.","language": "en","response_format": "mp3"}' \--output output.mp3
Parameter Guidelines
Exaggeration
Controls the emotional intensity and expressiveness of the speech:
- 0.25-0.4: Neutral, professional tones suitable for business or educational content
- 0.5 (default): Balanced expressiveness
- 0.7-0.8: More expressive, suitable for storytelling or creative content
- 1.0-2.0: Very dramatic, theatrical speech
CFG Weight
Adjusts the pacing and rhythm of speech:
- 0.0-0.3: Faster speech, more casual delivery
- 0.5 (default): Normal pacing
- 0.7-1.0: Slower, more deliberate speech
Tip: If you increase exaggeration for more expressiveness, consider lowering cfg_weight to maintain natural pacing.
Temperature
Controls randomness and variability in the output:
- 0.05-0.6: More consistent, deterministic outputs
- 0.8 (default): Balanced variability
- 1.0-5.0: More creative and varied outputs
Tip: Use lower temperatures for consistent voice characteristics across multiple generations, and higher temperatures for more natural variation.
Model Specifications
The following text-to-speech models are available:
resemble-ai/chatterbox-multilingual- 500M parameters, 4K context, float16 precision, supports Audio, Multilingual (23 languages) (Preview)