Text to Speech

PREVIEW MODE

This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the text-to-speech model is available upon request.

Text-to-speech (TTS) models enable developers to convert text into natural-sounding speech. This capability is powered by models like Chatterbox Multilingual, a state-of-the-art TTS model that supports 23 languages and voice cloning.

Overview

The Text to Speech API allows you to generate high-quality speech audio from text input. It supports multiple languages, voice cloning via reference audio, and fine-grained control over speech characteristics like emotional intensity, pacing, and variability.

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/audio/speech

Alternatively, you can use the alias endpoint:

POST https://api.inference.nebul.io/v1/audio/generations

Both endpoints behave identically.

Parameters

Parameter	Type	Required	Description
`model`	String	Yes	The model ID to use (e.g., `resemble-ai/chatterbox-multilingual`).
`input`	String	Yes	The text to convert to speech.
`language`	String	No	ISO 639-1 language code (e.g., `en`, `fr`, `es`). Defaults to `en`.
`response_format`	String	No	Audio output format: `wav` or `mp3`. Defaults to `wav`.
`voice`	String	No	Voice identifier (OpenAI-compatible, currently unused).
`speed`	Number	No	Speech speed multiplier. Defaults to `1.0`.
`seed`	Integer	No	Random seed for reproducible outputs.
`exaggeration`	Number	No	Controls emotional intensity (0.25-2.0). Defaults to `0.5`. Lower values (0.3-0.4) produce neutral tones; higher values (0.7-0.8) are more expressive; above 1.0 is very dramatic.
`cfg_weight`	Number	No	Adjusts speech pacing (0.0-1.0). Defaults to `0.5`. Lower values (0.2-0.3) speed up speech; higher values (0.7-0.8) slow it down.
`temperature`	Number	No	Controls randomness and creativity (0.05-5.0). Defaults to `0.8`. Lower values (0.4-0.6) yield more consistent outputs; higher values (1.0+) introduce more variability.
`audio_prompt`	Object	No	Optional reference audio for voice cloning. See Voice Cloning below.

Code Examples

Python
cURL

Using the requests library:

python
import requests

url = "https://api.inference.nebul.io/v1/audio/speech"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json",
}

payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hello, this is a test of the Chatterbox TTS service.",
    "language": "en",
    "response_format": "wav"
}

response = requests.post(url, headers=headers, json=payload)

with open("output.wav", "wb") as audio_file:
    audio_file.write(response.content)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hello, this is a test of the Chatterbox TTS service.",
    "language": "en",
    "response_format": "wav"
  }' \
  --output output.wav

tip

Language Codes: Use ISO 639-1 two-letter language codes (e.g., en, fr, de, nl). The model defaults to en if no language is specified. Supported languages include English, French, German, Dutch, Spanish, Italian, and 18 others.

tip

Audio Formats: Use wav for highest quality and compatibility, or mp3 for smaller file sizes. WAV files are uncompressed and larger, while MP3 files are compressed and more suitable for web applications.

Response Format

The API returns raw audio bytes with the appropriate content type:

WAV format: Content-Type: audio/wav
MP3 format: Content-Type: audio/mpeg

The response body contains the audio file bytes that can be saved directly to disk or streamed to audio players.

Supported Languages

The service supports 23 languages using ISO 639-1 codes:

Code	Language	Code	Language
`ar`	Arabic	`nl`	Dutch
`da`	Danish	`no`	Norwegian
`de`	German	`pl`	Polish
`el`	Greek	`pt`	Portuguese
`en`	English	`ru`	Russian
`es`	Spanish	`sv`	Swedish
`fi`	Finnish	`sw`	Swahili
`fr`	French	`tr`	Turkish
`he`	Hebrew	`zh`	Chinese
`hi`	Hindi
`it`	Italian
`ja`	Japanese
`ko`	Korean
`ms`	Malay

Advanced Usage

Dramatic/Expressive Speech

Use higher exaggeration values and lower cfg_weight for dramatic, expressive speech:

Python
cURL

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This is a dramatic and expressive speech!",
    "language": "en",
    "response_format": "wav",
    "exaggeration": 1.2,
    "cfg_weight": 0.3,
    "temperature": 0.9
}

response = requests.post(url, headers=headers, json=payload)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This is a dramatic and expressive speech!",
    "language": "en",
    "response_format": "wav",
    "exaggeration": 1.2,
    "cfg_weight": 0.3,
    "temperature": 0.9
  }' \
  --output dramatic.wav

Professional/Neutral Speech

Use lower exaggeration and higher cfg_weight for professional, neutral tones:

Python
cURL

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This is a professional, neutral tone.",
    "language": "en",
    "response_format": "wav",
    "exaggeration": 0.3,
    "cfg_weight": 0.7
}

response = requests.post(url, headers=headers, json=payload)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This is a professional, neutral tone.",
    "language": "en",
    "response_format": "wav",
    "exaggeration": 0.3,
    "cfg_weight": 0.7
  }' \
  --output professional.wav

Multilingual Examples

Python
cURL

python
# Dutch
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hallo, hoe gaat het? Dit is een test van de TTS-service.",
    "language": "nl",
    "response_format": "wav"
}

# Spanish
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hola, ¿cómo estás? Esta es una prueba del servicio TTS.",
    "language": "es",
    "response_format": "wav"
}

# Chinese
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "你好，今天天气真不错。",
    "language": "zh",
    "response_format": "wav"
}

bash
# Dutch
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hallo, hoe gaat het?",
    "language": "nl",
    "response_format": "wav"
  }' \
  --output dutch.wav

# Spanish
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Hola, ¿cómo estás?",
    "language": "es",
    "response_format": "wav"
  }' \
  --output spanish.wav

Reproducible Outputs

Use the seed parameter to generate the same audio output for the same input:

Python
cURL

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This will produce the same audio with the same seed.",
    "language": "en",
    "response_format": "wav",
    "seed": 42
}

response = requests.post(url, headers=headers, json=payload)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This will produce the same audio with the same seed.",
    "language": "en",
    "response_format": "wav",
    "seed": 42
  }' \
  --output seeded.wav

Voice Cloning

You can clone a voice by providing a reference audio clip using the audio_prompt parameter. The reference audio should be a WAV file encoded as base64.

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This will clone the voice from the reference audio.",
    "language": "en",
    "response_format": "wav",
    "audio_prompt": {
        "data": audio_b64,  # Base64-encoded WAV file
        "mime_type": "audio/wav"
    }
}

For detailed information about voice cloning, including usage examples, reference audio requirements, limitations, and best practices, see the Voice Cloning guide.

MP3 Output Format

To receive audio in MP3 format instead of WAV:

Python
cURL

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Testing MP3 output format.",
    "language": "en",
    "response_format": "mp3"
}

response = requests.post(url, headers=headers, json=payload)

with open("output.mp3", "wb") as audio_file:
    audio_file.write(response.content)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "Testing MP3 output format.",
    "language": "en",
    "response_format": "mp3"
  }' \
  --output output.mp3

Parameter Guidelines

Exaggeration

Controls the emotional intensity and expressiveness of the speech:

0.25-0.4: Neutral, professional tones suitable for business or educational content
0.5 (default): Balanced expressiveness
0.7-0.8: More expressive, suitable for storytelling or creative content
1.0-2.0: Very dramatic, theatrical speech

CFG Weight

Adjusts the pacing and rhythm of speech:

0.0-0.3: Faster speech, more casual delivery
0.5 (default): Normal pacing
0.7-1.0: Slower, more deliberate speech

Tip: If you increase exaggeration for more expressiveness, consider lowering cfg_weight to maintain natural pacing.

Temperature

Controls randomness and variability in the output:

0.05-0.6: More consistent, deterministic outputs
0.8 (default): Balanced variability
1.0-5.0: More creative and varied outputs

Tip: Use lower temperatures for consistent voice characteristics across multiple generations, and higher temperatures for more natural variation.

Model Specifications

The following text-to-speech models are available:

resemble-ai/chatterbox-multilingual - 500M parameters, 4K context, float16 precision, supports Audio, Multilingual (23 languages) (Preview)

Overview​

Quick Start​

Endpoint​

Parameters​

Code Examples​

Response Format​

Supported Languages​

Advanced Usage​

Dramatic/Expressive Speech​

Professional/Neutral Speech​

Multilingual Examples​

Reproducible Outputs​

Voice Cloning​

MP3 Output Format​

Parameter Guidelines​

Exaggeration​

CFG Weight​

Temperature​

Model Specifications​