Voice Cloning

Voice cloning allows you to generate speech in a specific voice by providing a reference audio clip. The Text to Speech API supports voice cloning through the audio_prompt parameter, enabling you to create personalized voice experiences.

Overview

Voice cloning uses a reference audio sample to capture voice characteristics such as tone, accent, and speaking style. The API then generates new speech that matches the reference voice while speaking the provided text.

Voice cloning is useful for:

Personalized voice assistants
Content localization with specific voices
Accessibility applications
Voice preservation
Character voice generation for media

How Voice Cloning Works

The voice cloning process:

Reference Audio: You provide a reference audio file (WAV format) encoded as base64
Voice Analysis: The model extracts voice characteristics from the reference
Speech Generation: New speech is generated matching the reference voice characteristics
Output: The generated audio maintains the voice characteristics while speaking your text

Usage

To use voice cloning, provide a reference audio file using the audio_prompt parameter:

Python
cURL

python
import requests
import base64

# Read and encode the reference audio file
with open("reference_voice.wav", "rb") as f:
    audio_bytes = f.read()
    audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")

url = "https://api.inference.nebul.io/v1/audio/speech"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>",
    "Content-Type": "application/json",
}

payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This will clone the voice from the reference audio.",
    "language": "en",
    "response_format": "wav",
    "audio_prompt": {
        "data": audio_b64,
        "mime_type": "audio/wav"
    }
}

response = requests.post(url, headers=headers, json=payload)

with open("cloned_voice.wav", "wb") as audio_file:
    audio_file.write(response.content)

bash
# First, encode the reference audio to base64
REF_AUDIO_B64=$(base64 -i reference_voice.wav)

curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"resemble-ai/chatterbox-multilingual\",
    \"input\": \"This will clone the voice from the reference audio.\",
    \"language\": \"en\",
    \"response_format\": \"wav\",
    \"audio_prompt\": {
      \"data\": \"${REF_AUDIO_B64}\",
      \"mime_type\": \"audio/wav\"
    }
  }" \
  --output cloned_voice.wav

Reference Audio Requirements

Format

File format: WAV (uncompressed)
Encoding: Base64-encoded in the request
MIME type: audio/wav

Duration

Recommended: Approximately 10 seconds
Minimum: 3-5 seconds for basic voice characteristics
Maximum: No strict limit, but longer files don't necessarily improve results

Quality Guidelines

Clear speech: Use audio with clear, intelligible speech
Minimal noise: Avoid background noise, music, or overlapping voices
Consistent speaker: Ensure the reference audio contains only one speaker
Language match: Use reference audio in the same language as the target text for best results

Parameters

The audio_prompt parameter accepts an object with:

Parameter	Type	Required	Description
`data`	String	Yes	Base64-encoded audio data
`mime_type`	String	Yes	MIME type of the audio (e.g., `"audio/wav"`)

Combining with Other Parameters

You can combine voice cloning with other TTS parameters for fine-grained control:

Python
cURL

python
payload = {
    "model": "resemble-ai/chatterbox-multilingual",
    "input": "This combines voice cloning with custom parameters.",
    "language": "en",
    "response_format": "wav",
    "audio_prompt": {
        "data": audio_b64,
        "mime_type": "audio/wav"
    },
    "exaggeration": 0.4,  # Neutral tone
    "cfg_weight": 0.6,    # Slightly slower pacing
    "temperature": 0.7    # Consistent output
}

response = requests.post(url, headers=headers, json=payload)

bash
curl -X POST https://api.inference.nebul.io/v1/audio/speech \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"resemble-ai/chatterbox-multilingual\",
    \"input\": \"This combines voice cloning with custom parameters.\",
    \"language\": \"en\",
    \"response_format\": \"wav\",
    \"audio_prompt\": {
      \"data\": \"${REF_AUDIO_B64}\",
      \"mime_type\": \"audio/wav\"
    },
    \"exaggeration\": 0.4,
    \"cfg_weight\": 0.6,
    \"temperature\": 0.7
  }" \
  --output cloned_with_params.wav

Limitations

Language matching: For best results, the reference audio should match the target language to avoid accent transfer
Voice quality: The quality of cloned voice depends on the quality of the reference audio
Speaker consistency: The reference audio should contain only one consistent speaker
Processing time: Voice cloning may take slightly longer than standard TTS generation

Best Practices

Reference audio quality: Use high-quality, clear reference audio (16kHz or higher sample rate)
Duration: Aim for 10 seconds of clear speech in the reference audio
Language consistency: Match the reference audio language to the target text language
Single speaker: Ensure reference audio contains only one speaker
Parameter tuning: Adjust exaggeration, cfg_weight, and temperature to fine-tune the cloned voice output
Testing: Test with different reference audio samples to find the best match for your use case

Use Cases

Personalized assistants: Create custom voice assistants with specific voice characteristics
Content creation: Generate voiceovers matching specific character voices
Accessibility: Provide voice options that match user preferences
Localization: Maintain consistent voice characteristics across different languages
Voice preservation: Preserve and replicate specific voices for archival or memorial purposes

Overview​

How Voice Cloning Works​

Usage​

Reference Audio Requirements​

Format​

Duration​

Quality Guidelines​

Parameters​

Combining with Other Parameters​

Limitations​

Best Practices​

Use Cases​