Skip to main content

Voice Cloning

Voice cloning allows you to generate speech in a specific voice by providing a reference audio clip. The Text to Speech API supports voice cloning through the audio_prompt parameter, enabling you to create personalized voice experiences.

Overview

Voice cloning uses a reference audio sample to capture voice characteristics such as tone, accent, and speaking style. The API then generates new speech that matches the reference voice while speaking the provided text.

Voice cloning is useful for:

  • Personalized voice assistants
  • Content localization with specific voices
  • Accessibility applications
  • Voice preservation
  • Character voice generation for media

How Voice Cloning Works

The voice cloning process:

  1. Reference Audio: You provide a reference audio file (WAV format) encoded as base64
  2. Voice Analysis: The model extracts voice characteristics from the reference
  3. Speech Generation: New speech is generated matching the reference voice characteristics
  4. Output: The generated audio maintains the voice characteristics while speaking your text

Usage

To use voice cloning, provide a reference audio file using the audio_prompt parameter:

python
1234567891011121314151617181920212223242526272829
import requests
import base64
# Read and encode the reference audio file
with open("reference_voice.wav", "rb") as f:
audio_bytes = f.read()
audio_b64 = base64.b64encode(audio_bytes).decode("utf-8")
url = "https://api.inference.nebul.io/v1/audio/speech"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>",
"Content-Type": "application/json",
}
payload = {
"model": "resemble-ai/chatterbox-multilingual",
"input": "This will clone the voice from the reference audio.",
"language": "en",
"response_format": "wav",
"audio_prompt": {
"data": audio_b64,
"mime_type": "audio/wav"
}
}
response = requests.post(url, headers=headers, json=payload)
with open("cloned_voice.wav", "wb") as audio_file:
audio_file.write(response.content)

Reference Audio Requirements

Format

  • File format: WAV (uncompressed)
  • Encoding: Base64-encoded in the request
  • MIME type: audio/wav

Duration

  • Recommended: Approximately 10 seconds
  • Minimum: 3-5 seconds for basic voice characteristics
  • Maximum: No strict limit, but longer files don't necessarily improve results

Quality Guidelines

  • Clear speech: Use audio with clear, intelligible speech
  • Minimal noise: Avoid background noise, music, or overlapping voices
  • Consistent speaker: Ensure the reference audio contains only one speaker
  • Language match: Use reference audio in the same language as the target text for best results

Parameters

The audio_prompt parameter accepts an object with:

ParameterTypeRequiredDescription
dataStringYesBase64-encoded audio data
mime_typeStringYesMIME type of the audio (e.g., "audio/wav")

Combining with Other Parameters

You can combine voice cloning with other TTS parameters for fine-grained control:

python
123456789101112131415
payload = {
"model": "resemble-ai/chatterbox-multilingual",
"input": "This combines voice cloning with custom parameters.",
"language": "en",
"response_format": "wav",
"audio_prompt": {
"data": audio_b64,
"mime_type": "audio/wav"
},
"exaggeration": 0.4, # Neutral tone
"cfg_weight": 0.6, # Slightly slower pacing
"temperature": 0.7 # Consistent output
}
response = requests.post(url, headers=headers, json=payload)

Limitations

  • Language matching: For best results, the reference audio should match the target language to avoid accent transfer
  • Voice quality: The quality of cloned voice depends on the quality of the reference audio
  • Speaker consistency: The reference audio should contain only one consistent speaker
  • Processing time: Voice cloning may take slightly longer than standard TTS generation

Best Practices

  1. Reference audio quality: Use high-quality, clear reference audio (16kHz or higher sample rate)
  2. Duration: Aim for 10 seconds of clear speech in the reference audio
  3. Language consistency: Match the reference audio language to the target text language
  4. Single speaker: Ensure reference audio contains only one speaker
  5. Parameter tuning: Adjust exaggeration, cfg_weight, and temperature to fine-tune the cloned voice output
  6. Testing: Test with different reference audio samples to find the best match for your use case

Use Cases

  • Personalized assistants: Create custom voice assistants with specific voice characteristics
  • Content creation: Generate voiceovers matching specific character voices
  • Accessibility: Provide voice options that match user preferences
  • Localization: Maintain consistent voice characteristics across different languages
  • Voice preservation: Preserve and replicate specific voices for archival or memorial purposes