Skip to main content

Speaker Diarization

Speaker diarization identifies different speakers in audio recordings. When enabled, transcriptions include speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.

Overview

The Speech to Text API supports speaker diarization through the diarized_json response format. This feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers in audio files.

Speaker diarization is particularly useful for:

  • Meeting transcriptions
  • Interview analysis
  • Multi-speaker conversations
  • Podcast transcription
  • Call center recordings
  • Conference calls

How Diarization Works

The diarization feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers. The system:

  1. Voice Activity Detection (VAD): Identifies speech segments in the audio
  2. Speaker Embeddings: Extracts speaker characteristics from each segment
  3. Clustering: Groups segments by speaker similarity
  4. Labeling: Assigns speaker labels (A, B, C, etc.) to each segment

Speaker labels are typically formatted as A, B, C, etc., or as SPEAKER_00, SPEAKER_01, etc.

Usage

To enable speaker diarization, use the diarized_json response format when making transcription requests:

python
12345678910111213141516171819202122
import requests
url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
files = {
"file": open("/path/to/meeting_audio.mp3", "rb")
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"language": "en",
"response_format": "diarized_json"
}
response = requests.post(url, headers=headers, files=files, data=data)
result = response.json()
print(f"Full text: {result['text']}")
for segment in result.get('segments', []):
print(f"Speaker {segment['speaker']}: {segment['text']}")

Response Format

When diarization is enabled, the response includes speaker information in each segment. The response format follows the OpenAI diarized JSON structure:

json
12345678910111213141516171819202122232425262728293031323334353637
{
"task": "transcribe",
"duration": 10.2,
"text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.",
"segments": [
{
"type": "transcript.text.segment",
"id": "seg_001",
"start": 0.0,
"end": 2.5,
"text": "Hello, welcome to the meeting.",
"speaker": "A"
},
{
"type": "transcript.text.segment",
"id": "seg_002",
"start": 3.0,
"end": 5.8,
"text": "Thank you for having me.",
"speaker": "B"
},
{
"type": "transcript.text.segment",
"id": "seg_003",
"start": 6.2,
"end": 8.5,
"text": "Let's start with the agenda.",
"speaker": "A"
}
],
"usage": {
"input_tokens": 8,
"output_tokens": 12,
"total_tokens": 20,
"seconds": 10.2
}
}

Response Fields

FieldTypeDescription
taskStringAlways "transcribe" for transcription tasks
durationNumberTotal duration of the audio in seconds
textStringFull transcription with speaker labels (e.g., "A: Hello\nB: Hi")
segmentsArrayArray of segment objects with speaker information
segments[].typeStringSegment type, typically "transcript.text.segment"
segments[].idStringUnique identifier for the segment
segments[].startNumberStart time of the segment in seconds
segments[].endNumberEnd time of the segment in seconds
segments[].textStringTranscribed text for this segment
segments[].speakerStringSpeaker label (A, B, C, etc.)

Limitations

  • Minimum audio length: Diarization works best with audio files longer than 10 seconds
  • Speaker count: The system can identify multiple speakers, but accuracy may decrease with more than 5-6 speakers
  • Audio quality: High-quality audio with clear speaker separation produces better results
  • Language support: Diarization is available for all languages supported by the v3 model
  • Processing time: Diarization adds additional processing time compared to standard transcription

Best Practices

  1. Audio quality: Use high-quality audio recordings with minimal background noise
  2. Speaker separation: Ensure speakers are clearly separated in the audio (avoid overlapping speech when possible)
  3. Minimum duration: Use audio files longer than 10 seconds for reliable speaker identification
  4. Language specification: Specify the language parameter for better accuracy
  5. Multiple speakers: For meetings with many speakers, consider splitting the audio into smaller segments