Speaker Diarization
Speaker diarization identifies different speakers in audio recordings. When enabled, transcriptions include speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.
Overview
The Speech to Text API supports speaker diarization through the diarized_json response format. This feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers in audio files.
Speaker diarization is particularly useful for:
- Meeting transcriptions
- Interview analysis
- Multi-speaker conversations
- Podcast transcription
- Call center recordings
- Conference calls
How Diarization Works
The diarization feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers. The system:
- Voice Activity Detection (VAD): Identifies speech segments in the audio
- Speaker Embeddings: Extracts speaker characteristics from each segment
- Clustering: Groups segments by speaker similarity
- Labeling: Assigns speaker labels (A, B, C, etc.) to each segment
Speaker labels are typically formatted as A, B, C, etc., or as SPEAKER_00, SPEAKER_01, etc.
Usage
To enable speaker diarization, use the diarized_json response format when making transcription requests:
- Python
- cURL
import requestsurl = "https://api.inference.nebul.io/v1/audio/transcriptions"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}files = {"file": open("/path/to/meeting_audio.mp3", "rb")}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","language": "en","response_format": "diarized_json"}response = requests.post(url, headers=headers, files=files, data=data)result = response.json()print(f"Full text: {result['text']}")for segment in result.get('segments', []):print(f"Speaker {segment['speaker']}: {segment['text']}")
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \-H "Authorization: Bearer <YOUR_API_KEY>" \-F "file=@/path/to/meeting_audio.mp3" \-F "model=nvidia/parakeet-tdt-0.6b-v3" \-F "language=en" \-F "response_format=diarized_json"
Response Format
When diarization is enabled, the response includes speaker information in each segment. The response format follows the OpenAI diarized JSON structure:
{"task": "transcribe","duration": 10.2,"text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.","segments": [{"type": "transcript.text.segment","id": "seg_001","start": 0.0,"end": 2.5,"text": "Hello, welcome to the meeting.","speaker": "A"},{"type": "transcript.text.segment","id": "seg_002","start": 3.0,"end": 5.8,"text": "Thank you for having me.","speaker": "B"},{"type": "transcript.text.segment","id": "seg_003","start": 6.2,"end": 8.5,"text": "Let's start with the agenda.","speaker": "A"}],"usage": {"input_tokens": 8,"output_tokens": 12,"total_tokens": 20,"seconds": 10.2}}
Response Fields
| Field | Type | Description |
|---|---|---|
task | String | Always "transcribe" for transcription tasks |
duration | Number | Total duration of the audio in seconds |
text | String | Full transcription with speaker labels (e.g., "A: Hello\nB: Hi") |
segments | Array | Array of segment objects with speaker information |
segments[].type | String | Segment type, typically "transcript.text.segment" |
segments[].id | String | Unique identifier for the segment |
segments[].start | Number | Start time of the segment in seconds |
segments[].end | Number | End time of the segment in seconds |
segments[].text | String | Transcribed text for this segment |
segments[].speaker | String | Speaker label (A, B, C, etc.) |
Limitations
- Minimum audio length: Diarization works best with audio files longer than 10 seconds
- Speaker count: The system can identify multiple speakers, but accuracy may decrease with more than 5-6 speakers
- Audio quality: High-quality audio with clear speaker separation produces better results
- Language support: Diarization is available for all languages supported by the v3 model
- Processing time: Diarization adds additional processing time compared to standard transcription
Best Practices
- Audio quality: Use high-quality audio recordings with minimal background noise
- Speaker separation: Ensure speakers are clearly separated in the audio (avoid overlapping speech when possible)
- Minimum duration: Use audio files longer than 10 seconds for reliable speaker identification
- Language specification: Specify the language parameter for better accuracy
- Multiple speakers: For meetings with many speakers, consider splitting the audio into smaller segments