Speaker Diarization

Speaker diarization identifies different speakers in audio recordings. When enabled, transcriptions include speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.

Overview

The Speech to Text API supports speaker diarization through the diarized_json response format. This feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers in audio files.

Speaker diarization is particularly useful for:

Meeting transcriptions
Interview analysis
Multi-speaker conversations
Podcast transcription
Call center recordings
Conference calls

How Diarization Works

The diarization feature uses NeMo's ClusteringDiarizer to automatically detect and label speakers. The system:

Voice Activity Detection (VAD): Identifies speech segments in the audio
Speaker Embeddings: Extracts speaker characteristics from each segment
Clustering: Groups segments by speaker similarity
Labeling: Assigns speaker labels (A, B, C, etc.) to each segment

Speaker labels are typically formatted as A, B, C, etc., or as SPEAKER_00, SPEAKER_01, etc.

Usage

To enable speaker diarization, use the diarized_json response format when making transcription requests:

Python
cURL

python
import requests

url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

files = {
    "file": open("/path/to/meeting_audio.mp3", "rb")
}
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "language": "en",
    "response_format": "diarized_json"
}

response = requests.post(url, headers=headers, files=files, data=data)
result = response.json()

print(f"Full text: {result['text']}")
for segment in result.get('segments', []):
    print(f"Speaker {segment['speaker']}: {segment['text']}")

bash
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -F "file=@/path/to/meeting_audio.mp3" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "language=en" \
  -F "response_format=diarized_json"

Response Format

When diarization is enabled, the response includes speaker information in each segment. The response format follows the OpenAI diarized JSON structure:

json
{
  "task": "transcribe",
  "duration": 10.2,
  "text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting.",
      "speaker": "A"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 3.0,
      "end": 5.8,
      "text": "Thank you for having me.",
      "speaker": "B"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_003",
      "start": 6.2,
      "end": 8.5,
      "text": "Let's start with the agenda.",
      "speaker": "A"
    }
  ],
  "usage": {
    "input_tokens": 8,
    "output_tokens": 12,
    "total_tokens": 20,
    "seconds": 10.2
  }
}

Response Fields

Field	Type	Description
`task`	String	Always `"transcribe"` for transcription tasks
`duration`	Number	Total duration of the audio in seconds
`text`	String	Full transcription with speaker labels (e.g., "A: Hello\nB: Hi")
`segments`	Array	Array of segment objects with speaker information
`segments[].type`	String	Segment type, typically `"transcript.text.segment"`
`segments[].id`	String	Unique identifier for the segment
`segments[].start`	Number	Start time of the segment in seconds
`segments[].end`	Number	End time of the segment in seconds
`segments[].text`	String	Transcribed text for this segment
`segments[].speaker`	String	Speaker label (A, B, C, etc.)

Limitations

Minimum audio length: Diarization works best with audio files longer than 10 seconds
Speaker count: The system can identify multiple speakers, but accuracy may decrease with more than 5-6 speakers
Audio quality: High-quality audio with clear speaker separation produces better results
Language support: Diarization is available for all languages supported by the v3 model
Processing time: Diarization adds additional processing time compared to standard transcription

Best Practices

Audio quality: Use high-quality audio recordings with minimal background noise
Speaker separation: Ensure speakers are clearly separated in the audio (avoid overlapping speech when possible)
Minimum duration: Use audio files longer than 10 seconds for reliable speaker identification
Language specification: Specify the language parameter for better accuracy
Multiple speakers: For meetings with many speakers, consider splitting the audio into smaller segments

Overview​

How Diarization Works​

Usage​

Response Format​

Response Fields​

Limitations​

Best Practices​