Skip to main content

Speech to Text

PREVIEW MODE

This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the speech-to-text model is available upon request.

Speech-to-text (ASR) models enable developers to convert audio into text. This capability is powered by models like NVIDIA Parakeet, which provide high-accuracy transcription with support for multiple languages. Our implementation includes word-level and segment-level timestamps, streaming transcription, and speaker diarization.

Overview

The Speech to Text API allows you to transcribe audio files into text with high accuracy. It supports:

  • Multiple languages: English-only (v2) and multilingual (v3) models
  • Timestamping: Word-level and segment-level timestamps for precise timing
  • Streaming: Real-time transcription for both file uploads and continuous audio streams
  • Multiple formats: JSON, plain text, SRT, and WebVTT subtitle formats
  • Long audio handling: Automatic splitting of long audio files using Voice Activity Detection (VAD)
  • Speaker diarization: Identification of different speakers in audio (when available)

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/audio/transcriptions

Parameters

ParameterTypeRequiredDescription
fileFileYesThe audio file to transcribe (supports .mp3, .wav, .flac, .opus).
modelStringYesThe model ID to use. Use nvidia/parakeet-tdt-0.6b-v2 for English-only or nvidia/parakeet-tdt-0.6b-v3 for multilingual.
languageStringNoLanguage code (ISO 639-1). Use 2-letter codes like nl for Dutch, de for German, etc. Defaults to en. For v2 model, only en is supported.
promptStringNoOptional text prompt to guide the transcription.
response_formatStringNoResponse format. Options: json (default), text, srt, vtt, verbose_json, diarized_json. Use diarized_json for speaker diarization.
temperatureFloatNoSampling temperature between 0 and 1. Defaults to 0.0.

Supported Languages (v3 model)

The v3 model supports the following languages: auto, bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk.

Code Examples

Basic transcription:

python
123456789101112131415161718
import requests
url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
files = {
"file": open("/path/to/speech_test.mp3", "rb")
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"language": "en",
"response_format": "json"
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

Transcription with verbose JSON (includes word and segment timestamps):

python
123456789101112131415161718192021222324
import requests
url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
files = {
"file": open("/path/to/speech_test.mp3", "rb")
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"language": "nl",
"response_format": "verbose_json"
}
response = requests.post(url, headers=headers, files=files, data=data)
result = response.json()
print(f"Text: {result['text']}")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']} seconds")
print(f"Words: {len(result.get('words', []))}")
print(f"Segments: {len(result.get('segments', []))}")

Generate SRT subtitles:

python
1234567891011121314151617
import requests
url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
files = {
"file": open("/path/to/speech_test.mp3", "rb")
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"response_format": "srt"
}
response = requests.post(url, headers=headers, files=files, data=data)
print(response.text) # SRT subtitle content
tip

Audio Format Support: The API supports .mp3, .wav, .flac, and .opus formats. For best accuracy, use high-quality audio files (16kHz or higher sample rate, mono or stereo).

tip

Language Detection: For the v3 multilingual model, you can use "language": "auto" to automatically detect the language. For v2, only English ("en") is supported.

tip

Long Audio Files: The API automatically handles long audio files by splitting them using Voice Activity Detection (VAD). For very long files, consider using the streaming endpoints for real-time processing.

Response Format

The API returns different response formats based on the response_format parameter:

JSON (default)

json
123456789
{
"text": "Hello. This is a test of the Parakeet speech recognition system.",
"usage": {
"input_tokens": 8,
"output_tokens": 12,
"total_tokens": 20,
"seconds": 8
}
}

Verbose JSON

Includes word-level and segment-level timestamps:

json
1234567891011121314151617181920212223242526272829303132333435
{
"text": "Hello. This is a test of the Parakeet speech recognition system.",
"language": "en",
"duration": 8.016,
"words": [
{
"word": "Hello.",
"start": 0.08,
"end": 0.8
},
{
"word": "This",
"start": 1.12,
"end": 1.36
}
],
"segments": [
{
"text": "Hello.",
"start": 0.08,
"end": 0.8
},
{
"text": "This is a test of the Parakeet speech recognition system.",
"start": 1.12,
"end": 4.72
}
],
"usage": {
"input_tokens": 8,
"output_tokens": 12,
"total_tokens": 20,
"seconds": 8
}
}

Text

Plain text transcription:

1
Hello. This is a test of the Parakeet speech recognition system.

SRT (SubRip Subtitle)

Subtitle format for video players:

srt
1234567
1
00:00:00,080 --> 00:00:00,800
Hello.
2
00:00:01,120 --> 00:00:04,720
This is a test of the Parakeet speech recognition system.

WebVTT

Web Video Text Tracks format:

vtt
1234567
WEBVTT
00:00:00.080 --> 00:00:00.800
Hello.
00:00:01.120 --> 00:00:04.720
This is a test of the Parakeet speech recognition system.

Diarized JSON

Speaker diarization format with speaker labels. Returns speaker-labeled text and segments with speaker information:

json
12345678910111213141516171819202122232425262728293031
{
"task": "transcribe",
"duration": 10.2,
"text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.",
"segments": [
{
"type": "transcript.text.segment",
"id": "seg_001",
"start": 0.0,
"end": 2.5,
"text": "Hello, welcome to the meeting.",
"speaker": "A"
},
{
"type": "transcript.text.segment",
"id": "seg_002",
"start": 3.0,
"end": 5.8,
"text": "Thank you for having me.",
"speaker": "B"
},
{
"type": "transcript.text.segment",
"id": "seg_003",
"start": 6.2,
"end": 8.1,
"text": "Let's start with the agenda.",
"speaker": "A"
}
]
}

Streaming Transcription

The API supports streaming transcription for real-time processing. There are two streaming endpoints:

File Streaming (/v1/audio/transcriptions/stream)

Streams transcription results as Server-Sent Events (SSE) while processing an uploaded audio file:

python
123456789101112131415161718192021222324252627282930
import requests
import json
url = "https://api.inference.nebul.io/v1/audio/transcriptions/stream"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
files = {
"file": open("/path/to/speech_test.mp3", "rb")
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"response_format": "verbose_json"
}
response = requests.post(url, headers=headers, files=files, data=data, stream=True)
for line in response.iter_lines():
if line:
line_str = line.decode('utf-8')
if line_str.startswith('data: '):
data_str = line_str[6:] # Remove 'data: ' prefix
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
print(f"Chunk: {chunk.get('text', '')}")
except json.JSONDecodeError:
pass

Input Streaming (/v1/audio/transcriptions/stream_input)

Accepts continuous audio stream and uses Voice Activity Detection (VAD) to detect silence and process chunks:

python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import requests
import json
url = "https://api.inference.nebul.io/v1/audio/transcriptions/stream_input"
headers = {
"Authorization": "Bearer <YOUR_API_KEY>"
}
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"language": "en",
"response_format": "verbose_json",
"silence_threshold_seconds": 1.0,
"min_chunk_seconds": 2.0,
"max_chunk_seconds": 10.0
}
# Stream audio data
def audio_stream():
with open("/path/to/audio.wav", "rb") as f:
while True:
chunk = f.read(16000) # Read in chunks
if not chunk:
break
yield chunk
response = requests.post(
url,
headers=headers,
data=data,
files={"file": audio_stream()},
stream=True
)
for line in response.iter_lines():
if line:
line_str = line.decode('utf-8')
if line_str.startswith('data: '):
data_str = line_str[6:]
if data_str == '[DONE]':
break
try:
chunk = json.loads(data_str)
print(f"Transcription: {chunk.get('text', '')}")
except json.JSONDecodeError:
pass

Speaker Diarization

The API supports speaker diarization to identify different speakers in audio. When enabled, the transcription includes speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.

To enable speaker diarization, use the diarized_json response format:

python
12345
data = {
"model": "nvidia/parakeet-tdt-0.6b-v3",
"language": "en",
"response_format": "diarized_json"
}

For detailed information about speaker diarization, including usage examples, response format, limitations, and best practices, see the Speaker Diarization guide.

Model Specifications

The following speech-to-text models are available:

  • nvidia/parakeet-tdt-0.6b-v2 - 0.6B parameters, 4K context, float16 precision, supports Audio, English-only (Preview)
  • nvidia/parakeet-tdt-0.6b-v3 - 0.6B parameters, 4K context, float16 precision, supports Audio, Multilingual (24 languages) (Preview)

Model Versions

The API supports two Parakeet model versions:

  • v2 (nvidia/parakeet-tdt-0.6b-v2): English-only model, optimized for English transcription
  • v3 (nvidia/parakeet-tdt-0.6b-v3): Multilingual model supporting 24 languages

Choose the appropriate model based on your language requirements. The v3 model is recommended for non-English audio, while v2 may provide better accuracy for English-only use cases.

Long Audio Handling

The API automatically handles long audio files by splitting them into manageable segments using Voice Activity Detection (VAD). This ensures accurate transcription even for very long recordings without manual intervention.

Processing information is included in verbose JSON responses:

json
12345678
{
"processing_info": {
"original_duration_seconds": 1800.0,
"total_segments": 3,
"split_performed": true,
"split_points": [600.0, 1200.0]
}
}