Speech to Text
This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the speech-to-text model is available upon request.
Speech-to-text (ASR) models enable developers to convert audio into text. This capability is powered by models like NVIDIA Parakeet, which provide high-accuracy transcription with support for multiple languages. Our implementation includes word-level and segment-level timestamps, streaming transcription, and speaker diarization.
Overview
The Speech to Text API allows you to transcribe audio files into text with high accuracy. It supports:
- Multiple languages: English-only (v2) and multilingual (v3) models
- Timestamping: Word-level and segment-level timestamps for precise timing
- Streaming: Real-time transcription for both file uploads and continuous audio streams
- Multiple formats: JSON, plain text, SRT, and WebVTT subtitle formats
- Long audio handling: Automatic splitting of long audio files using Voice Activity Detection (VAD)
- Speaker diarization: Identification of different speakers in audio (when available)
Quick Start
Endpoint
POST https://api.inference.nebul.io/v1/audio/transcriptions
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
file | File | Yes | The audio file to transcribe (supports .mp3, .wav, .flac, .opus). |
model | String | Yes | The model ID to use. Use nvidia/parakeet-tdt-0.6b-v2 for English-only or nvidia/parakeet-tdt-0.6b-v3 for multilingual. |
language | String | No | Language code (ISO 639-1). Use 2-letter codes like nl for Dutch, de for German, etc. Defaults to en. For v2 model, only en is supported. |
prompt | String | No | Optional text prompt to guide the transcription. |
response_format | String | No | Response format. Options: json (default), text, srt, vtt, verbose_json, diarized_json. Use diarized_json for speaker diarization. |
temperature | Float | No | Sampling temperature between 0 and 1. Defaults to 0.0. |
Supported Languages (v3 model)
The v3 model supports the following languages: auto, bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk.
Code Examples
- Python
- cURL
Basic transcription:
import requestsurl = "https://api.inference.nebul.io/v1/audio/transcriptions"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}files = {"file": open("/path/to/speech_test.mp3", "rb")}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","language": "en","response_format": "json"}response = requests.post(url, headers=headers, files=files, data=data)print(response.json())
Transcription with verbose JSON (includes word and segment timestamps):
import requestsurl = "https://api.inference.nebul.io/v1/audio/transcriptions"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}files = {"file": open("/path/to/speech_test.mp3", "rb")}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","language": "nl","response_format": "verbose_json"}response = requests.post(url, headers=headers, files=files, data=data)result = response.json()print(f"Text: {result['text']}")print(f"Language: {result['language']}")print(f"Duration: {result['duration']} seconds")print(f"Words: {len(result.get('words', []))}")print(f"Segments: {len(result.get('segments', []))}")
Generate SRT subtitles:
import requestsurl = "https://api.inference.nebul.io/v1/audio/transcriptions"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}files = {"file": open("/path/to/speech_test.mp3", "rb")}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","response_format": "srt"}response = requests.post(url, headers=headers, files=files, data=data)print(response.text) # SRT subtitle content
Basic transcription:
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \-H "Authorization: Bearer <YOUR_API_KEY>" \-F "file=@/path/to/speech_test.mp3" \-F "model=nvidia/parakeet-tdt-0.6b-v3" \-F "language=en" \-F "response_format=json"
Transcription with verbose JSON:
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \-H "Authorization: Bearer <YOUR_API_KEY>" \-F "file=@/path/to/speech_test.mp3" \-F "model=nvidia/parakeet-tdt-0.6b-v3" \-F "language=nl" \-F "response_format=verbose_json"
Generate WebVTT subtitles:
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \-H "Authorization: Bearer <YOUR_API_KEY>" \-F "file=@/path/to/speech_test.mp3" \-F "model=nvidia/parakeet-tdt-0.6b-v3" \-F "response_format=vtt"
Audio Format Support: The API supports .mp3, .wav, .flac, and .opus formats. For best accuracy, use high-quality audio files (16kHz or higher sample rate, mono or stereo).
Language Detection: For the v3 multilingual model, you can use "language": "auto" to automatically detect the language. For v2, only English ("en") is supported.
Long Audio Files: The API automatically handles long audio files by splitting them using Voice Activity Detection (VAD). For very long files, consider using the streaming endpoints for real-time processing.
Response Format
The API returns different response formats based on the response_format parameter:
JSON (default)
{"text": "Hello. This is a test of the Parakeet speech recognition system.","usage": {"input_tokens": 8,"output_tokens": 12,"total_tokens": 20,"seconds": 8}}
Verbose JSON
Includes word-level and segment-level timestamps:
{"text": "Hello. This is a test of the Parakeet speech recognition system.","language": "en","duration": 8.016,"words": [{"word": "Hello.","start": 0.08,"end": 0.8},{"word": "This","start": 1.12,"end": 1.36}],"segments": [{"text": "Hello.","start": 0.08,"end": 0.8},{"text": "This is a test of the Parakeet speech recognition system.","start": 1.12,"end": 4.72}],"usage": {"input_tokens": 8,"output_tokens": 12,"total_tokens": 20,"seconds": 8}}
Text
Plain text transcription:
Hello. This is a test of the Parakeet speech recognition system.
SRT (SubRip Subtitle)
Subtitle format for video players:
100:00:00,080 --> 00:00:00,800Hello.200:00:01,120 --> 00:00:04,720This is a test of the Parakeet speech recognition system.
WebVTT
Web Video Text Tracks format:
WEBVTT00:00:00.080 --> 00:00:00.800Hello.00:00:01.120 --> 00:00:04.720This is a test of the Parakeet speech recognition system.
Diarized JSON
Speaker diarization format with speaker labels. Returns speaker-labeled text and segments with speaker information:
{"task": "transcribe","duration": 10.2,"text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.","segments": [{"type": "transcript.text.segment","id": "seg_001","start": 0.0,"end": 2.5,"text": "Hello, welcome to the meeting.","speaker": "A"},{"type": "transcript.text.segment","id": "seg_002","start": 3.0,"end": 5.8,"text": "Thank you for having me.","speaker": "B"},{"type": "transcript.text.segment","id": "seg_003","start": 6.2,"end": 8.1,"text": "Let's start with the agenda.","speaker": "A"}]}
Streaming Transcription
The API supports streaming transcription for real-time processing. There are two streaming endpoints:
File Streaming (/v1/audio/transcriptions/stream)
Streams transcription results as Server-Sent Events (SSE) while processing an uploaded audio file:
import requestsimport jsonurl = "https://api.inference.nebul.io/v1/audio/transcriptions/stream"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}files = {"file": open("/path/to/speech_test.mp3", "rb")}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","response_format": "verbose_json"}response = requests.post(url, headers=headers, files=files, data=data, stream=True)for line in response.iter_lines():if line:line_str = line.decode('utf-8')if line_str.startswith('data: '):data_str = line_str[6:] # Remove 'data: ' prefixif data_str == '[DONE]':breaktry:chunk = json.loads(data_str)print(f"Chunk: {chunk.get('text', '')}")except json.JSONDecodeError:pass
Input Streaming (/v1/audio/transcriptions/stream_input)
Accepts continuous audio stream and uses Voice Activity Detection (VAD) to detect silence and process chunks:
import requestsimport jsonurl = "https://api.inference.nebul.io/v1/audio/transcriptions/stream_input"headers = {"Authorization": "Bearer <YOUR_API_KEY>"}data = {"model": "nvidia/parakeet-tdt-0.6b-v3","language": "en","response_format": "verbose_json","silence_threshold_seconds": 1.0,"min_chunk_seconds": 2.0,"max_chunk_seconds": 10.0}# Stream audio datadef audio_stream():with open("/path/to/audio.wav", "rb") as f:while True:chunk = f.read(16000) # Read in chunksif not chunk:breakyield chunkresponse = requests.post(url,headers=headers,data=data,files={"file": audio_stream()},stream=True)for line in response.iter_lines():if line:line_str = line.decode('utf-8')if line_str.startswith('data: '):data_str = line_str[6:]if data_str == '[DONE]':breaktry:chunk = json.loads(data_str)print(f"Transcription: {chunk.get('text', '')}")except json.JSONDecodeError:pass
Speaker Diarization
The API supports speaker diarization to identify different speakers in audio. When enabled, the transcription includes speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.
To enable speaker diarization, use the diarized_json response format:
data = {"model": "nvidia/parakeet-tdt-0.6b-v3","language": "en","response_format": "diarized_json"}
For detailed information about speaker diarization, including usage examples, response format, limitations, and best practices, see the Speaker Diarization guide.
Model Specifications
The following speech-to-text models are available:
nvidia/parakeet-tdt-0.6b-v2- 0.6B parameters, 4K context, float16 precision, supports Audio, English-only (Preview)nvidia/parakeet-tdt-0.6b-v3- 0.6B parameters, 4K context, float16 precision, supports Audio, Multilingual (24 languages) (Preview)
Model Versions
The API supports two Parakeet model versions:
- v2 (
nvidia/parakeet-tdt-0.6b-v2): English-only model, optimized for English transcription - v3 (
nvidia/parakeet-tdt-0.6b-v3): Multilingual model supporting 24 languages
Choose the appropriate model based on your language requirements. The v3 model is recommended for non-English audio, while v2 may provide better accuracy for English-only use cases.
Long Audio Handling
The API automatically handles long audio files by splitting them into manageable segments using Voice Activity Detection (VAD). This ensures accurate transcription even for very long recordings without manual intervention.
Processing information is included in verbose JSON responses:
{"processing_info": {"original_duration_seconds": 1800.0,"total_segments": 3,"split_performed": true,"split_points": [600.0, 1200.0]}}