Speech to Text

PREVIEW MODE

This feature and its corresponding model are currently in preview mode. Implementation details may change before the official release. Access to the speech-to-text model is available upon request.

Speech-to-text (ASR) models enable developers to convert audio into text. This capability is powered by models like NVIDIA Parakeet, which provide high-accuracy transcription with support for multiple languages. Our implementation includes word-level and segment-level timestamps, streaming transcription, and speaker diarization.

Overview

The Speech to Text API allows you to transcribe audio files into text with high accuracy. It supports:

Multiple languages: English-only (v2) and multilingual (v3) models
Timestamping: Word-level and segment-level timestamps for precise timing
Streaming: Real-time transcription for both file uploads and continuous audio streams
Multiple formats: JSON, plain text, SRT, and WebVTT subtitle formats
Long audio handling: Automatic splitting of long audio files using Voice Activity Detection (VAD)
Speaker diarization: Identification of different speakers in audio (when available)

Quick Start

Endpoint

POST https://api.inference.nebul.io/v1/audio/transcriptions

Parameters

Parameter	Type	Required	Description
`file`	File	Yes	The audio file to transcribe (supports `.mp3`, `.wav`, `.flac`, `.opus`).
`model`	String	Yes	The model ID to use. Use `nvidia/parakeet-tdt-0.6b-v2` for English-only or `nvidia/parakeet-tdt-0.6b-v3` for multilingual.
`language`	String	No	Language code (ISO 639-1). Use 2-letter codes like `nl` for Dutch, `de` for German, etc. Defaults to `en`. For v2 model, only `en` is supported.
`prompt`	String	No	Optional text prompt to guide the transcription.
`response_format`	String	No	Response format. Options: `json` (default), `text`, `srt`, `vtt`, `verbose_json`, `diarized_json`. Use `diarized_json` for speaker diarization.
`temperature`	Float	No	Sampling temperature between 0 and 1. Defaults to `0.0`.

Supported Languages (v3 model)

The v3 model supports the following languages: auto, bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk.

Code Examples

Python
cURL

Basic transcription:

python
import requests

url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

files = {
    "file": open("/path/to/speech_test.mp3", "rb")
}
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "language": "en",
    "response_format": "json"
}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

Transcription with verbose JSON (includes word and segment timestamps):

python
import requests

url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

files = {
    "file": open("/path/to/speech_test.mp3", "rb")
}
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "language": "nl",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data)
result = response.json()

print(f"Text: {result['text']}")
print(f"Language: {result['language']}")
print(f"Duration: {result['duration']} seconds")
print(f"Words: {len(result.get('words', []))}")
print(f"Segments: {len(result.get('segments', []))}")

Generate SRT subtitles:

python
import requests

url = "https://api.inference.nebul.io/v1/audio/transcriptions"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

files = {
    "file": open("/path/to/speech_test.mp3", "rb")
}
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "response_format": "srt"
}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.text)  # SRT subtitle content

Basic transcription:

bash
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -F "file=@/path/to/speech_test.mp3" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "language=en" \
  -F "response_format=json"

Transcription with verbose JSON:

bash
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -F "file=@/path/to/speech_test.mp3" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "language=nl" \
  -F "response_format=verbose_json"

Generate WebVTT subtitles:

bash
curl -X POST https://api.inference.nebul.io/v1/audio/transcriptions \
  -H "Authorization: Bearer <YOUR_API_KEY>" \
  -F "file=@/path/to/speech_test.mp3" \
  -F "model=nvidia/parakeet-tdt-0.6b-v3" \
  -F "response_format=vtt"

tip

Audio Format Support: The API supports .mp3, .wav, .flac, and .opus formats. For best accuracy, use high-quality audio files (16kHz or higher sample rate, mono or stereo).

tip

Language Detection: For the v3 multilingual model, you can use "language": "auto" to automatically detect the language. For v2, only English ("en") is supported.

tip

Long Audio Files: The API automatically handles long audio files by splitting them using Voice Activity Detection (VAD). For very long files, consider using the streaming endpoints for real-time processing.

Response Format

The API returns different response formats based on the response_format parameter:

JSON (default)

json
{
  "text": "Hello. This is a test of the Parakeet speech recognition system.",
  "usage": {
    "input_tokens": 8,
    "output_tokens": 12,
    "total_tokens": 20,
    "seconds": 8
  }
}

Verbose JSON

Includes word-level and segment-level timestamps:

json
{
  "text": "Hello. This is a test of the Parakeet speech recognition system.",
  "language": "en",
  "duration": 8.016,
  "words": [
    {
      "word": "Hello.",
      "start": 0.08,
      "end": 0.8
    },
    {
      "word": "This",
      "start": 1.12,
      "end": 1.36
    }
  ],
  "segments": [
    {
      "text": "Hello.",
      "start": 0.08,
      "end": 0.8
    },
    {
      "text": "This is a test of the Parakeet speech recognition system.",
      "start": 1.12,
      "end": 4.72
    }
  ],
  "usage": {
    "input_tokens": 8,
    "output_tokens": 12,
    "total_tokens": 20,
    "seconds": 8
  }
}

Text

Plain text transcription:

Hello. This is a test of the Parakeet speech recognition system.

SRT (SubRip Subtitle)

Subtitle format for video players:

srt
1
00:00:00,080 --> 00:00:00,800
Hello.

2
00:00:01,120 --> 00:00:04,720
This is a test of the Parakeet speech recognition system.

WebVTT

Web Video Text Tracks format:

vtt
WEBVTT

00:00:00.080 --> 00:00:00.800
Hello.

00:00:01.120 --> 00:00:04.720
This is a test of the Parakeet speech recognition system.

Diarized JSON

Speaker diarization format with speaker labels. Returns speaker-labeled text and segments with speaker information:

json
{
  "task": "transcribe",
  "duration": 10.2,
  "text": "A: Hello, welcome to the meeting.\nB: Thank you for having me.\nA: Let's start with the agenda.",
  "segments": [
    {
      "type": "transcript.text.segment",
      "id": "seg_001",
      "start": 0.0,
      "end": 2.5,
      "text": "Hello, welcome to the meeting.",
      "speaker": "A"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_002",
      "start": 3.0,
      "end": 5.8,
      "text": "Thank you for having me.",
      "speaker": "B"
    },
    {
      "type": "transcript.text.segment",
      "id": "seg_003",
      "start": 6.2,
      "end": 8.1,
      "text": "Let's start with the agenda.",
      "speaker": "A"
    }
  ]
}

Streaming Transcription

The API supports streaming transcription for real-time processing. There are two streaming endpoints:

File Streaming (`/v1/audio/transcriptions/stream`)

Streams transcription results as Server-Sent Events (SSE) while processing an uploaded audio file:

python
import requests
import json

url = "https://api.inference.nebul.io/v1/audio/transcriptions/stream"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

files = {
    "file": open("/path/to/speech_test.mp3", "rb")
}
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "response_format": "verbose_json"
}

response = requests.post(url, headers=headers, files=files, data=data, stream=True)

for line in response.iter_lines():
    if line:
        line_str = line.decode('utf-8')
        if line_str.startswith('data: '):
            data_str = line_str[6:]  # Remove 'data: ' prefix
            if data_str == '[DONE]':
                break
            try:
                chunk = json.loads(data_str)
                print(f"Chunk: {chunk.get('text', '')}")
            except json.JSONDecodeError:
                pass

Input Streaming (`/v1/audio/transcriptions/stream_input`)

Accepts continuous audio stream and uses Voice Activity Detection (VAD) to detect silence and process chunks:

python
import requests
import json

url = "https://api.inference.nebul.io/v1/audio/transcriptions/stream_input"
headers = {
    "Authorization": "Bearer <YOUR_API_KEY>"
}

data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "language": "en",
    "response_format": "verbose_json",
    "silence_threshold_seconds": 1.0,
    "min_chunk_seconds": 2.0,
    "max_chunk_seconds": 10.0
}

# Stream audio data
def audio_stream():
    with open("/path/to/audio.wav", "rb") as f:
        while True:
            chunk = f.read(16000)  # Read in chunks
            if not chunk:
                break
            yield chunk

response = requests.post(
    url,
    headers=headers,
    data=data,
    files={"file": audio_stream()},
    stream=True
)

for line in response.iter_lines():
    if line:
        line_str = line.decode('utf-8')
        if line_str.startswith('data: '):
            data_str = line_str[6:]
            if data_str == '[DONE]':
                break
            try:
                chunk = json.loads(data_str)
                print(f"Transcription: {chunk.get('text', '')}")
            except json.JSONDecodeError:
                pass

Speaker Diarization

The API supports speaker diarization to identify different speakers in audio. When enabled, the transcription includes speaker labels for each segment, allowing you to distinguish between multiple speakers in conversations, interviews, or meetings.

To enable speaker diarization, use the diarized_json response format:

python
data = {
    "model": "nvidia/parakeet-tdt-0.6b-v3",
    "language": "en",
    "response_format": "diarized_json"
}

For detailed information about speaker diarization, including usage examples, response format, limitations, and best practices, see the Speaker Diarization guide.

Model Specifications

The following speech-to-text models are available:

nvidia/parakeet-tdt-0.6b-v2 - 0.6B parameters, 4K context, float16 precision, supports Audio, English-only (Preview)
nvidia/parakeet-tdt-0.6b-v3 - 0.6B parameters, 4K context, float16 precision, supports Audio, Multilingual (24 languages) (Preview)

Model Versions

The API supports two Parakeet model versions:

v2 (nvidia/parakeet-tdt-0.6b-v2): English-only model, optimized for English transcription
v3 (nvidia/parakeet-tdt-0.6b-v3): Multilingual model supporting 24 languages

Choose the appropriate model based on your language requirements. The v3 model is recommended for non-English audio, while v2 may provide better accuracy for English-only use cases.

Long Audio Handling

The API automatically handles long audio files by splitting them into manageable segments using Voice Activity Detection (VAD). This ensures accurate transcription even for very long recordings without manual intervention.

Processing information is included in verbose JSON responses:

json
{
  "processing_info": {
    "original_duration_seconds": 1800.0,
    "total_segments": 3,
    "split_performed": true,
    "split_points": [600.0, 1200.0]
  }
}

Overview​

Quick Start​

Endpoint​

Parameters​

Supported Languages (v3 model)​

Code Examples​

Response Format​

JSON (default)​

Verbose JSON​

Text​

SRT (SubRip Subtitle)​

WebVTT​

Diarized JSON​

Streaming Transcription​

File Streaming (/v1/audio/transcriptions/stream)​

Input Streaming (/v1/audio/transcriptions/stream_input)​

Speaker Diarization​

Model Specifications​

Model Versions​

Long Audio Handling​