Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ packages/sdk/bun.lock
NOTICE_LOG.txt
NOTICE_FULL_REPORT.txt

# Local TTS / transcription example output artifacts (generated by
# packages/sdk/examples/**). Never commit these β€” they're large WAVs
# produced by running the demos locally.
packages/sdk/*-output.wav
packages/sdk/examples/**/*-output.wav

# Slack/Discord copy-paste announcement posts generated by the
# changelog skill (see scripts/sdk/generate-changelog-sdk-pod.cjs
# --generate-announcement-post). These are local working artifacts,
Expand Down
79 changes: 70 additions & 9 deletions docs/website/content/docs/ai-capabilities/transcription.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ schemaType: HowTo

## Overview

Transcription uses your choice of either [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) or [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) (via [ONNX Runtime](https://onnxruntime.ai)) as inference engine. Load a model using `modelType: "whisper"` for `qvac-ext-lib-whisper.cpp`, or `modelType: "parakeet"` for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), and speaker diarization (Sortformer).
Transcription uses your choice of either [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) or [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (via the GGML-based [`parakeet-cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp/tree/main/parakeet-cpp) engine) as inference engine. Load a model using `modelType: "whisper"` for `qvac-ext-lib-whisper.cpp`, or `modelType: "parakeet"` for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), speaker diarization (Sortformer), and end-of-utterance detection (EOU) for duplex streaming.

Provide audio input as `audioChunk`, either as a file path (string) or an in-memory audio buffer.

`transcribe()` returns the full transcription as a single `string`. If you need partial results as they become available, use `transcribeStream()` to receive text chunks in real-time.
`transcribe()` returns the full transcription as a single `string`. If you need partial results as they become available, use `transcribeStream()` to receive text chunks in real-time. Both whisper and parakeet expose duplex `transcribeStream()` sessions; see "Streaming with `transcribeStream()`" below.

## Functions

Expand All @@ -31,23 +31,84 @@ You should load two models:

### Parakeet

Parakeet requires multiple model artifacts. The required files depend on the model variant:
- **TDT** (multilingual, ~25 languages): encoder, encoder data, decoder, vocabulary, and preprocessor files.
- **CTC** (english-only): model, model data, and tokenizer files.
- **Sortformer** (speaker diarization): a single model file.
As of `@qvac/transcription-parakeet` 0.6.0, Parakeet ships as a **single GGUF** per variant β€” the addon auto-detects TDT / CTC / Sortformer / EOU from `parakeet.model.type` GGUF metadata. There is no `modelConfig.modelType` discriminator, no per-variant `parakeet*Src` artifact fields, and no `ParakeetArtifactsRequiredError`. Just supply the GGUF via the top-level `modelSrc`:

Pass the model variant via `modelConfig.modelType` (`"tdt"`, `"ctc"`, or `"sortformer"`) and provide the corresponding source fields in `modelConfig`. See [`loadModel()` β€” Parakeet `modelConfig`](/reference/api#loadmodel).
```ts
await loadModel({
modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, // multilingual, ~750MB
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_CTC_0_6B_Q8_0, // english-only, streaming-capable
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_SORTFORMER_4SPK_V1_Q8_0, // 4-speaker diarization
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_EOU_120M_V1_Q8_0, // end-of-utterance detection
modelType: "parakeet",
});
```

For model artifacts available as constants, see [SDK β€” Models](/introduction#models).

<Callout type="success">
**Tip:** if you are not using SDK model constants, download the required files from a compatible model repository β€” e.g., [`parakeet-ctc-0.6b-ONNX` on Hugging Face](https://huggingface.co/onnx-community/parakeet-ctc-0.6b-ONNX/tree/main/onnx). For a list of compatible repositories, see [Addons β€” Parakeet](https://github.com/tetherto/qvac/tree/main/packages/transcription-parakeet#models).
<Callout type="info">
**Migrating from pre-0.6 Parakeet (ONNX multi-file):** the legacy multi-file ONNX `modelConfig` shape (`parakeetEncoderSrc` / `parakeetDecoderSrc` / `parakeetVocabSrc` / `parakeetPreprocessorSrc`, plus `parakeetCtcModelSrc` / `parakeetTokenizerSrc` and `parakeetSortformerSrc` for the CTC/Sortformer variants) is no longer supported. Passing any of those fields raises a structured `LegacyParakeetModelDeprecatedError` with a migration message. The legacy ONNX constants (e.g. `PARAKEET_TDT_ENCODER_INT8`, `PARAKEET_CTC_FP32`, `PARAKEET_SORTFORMER_FP32`) remain exported for one minor cycle for codemod migrations only and will be removed in a future release.
</Callout>

<Callout type="info">
**On VAD:** when using `qvac-ext-lib-whisper.cpp`, you can optionally provide a separate model for voice activity detection (VAD); this is recommended. In turn, Parakeet handles VAD internally, so no additional model or configuration is required.
</Callout>

## Streaming with `transcribeStream()`

`transcribeStream()` opens a duplex session for both engines β€” write audio chunks via `session.write(...)`, iterate events with `for await (const event of session) { ... }`. Events are typed as a discriminated union `{ type }`:

- `{ type: "text", text }` β€” incremental transcript text.
- `{ type: "segment", segment }` β€” segment metadata (whisper-only when `metadata: true`).
- `{ type: "vad", speaking, probability }` β€” voice-activity-detection state (whisper-only).
- `{ type: "endOfTurn", source: "whisper", silenceDurationMs }` β€” turn boundary detected from a measured silence window (whisper).
- `{ type: "endOfTurn", source: "parakeet" }` β€” turn boundary detected from the EOU model's `<EOU>` token (parakeet; no silence window β€” the event is token-driven).

The `source` field on `endOfTurn` lets consumers narrow the union: whisper events always carry a numeric `silenceDurationMs`; parakeet events never do.

### Parakeet duplex streaming

Pass `parakeetStreamingConfig` to `transcribeStream()` to override per-call streaming knobs (each falls back to its `parakeetConfig.streaming*` load-time counterpart):

```ts
const session = await transcribeStream({
modelId,
parakeetStreamingConfig: {
chunkMs: 1000, // encoder cadence
historyMs: 30000, // sortformer rolling-history window
leftContextMs: 500, // ASR encoder left-context window
rightLookaheadMs: 200, // ASR encoder right-lookahead window
emitPartials: true, // emit partial segments before chunk boundaries
emitEnergyVad: false, // CTC/TDT energy-based VAD hint (engine-internal)
},
});

for await (const event of session) {
switch (event.type) {
case "text":
process.stdout.write(event.text);
break;
case "endOfTurn":
// event.source: "whisper" | "parakeet"
console.log("\n[endOfTurn] turn boundary detected\n");
break;
}
}
```

The synthetic `{ type: "endOfTurn", source: "parakeet" }` event surfaces whenever the EOU model emits an `<EOU>` token, and is the parakeet equivalent of whisper's silence-window EOU. Pair it with the `PARAKEET_EOU_120M_V1_Q8_0` checkpoint when you need explicit turn boundaries from parakeet.

## Examples

### `qvac-ext-lib-whisper.cpp`
Expand Down
Loading
Loading