Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ packages/sdk/bun.lock
NOTICE_LOG.txt
NOTICE_FULL_REPORT.txt

# Local TTS / transcription example output artifacts (generated by
# packages/sdk/examples/**). Never commit these β€” they're large WAVs
# produced by running the demos locally.
packages/sdk/*-output.wav
packages/sdk/examples/**/*-output.wav

# Slack/Discord copy-paste announcement posts generated by the
# changelog skill (see scripts/sdk/generate-changelog-sdk-pod.cjs
# --generate-announcement-post). These are local working artifacts,
Expand Down
83 changes: 74 additions & 9 deletions docs/website/content/docs/ai-capabilities/transcription.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ schemaType: HowTo

## Overview

Transcription uses your choice of either [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) or [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) (via [ONNX Runtime](https://onnxruntime.ai)) as inference engine. Load a model using `modelType: "whisper"` for `qvac-ext-lib-whisper.cpp`, or `modelType: "parakeet"` for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), and speaker diarization (Sortformer).
Transcription uses your choice of either [`qvac-ext-lib-whisper.cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp) or [NVIDIA Parakeet](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3) (via the GGML-based [`parakeet-cpp`](https://github.com/tetherto/qvac-ext-lib-whisper.cpp/tree/main/parakeet-cpp) engine) as inference engine. Load a model using `modelType: "whisper"` for `qvac-ext-lib-whisper.cpp`, or `modelType: "parakeet"` for Parakeet. Parakeet supports multilingual transcription (TDT), english-only transcription (CTC), speaker diarization (Sortformer), and end-of-utterance detection (EOU) for duplex streaming.

Provide audio input as `audioChunk`, either as a file path (string) or an in-memory audio buffer.

`transcribe()` returns the full transcription as a single `string`. If you need partial results as they become available, use `transcribeStream()` to receive text chunks in real-time.
`transcribe()` returns the full transcription as a single `string`. If you need partial results as they become available, use `transcribeStream()` to receive text chunks in real-time. Both whisper and parakeet expose duplex `transcribeStream()` sessions; see "Streaming with `transcribeStream()`" below.

## Functions

Expand All @@ -31,23 +31,88 @@ You should load two models:

### Parakeet

Parakeet requires multiple model artifacts. The required files depend on the model variant:
- **TDT** (multilingual, ~25 languages): encoder, encoder data, decoder, vocabulary, and preprocessor files.
- **CTC** (english-only): model, model data, and tokenizer files.
- **Sortformer** (speaker diarization): a single model file.
As of `@qvac/transcription-parakeet` 0.6.0, Parakeet ships as a **single GGUF** per variant β€” the addon auto-detects TDT / CTC / Sortformer / EOU from `parakeet.model.type` GGUF metadata. There is no `modelConfig.modelType` discriminator, no per-variant `parakeet*Src` artifact fields, and no `ParakeetArtifactsRequiredError`. Just supply the GGUF via the top-level `modelSrc`:

Pass the model variant via `modelConfig.modelType` (`"tdt"`, `"ctc"`, or `"sortformer"`) and provide the corresponding source fields in `modelConfig`. See [`loadModel()` β€” Parakeet `modelConfig`](/reference/api#loadmodel).
```ts
await loadModel({
modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, // multilingual, ~750MB
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_CTC_0_6B_Q8_0, // english-only, streaming-capable
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_SORTFORMER_4SPK_V1_Q8_0, // 4-speaker diarization
modelType: "parakeet",
});

await loadModel({
modelSrc: PARAKEET_EOU_120M_V1_Q8_0, // end-of-utterance detection
modelType: "parakeet",
});
```

For model artifacts available as constants, see [SDK β€” Models](/introduction#models).

<Callout type="success">
**Tip:** if you are not using SDK model constants, download the required files from a compatible model repository β€” e.g., [`parakeet-ctc-0.6b-ONNX` on Hugging Face](https://huggingface.co/onnx-community/parakeet-ctc-0.6b-ONNX/tree/main/onnx). For a list of compatible repositories, see [Addons β€” Parakeet](https://github.com/tetherto/qvac/tree/main/packages/transcription-parakeet#models).
<Callout type="info">
**Migrating from pre-0.6 Parakeet (ONNX multi-file):** the legacy multi-file ONNX `modelConfig` shape (`parakeetEncoderSrc` / `parakeetDecoderSrc` / `parakeetVocabSrc` / `parakeetPreprocessorSrc`, plus `parakeetCtcModelSrc` / `parakeetTokenizerSrc` and `parakeetSortformerSrc` for the CTC/Sortformer variants) is no longer supported. Passing any of those fields raises a structured `LegacyParakeetModelDeprecatedError` with a migration message. The legacy ONNX constants (e.g. `PARAKEET_TDT_ENCODER_INT8`, `PARAKEET_CTC_FP32`, `PARAKEET_SORTFORMER_FP32`) remain exported for one minor cycle for codemod migrations only and will be removed in a future release.
</Callout>

<Callout type="info">
**On VAD:** when using `qvac-ext-lib-whisper.cpp`, you can optionally provide a separate model for voice activity detection (VAD); this is recommended. In turn, Parakeet handles VAD internally, so no additional model or configuration is required.
</Callout>

## Streaming with `transcribeStream()`

`transcribeStream()` opens a duplex session for both engines β€” write audio chunks via `session.write(...)`, iterate events with `for await (const event of session) { ... }`. Events are typed as a discriminated union `{ type }`:

- `{ type: "text", text }` β€” incremental transcript text.
- `{ type: "segment", segment }` β€” segment metadata (whisper-only when `metadata: true`).
- `{ type: "vad", speaking, probability }` β€” voice-activity-detection state (whisper-only).
- `{ type: "endOfTurn", source: "whisper", silenceDurationMs }` β€” turn boundary detected from a measured silence window (whisper).
- `{ type: "endOfTurn", source: "parakeet" }` β€” turn boundary detected from the EOU model's `<EOU>` token (parakeet; no silence window β€” the event is token-driven).

The `source` field on `endOfTurn` lets consumers narrow the union: whisper events always carry a numeric `silenceDurationMs`; parakeet events never do.

<Callout type="info">
**Wire compatibility:** post-0.6 servers emit `source` on every `endOfTurn` frame. SDK parsers still accept the legacy whisper wire shape `{ silenceDurationMs }` (no `source`) and normalize it to `source: "whisper"`. Upgrade client and server together when using parakeet `source: "parakeet"` events β€” older servers never emit that branch.
</Callout>

### Parakeet duplex streaming

Pass `parakeetStreamingConfig` to `transcribeStream()` to override per-call streaming knobs (each falls back to its `parakeetConfig.streaming*` load-time counterpart):

```ts
const session = await transcribeStream({
modelId,
parakeetStreamingConfig: {
chunkMs: 1000, // encoder cadence
historyMs: 30000, // sortformer rolling-history window
leftContextMs: 500, // ASR encoder left-context window
rightLookaheadMs: 200, // ASR encoder right-lookahead window
emitPartials: true, // emit partial segments before chunk boundaries
emitEnergyVad: false, // CTC/TDT energy-based VAD hint (engine-internal)
},
});

for await (const event of session) {
switch (event.type) {
case "text":
process.stdout.write(event.text);
break;
case "endOfTurn":
// event.source: "whisper" | "parakeet"
console.log("\n[endOfTurn] turn boundary detected\n");
break;
}
}
```

The synthetic `{ type: "endOfTurn", source: "parakeet" }` event surfaces whenever the EOU model emits an `<EOU>` token, and is the parakeet equivalent of whisper's silence-window EOU. Pair it with the `PARAKEET_EOU_120M_V1_Q8_0` checkpoint when you need explicit turn boundaries from parakeet.

## Examples

### `qvac-ext-lib-whisper.cpp`
Expand Down
2 changes: 1 addition & 1 deletion docs/website/content/docs/reference/api/index.mdx
Comment thread
GustavoA1604 marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -1518,7 +1518,7 @@ and read `error.code` / `error.cause`. Code ranges:
| `VAD_MODEL_REQUIRED` | 52205 | VAD model source is required for this configuration |
| `TTS_ARTIFACTS_REQUIRED` | 52208 | TTS (Chatterbox) requires ttsTokenizerSrc, ttsSpeechEncoderSrc, ttsEmbedTokensSrc, ttsConditionalDecoderSrc, and ttsLanguageModelSrc |
| `TTS_REFERENCE_AUDIO_REQUIRED` | 52209 | TTS (Chatterbox) requires referenceAudioSrc (path or URL to a WAV file for voice cloning) |
| `PARAKEET_ARTIFACTS_REQUIRED` | 52210 | Parakeet model sources are missing. TDT requires parakeetEncoderSrc, parakeetDecoderSrc, parakeetVocabSrc, parakeetPreprocessorSrc. CTC requires parakeetCtcModelSrc, parakeetTokenizerSrc. Sortformer requires parakeetSortformerSrc. |
| `LEGACY_PARAKEET_MODEL_DEPRECATED` | 52210 | Legacy parakeet ONNX modelConfig fields are no longer supported. As of `@qvac/transcription-parakeet` 0.6.0 the addon ships as a single GGUF that auto-detects TDT / CTC / EOU / Sortformer from GGUF metadata. Supply the GGUF via the top-level `modelSrc` (e.g. `loadModel({ modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0, modelType: "parakeet" })`). |
| `MODEL_UNLOAD_FAILED` | 52400 | Failed to unload model… |
| `EMBED_FAILED` | 52401 | Failed to generate embeddings… |
| `EMBED_NO_EMBEDDINGS` | 52402 | No embeddings returned from model |
Expand Down
25 changes: 23 additions & 2 deletions packages/sdk/client/api/transcribe.ts
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,14 @@ export function transcribeStream(
params: TranscribeStreamClientParams & { emitVadEvents: true },
options?: RPCOptions,
): Promise<TranscribeStreamConversationSession>;
export function transcribeStream(
params: TranscribeStreamClientParams & {
parakeetStreamingConfig: NonNullable<
TranscribeStreamClientParams["parakeetStreamingConfig"]
>;
},
options?: RPCOptions,
): Promise<TranscribeStreamConversationSession>;
export function transcribeStream(
params: TranscribeStreamClientParams & { metadata: true },
options?: RPCOptions,
Expand Down Expand Up @@ -196,7 +204,10 @@ export function transcribeStream(
return transcribeStreamWithAudio(params, options);
}
const streamParams = params as TranscribeStreamClientParams;
if (streamParams.emitVadEvents === true) {
if (
streamParams.emitVadEvents === true ||
streamParams.parakeetStreamingConfig !== undefined
) {
return transcribeStreamDuplexConversation(streamParams, options);
}
if (streamParams.metadata === true) {
Expand Down Expand Up @@ -257,6 +268,9 @@ function buildTranscribeStreamRequest(
...(params.vadRunIntervalMs !== undefined && {
vadRunIntervalMs: params.vadRunIntervalMs,
}),
...(params.parakeetStreamingConfig && {
parakeetStreamingConfig: params.parakeetStreamingConfig,
}),
};
}

Expand Down Expand Up @@ -435,9 +449,16 @@ function processLineConversation(
};
}
if (response.endOfTurn) {
if (response.endOfTurn.source === "whisper") {
return {
type: "endOfTurn",
source: "whisper",
silenceDurationMs: response.endOfTurn.silenceDurationMs,
};
}
return {
type: "endOfTurn",
silenceDurationMs: response.endOfTurn.silenceDurationMs,
source: "parakeet",
};
}
if (wantsMetadata) {
Expand Down
31 changes: 18 additions & 13 deletions packages/sdk/examples/transcription/parakeet-ctc-filesystem.ts
Original file line number Diff line number Diff line change
@@ -1,36 +1,41 @@
/**
* Parakeet CTC transcription from a WAV file.
*
* Usage:
* bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> [parakeet-ctc-gguf]
*
* Loads a single GGUF checkpoint (`PARAKEET_CTC_0_6B_Q8_0` by default) and
* transcribes the file with the batch `transcribe` API. Omit the model
* argument to use the registry constant.
*
* Audio should be 16 kHz mono PCM in a WAV container.
*/
import {
loadModel,
unloadModel,
transcribe,
PARAKEET_CTC_FP32,
PARAKEET_CTC_TOKENIZER,
PARAKEET_CTC_0_6B_Q8_0,
} from "@qvac/sdk";

const args = process.argv.slice(2);

if (!args[0]) {
console.error(
"Usage: bun run examples/transcription/parakeet-ctc-filesystem.ts <wav-file> " +
"[model.onnx] [tokenizer.json]",
"[parakeet-ctc-gguf]",
);
console.error("\nIf model paths are omitted, defaults to registry models.");
console.error("\nIf the model path is omitted, defaults to the registry model.");
process.exit(1);
}

const audioFilePath = args[0];
const parakeetCtcModelSrc = args[1] ?? PARAKEET_CTC_FP32;
const parakeetTokenizerSrc = args[2] ?? PARAKEET_CTC_TOKENIZER;
const parakeetModelSrc = args[1] ?? PARAKEET_CTC_0_6B_Q8_0;

try {
console.log("Loading Parakeet CTC model...");
const modelId = await loadModel({
modelSrc: parakeetCtcModelSrc,
modelSrc: parakeetModelSrc,
modelType: "parakeet",
modelConfig: {
modelType: "ctc",
parakeetCtcModelSrc,
parakeetTokenizerSrc,
},
onProgress: (progress) => {
console.log(`Download progress: ${progress.percentage.toFixed(1)}%`);
},
Expand All @@ -48,6 +53,6 @@ try {
await unloadModel({ modelId });
console.log("Done");
} catch (error) {
console.error("Error:", error);
console.error("❌ Error:", error);
process.exit(1);
}
Original file line number Diff line number Diff line change
@@ -1,21 +1,19 @@
/**
* Microphone β†’ Parakeet transcription using chunked `transcribe` calls.
* Microphone β†’ Parakeet batch transcription (chunked `transcribe`).
*
* Usage: bun run examples/transcription/parakeet-microphone-record.ts
* Usage:
* bun run examples/transcription/parakeet-microphone-record.ts
*
* Captures 3-second audio chunks from the microphone and sends each to the
* batch `transcribe` API. Press Ctrl+C to quit.
* Captures 3 s s16le chunks from the microphone and sends each to `transcribe`
* with the TDT model. Press Ctrl+C to stop.
*
* Requirements: FFmpeg installed, microphone access.
*/
import {
loadModel,
unloadModel,
transcribe,
PARAKEET_TDT_ENCODER_FP32,
PARAKEET_TDT_DECODER_FP32,
PARAKEET_TDT_VOCAB,
PARAKEET_TDT_PREPROCESSOR_FP32,
PARAKEET_TDT_0_6B_V3_Q8_0,
} from "@qvac/sdk";
import { spawnSync } from "child_process";
import { startMicrophone } from "../audio/mic-input";
Expand All @@ -37,14 +35,8 @@ try {

console.log("Loading Parakeet model...");
const modelId = await loadModel({
modelSrc: PARAKEET_TDT_ENCODER_FP32,
modelSrc: PARAKEET_TDT_0_6B_V3_Q8_0,
modelType: "parakeet",
modelConfig: {
parakeetEncoderSrc: PARAKEET_TDT_ENCODER_FP32,
parakeetDecoderSrc: PARAKEET_TDT_DECODER_FP32,
parakeetVocabSrc: PARAKEET_TDT_VOCAB,
parakeetPreprocessorSrc: PARAKEET_TDT_PREPROCESSOR_FP32,
},
onProgress: (p) => console.log(`Download: ${p.percentage.toFixed(1)}%`),
});
console.log("Model loaded.\n");
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
/**
* Microphone β†’ Parakeet duplex streaming (`transcribeStream`).
*
* Usage:
* bun run examples/transcription/parakeet-microphone-stream.ts
*
* Streams microphone audio through `transcribeStream` with
* `parakeetStreamingConfig`. Uses the EOU checkpoint so you may see
* `{ type: "endOfTurn", source: "parakeet" }` events; CTC/TDT models
* emit transcript text only. Parakeet does not yield standalone VAD events.
*
* Requirements: FFmpeg installed, microphone access.
*/
import {
loadModel,
unloadModel,
transcribeStream,
PARAKEET_EOU_120M_V1_Q8_0,
} from "@qvac/sdk";
import { spawnSync } from "child_process";
import { startMicrophone } from "../audio/mic-input";

const SAMPLE_RATE = 16000;

try {
const r = spawnSync("ffmpeg", ["-version"], { stdio: "ignore" });
if (r.error || r.status !== 0) throw new Error("FFmpeg not found");
} catch {
console.error("Error: FFmpeg is required. Install it and try again.");
process.exit(1);
}

let modelId: string | null = null;
let ffmpeg: ReturnType<typeof startMicrophone> | null = null;

async function cleanup() {
console.log("\n\nStopping...");
ffmpeg?.kill();
if (modelId) await unloadModel({ modelId });
console.log("Done.");
}

process.on("SIGINT", () => {
void cleanup().finally(() => process.exit(0));
});
process.on("SIGTERM", () => {
void cleanup().finally(() => process.exit(0));
});

try {
console.log("Loading Parakeet (EOU) streaming model...");
modelId = await loadModel({
modelSrc: PARAKEET_EOU_120M_V1_Q8_0,
modelType: "parakeet",
onProgress: (p) => console.log(`Download: ${p.percentage.toFixed(1)}%`),
});
console.log("Model loaded.\n");

ffmpeg = startMicrophone({ sampleRate: SAMPLE_RATE, format: "s16le" });

const session = await transcribeStream({
modelId,
parakeetStreamingConfig: {
chunkMs: 1000,
emitPartials: true,
},
});

ffmpeg.stdout.on("data", (chunk: Buffer) => session.write(chunk));

console.log(
"Listening... speak and pause to see transcripts. End-of-turn boundaries fire when the EOU model emits an <EOU> token.\n",
);

for await (const event of session) {
switch (event.type) {
case "text":
if (event.text.trim()) {
process.stdout.write(`${event.text}`);
}
break;
case "endOfTurn":
console.log("\n[endOfTurn] turn boundary detected\n");
break;
}
}
await cleanup();
process.exit(0);
} catch (error) {
console.error("Error:", error);
await cleanup();
process.exit(1);
}
Loading
Loading