Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions packages/transcription-parakeet/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,24 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.5.0]

In this release we expose the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API. AOSC anchors each speaker to a stable cache slot across silence and re-entry, fixing the per-chunk permutation-invariance drift v1's sliding-window streaming exhibits once two voices have been seen. v2.1 becomes the recommended streaming Sortformer; v1 stays the offline-batch default. Six new optional config knobs surface the cache geometry for tuning and A/B comparison; defaults mirror parakeet-cpp's NeMo-port tuning so a bare `streaming: true` against a v2.1 GGUF Just Works.

### Added
- **AOSC config knobs.** `ParakeetConfig` gains six optional fields — `streamingSpkCacheEnable` (default `true`), `streamingSpkCacheLen` (188), `streamingFifoLen` (188), `streamingChunkLeftContextMs` (80), `streamingChunkRightContextMs` (560), `streamingSpkCacheUpdatePeriod` (144) — forwarded into `parakeet::SortformerStreamingOptions` for both the in-process Mode-3 streaming path (`ParakeetModel::runStreamingProcess_`) and the duplex `runStreaming()` processor (`ParakeetStreamingProcessor`). Mirrored as per-call overrides on `StreamingRunConfig` (`spkCacheEnable`, `spkCacheLen`, `fifoLen`, `chunkLeftContextMs`, `chunkRightContextMs`, `spkCacheUpdatePeriod`). parakeet-cpp ignores these on v1 / v2 Sortformer GGUFs and on non-Sortformer engines, so always-forward is safe.
- **v2.1 Sortformer auto-detection.** When a `diar_streaming_sortformer_4spk-v2.1.*` GGUF is loaded, parakeet-cpp's engine recognises it from the GGUF metadata tag `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` and enables AOSC by default. Setting `streamingSpkCacheEnable: false` forces the v1 sliding-window code path on a v2.1 model (A/B comparison).
- **`examples/live-mic-diarized-aosc.js`** — v2.1-focused dual-stream live mic example mirroring `live-mic-diarized.js`'s ASR + Sortformer pattern, with CLI flags for every AOSC knob (`--spk-cache-enable`, `--spk-cache-len`, `--fifo-len`, `--chunk-left-context-ms`, `--chunk-right-context-ms`, `--spk-cache-update-period`).
- **`test/integration/sortformer-aosc-streaming.test.js`** — covers default-AOSC streaming and `streamingSpkCacheEnable=false` fallback. The full AOSC slot-stability contract (same physical speaker → same `Speaker N` tag across non-contiguous re-entries) is verified at C++ level in `parakeet-cpp/test/test_sortformer_aosc_speakers.cpp`; this JS-level test focuses on wiring correctness — that the override actually reaches the engine and the engine emits well-formed segments in both modes.
- **`MODEL_CONFIGS.sortformerStreaming`** entry in `test/integration/helpers.js` pointing at `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. Tests skip cleanly when the GGUF isn't staged via `npm run setup-models` / `QVAC_TEST_GGUF_*`.

### Changed
- **parakeet-cpp dep bumped** to `version>= 2026-05-20` (was `2026-05-05#1`) across all three platform branches in `vcpkg.json`. The new port (qvac-registry-vcpkg PR #156) pulls in PRs #22 + #24 of `qvac-ext-lib-whisper.cpp`, which introduce the v2.1 Sortformer support, AOSC engine implementation, strict variant detection via the `parakeet.model_variant` GGUF tag, and review-fixup cleanups (magic-number elimination, dead-code removal, test utility consolidation, Windows `<algorithm>` include).
- **`index.js::_buildConfigurationParams()`** now forwards the 6 new AOSC fields (and explicit defaults for unset values) into `createInstance` / `reload`. Without this, JSDoc + native plumbing would exist but JS-layer overrides would never reach C++.
- **`examples/live-mic-diarized.js`** header: recommends the v2.1 GGUF as `--diar-model` and notes that `streamingHistoryMs` is superseded by AOSC on v2.1 models (kept for v1 back-compat). Points to the new `live-mic-diarized-aosc.js` for explicit knob control.
- **`examples/diarized-transcribe.js`** header: notes v1 remains the recommended OFFLINE diarization model — AOSC's slot-stability benefit only applies to continuous streaming and is wasted in batch mode.
- **`README.md`** — extended Model Variants table with v1 (offline default) and v2.1 + AOSC (streaming default) rows; new `streamingSpkCache*` rows in the ParakeetConfig table; dedicated "Sortformer Streaming Diarization (v2.1 + AOSC)" section explaining the v1-drift problem AOSC solves, the model-variant auto-detection, and when to leave the defaults alone.

## [0.4.0]

In this release, we have replaced the onnxruntime backend with a pure C++/ggml engine, added a duplex-streaming entry point that bypasses the framework's batch-then-process lifecycle for live use cases, and surfaced two new per-segment signals (`isEndOfTurn`, `startsWord`) so consumers can build cleaner live transcripts. The release also exposes per-engine backend stats (`backendDevice`, `backendId`) so callers can verify the GPU path actually engaged, and consolidates the examples / docs / mock fixtures into a single duplex-aware surface.
Expand Down
53 changes: 49 additions & 4 deletions packages/transcription-parakeet/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,9 +214,46 @@ Most users interact with the package through `index.js`. From that entrypoint we
| | `streamingEnergyVad` | CTC/TDT energy-VAD events (default: `false`) |
| | `streamingLeftContextMs` | ASR encoder left-context window in ms; `-1` keeps parakeet-cpp's default of 10000. ASR sessions only (Sortformer ignores it). |
| | `streamingRightLookaheadMs` | ASR encoder right-lookahead window in ms; `-1` keeps parakeet-cpp's default of 2000. Adds directly to the per-segment latency floor (`chunk_ms + right_lookahead_ms`). ASR sessions only. |
| | `streamingSpkCacheEnable` | AOSC: enable v2.1 Sortformer's speaker-cache streaming (default: `true`). Ignored on v1/v2 Sortformer GGUFs and on non-Sortformer models. Set `false` to force a v2.1 GGUF onto the v1 sliding-window path (A/B comparison). |
| | `streamingSpkCacheLen` | AOSC: long-term speaker-cache rows (~15 s of encoder frames). Default: 188. |
| | `streamingFifoLen` | AOSC: FIFO warmup buffer rows. Default: 188. |
| | `streamingChunkLeftContextMs` | AOSC: encoder left-context window (ms; ~1 encoder frame). Default: 80. |
| | `streamingChunkRightContextMs` | AOSC: encoder right-context window (ms; ~7 encoder frames). Default: 560. |
| | `streamingSpkCacheUpdatePeriod` | AOSC: FIFO-overflow pop-out count. Default: 144. |

The model type (CTC / TDT / EOU / Sortformer) is **auto-detected from the GGUF metadata**, so callers don't need to pass `modelType`. Other knobs (`captionEnabled`, `timestampsEnabled`, `seed`, `sampleRate`, `channels`) keep sensible defaults.

**Sortformer Streaming Diarization (v2.1 + AOSC).** parakeet-cpp ships
two streaming-diarization paths picked automatically by the GGUF:

- **v1** uses a fixed-size sliding-history window inside the engine.
Once two voices have been seen, the per-chunk decisions are
permutation-invariant; if a speaker goes silent long enough to roll
out of the window, the slot can drift onto a different physical voice
when they return. Fine for short, stable clips; ships as
`sortformer-4spk-v1.q8_0.gguf`.
- **v2.1** replaces the sliding window with AOSC (Audio-Online Speaker
Cache, ported from NVIDIA NeMo) which anchors each slot to its
accumulated embedding. Same physical speaker comes back to the same
`Speaker N` tag across silences. Default for live capture; ships as
`diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. The engine detects
v2.1 via the GGUF metadata tag
`parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`; you
don't need to opt in via config.

The defaults in the `streamingSpkCache*` / `streamingFifo*` /
`streamingChunk{Left,Right}ContextMs` table rows above are the NeMo-port
tuning parakeet-cpp ships -- you almost always want to keep them. The
knobs are exposed for A/B comparison (e.g. `--spk-cache-enable false`
in `examples/live-mic-diarized-aosc.js` to force a v2.1 GGUF onto the
v1 path) and for tuning unusual audio (longer cache, larger
right-context window for higher latency tolerance, etc.).

For offline diarization (single batch over a finite clip) v1 remains
the recommended GGUF -- AOSC's slot-stability benefit only applies to
continuous streaming and offers no measurable improvement when the
entire clip is available at once.

#### Configuration Example

```javascript
Expand Down Expand Up @@ -408,10 +445,16 @@ bare examples/diarized-transcribe.js \
# Live mic transcription
bare examples/live-mic.js --model models/parakeet-eou-120m-v1.q8_0.gguf --accumulate

# Live mic + speaker tagging
# Live mic + speaker tagging (recommended: v2.1 diar GGUF, AOSC auto-on)
bare examples/live-mic-diarized.js \
--asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--diar-model models/sortformer-4spk-v1.q8_0.gguf --accumulate
--diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf --accumulate

# Same as above, with explicit AOSC tuning knobs exposed as CLI flags
bare examples/live-mic-diarized-aosc.js \
--asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf \
--spk-cache-len 256 --chunk-right-context-ms 480 --accumulate
```

> If you use `npm run example:* -- ...` instead of `bare`, remember the `--` separator -- without it npm interprets `--model` as one of its own config flags.
Expand All @@ -425,14 +468,16 @@ The live-mic examples capture the default input device via `sox -d` (install: `b
| **CTC** | English | argmax CTC | ~ 700 MiB | Fast, no PnC. |
| **TDT** | ~25 | RNN-T greedy + duration | ~ 715 MiB | Recommended default; PnC + auto-detect. |
| **EOU** | English | RNN-T greedy + `<EOU>` | ~ 132 MiB | Streaming-trained; native end-of-turn token. |
| **Sortformer** | n/a | Diarization head | ~ 141 MiB | 4-speaker. |
| **Sortformer v1** | n/a | Diarization head (sliding history) | ~ 141 MiB | 4-speaker. **Default for offline diarization.** |
| **Sortformer v2.1 + AOSC** | n/a | Diarization head + speaker cache | ~ 141 MiB | 4-speaker. **Default for streaming diarization.** AOSC anchors speaker slots across silence/re-entry; auto-detected via GGUF metadata tag `parakeet.model_variant`. |

## Other examples

- [`examples/transcribe.js`](examples/transcribe.js) -- universal single-file transcribe / diarize (any GGUF, all model types).
- [`examples/diarized-transcribe.js`](examples/diarized-transcribe.js) -- combined Sortformer + ASR pipeline ("who said what").
- [`examples/live-mic.js`](examples/live-mic.js) -- live microphone transcription via `sox` and the streaming session.
- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts.
- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts. Pass a v2.1 Sortformer GGUF to get AOSC speaker-cache streaming automatically.
- [`examples/live-mic-diarized-aosc.js`](examples/live-mic-diarized-aosc.js) -- same as above but with CLI flags for the AOSC tuning knobs (`--spk-cache-len`, `--fifo-len`, `--chunk-right-context-ms`, `--spk-cache-enable`, etc.). Useful for A/B comparing AOSC vs the v1 sliding-window code path on the same v2.1 GGUF.
- [`examples/decode-audio.js`](examples/decode-audio.js) -- decode + transcribe in one step. Same flag surface as `transcribe.js` but pipes the input through `@qvac/decoder-audio` (FFmpeg) first, so any container / codec FFmpeg supports (mp3, m4a, ogg, flac, mp4, ...) works -- not just 16 kHz mono `.wav` / raw s16le PCM.
- [`examples/utils.js`](examples/utils.js) -- shared helpers used by the examples (`loadWeights` streaming, `Output`/`JobEnded` race resolution).

Expand Down
43 changes: 43 additions & 0 deletions packages/transcription-parakeet/addon/src/addon/AddonJs.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,13 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try {
parakeetModel.getDiarMinDurationOn() * 1000.0F);
config.leftContextMs = parakeetModel.getStreamingLeftContextMs();
config.rightLookaheadMs = parakeetModel.getStreamingRightLookaheadMs();
// AOSC defaults sourced from the model's load-time ParakeetConfig.
config.spkCacheEnable = parakeetModel.getStreamingSpkCacheEnable();
config.spkCacheLen = parakeetModel.getStreamingSpkCacheLen();
config.fifoLen = parakeetModel.getStreamingFifoLen();
config.chunkLeftContextMs = parakeetModel.getStreamingChunkLeftContextMs();
config.chunkRightContextMs = parakeetModel.getStreamingChunkRightContextMs();
config.spkCacheUpdatePeriod = parakeetModel.getStreamingSpkCacheUpdatePeriod();

if (auto chunkMs =
configObj.getOptionalProperty<js::Number>(env, "chunkMs");
Expand Down Expand Up @@ -198,6 +205,42 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try {
emitEnergyVad.has_value()) {
config.emitEnergyVad = emitEnergyVad.value().as<bool>(env);
}
// AOSC per-call overrides (v2.1+ Sortformer only).
if (auto spkCacheEnable =
configObj.getOptionalProperty<js::Boolean>(env, "spkCacheEnable");
spkCacheEnable.has_value()) {
config.spkCacheEnable = spkCacheEnable.value().as<bool>(env);
}
if (auto spkCacheLen =
configObj.getOptionalProperty<js::Number>(env, "spkCacheLen");
spkCacheLen.has_value()) {
const auto v = static_cast<int>(spkCacheLen.value().as<double>(env));
if (v > 0) config.spkCacheLen = v;
}
if (auto fifoLen =
configObj.getOptionalProperty<js::Number>(env, "fifoLen");
fifoLen.has_value()) {
const auto v = static_cast<int>(fifoLen.value().as<double>(env));
if (v > 0) config.fifoLen = v;
}
if (auto chunkLeftContextMs =
configObj.getOptionalProperty<js::Number>(env, "chunkLeftContextMs");
chunkLeftContextMs.has_value()) {
const auto v = static_cast<int>(chunkLeftContextMs.value().as<double>(env));
if (v >= 0) config.chunkLeftContextMs = v;
}
if (auto chunkRightContextMs =
configObj.getOptionalProperty<js::Number>(env, "chunkRightContextMs");
chunkRightContextMs.has_value()) {
const auto v = static_cast<int>(chunkRightContextMs.value().as<double>(env));
if (v >= 0) config.chunkRightContextMs = v;
}
if (auto spkCacheUpdatePeriod =
configObj.getOptionalProperty<js::Number>(env, "spkCacheUpdatePeriod");
spkCacheUpdatePeriod.has_value()) {
const auto v = static_cast<int>(spkCacheUpdatePeriod.value().as<double>(env));
if (v > 0) config.spkCacheUpdatePeriod = v;
}

{
std::lock_guard<std::mutex> lock(g_streamingMtx);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,53 @@ auto JSAdapter::loadFromJSObject(js::Object jsObject, js_env_t* env)
streamingRightLookaheadMsOpt.value().as<int32_t>(env);
}

// AOSC (v2.1+ Sortformer only). All optional; unspecified values keep
// ParakeetConfig's defaults. Forwarded into parakeet::SortformerStreamingOptions
// by ParakeetModel / ParakeetStreamingProcessor; ignored for v1/v2/non-Sortformer.
auto streamingSpkCacheEnableOpt =
jsObject.getOptionalProperty<js::Boolean>(env, "streamingSpkCacheEnable");
if (streamingSpkCacheEnableOpt.has_value()) {
config.streamingSpkCacheEnable =
streamingSpkCacheEnableOpt.value().as<bool>(env);
}

auto streamingSpkCacheLenOpt =
jsObject.getOptionalProperty<js::Number>(env, "streamingSpkCacheLen");
if (streamingSpkCacheLenOpt.has_value()) {
config.streamingSpkCacheLen =
streamingSpkCacheLenOpt.value().as<int32_t>(env);
}

auto streamingFifoLenOpt =
jsObject.getOptionalProperty<js::Number>(env, "streamingFifoLen");
if (streamingFifoLenOpt.has_value()) {
config.streamingFifoLen = streamingFifoLenOpt.value().as<int32_t>(env);
}

auto streamingChunkLeftContextMsOpt =
jsObject.getOptionalProperty<js::Number>(
env, "streamingChunkLeftContextMs");
if (streamingChunkLeftContextMsOpt.has_value()) {
config.streamingChunkLeftContextMs =
streamingChunkLeftContextMsOpt.value().as<int32_t>(env);
}

auto streamingChunkRightContextMsOpt =
jsObject.getOptionalProperty<js::Number>(
env, "streamingChunkRightContextMs");
if (streamingChunkRightContextMsOpt.has_value()) {
config.streamingChunkRightContextMs =
streamingChunkRightContextMsOpt.value().as<int32_t>(env);
}

auto streamingSpkCacheUpdatePeriodOpt =
jsObject.getOptionalProperty<js::Number>(
env, "streamingSpkCacheUpdatePeriod");
if (streamingSpkCacheUpdatePeriodOpt.has_value()) {
config.streamingSpkCacheUpdatePeriod =
streamingSpkCacheUpdatePeriodOpt.value().as<int32_t>(env);
}

auto innerConfigOpt = jsObject.getOptionalProperty<js::Object>(env, "config");
if (innerConfigOpt.has_value()) {
loadModelParams(innerConfigOpt.value(), env, config);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,15 @@ ParakeetStreamingProcessor::ParakeetStreamingProcessor(
opts.threshold = config_.diarOnsetThreshold;
opts.min_segment_ms = config_.diarMinSegmentMs;
opts.emit_partials = config_.emitPartials;
// AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these fields for
// v1/v2 GGUFs (variant detected from `parakeet.model_variant` metadata
// or the encoder shape heuristic), so always-forward is safe.
opts.spkcache_enable = config_.spkCacheEnable;
opts.spkcache_len = config_.spkCacheLen;
opts.fifo_len = config_.fifoLen;
opts.chunk_left_context_ms = config_.chunkLeftContextMs;
opts.chunk_right_context_ms = config_.chunkRightContextMs;
opts.spkcache_update_period = config_.spkCacheUpdatePeriod;

diar_session_ = model_.createDuplexDiarizationSession(
opts,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,18 @@ class ParakeetStreamingProcessor {
// parakeet engine default in place" (10000 / 2000 ms respectively).
int leftContextMs = -1;
int rightLookaheadMs = -1;
// === AOSC (v2.1+ Sortformer only) ====================================
// Forwarded into parakeet::SortformerStreamingOptions when the loaded
// model is a v2.1 Sortformer GGUF (auto-detected from the GGUF's
// `parakeet.model_variant` metadata tag). parakeet-cpp ignores these
// fields on v1/v2 GGUFs and on non-Sortformer engines, so they are
// always safe to forward.
bool spkCacheEnable = true;
int spkCacheLen = 188;
int fifoLen = 188;
int chunkLeftContextMs = 80;
int chunkRightContextMs = 560;
int spkCacheUpdatePeriod = 144;
};

ParakeetStreamingProcessor(
Expand Down
Loading
Loading