From 402422c2a4abd5423f7b3c0db537a7c6ca549f74 Mon Sep 17 00:00:00 2001 From: Pratik Narola Date: Wed, 20 May 2026 16:00:47 +0530 Subject: [PATCH 1/5] feat[api]: add Sortformer v2.1 + AOSC streaming diarization support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bumps @qvac/transcription-parakeet 0.4.0 -> 0.5.0 (MINOR -- additive API only; no breaking changes). ## ๐ŸŽฏ What problem does this PR solve? - v1 Sortformer streaming uses a fixed-size sliding-history window; once a speaker goes silent long enough to roll out of the window, their slot identity drifts onto a different physical voice when they return. - Continuous single-speaker stretches collapse all voices onto `sortformer_0` once two speakers have been seen, breaking live speaker-tagged transcripts. - v2.1 + AOSC (Audio-Online Speaker Cache, NeMo-ported) fixes this in parakeet-cpp, but until now there was no way to consume it from the JS layer. ## ๐Ÿ“ How does it solve it? - Bump `parakeet-cpp` to `version>= 2026-05-20` (the qvac-registry-vcpkg bump in PR #156 pulls in PRs #22 / #24 of qvac-ext-lib-whisper.cpp). - Plumb 6 AOSC knobs (`streamingSpkCacheEnable`, `streamingSpkCacheLen`, `streamingFifoLen`, `streamingChunkLeftContextMs`, `streamingChunkRightContextMs`, `streamingSpkCacheUpdatePeriod`) from JS through `ParakeetConfig` -> `ParakeetModel` / `ParakeetStreamingProcessor` -> `parakeet::SortformerStreamingOptions`, for both the in-process Mode-3 streaming path and the duplex `runStreaming()` processor. - v2.1 is auto-detected by the engine via the GGUF metadata tag `parakeet.model_variant`; AOSC defaults mirror parakeet-cpp's NeMo-port tuning (188 / 188 / 80 / 560 / 144, enabled). - Defaults: v2.1 becomes the streaming Sortformer; v1 stays the offline default. Both GGUFs remain registered. - New `examples/live-mic-diarized-aosc.js` exposes every AOSC knob as a CLI flag for A/B comparison against the v1 sliding-window path. ## ๐Ÿงช How was it tested? - Built locally against a vcpkg overlay pointing at the PR #156 branch; addon compiled cleanly with all 6 new AOSC field references through `ParakeetStreamingProcessor.cpp`, `ParakeetModel.cpp`, `AddonJs.hpp`, and `JSAdapter.cpp`. - Full integration suite: **37/37 tests pass, 72/72 assertions in 145s** (macOS arm64, all q8_0 GGUFs staged including v2.1 Sortformer). - New `test/integration/sortformer-aosc-streaming.test.js` covers default-AOSC streaming + `streamingSpkCacheEnable=false` fallback to the v1 sliding-window code path. Confirmed via engine logs that the override actually disables the cache (`Sortformer AOSC enabled` line only prints when AOSC is active). - v1 Sortformer desktop integration + GPU smoke tests still pass -- no regression to the existing diarization path. ## ๐Ÿ”Œ API Changes New optional fields on `ParakeetConfig`, mirrored as per-call overrides on `StreamingRunConfig`. All default to parakeet-cpp's NeMo-port tuning; specifying them is opt-in. Ignored on v1 / v2 Sortformer and on non-Sortformer engines (no-op forwarding is safe). ```typescript import { TranscriptionParakeet } from "@qvac/transcription-parakeet"; const model = new TranscriptionParakeet({ files: { model: "diar_streaming_sortformer_4spk-v2.1.q8_0.gguf" }, config: { parakeetConfig: { streaming: true, streamingChunkMs: 2000, // AOSC (v2.1+ only; auto-detected via GGUF metadata) streamingSpkCacheEnable: true, // default streamingSpkCacheLen: 188, // long-term cache rows streamingFifoLen: 188, // warmup FIFO rows streamingChunkLeftContextMs: 80, // ~1 encoder frame streamingChunkRightContextMs: 560, // ~7 encoder frames streamingSpkCacheUpdatePeriod: 144, // FIFO-overflow pop count }, }, }); ``` ## Depends on - qvac-registry-vcpkg #156 (parakeet-cpp 2026-05-20 bump). CI will not resolve the new `version>=` constraint until that PR merges. - Separate registry-server PR for the v2.1 GGUF entry in `models.prod.json` (out of scope for this PR -- handled independently). - Upload of `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf` to S3 (the GGUF the new test resolves via `MODEL_CONFIGS.sortformerStreaming`). ## Follow-up (separate PR, not in scope here) SDK adoption (`@qvac/sdk` schema + plugin + example) lands in a separate PR after this addon is published and the v2.1 GGUF entry has synced into `sdk/models/registry/models.ts`. The SDK needs both pieces in place before its schema can meaningfully forward AOSC knobs. --- packages/transcription-parakeet/CHANGELOG.md | 18 + packages/transcription-parakeet/README.md | 53 ++- .../addon/src/addon/AddonJs.hpp | 43 ++ .../addon/src/js-interface/JSAdapter.cpp | 47 +++ .../ParakeetStreamingProcessor.cpp | 9 + .../ParakeetStreamingProcessor.hpp | 12 + .../parakeet/ParakeetConfig.hpp | 29 +- .../parakeet/ParakeetModel.cpp | 9 + .../parakeet/ParakeetModel.hpp | 21 + .../examples/diarized-transcribe.js | 10 +- .../examples/live-mic-diarized-aosc.js | 369 ++++++++++++++++++ .../examples/live-mic-diarized.js | 40 +- packages/transcription-parakeet/index.d.ts | 36 ++ packages/transcription-parakeet/index.js | 14 +- packages/transcription-parakeet/package.json | 2 +- packages/transcription-parakeet/parakeet.js | 20 + .../test/integration/helpers.js | 13 + .../sortformer-aosc-streaming.test.js | 222 +++++++++++ packages/transcription-parakeet/vcpkg.json | 8 +- 19 files changed, 951 insertions(+), 24 deletions(-) create mode 100644 packages/transcription-parakeet/examples/live-mic-diarized-aosc.js create mode 100644 packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js diff --git a/packages/transcription-parakeet/CHANGELOG.md b/packages/transcription-parakeet/CHANGELOG.md index caad384708..67dbabd684 100644 --- a/packages/transcription-parakeet/CHANGELOG.md +++ b/packages/transcription-parakeet/CHANGELOG.md @@ -5,6 +5,24 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.5.0] + +In this release we expose the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API. AOSC anchors each speaker to a stable cache slot across silence and re-entry, fixing the per-chunk permutation-invariance drift v1's sliding-window streaming exhibits once two voices have been seen. v2.1 becomes the recommended streaming Sortformer; v1 stays the offline-batch default. Six new optional config knobs surface the cache geometry for tuning and A/B comparison; defaults mirror parakeet-cpp's NeMo-port tuning so a bare `streaming: true` against a v2.1 GGUF Just Works. + +### Added +- **AOSC config knobs.** `ParakeetConfig` gains six optional fields โ€” `streamingSpkCacheEnable` (default `true`), `streamingSpkCacheLen` (188), `streamingFifoLen` (188), `streamingChunkLeftContextMs` (80), `streamingChunkRightContextMs` (560), `streamingSpkCacheUpdatePeriod` (144) โ€” forwarded into `parakeet::SortformerStreamingOptions` for both the in-process Mode-3 streaming path (`ParakeetModel::runStreamingProcess_`) and the duplex `runStreaming()` processor (`ParakeetStreamingProcessor`). Mirrored as per-call overrides on `StreamingRunConfig` (`spkCacheEnable`, `spkCacheLen`, `fifoLen`, `chunkLeftContextMs`, `chunkRightContextMs`, `spkCacheUpdatePeriod`). parakeet-cpp ignores these on v1 / v2 Sortformer GGUFs and on non-Sortformer engines, so always-forward is safe. +- **v2.1 Sortformer auto-detection.** When a `diar_streaming_sortformer_4spk-v2.1.*` GGUF is loaded, parakeet-cpp's engine recognises it from the GGUF metadata tag `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` and enables AOSC by default. Setting `streamingSpkCacheEnable: false` forces the v1 sliding-window code path on a v2.1 model (A/B comparison). +- **`examples/live-mic-diarized-aosc.js`** โ€” v2.1-focused dual-stream live mic example mirroring `live-mic-diarized.js`'s ASR + Sortformer pattern, with CLI flags for every AOSC knob (`--spk-cache-enable`, `--spk-cache-len`, `--fifo-len`, `--chunk-left-context-ms`, `--chunk-right-context-ms`, `--spk-cache-update-period`). +- **`test/integration/sortformer-aosc-streaming.test.js`** โ€” covers default-AOSC streaming and `streamingSpkCacheEnable=false` fallback. The full AOSC slot-stability contract (same physical speaker โ†’ same `Speaker N` tag across non-contiguous re-entries) is verified at C++ level in `parakeet-cpp/test/test_sortformer_aosc_speakers.cpp`; this JS-level test focuses on wiring correctness โ€” that the override actually reaches the engine and the engine emits well-formed segments in both modes. +- **`MODEL_CONFIGS.sortformerStreaming`** entry in `test/integration/helpers.js` pointing at `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. Tests skip cleanly when the GGUF isn't staged via `npm run setup-models` / `QVAC_TEST_GGUF_*`. + +### Changed +- **parakeet-cpp dep bumped** to `version>= 2026-05-20` (was `2026-05-05#1`) across all three platform branches in `vcpkg.json`. The new port (qvac-registry-vcpkg PR #156) pulls in PRs #22 + #24 of `qvac-ext-lib-whisper.cpp`, which introduce the v2.1 Sortformer support, AOSC engine implementation, strict variant detection via the `parakeet.model_variant` GGUF tag, and review-fixup cleanups (magic-number elimination, dead-code removal, test utility consolidation, Windows `` include). +- **`index.js::_buildConfigurationParams()`** now forwards the 6 new AOSC fields (and explicit defaults for unset values) into `createInstance` / `reload`. Without this, JSDoc + native plumbing would exist but JS-layer overrides would never reach C++. +- **`examples/live-mic-diarized.js`** header: recommends the v2.1 GGUF as `--diar-model` and notes that `streamingHistoryMs` is superseded by AOSC on v2.1 models (kept for v1 back-compat). Points to the new `live-mic-diarized-aosc.js` for explicit knob control. +- **`examples/diarized-transcribe.js`** header: notes v1 remains the recommended OFFLINE diarization model โ€” AOSC's slot-stability benefit only applies to continuous streaming and is wasted in batch mode. +- **`README.md`** โ€” extended Model Variants table with v1 (offline default) and v2.1 + AOSC (streaming default) rows; new `streamingSpkCache*` rows in the ParakeetConfig table; dedicated "Sortformer Streaming Diarization (v2.1 + AOSC)" section explaining the v1-drift problem AOSC solves, the model-variant auto-detection, and when to leave the defaults alone. + ## [0.4.0] In this release, we have replaced the onnxruntime backend with a pure C++/ggml engine, added a duplex-streaming entry point that bypasses the framework's batch-then-process lifecycle for live use cases, and surfaced two new per-segment signals (`isEndOfTurn`, `startsWord`) so consumers can build cleaner live transcripts. The release also exposes per-engine backend stats (`backendDevice`, `backendId`) so callers can verify the GPU path actually engaged, and consolidates the examples / docs / mock fixtures into a single duplex-aware surface. diff --git a/packages/transcription-parakeet/README.md b/packages/transcription-parakeet/README.md index 548668823c..c3ae749e0f 100644 --- a/packages/transcription-parakeet/README.md +++ b/packages/transcription-parakeet/README.md @@ -214,9 +214,46 @@ Most users interact with the package through `index.js`. From that entrypoint we | | `streamingEnergyVad` | CTC/TDT energy-VAD events (default: `false`) | | | `streamingLeftContextMs` | ASR encoder left-context window in ms; `-1` keeps parakeet-cpp's default of 10000. ASR sessions only (Sortformer ignores it). | | | `streamingRightLookaheadMs` | ASR encoder right-lookahead window in ms; `-1` keeps parakeet-cpp's default of 2000. Adds directly to the per-segment latency floor (`chunk_ms + right_lookahead_ms`). ASR sessions only. | +| | `streamingSpkCacheEnable` | AOSC: enable v2.1 Sortformer's speaker-cache streaming (default: `true`). Ignored on v1/v2 Sortformer GGUFs and on non-Sortformer models. Set `false` to force a v2.1 GGUF onto the v1 sliding-window path (A/B comparison). | +| | `streamingSpkCacheLen` | AOSC: long-term speaker-cache rows (~15 s of encoder frames). Default: 188. | +| | `streamingFifoLen` | AOSC: FIFO warmup buffer rows. Default: 188. | +| | `streamingChunkLeftContextMs` | AOSC: encoder left-context window (ms; ~1 encoder frame). Default: 80. | +| | `streamingChunkRightContextMs` | AOSC: encoder right-context window (ms; ~7 encoder frames). Default: 560. | +| | `streamingSpkCacheUpdatePeriod` | AOSC: FIFO-overflow pop-out count. Default: 144. | The model type (CTC / TDT / EOU / Sortformer) is **auto-detected from the GGUF metadata**, so callers don't need to pass `modelType`. Other knobs (`captionEnabled`, `timestampsEnabled`, `seed`, `sampleRate`, `channels`) keep sensible defaults. +**Sortformer Streaming Diarization (v2.1 + AOSC).** parakeet-cpp ships +two streaming-diarization paths picked automatically by the GGUF: + +- **v1** uses a fixed-size sliding-history window inside the engine. + Once two voices have been seen, the per-chunk decisions are + permutation-invariant; if a speaker goes silent long enough to roll + out of the window, the slot can drift onto a different physical voice + when they return. Fine for short, stable clips; ships as + `sortformer-4spk-v1.q8_0.gguf`. +- **v2.1** replaces the sliding window with AOSC (Audio-Online Speaker + Cache, ported from NVIDIA NeMo) which anchors each slot to its + accumulated embedding. Same physical speaker comes back to the same + `Speaker N` tag across silences. Default for live capture; ships as + `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. The engine detects + v2.1 via the GGUF metadata tag + `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`; you + don't need to opt in via config. + +The defaults in the `streamingSpkCache*` / `streamingFifo*` / +`streamingChunk{Left,Right}ContextMs` table rows above are the NeMo-port +tuning parakeet-cpp ships -- you almost always want to keep them. The +knobs are exposed for A/B comparison (e.g. `--spk-cache-enable false` +in `examples/live-mic-diarized-aosc.js` to force a v2.1 GGUF onto the +v1 path) and for tuning unusual audio (longer cache, larger +right-context window for higher latency tolerance, etc.). + +For offline diarization (single batch over a finite clip) v1 remains +the recommended GGUF -- AOSC's slot-stability benefit only applies to +continuous streaming and offers no measurable improvement when the +entire clip is available at once. + #### Configuration Example ```javascript @@ -408,10 +445,16 @@ bare examples/diarized-transcribe.js \ # Live mic transcription bare examples/live-mic.js --model models/parakeet-eou-120m-v1.q8_0.gguf --accumulate -# Live mic + speaker tagging +# Live mic + speaker tagging (recommended: v2.1 diar GGUF, AOSC auto-on) bare examples/live-mic-diarized.js \ --asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \ - --diar-model models/sortformer-4spk-v1.q8_0.gguf --accumulate + --diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf --accumulate + +# Same as above, with explicit AOSC tuning knobs exposed as CLI flags +bare examples/live-mic-diarized-aosc.js \ + --asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \ + --diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf \ + --spk-cache-len 256 --chunk-right-context-ms 480 --accumulate ``` > If you use `npm run example:* -- ...` instead of `bare`, remember the `--` separator -- without it npm interprets `--model` as one of its own config flags. @@ -425,14 +468,16 @@ The live-mic examples capture the default input device via `sox -d` (install: `b | **CTC** | English | argmax CTC | ~ 700 MiB | Fast, no PnC. | | **TDT** | ~25 | RNN-T greedy + duration | ~ 715 MiB | Recommended default; PnC + auto-detect. | | **EOU** | English | RNN-T greedy + `` | ~ 132 MiB | Streaming-trained; native end-of-turn token. | -| **Sortformer** | n/a | Diarization head | ~ 141 MiB | 4-speaker. | +| **Sortformer v1** | n/a | Diarization head (sliding history) | ~ 141 MiB | 4-speaker. **Default for offline diarization.** | +| **Sortformer v2.1 + AOSC** | n/a | Diarization head + speaker cache | ~ 141 MiB | 4-speaker. **Default for streaming diarization.** AOSC anchors speaker slots across silence/re-entry; auto-detected via GGUF metadata tag `parakeet.model_variant`. | ## Other examples - [`examples/transcribe.js`](examples/transcribe.js) -- universal single-file transcribe / diarize (any GGUF, all model types). - [`examples/diarized-transcribe.js`](examples/diarized-transcribe.js) -- combined Sortformer + ASR pipeline ("who said what"). - [`examples/live-mic.js`](examples/live-mic.js) -- live microphone transcription via `sox` and the streaming session. -- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts. +- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts. Pass a v2.1 Sortformer GGUF to get AOSC speaker-cache streaming automatically. +- [`examples/live-mic-diarized-aosc.js`](examples/live-mic-diarized-aosc.js) -- same as above but with CLI flags for the AOSC tuning knobs (`--spk-cache-len`, `--fifo-len`, `--chunk-right-context-ms`, `--spk-cache-enable`, etc.). Useful for A/B comparing AOSC vs the v1 sliding-window code path on the same v2.1 GGUF. - [`examples/decode-audio.js`](examples/decode-audio.js) -- decode + transcribe in one step. Same flag surface as `transcribe.js` but pipes the input through `@qvac/decoder-audio` (FFmpeg) first, so any container / codec FFmpeg supports (mp3, m4a, ogg, flac, mp4, ...) works -- not just 16 kHz mono `.wav` / raw s16le PCM. - [`examples/utils.js`](examples/utils.js) -- shared helpers used by the examples (`loadWeights` streaming, `Output`/`JobEnded` race resolution). diff --git a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp index 12cfea1766..e91833e726 100644 --- a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp +++ b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp @@ -163,6 +163,13 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { parakeetModel.getDiarMinDurationOn() * 1000.0F); config.leftContextMs = parakeetModel.getStreamingLeftContextMs(); config.rightLookaheadMs = parakeetModel.getStreamingRightLookaheadMs(); + // AOSC defaults sourced from the model's load-time ParakeetConfig. + config.spkCacheEnable = parakeetModel.getStreamingSpkCacheEnable(); + config.spkCacheLen = parakeetModel.getStreamingSpkCacheLen(); + config.fifoLen = parakeetModel.getStreamingFifoLen(); + config.chunkLeftContextMs = parakeetModel.getStreamingChunkLeftContextMs(); + config.chunkRightContextMs = parakeetModel.getStreamingChunkRightContextMs(); + config.spkCacheUpdatePeriod = parakeetModel.getStreamingSpkCacheUpdatePeriod(); if (auto chunkMs = configObj.getOptionalProperty(env, "chunkMs"); @@ -198,6 +205,42 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { emitEnergyVad.has_value()) { config.emitEnergyVad = emitEnergyVad.value().as(env); } + // AOSC per-call overrides (v2.1+ Sortformer only). + if (auto spkCacheEnable = + configObj.getOptionalProperty(env, "spkCacheEnable"); + spkCacheEnable.has_value()) { + config.spkCacheEnable = spkCacheEnable.value().as(env); + } + if (auto spkCacheLen = + configObj.getOptionalProperty(env, "spkCacheLen"); + spkCacheLen.has_value()) { + const auto v = static_cast(spkCacheLen.value().as(env)); + if (v > 0) config.spkCacheLen = v; + } + if (auto fifoLen = + configObj.getOptionalProperty(env, "fifoLen"); + fifoLen.has_value()) { + const auto v = static_cast(fifoLen.value().as(env)); + if (v > 0) config.fifoLen = v; + } + if (auto chunkLeftContextMs = + configObj.getOptionalProperty(env, "chunkLeftContextMs"); + chunkLeftContextMs.has_value()) { + const auto v = static_cast(chunkLeftContextMs.value().as(env)); + if (v >= 0) config.chunkLeftContextMs = v; + } + if (auto chunkRightContextMs = + configObj.getOptionalProperty(env, "chunkRightContextMs"); + chunkRightContextMs.has_value()) { + const auto v = static_cast(chunkRightContextMs.value().as(env)); + if (v >= 0) config.chunkRightContextMs = v; + } + if (auto spkCacheUpdatePeriod = + configObj.getOptionalProperty(env, "spkCacheUpdatePeriod"); + spkCacheUpdatePeriod.has_value()) { + const auto v = static_cast(spkCacheUpdatePeriod.value().as(env)); + if (v > 0) config.spkCacheUpdatePeriod = v; + } { std::lock_guard lock(g_streamingMtx); diff --git a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp index 269899fcd8..be3d9ee42c 100644 --- a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp +++ b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp @@ -107,6 +107,53 @@ auto JSAdapter::loadFromJSObject(js::Object jsObject, js_env_t* env) streamingRightLookaheadMsOpt.value().as(env); } + // AOSC (v2.1+ Sortformer only). All optional; unspecified values keep + // ParakeetConfig's defaults. Forwarded into parakeet::SortformerStreamingOptions + // by ParakeetModel / ParakeetStreamingProcessor; ignored for v1/v2/non-Sortformer. + auto streamingSpkCacheEnableOpt = + jsObject.getOptionalProperty(env, "streamingSpkCacheEnable"); + if (streamingSpkCacheEnableOpt.has_value()) { + config.streamingSpkCacheEnable = + streamingSpkCacheEnableOpt.value().as(env); + } + + auto streamingSpkCacheLenOpt = + jsObject.getOptionalProperty(env, "streamingSpkCacheLen"); + if (streamingSpkCacheLenOpt.has_value()) { + config.streamingSpkCacheLen = + streamingSpkCacheLenOpt.value().as(env); + } + + auto streamingFifoLenOpt = + jsObject.getOptionalProperty(env, "streamingFifoLen"); + if (streamingFifoLenOpt.has_value()) { + config.streamingFifoLen = streamingFifoLenOpt.value().as(env); + } + + auto streamingChunkLeftContextMsOpt = + jsObject.getOptionalProperty( + env, "streamingChunkLeftContextMs"); + if (streamingChunkLeftContextMsOpt.has_value()) { + config.streamingChunkLeftContextMs = + streamingChunkLeftContextMsOpt.value().as(env); + } + + auto streamingChunkRightContextMsOpt = + jsObject.getOptionalProperty( + env, "streamingChunkRightContextMs"); + if (streamingChunkRightContextMsOpt.has_value()) { + config.streamingChunkRightContextMs = + streamingChunkRightContextMsOpt.value().as(env); + } + + auto streamingSpkCacheUpdatePeriodOpt = + jsObject.getOptionalProperty( + env, "streamingSpkCacheUpdatePeriod"); + if (streamingSpkCacheUpdatePeriodOpt.has_value()) { + config.streamingSpkCacheUpdatePeriod = + streamingSpkCacheUpdatePeriodOpt.value().as(env); + } + auto innerConfigOpt = jsObject.getOptionalProperty(env, "config"); if (innerConfigOpt.has_value()) { loadModelParams(innerConfigOpt.value(), env, config); diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp index 8161375bb5..9298d2c81d 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp @@ -46,6 +46,15 @@ ParakeetStreamingProcessor::ParakeetStreamingProcessor( opts.threshold = config_.diarOnsetThreshold; opts.min_segment_ms = config_.diarMinSegmentMs; opts.emit_partials = config_.emitPartials; + // AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these fields for + // v1/v2 GGUFs (variant detected from `parakeet.model_variant` metadata + // or the encoder shape heuristic), so always-forward is safe. + opts.spkcache_enable = config_.spkCacheEnable; + opts.spkcache_len = config_.spkCacheLen; + opts.fifo_len = config_.fifoLen; + opts.chunk_left_context_ms = config_.chunkLeftContextMs; + opts.chunk_right_context_ms = config_.chunkRightContextMs; + opts.spkcache_update_period = config_.spkCacheUpdatePeriod; diar_session_ = model_.createDuplexDiarizationSession( opts, diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp index f611172eb6..c2712c0cfc 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp @@ -54,6 +54,18 @@ class ParakeetStreamingProcessor { // parakeet engine default in place" (10000 / 2000 ms respectively). int leftContextMs = -1; int rightLookaheadMs = -1; + // === AOSC (v2.1+ Sortformer only) ==================================== + // Forwarded into parakeet::SortformerStreamingOptions when the loaded + // model is a v2.1 Sortformer GGUF (auto-detected from the GGUF's + // `parakeet.model_variant` metadata tag). parakeet-cpp ignores these + // fields on v1/v2 GGUFs and on non-Sortformer engines, so they are + // always safe to forward. + bool spkCacheEnable = true; + int spkCacheLen = 188; + int fifoLen = 188; + int chunkLeftContextMs = 80; + int chunkRightContextMs = 560; + int spkCacheUpdatePeriod = 144; }; ParakeetStreamingProcessor( diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp index a0d487242e..b08b7e520b 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp @@ -57,6 +57,27 @@ struct ParakeetConfig { int streamingLeftContextMs = -1; int streamingRightLookaheadMs = -1; + // === AOSC (Audio-Online Speaker Cache; v2.1+ Sortformer only) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ + // Forwarded to parakeet::SortformerStreamingOptions.spkcache_* / + // fifo_len / chunk_{left,right}_context_ms / spkcache_update_period. + // Ignored on non-Sortformer models and on v1/v2 Sortformer GGUFs; + // parakeet-cpp auto-enables AOSC for v2.1 via the GGUF metadata tag + // `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + // + // The cache anchors speaker-slot identity across silence and re-entry, + // fixing the per-chunk permutation-invariance drift that v1's sliding + // window suffers from. Defaults mirror parakeet-cpp's own (NeMo-port + // tuning); override only when A/B comparing or for specialised audio. + // + // Setting streamingSpkCacheEnable = false on a v2.1 model forces the + // v1 sliding-window code path (useful for regression comparison). + bool streamingSpkCacheEnable = true; + int streamingSpkCacheLen = 188; // long-term speaker rows (~15s) + int streamingFifoLen = 188; // FIFO warmup buffer rows + int streamingChunkLeftContextMs = 80; // encoder left context (~1 frame) + int streamingChunkRightContextMs = 560; // encoder right context (~7 frames) + int streamingSpkCacheUpdatePeriod = 144; // FIFO-overflow pop-out count + ParakeetConfig() = default; explicit ParakeetConfig(const std::string& path) : modelPath(path) {} @@ -73,7 +94,13 @@ struct ParakeetConfig { streamingEmitPartials == other.streamingEmitPartials && streamingEnergyVad == other.streamingEnergyVad && streamingLeftContextMs == other.streamingLeftContextMs && - streamingRightLookaheadMs == other.streamingRightLookaheadMs; + streamingRightLookaheadMs == other.streamingRightLookaheadMs && + streamingSpkCacheEnable == other.streamingSpkCacheEnable && + streamingSpkCacheLen == other.streamingSpkCacheLen && + streamingFifoLen == other.streamingFifoLen && + streamingChunkLeftContextMs == other.streamingChunkLeftContextMs && + streamingChunkRightContextMs == other.streamingChunkRightContextMs && + streamingSpkCacheUpdatePeriod == other.streamingSpkCacheUpdatePeriod; } bool operator!=(const ParakeetConfig& other) const { return !(*this == other); } diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp index 96cc3dcdc3..6292d617b7 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp @@ -683,6 +683,15 @@ void ParakeetModel::openStreamingSession_() { opts.threshold = diarConfig_.onset; opts.min_segment_ms = static_cast(diarConfig_.minDurationOn * 1000.0f); opts.emit_partials = cfg_.streamingEmitPartials; + // AOSC (v2.1+ Sortformer only; ignored for v1/v2 GGUFs). The engine + // detects v2.1 via the GGUF metadata tag `parakeet.model_variant` and + // only consults these fields then -- safe to forward unconditionally. + opts.spkcache_enable = cfg_.streamingSpkCacheEnable; + opts.spkcache_len = cfg_.streamingSpkCacheLen; + opts.fifo_len = cfg_.streamingFifoLen; + opts.chunk_left_context_ms = cfg_.streamingChunkLeftContextMs; + opts.chunk_right_context_ms = cfg_.streamingChunkRightContextMs; + opts.spkcache_update_period = cfg_.streamingSpkCacheUpdatePeriod; auto session = engine->diarize_start( opts, [this](const parakeet::StreamingDiarizationSegment& seg) { diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp index 2cd7c5f993..27c62be9bb 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp @@ -139,6 +139,27 @@ class ParakeetModel : public qvac_lib_inference_addon_cpp::model::IModel, bool getStreamingEnergyVad() const { return cfg_.streamingEnergyVad; } + // AOSC accessors (v2.1+ Sortformer only). Forwarded verbatim from + // ParakeetConfig; parakeet-cpp ignores them for non-Sortformer engines + // and for v1/v2 Sortformer GGUFs. + bool getStreamingSpkCacheEnable() const { + return cfg_.streamingSpkCacheEnable; + } + int getStreamingSpkCacheLen() const { + return cfg_.streamingSpkCacheLen; + } + int getStreamingFifoLen() const { + return cfg_.streamingFifoLen; + } + int getStreamingChunkLeftContextMs() const { + return cfg_.streamingChunkLeftContextMs; + } + int getStreamingChunkRightContextMs() const { + return cfg_.streamingChunkRightContextMs; + } + int getStreamingSpkCacheUpdatePeriod() const { + return cfg_.streamingSpkCacheUpdatePeriod; + } bool isSortformer() const { return cfg_.modelType == ModelType::SORTFORMER; } diff --git a/packages/transcription-parakeet/examples/diarized-transcribe.js b/packages/transcription-parakeet/examples/diarized-transcribe.js index 424092d470..8f25c8f131 100644 --- a/packages/transcription-parakeet/examples/diarized-transcribe.js +++ b/packages/transcription-parakeet/examples/diarized-transcribe.js @@ -1,13 +1,21 @@ 'use strict' /** - * Combined ASR + diarization example. + * Combined ASR + diarization example (offline). * * Runs Sortformer to find speaker time-segments, then transcribes * each speaker's audio slice with the ASR model. Output is a * "Speaker N: ..." per-segment transcript. Both engines run * through the public `TranscriptionParakeet` class. * + * Recommended `--diar-model`: the v1 Sortformer GGUF + * (`sortformer-4spk-v1.q8_0.gguf`). v2.1 also works but the AOSC + * speaker cache it brings is a *streaming* optimisation -- in batch / + * offline mode the entire clip is available at once, so AOSC's slot + * stability across silence/re-entry provides no additional benefit + * over v1. For live capture, use `examples/live-mic-diarized.js` + * (or `examples/live-mic-diarized-aosc.js`) with the v2.1 GGUF. + * * Usage: * bare examples/diarized-transcribe.js \ * --asr-model --diar-model --audio diff --git a/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js b/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js new file mode 100644 index 0000000000..41970a3026 --- /dev/null +++ b/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js @@ -0,0 +1,369 @@ +'use strict' + +/** + * Live-mic transcription + diarization example with full AOSC control. + * + * This is the v2.1-focused counterpart of `examples/live-mic-diarized.js`. + * Both files share the same duplex pattern (two `runStreaming()` + * sessions fanned from a single sox capture, with the ASR transcript + * tagged by the latest Sortformer speaker_id). What this file adds is + * explicit CLI control of the AOSC (Audio-Online Speaker Cache) knobs + * parakeet-cpp exposes for v2.1 Sortformer streaming: + * + * --spk-cache-enable {true|false} Toggle AOSC. Defaults to true. + * Set false to force a v2.1 GGUF + * onto the v1 sliding-window + * path (A/B comparison). + * --spk-cache-len Long-term speaker-cache rows + * (default 188 โ‰ˆ 15 s). + * --fifo-len FIFO warmup buffer rows + * (default 188). + * --chunk-left-context-ms Encoder left context, ~1 frame + * (default 80). + * --chunk-right-context-ms Encoder right context, ~7 frames + * (default 560). Adds directly to + * per-chunk emission latency. + * --spk-cache-update-period FIFO-overflow pop-out count + * (default 144). How many frames + * get promoted into the long-term + * cache each time the FIFO fills. + * + * Background -- what AOSC fixes: + * v1 / v2 Sortformer streams use a fixed-size sliding-history window + * inside the engine. Once two voices have been seen, the model's + * per-chunk decisions are permutation-invariant; if one speaker goes + * silent long enough to roll out of the window, its slot identity can + * silently drift onto a different physical voice when it returns. v2.1 + * replaces the sliding window with a NeMo-port speaker cache that + * anchors each slot to its accumulated embedding, so the same physical + * speaker comes back to the same `Speaker N` tag across silences. + * + * For the upstream API + algorithm details, see + * `parakeet-cpp/include/parakeet/diarization.h` and the upstream PRs + * that introduced this feature in qvac-ext-lib-whisper.cpp (PR #22 + * commit e6ba38c, PR #24 commit 08df2e7). + * + * Usage: + * bare examples/live-mic-diarized-aosc.js \ + * --asr-model \ + * --diar-model \ + * [--accumulate] [--chunk-ms ] [--capture ""] \ + * [--spk-cache-enable {true|false}] [--spk-cache-len ] \ + * [--fifo-len ] [--chunk-left-context-ms ] \ + * [--chunk-right-context-ms ] [--spk-cache-update-period ] + * + * Notes: + * - The AOSC knobs are silently ignored on v1/v2 GGUFs and on + * non-Sortformer models. The engine detects v2.1 via the GGUF + * metadata tag `parakeet.model_variant`. + * - On Windows, if sox exits without producing audio, override capture: + * --capture "sox -t waveaudio default -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -" + */ + +/* global Bare */ +const path = require('bare-path') +const process = require('bare-process') +const subprocess = require('bare-subprocess') +const TranscriptionParakeet = require('../index.js') +const addonLogging = require('../addonLogging.js') +const { setupLogger, validatePaths, pushableStream } = require('./utils.js') + +const CAPTURE_CMD = 'sox -d -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -' + +const SILENCE_SENTINELS = new Set([ + '[No speech detected]', + '[Audio too short]', + '[Model not ready]', + '[No speakers detected]' +]) + +function isSilenceText (text) { + return text.length === 0 || SILENCE_SENTINELS.has(text) +} + +function buildSegmentText (items) { + let text = '' + let firstStartsWord = true + let isFirst = true + for (const s of items) { + if (!s || !s.text || !s.toAppend) continue + const sw = s.startsWord !== false + if (isFirst) { + firstStartsWord = sw + text = s.text + isFirst = false + } else { + text += (sw ? ' ' : '') + s.text + } + } + return { text: text.replace(/\s+/g, ' '), firstStartsWord } +} + +function parseSortformerSpeakerId (text) { + const m = typeof text === 'string' + ? text.match(/Speaker\s+(\d+)/) + : null + return m ? parseInt(m[1], 10) : -1 +} + +function parseBoolFlag (value) { + if (value === undefined || value === null) return undefined + const normalised = String(value).toLowerCase() + if (normalised === 'true' || normalised === '1' || normalised === 'yes') return true + if (normalised === 'false' || normalised === '0' || normalised === 'no') return false + return undefined +} + +function parsePositiveInt (value) { + const n = parseInt(value, 10) + return Number.isFinite(n) && n > 0 ? n : null +} + +function parseNonNegativeInt (value) { + const n = parseInt(value, 10) + return Number.isFinite(n) && n >= 0 ? n : null +} + +function parseArgs () { + const args = { + asrModel: null, + diarModel: null, + accumulate: false, + capture: null, + chunkMs: null, + spkCacheEnable: undefined, + spkCacheLen: null, + fifoLen: null, + chunkLeftContextMs: null, + chunkRightContextMs: null, + spkCacheUpdatePeriod: null + } + const argv = Bare.argv.slice(2) + for (let i = 0; i < argv.length; i++) { + const a = argv[i] + if (a === '--asr-model' || a === '-m') args.asrModel = argv[++i] + else if (a === '--diar-model' || a === '-d') args.diarModel = argv[++i] + else if (a === '--accumulate') args.accumulate = true + else if (a === '--capture' || a === '-c') args.capture = argv[++i] + else if (a === '--chunk-ms') { + const v = parsePositiveInt(argv[++i]) + if (v !== null && v >= 200) args.chunkMs = v + } else if (a === '--spk-cache-enable') { + const v = parseBoolFlag(argv[++i]) + if (v !== undefined) args.spkCacheEnable = v + } else if (a === '--spk-cache-len') args.spkCacheLen = parsePositiveInt(argv[++i]) + else if (a === '--fifo-len') args.fifoLen = parsePositiveInt(argv[++i]) + else if (a === '--chunk-left-context-ms') args.chunkLeftContextMs = parseNonNegativeInt(argv[++i]) + else if (a === '--chunk-right-context-ms') args.chunkRightContextMs = parseNonNegativeInt(argv[++i]) + else if (a === '--spk-cache-update-period') args.spkCacheUpdatePeriod = parsePositiveInt(argv[++i]) + } + return args +} + +function buildDiarConfig (args) { + const config = { + streaming: true, + streamingChunkMs: args.chunkMs ?? 2000, + useGPU: true + } + if (args.spkCacheEnable !== undefined) config.streamingSpkCacheEnable = args.spkCacheEnable + if (args.spkCacheLen !== null) config.streamingSpkCacheLen = args.spkCacheLen + if (args.fifoLen !== null) config.streamingFifoLen = args.fifoLen + if (args.chunkLeftContextMs !== null) config.streamingChunkLeftContextMs = args.chunkLeftContextMs + if (args.chunkRightContextMs !== null) config.streamingChunkRightContextMs = args.chunkRightContextMs + if (args.spkCacheUpdatePeriod !== null) config.streamingSpkCacheUpdatePeriod = args.spkCacheUpdatePeriod + return config +} + +function describeAoscConfig (config) { + const parts = [] + if ('streamingSpkCacheEnable' in config) parts.push(`spkCacheEnable=${config.streamingSpkCacheEnable}`) + if ('streamingSpkCacheLen' in config) parts.push(`spkCacheLen=${config.streamingSpkCacheLen}`) + if ('streamingFifoLen' in config) parts.push(`fifoLen=${config.streamingFifoLen}`) + if ('streamingChunkLeftContextMs' in config) parts.push(`chunkLeftContextMs=${config.streamingChunkLeftContextMs}`) + if ('streamingChunkRightContextMs' in config) parts.push(`chunkRightContextMs=${config.streamingChunkRightContextMs}`) + if ('streamingSpkCacheUpdatePeriod' in config) parts.push(`spkCacheUpdatePeriod=${config.streamingSpkCacheUpdatePeriod}`) + return parts.length === 0 ? '(all AOSC defaults)' : parts.join(' ') +} + +async function main () { + const args = parseArgs() + if (!args.asrModel || !args.diarModel) { + console.error('Usage: bare examples/live-mic-diarized-aosc.js --asr-model --diar-model [--accumulate] [--chunk-ms ] [--capture ""] [--spk-cache-enable {true|false}] [--spk-cache-len ] [--fifo-len ] [--chunk-left-context-ms ] [--chunk-right-context-ms ] [--spk-cache-update-period ]') + process.exit(1) + } + + setupLogger(addonLogging) + let stopping = false + + const asrPath = path.resolve(args.asrModel) + const diarPath = path.resolve(args.diarModel) + if (!validatePaths({ model: asrPath })) { addonLogging.releaseLogger(); process.exit(1) } + if (!validatePaths({ model: diarPath })) { addonLogging.releaseLogger(); process.exit(1) } + + console.log(`Loading ASR: ${asrPath}`) + console.log(`Loading DIAR: ${diarPath}`) + + const diarConfig = buildDiarConfig(args) + console.log(`AOSC config: ${describeAoscConfig(diarConfig)}`) + + const asr = new TranscriptionParakeet({ + files: { model: asrPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: args.chunkMs ?? 2000, + useGPU: true + } + } + }) + const diar = new TranscriptionParakeet({ + files: { model: diarPath }, + config: { parakeetConfig: diarConfig } + }) + + await asr.load() + await diar.load() + console.log('Listening (Ctrl-C to stop)...\n') + + const captureCmd = args.capture && args.capture.length > 0 ? args.capture : CAPTURE_CMD + const [captureBin, ...captureArgs] = captureCmd.split(' ') + let child + try { + child = subprocess.spawn(captureBin, captureArgs, + { stdio: ['ignore', 'pipe', 'pipe'] }) + } catch (err) { + if (err && err.code === 'ENOENT') { + console.error(`\n'${captureBin}' not found on PATH.`) + console.error('Install sox (brew install sox / apt install sox / choco install sox / winget install ChrisBagwell.SoX).') + } else { + console.error(`\nFailed to spawn capture command: ${err.message}`) + } + addonLogging.releaseLogger() + process.exit(1) + } + child.on('error', (err) => { + console.error(`\nCapture command failed: ${err.message}`) + process.exit(1) + }) + + let firstAudioSeen = false + let stderrBuf = '' + child.stderr.on('data', (chunk) => { + stderrBuf += chunk.toString('utf8') + if (stderrBuf.length > 8192) stderrBuf = stderrBuf.slice(-8192) + }) + + let lineOpen = false + let lineSpeaker = null + let lastSpeaker = -1 + + function flushLine () { + if (lineOpen) { + process.stdout.write('\n') + lineOpen = false + lineSpeaker = null + } + } + function emitTranscript (speaker, text, firstStartsWord) { + if (isSilenceText(text)) { + if (args.accumulate) flushLine() + return + } + const tag = speaker >= 0 ? `speaker_${speaker}` : 'speaker_?' + const ts = new Date().toISOString().slice(11, 19) + if (args.accumulate) { + if (lineOpen && lineSpeaker !== speaker) flushLine() + if (!lineOpen) { + process.stdout.write(`[${ts}] ${tag}: ${text}`) + lineOpen = true + lineSpeaker = speaker + } else { + process.stdout.write((firstStartsWord ? ' ' : '') + text) + } + } else { + console.log(`[${ts}] ${tag}: ${text}`) + } + } + + const asrStream = pushableStream() + const diarStream = pushableStream() + child.stdout.on('data', (chunk) => { + if (!firstAudioSeen) firstAudioSeen = true + if (stopping) return + asrStream.push(chunk) + diarStream.push(chunk) + }) + + const streamingConfig = {} + if (args.chunkMs !== null) streamingConfig.chunkMs = args.chunkMs + + const diarRunPromise = (async () => { + const response = await diar.runStreaming(diarStream, streamingConfig) + await response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + for (let i = items.length - 1; i >= 0; i--) { + const s = items[i] + if (!s || !s.text || isSilenceText(s.text)) continue + const id = parseSortformerSpeakerId(s.text) + if (id >= 0) { + lastSpeaker = id + break + } + } + }) + .await() + })() + + const asrRunPromise = (async () => { + const response = await asr.runStreaming(asrStream, streamingConfig) + await response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + const { text, firstStartsWord } = buildSegmentText(items) + emitTranscript(lastSpeaker, text.trim(), firstStartsWord) + }) + .await() + })() + + async function shutdown () { + if (stopping) return + stopping = true + console.log('\nStopping...') + try { child.kill('SIGTERM') } catch (e) { /* ignore */ } + asrStream.end() + diarStream.end() + try { await Promise.all([asrRunPromise, diarRunPromise]) } catch (e) { /* swallow */ } + flushLine() + try { await asr.unload() } catch (e) { /* ignore */ } + try { await diar.unload() } catch (e) { /* ignore */ } + addonLogging.releaseLogger() + process.exit(0) + } + + process.once('SIGINT', shutdown) + process.once('SIGTERM', shutdown) + child.on('exit', (code, signal) => { + if (!firstAudioSeen && !stopping) { + console.error(`\nCapture command exited before producing audio (code=${code}, signal=${signal}).`) + const tail = stderrBuf.trim() + if (tail) { + console.error('--- sox stderr ---') + console.error(tail) + console.error('------------------') + } + console.error('Hints:') + console.error(' - On Windows, try: --capture "sox -t waveaudio default -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -"') + console.error(' - Verify a default recording device exists (Settings -> System -> Sound -> Input).') + console.error(' - Confirm SoX can list audio devices: sox -V6 -d -t raw -r 16000 -c 1 -e signed-integer -b 16 -L - 2>&1 | head') + } + shutdown() + }) +} + +main().catch(err => { + console.error('Error:', err) + addonLogging.releaseLogger() + process.exit(1) +}) diff --git a/packages/transcription-parakeet/examples/live-mic-diarized.js b/packages/transcription-parakeet/examples/live-mic-diarized.js index 47808ea2f4..ada13e5b32 100644 --- a/packages/transcription-parakeet/examples/live-mic-diarized.js +++ b/packages/transcription-parakeet/examples/live-mic-diarized.js @@ -10,16 +10,27 @@ * Sortformer segment; the ASR side tags each printed transcript with * `lastSpeaker`. Press Ctrl-C to flush and exit. * - * Diarization tagging is best-effort. Sortformer's streaming session - * is permutation-invariant per chunk and prone to occasional - * speaker-ID drift on continuous single-speaker stretches once two - * voices have been seen in the rolling-history window. parakeet-cpp - * documents this behaviour in + * Recommended `--diar-model`: the v2.1 Sortformer GGUF + * (`diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`). parakeet-cpp + * detects v2.1 from the GGUF metadata tag + * `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` and + * enables AOSC (Audio-Online Speaker Cache) automatically, which + * anchors speaker slots across silence and re-entry and largely + * removes the drift caveat described below. + * + * For an AOSC-aware variant that also exposes the speaker-cache + * tuning knobs from the CLI, see `examples/live-mic-diarized-aosc.js`. + * + * v1 caveat (kept for users running the older v1 GGUF): Sortformer's + * streaming session is permutation-invariant per chunk and prone to + * occasional speaker-ID drift on continuous single-speaker stretches + * once two voices have been seen in the rolling-history window. + * parakeet-cpp documents this behaviour in * `parakeet-cpp/include/parakeet/diarization.h:80-82`. Fixing it - * properly requires per-segment voice embeddings (currently not - * exposed by the engine) -- this example therefore renders the raw - * Sortformer ID and accepts the occasional mis-tag rather than try - * to second-guess the model in JS. + * properly required per-segment voice embeddings (now solved by v2.1's + * AOSC) -- this example therefore renders the raw Sortformer ID and + * accepts the occasional mis-tag rather than try to second-guess the + * model in JS. * * Usage: * bare examples/live-mic-diarized.js \ @@ -98,9 +109,14 @@ function parseArgs () { } // Pin the Sortformer rolling-history window at parakeet-cpp's default -// (30 s). Pushing past it puts the input outside the window the -// underlying model was trained on, which empirically causes the engine -// to collapse all voices onto sortformer_0. +// (30 s). Pushing past it on a v1 GGUF puts the input outside the +// window the underlying model was trained on, which empirically causes +// the engine to collapse all voices onto sortformer_0. +// +// On a v2.1 GGUF, AOSC is auto-enabled and supersedes this rolling +// window with a NeMo-port speaker cache. parakeet-cpp ignores +// `history_ms` for v2.1 sessions, so this constant is harmless either +// way and is kept for backwards compatibility with v1 GGUFs. const STREAMING_HISTORY_MS = 30000 // Pull the Sortformer speaker_id out of the addon's segment text diff --git a/packages/transcription-parakeet/index.d.ts b/packages/transcription-parakeet/index.d.ts index 3543cacbae..408b46bccc 100644 --- a/packages/transcription-parakeet/index.d.ts +++ b/packages/transcription-parakeet/index.d.ts @@ -71,6 +71,30 @@ declare interface ParakeetConfig { * (2000 ms). ASR sessions only. */ streamingRightLookaheadMs?: number + + /** + * AOSC (Audio-Online Speaker Cache): enable v2.1 Sortformer's + * speaker-cache streaming. Ignored on v1/v2 Sortformer GGUFs and on + * non-Sortformer models. Set false to force a v2.1 model onto the + * v1 sliding-window path (e.g. for A/B comparison). Default: true. + * + * The cache anchors each speaker to a stable slot across silence and + * re-entry, fixing the per-chunk permutation-invariance drift that v1 + * suffers from when two voices have been seen in the rolling window. + * v2.1 is auto-detected from the GGUF metadata tag + * `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + */ + streamingSpkCacheEnable?: boolean + /** AOSC: long-term speaker-cache rows (~15 s of encoder frames). Default: 188. */ + streamingSpkCacheLen?: number + /** AOSC: FIFO warmup buffer rows. Default: 188. */ + streamingFifoLen?: number + /** AOSC: encoder left-context window (ms; ~1 encoder frame). Default: 80. */ + streamingChunkLeftContextMs?: number + /** AOSC: encoder right-context window (ms; ~7 encoder frames). Default: 560. */ + streamingChunkRightContextMs?: number + /** AOSC: FIFO-overflow pop-out count. Default: 144. */ + streamingSpkCacheUpdatePeriod?: number } /** @@ -175,6 +199,18 @@ declare interface StreamingRunConfig { emitPartials?: boolean /** CTC/TDT-only energy-VAD events. */ emitEnergyVad?: boolean + /** AOSC: enable/disable v2.1 speaker cache (overrides `streamingSpkCacheEnable`). */ + spkCacheEnable?: boolean + /** AOSC: long-term speaker-cache rows (overrides `streamingSpkCacheLen`). */ + spkCacheLen?: number + /** AOSC: FIFO warmup buffer rows (overrides `streamingFifoLen`). */ + fifoLen?: number + /** AOSC: encoder left-context window in ms (overrides `streamingChunkLeftContextMs`). */ + chunkLeftContextMs?: number + /** AOSC: encoder right-context window in ms (overrides `streamingChunkRightContextMs`). */ + chunkRightContextMs?: number + /** AOSC: FIFO-overflow pop-out count (overrides `streamingSpkCacheUpdatePeriod`). */ + spkCacheUpdatePeriod?: number } /** diff --git a/packages/transcription-parakeet/index.js b/packages/transcription-parakeet/index.js index 2f0fe0a896..74196d565a 100644 --- a/packages/transcription-parakeet/index.js +++ b/packages/transcription-parakeet/index.js @@ -111,7 +111,19 @@ class TranscriptionParakeet { streamingEmitPartials: this.params.streamingEmitPartials !== false, streamingEnergyVad: this.params.streamingEnergyVad === true, streamingLeftContextMs: this.params.streamingLeftContextMs ?? -1, - streamingRightLookaheadMs: this.params.streamingRightLookaheadMs ?? -1 + streamingRightLookaheadMs: this.params.streamingRightLookaheadMs ?? -1, + // AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these on + // non-Sortformer engines and on v1/v2 GGUFs. Defaults mirror the + // C++ ParakeetConfig defaults; passing the field explicitly (vs + // letting C++ pick its own default) ensures user overrides at + // the JS layer reach the native engine instead of being silently + // discarded by _buildConfigurationParams. + streamingSpkCacheEnable: this.params.streamingSpkCacheEnable !== false, + streamingSpkCacheLen: this.params.streamingSpkCacheLen ?? 188, + streamingFifoLen: this.params.streamingFifoLen ?? 188, + streamingChunkLeftContextMs: this.params.streamingChunkLeftContextMs ?? 80, + streamingChunkRightContextMs: this.params.streamingChunkRightContextMs ?? 560, + streamingSpkCacheUpdatePeriod: this.params.streamingSpkCacheUpdatePeriod ?? 144 } } diff --git a/packages/transcription-parakeet/package.json b/packages/transcription-parakeet/package.json index 9804228c29..7856703f57 100644 --- a/packages/transcription-parakeet/package.json +++ b/packages/transcription-parakeet/package.json @@ -1,6 +1,6 @@ { "name": "@qvac/transcription-parakeet", - "version": "0.4.0", + "version": "0.5.0", "description": "High-performance speech-to-text inference addon using NVIDIA Parakeet models for Bare runtime", "addon": true, "engines": { diff --git a/packages/transcription-parakeet/parakeet.js b/packages/transcription-parakeet/parakeet.js index 79f0db2243..78bc56691e 100644 --- a/packages/transcription-parakeet/parakeet.js +++ b/packages/transcription-parakeet/parakeet.js @@ -57,6 +57,20 @@ class ParakeetInterface { * left context (parakeet default 10000 ms; -1 keeps the engine default). * @param {number} [configurationParams.streamingRightLookaheadMs] - ASR encoder * right lookahead (parakeet default 2000 ms; -1 keeps the engine default). + * @param {boolean} [configurationParams.streamingSpkCacheEnable=true] - AOSC: + * enable v2.1 Sortformer speaker-cache streaming. Ignored on v1/v2 GGUFs + * and on non-Sortformer models. Set false to force the v1 sliding-window + * path on a v2.1 model (A/B comparison). + * @param {number} [configurationParams.streamingSpkCacheLen=188] - AOSC: + * long-term speaker-cache rows (~15 s of encoder frames). + * @param {number} [configurationParams.streamingFifoLen=188] - AOSC: FIFO + * warmup buffer rows. + * @param {number} [configurationParams.streamingChunkLeftContextMs=80] - + * AOSC: encoder left-context window (ms; ~1 encoder frame). + * @param {number} [configurationParams.streamingChunkRightContextMs=560] - + * AOSC: encoder right-context window (ms; ~7 encoder frames). + * @param {number} [configurationParams.streamingSpkCacheUpdatePeriod=144] - + * AOSC: FIFO-overflow pop-out count. * @param {Function} outputCallback - callback for transcription output events * @param {Function} [stateCallback] - callback for state transitions */ @@ -453,6 +467,12 @@ class ParakeetInterface { * @param {number} [config.rightLookaheadMs] - ASR encoder right lookahead (overrides cfg.streamingRightLookaheadMs) * @param {boolean} [config.emitPartials] - emit partial segments on chunk boundaries * @param {boolean} [config.emitEnergyVad] - surface energy-VAD events for CTC/TDT + * @param {boolean} [config.spkCacheEnable] - AOSC: enable/disable v2.1 speaker cache (overrides cfg.streamingSpkCacheEnable) + * @param {number} [config.spkCacheLen] - AOSC: long-term speaker-cache rows (overrides cfg.streamingSpkCacheLen) + * @param {number} [config.fifoLen] - AOSC: FIFO warmup buffer rows (overrides cfg.streamingFifoLen) + * @param {number} [config.chunkLeftContextMs] - AOSC: encoder left-context window in ms (overrides cfg.streamingChunkLeftContextMs) + * @param {number} [config.chunkRightContextMs] - AOSC: encoder right-context window in ms (overrides cfg.streamingChunkRightContextMs) + * @param {number} [config.spkCacheUpdatePeriod] - AOSC: FIFO-overflow pop-out count (overrides cfg.streamingSpkCacheUpdatePeriod) * @returns {Promise} jobId assigned to the streaming session */ async startStreaming (config = {}) { diff --git a/packages/transcription-parakeet/test/integration/helpers.js b/packages/transcription-parakeet/test/integration/helpers.js index 27960763bf..0d34f03245 100644 --- a/packages/transcription-parakeet/test/integration/helpers.js +++ b/packages/transcription-parakeet/test/integration/helpers.js @@ -802,6 +802,19 @@ const MODEL_CONFIGS = { mobileFile: 'sortformer-4spk-v1.q4_0.gguf', minSize: 50 * 1024 * 1024, url: null + }, + // Streaming-default Sortformer (v2.1 + NeMo-port AOSC). The AOSC + // speaker cache anchors slot identity across silence and re-entry, + // fixing the per-chunk drift v1 shows when two voices have been seen + // in the rolling-history window. Auto-enabled by parakeet-cpp when the + // GGUF carries `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + // The GGUF needs to be staged (npm run setup-models / QVAC_TEST_GGUF_DIR) + // before sortformer-streaming tests can run; otherwise they skip. + sortformerStreaming: { + file: 'diar_streaming_sortformer_4spk-v2.1.q8_0.gguf', + mobileFile: 'diar_streaming_sortformer_4spk-v2.1.q4_0.gguf', + minSize: 50 * 1024 * 1024, + url: null } } diff --git a/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js b/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js new file mode 100644 index 0000000000..f3749349a9 --- /dev/null +++ b/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js @@ -0,0 +1,222 @@ +'use strict' + +/** + * Sortformer v2.1 + AOSC streaming integration test. + * + * Verifies that: + * 1. The v2.1 Sortformer GGUF loads and the JS-side AOSC config + * knobs flow through the native binding without errors. + * 2. A streaming diarization session with default AOSC config emits + * well-formed speaker segments matching the + * "Speaker N: HH:MM:SS.fff - HH:MM:SS.fff" pattern that the + * offline diarization path also produces. + * 3. Forcing `streamingSpkCacheEnable: false` on the same v2.1 GGUF + * falls back to the v1 sliding-window path cleanly (still emits + * segments; just without the AOSC stability guarantees). + * + * The full AOSC slot-stability contract (same speaker -> same hyp_ + * across non-contiguous re-entries) is verified at C++ level by + * `parakeet-cpp/test/test_sortformer_aosc_speakers.cpp` using the + * `abcba.wav` / `abcdba.wav` fixtures. This JS-level test focuses on + * wiring correctness; if it passes, the AOSC knobs are reaching the + * engine and parakeet-cpp's own regression tests cover the runtime + * behaviour. + * + * Skips cleanly when the v2.1 GGUF is missing + * (`MODEL_CONFIGS.sortformerStreaming`); the file isn't bundled with + * the repo -- stage it via `npm run setup-models` or by pointing + * `QVAC_TEST_GGUF_DIR` at a directory containing + * `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. + */ + +const test = require('brittle') +const fs = require('bare-fs') +const path = require('bare-path') +const { + binding, + TranscriptionParakeet, + setupJsLogger, + getTestPaths, + loadGgufOrSkip +} = require('./helpers.js') + +const { samplesDir } = getTestPaths() + +const SAMPLE_RATE = 16000 +const STREAM_CHUNK_MS = 2000 +const FEED_CHUNK_MS = 500 + +function loadAudioSample () { + const samplePath = path.join(samplesDir, 'sample.raw') + if (!fs.existsSync(samplePath)) return null + const rawBuffer = fs.readFileSync(samplePath) + const pcm = new Int16Array( + rawBuffer.buffer, rawBuffer.byteOffset, rawBuffer.length / 2) + const audio = new Float32Array(pcm.length) + for (let i = 0; i < pcm.length; i++) audio[i] = pcm[i] / 32768.0 + return audio +} + +function pushableStream () { + const queue = [] + let waiter = null + let ended = false + return { + push (chunk) { + if (ended) return + queue.push(chunk) + if (waiter) { const w = waiter; waiter = null; w() } + }, + end () { + ended = true + if (waiter) { const w = waiter; waiter = null; w() } + }, + async * [Symbol.asyncIterator] () { + while (true) { + if (queue.length > 0) { yield queue.shift(); continue } + if (ended) return + await new Promise(resolve => { waiter = resolve }) + } + } + } +} + +async function feedAndCollect (model, audio) { + const samplesPerChunk = Math.floor((FEED_CHUNK_MS / 1000) * SAMPLE_RATE) + const stream = pushableStream() + const segments = [] + + const response = await model.runStreaming(stream) + const updateDone = response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + for (const seg of items) { + if (!seg || !seg.text) continue + segments.push(seg) + } + }) + .await() + + for (let i = 0; i < audio.length; i += samplesPerChunk) { + const endIdx = Math.min(i + samplesPerChunk, audio.length) + const chunk = new Float32Array(audio.slice(i, endIdx)) + stream.push(chunk) + if (i + samplesPerChunk < audio.length) { + await new Promise(resolve => setTimeout(resolve, FEED_CHUNK_MS)) + } + } + stream.end() + await updateDone + + return segments +} + +// Pull "Speaker N" out of the addon's emitted text. Returns -1 when +// the text doesn't match (e.g. silence sentinels). Mirrors the parser +// used by examples/live-mic-diarized.js so the assertion below stays +// in sync with the actual contract consumers rely on. +function parseSpeakerId (text) { + const m = typeof text === 'string' ? text.match(/Speaker\s+(\d+)/) : null + return m ? parseInt(m[1], 10) : -1 +} + +test('Sortformer v2.1 AOSC โ€” default config streams diarization segments', + { timeout: 600000 }, async (t) => { + const loggerBinding = setupJsLogger(binding) + + try { + const modelPath = await loadGgufOrSkip(t, 'sortformerStreaming') + if (!modelPath) return + + const audio = loadAudioSample() + if (!audio) { t.pass('sample.raw not found - skipping'); return } + + const model = new TranscriptionParakeet({ + files: { model: modelPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: STREAM_CHUNK_MS, + // streamingSpkCacheEnable defaults to true; left unset so + // the AOSC default path runs as it would for real users. + maxThreads: 4, + useGPU: false + } + } + }) + + try { + await model.load() + const segments = await feedAndCollect(model, audio) + + t.ok(segments.length > 0, + `AOSC streaming should emit at least one segment (got ${segments.length})`) + + const speakerIds = segments + .map(s => parseSpeakerId(s.text)) + .filter(id => id >= 0) + t.ok(speakerIds.length > 0, + 'segments should match the "Speaker N: ..." format') + + const distinctIds = new Set(speakerIds) + console.log( + `[aosc/default] segments=${segments.length} ` + + `speakers=${distinctIds.size} ids=[${[...distinctIds].sort().join(',')}]`) + } finally { + try { await model.unload() } catch (e) { /* ignore */ } + } + } finally { + try { loggerBinding.releaseLogger() } catch (e) { /* ignore */ } + } + }) + +test('Sortformer v2.1 AOSC โ€” streamingSpkCacheEnable=false falls back to v1 path', + { timeout: 600000 }, async (t) => { + const loggerBinding = setupJsLogger(binding) + + try { + const modelPath = await loadGgufOrSkip(t, 'sortformerStreaming') + if (!modelPath) return + + const audio = loadAudioSample() + if (!audio) { t.pass('sample.raw not found - skipping'); return } + + const model = new TranscriptionParakeet({ + files: { model: modelPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: STREAM_CHUNK_MS, + // Force the v1 sliding-window code path on the v2.1 GGUF. + // The engine must accept this without errors and continue + // to emit speaker segments; speaker IDs may drift in ways + // they would not with AOSC active. + streamingSpkCacheEnable: false, + maxThreads: 4, + useGPU: false + } + } + }) + + try { + await model.load() + const segments = await feedAndCollect(model, audio) + + t.ok(segments.length > 0, + 'v1-path streaming should still emit at least one segment ' + + `(got ${segments.length})`) + + const speakerIds = segments + .map(s => parseSpeakerId(s.text)) + .filter(id => id >= 0) + t.ok(speakerIds.length > 0, + 'segments should match the "Speaker N: ..." format') + + console.log(`[aosc/disabled] segments=${segments.length}`) + } finally { + try { await model.unload() } catch (e) { /* ignore */ } + } + } finally { + try { loggerBinding.releaseLogger() } catch (e) { /* ignore */ } + } + }) diff --git a/packages/transcription-parakeet/vcpkg.json b/packages/transcription-parakeet/vcpkg.json index 087ec3c7ff..5aee8ff9c0 100644 --- a/packages/transcription-parakeet/vcpkg.json +++ b/packages/transcription-parakeet/vcpkg.json @@ -1,22 +1,22 @@ { "name": "transcription-parakeet", - "version": "0.4.0", + "version": "0.5.0", "dependencies": [ { "name": "parakeet-cpp", - "version>=": "2026-05-05#1", + "version>=": "2026-05-20", "features": ["metal"], "platform": "osx | ios" }, { "name": "parakeet-cpp", - "version>=": "2026-05-05#1", + "version>=": "2026-05-20", "features": ["vulkan", "opencl"], "platform": "android" }, { "name": "parakeet-cpp", - "version>=": "2026-05-05#1", + "version>=": "2026-05-20", "features": ["vulkan"], "platform": "!(osx | ios | android)" }, From decff740564d4c02cd4d4821c5f2f95afa1d1409 Mon Sep 17 00:00:00 2001 From: Pratik Narola Date: Wed, 20 May 2026 17:23:55 +0530 Subject: [PATCH 2/5] =?UTF-8?q?chore[notask]:=20address=20review=20?= =?UTF-8?q?=E2=80=94=20setup-models=20v2.1=20+=20CHANGELOG=20[Unreleased]?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two reviewer follow-ups on the v2.1 + AOSC PR: 1. `npm run setup-models` now fetches + converts v2.1 sortformer. - download-models.sh: new `sortformer-streaming-v2.1` type pulling from https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1/resolve/main/diar_streaming_sortformer_4spk-v2.1.nemo - convert-nemo.sh: matching type maps .nemo -> `diar_streaming_sortformer_4spk-v2.1.${q}.gguf`. - `--type all` (default) now includes the new type, so `npm run setup-models` stages v2.1 alongside the other models. - convert-nemo-to-gguf.py: surgically picked up PR #24's variant emission (the `detect_sortformer_variant(ckpt)` helper + `writer.add_string("parakeet.model_variant", ...)` call) without touching local qvac divergences (vendored attribution header, descriptive docstrings, `--quant f16` default, and the huggingface_hub import-error helper). The C++ engine's strict v2.1 detection now matches on `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` instead of falling back to the encoder-shape heuristic. - Verified end-to-end locally: `bash scripts/convert-nemo.sh --type sortformer-streaming-v2.1 --quant q8_0 --force` produces models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf and the resulting GGUF carries `parakeet.model_variant = "sortformer-streaming-v2.1-aosc"` (confirmed via gguf reader). 2. CHANGELOG entry moved under `## [Unreleased]`; version bumps in package.json + vcpkg.json reverted to 0.4.0. The release PR will promote `[Unreleased]` -> `[0.5.0]` and bump the versions then. --- packages/transcription-parakeet/CHANGELOG.md | 2 +- packages/transcription-parakeet/package.json | 2 +- .../scripts/convert-nemo-to-gguf.py | 27 +++++++++++++++++-- .../scripts/convert-nemo.sh | 11 +++++--- .../scripts/download-models.sh | 10 ++++--- packages/transcription-parakeet/vcpkg.json | 2 +- 6 files changed, 41 insertions(+), 13 deletions(-) diff --git a/packages/transcription-parakeet/CHANGELOG.md b/packages/transcription-parakeet/CHANGELOG.md index 67dbabd684..e35c0b4c2f 100644 --- a/packages/transcription-parakeet/CHANGELOG.md +++ b/packages/transcription-parakeet/CHANGELOG.md @@ -5,7 +5,7 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [0.5.0] +## [Unreleased] In this release we expose the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API. AOSC anchors each speaker to a stable cache slot across silence and re-entry, fixing the per-chunk permutation-invariance drift v1's sliding-window streaming exhibits once two voices have been seen. v2.1 becomes the recommended streaming Sortformer; v1 stays the offline-batch default. Six new optional config knobs surface the cache geometry for tuning and A/B comparison; defaults mirror parakeet-cpp's NeMo-port tuning so a bare `streaming: true` against a v2.1 GGUF Just Works. diff --git a/packages/transcription-parakeet/package.json b/packages/transcription-parakeet/package.json index 7856703f57..9804228c29 100644 --- a/packages/transcription-parakeet/package.json +++ b/packages/transcription-parakeet/package.json @@ -1,6 +1,6 @@ { "name": "@qvac/transcription-parakeet", - "version": "0.5.0", + "version": "0.4.0", "description": "High-performance speech-to-text inference addon using NVIDIA Parakeet models for Bare runtime", "addon": true, "engines": { diff --git a/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py b/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py index c707ee1788..e18d9f8467 100644 --- a/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py +++ b/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py @@ -217,7 +217,24 @@ def fuse_bn(weight, bias, running_mean, running_var, eps=1e-5): return scale.astype(np.float32), shift.astype(np.float32) -def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): +def detect_sortformer_variant(ckpt: Path) -> str: + """ + Map a NeMo Sortformer .nemo filename to a stable variant tag the C++ + loader can match against. The tag is the only thing that distinguishes + cache-aware v2.1 from architecturally-identical v1 / v2 at GGUF time + (encoder shape alone is ambiguous against future variants). + """ + stem = ckpt.stem + if "streaming_sortformer" in stem and "-v2.1" in stem: + return "sortformer-streaming-v2.1-aosc" + if "streaming_sortformer" in stem and "-v2" in stem: + return "sortformer-streaming-v2" + if "diar_sortformer" in stem and "-v1" in stem: + return "sortformer-v1" + return "" + + +def write_gguf(out: Path, ckpt: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): model_type = detect_model_type(cfg) enc = cfg["encoder"] @@ -349,6 +366,12 @@ def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): writer.add_uint32("parakeet.sortformer.tf_n_heads", int(tfe["num_attention_heads"])) writer.add_bool ("parakeet.sortformer.tf_pre_ln", bool(tfe.get("pre_ln", False))) writer.add_string("parakeet.sortformer.tf_hidden_act", str(tfe.get("hidden_act", "relu"))) + # Variant tag (preferred over shape-based detection on the C++ side). + # Empty string = unknown checkpoint; loader falls back to encoder + # shape so older GGUFs continue to load. + variant = detect_sortformer_variant(ckpt) + if variant: + writer.add_string("parakeet.model_variant", variant) else: pred_hidden = int(dec["prednet"]["pred_hidden"]) pred_rnn_layers = int(dec["prednet"]["pred_rnn_layers"]) @@ -628,7 +651,7 @@ def main(): ckpt = ensure_ckpt(args.ckpt, args.hf_repo) cfg, sd, tok_bytes = load_nemo(ckpt) args.out.parent.mkdir(parents=True, exist_ok=True) - write_gguf(args.out, cfg, sd, tok_bytes, args.quant) + write_gguf(args.out, ckpt, cfg, sd, tok_bytes, args.quant) if __name__ == "__main__": diff --git a/packages/transcription-parakeet/scripts/convert-nemo.sh b/packages/transcription-parakeet/scripts/convert-nemo.sh index cd7be608bd..33de47fb53 100644 --- a/packages/transcription-parakeet/scripts/convert-nemo.sh +++ b/packages/transcription-parakeet/scripts/convert-nemo.sh @@ -17,7 +17,8 @@ # ./scripts/convert-nemo.sh [flags] # # Flags: -# --type, -t Which model(s) (default: all) +# --type, -t +# Which model(s) (default: all) # --quant, -q Quant tier (default: q8_0) # --python Python interpreter (default: # $PYTHON, then ./venv/bin/python, @@ -62,8 +63,8 @@ while [[ $# -gt 0 ]]; do done case "$TYPE" in - ctc|tdt|eou|sortformer|all) ;; - *) echo "Error: --type must be ctc|tdt|eou|sortformer|all" >&2; exit 2;; + ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all) ;; + *) echo "Error: --type must be ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all" >&2; exit 2;; esac case "$QUANT" in f32|f16|q8_0|q5_0|q4_0) ;; @@ -128,6 +129,7 @@ nemo_filename() { tdt) echo "parakeet-tdt-0.6b-v3.nemo";; eou) echo "parakeet_realtime_eou_120m-v1.nemo";; sortformer) echo "diar_sortformer_4spk-v1.nemo";; + sortformer-streaming-v2.1) echo "diar_streaming_sortformer_4spk-v2.1.nemo";; esac } gguf_filename() { @@ -137,6 +139,7 @@ gguf_filename() { tdt) echo "parakeet-tdt-0.6b-v3.${q}.gguf";; eou) echo "parakeet-eou-120m-v1.${q}.gguf";; sortformer) echo "sortformer-4spk-v1.${q}.gguf";; + sortformer-streaming-v2.1) echo "diar_streaming_sortformer_4spk-v2.1.${q}.gguf";; esac } @@ -196,7 +199,7 @@ echo failures=0 if [[ "$TYPE" == "all" ]]; then - for t in ctc tdt eou sortformer; do + for t in ctc tdt eou sortformer sortformer-streaming-v2.1; do convert_one "$t" || failures=$((failures + 1)) done else diff --git a/packages/transcription-parakeet/scripts/download-models.sh b/packages/transcription-parakeet/scripts/download-models.sh index 5b2404117f..d9eeb00c05 100755 --- a/packages/transcription-parakeet/scripts/download-models.sh +++ b/packages/transcription-parakeet/scripts/download-models.sh @@ -12,7 +12,8 @@ # ./scripts/download-models.sh [flags] # # Flags: -# --type, -t Which model(s) (default: all) +# --type, -t +# Which model(s) (default: all) # --output, -o Destination dir (default: ./models/nemo) # --force, -f Re-download even if present # --help, -h Show this help @@ -43,8 +44,8 @@ while [[ $# -gt 0 ]]; do done case "$TYPE" in - ctc|tdt|eou|sortformer|all) ;; - *) echo "Error: --type must be ctc|tdt|eou|sortformer|all" >&2; exit 2;; + ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all) ;; + *) echo "Error: --type must be ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all" >&2; exit 2;; esac # Map model type -> { hf_repo, nemo_filename } @@ -54,6 +55,7 @@ nemo_url() { tdt) echo "https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3/resolve/main/parakeet-tdt-0.6b-v3.nemo";; eou) echo "https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1/resolve/main/parakeet_realtime_eou_120m-v1.nemo";; sortformer) echo "https://huggingface.co/nvidia/diar_sortformer_4spk-v1/resolve/main/diar_sortformer_4spk-v1.nemo";; + sortformer-streaming-v2.1) echo "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1/resolve/main/diar_streaming_sortformer_4spk-v2.1.nemo";; esac } nemo_filename() { @@ -95,7 +97,7 @@ echo "Output: ${OUTPUT_DIR}" echo if [[ "$TYPE" == "all" ]]; then - for t in ctc tdt eou sortformer; do + for t in ctc tdt eou sortformer sortformer-streaming-v2.1; do fetch_nemo "$t" done else diff --git a/packages/transcription-parakeet/vcpkg.json b/packages/transcription-parakeet/vcpkg.json index 5aee8ff9c0..857968d084 100644 --- a/packages/transcription-parakeet/vcpkg.json +++ b/packages/transcription-parakeet/vcpkg.json @@ -1,6 +1,6 @@ { "name": "transcription-parakeet", - "version": "0.5.0", + "version": "0.4.0", "dependencies": [ { "name": "parakeet-cpp", From d4e5ccf5c5f6a43d90edb4c28de1e31849aadac9 Mon Sep 17 00:00:00 2001 From: Pratik Narola Date: Wed, 20 May 2026 18:09:44 +0530 Subject: [PATCH 3/5] fix[notask]: pin parakeet-cpp to 2026-05-20#1 to avoid orphan tree The registry's parakeet-cpp.json lists both 2026-05-20#0 and 2026-05-20#1 (PR #156 introduced both port-versions in its two commits before squash-merging). vcpkg's minimum-version-selection picks #0 when the manifest says `version>=: 2026-05-20`, but the #0 git-tree is orphaned by the squash merge -- unreachable from main, so `git fetch HEAD` doesn't pull it in. CI fails with: fatal: failed to unpack tree object 91a6fc169003b70dcc66b82ca8d1d23445343127 note: while loading parakeet-cpp@2026-05-20 Pinning `version>=: 2026-05-20#1` skips the orphan and resolves to the actual port content on main (tree 69619b43...). Matches the existing `qvac-lint-cpp >= 1.4.4#3` precedent in the same file. Local clean build (no overlay, no cached registry) succeeds. --- packages/transcription-parakeet/vcpkg.json | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/packages/transcription-parakeet/vcpkg.json b/packages/transcription-parakeet/vcpkg.json index 857968d084..3c9b3083e8 100644 --- a/packages/transcription-parakeet/vcpkg.json +++ b/packages/transcription-parakeet/vcpkg.json @@ -4,19 +4,19 @@ "dependencies": [ { "name": "parakeet-cpp", - "version>=": "2026-05-20", + "version>=": "2026-05-20#1", "features": ["metal"], "platform": "osx | ios" }, { "name": "parakeet-cpp", - "version>=": "2026-05-20", + "version>=": "2026-05-20#1", "features": ["vulkan", "opencl"], "platform": "android" }, { "name": "parakeet-cpp", - "version>=": "2026-05-20", + "version>=": "2026-05-20#1", "features": ["vulkan"], "platform": "!(osx | ios | android)" }, From 53c88d125fb683dd2ef4a85fb541984d0188f323 Mon Sep 17 00:00:00 2001 From: GustavoA1604 Date: Wed, 20 May 2026 13:06:05 -0300 Subject: [PATCH 4/5] cpp lint format --- .../addon/src/addon/AddonJs.hpp | 41 +++++++++++-------- .../addon/src/js-interface/JSAdapter.cpp | 5 ++- .../ParakeetStreamingProcessor.cpp | 8 ++-- .../ParakeetStreamingProcessor.hpp | 12 +++--- .../parakeet/ParakeetConfig.hpp | 15 +++---- .../parakeet/ParakeetModel.cpp | 8 ++-- .../parakeet/ParakeetModel.hpp | 16 +++----- 7 files changed, 55 insertions(+), 50 deletions(-) diff --git a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp index e91833e726..02176d0311 100644 --- a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp +++ b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp @@ -164,12 +164,13 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { config.leftContextMs = parakeetModel.getStreamingLeftContextMs(); config.rightLookaheadMs = parakeetModel.getStreamingRightLookaheadMs(); // AOSC defaults sourced from the model's load-time ParakeetConfig. - config.spkCacheEnable = parakeetModel.getStreamingSpkCacheEnable(); - config.spkCacheLen = parakeetModel.getStreamingSpkCacheLen(); - config.fifoLen = parakeetModel.getStreamingFifoLen(); - config.chunkLeftContextMs = parakeetModel.getStreamingChunkLeftContextMs(); - config.chunkRightContextMs = parakeetModel.getStreamingChunkRightContextMs(); - config.spkCacheUpdatePeriod = parakeetModel.getStreamingSpkCacheUpdatePeriod(); + config.spkCacheEnable = parakeetModel.getStreamingSpkCacheEnable(); + config.spkCacheLen = parakeetModel.getStreamingSpkCacheLen(); + config.fifoLen = parakeetModel.getStreamingFifoLen(); + config.chunkLeftContextMs = parakeetModel.getStreamingChunkLeftContextMs(); + config.chunkRightContextMs = parakeetModel.getStreamingChunkRightContextMs(); + config.spkCacheUpdatePeriod = + parakeetModel.getStreamingSpkCacheUpdatePeriod(); if (auto chunkMs = configObj.getOptionalProperty(env, "chunkMs"); @@ -215,31 +216,37 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { configObj.getOptionalProperty(env, "spkCacheLen"); spkCacheLen.has_value()) { const auto v = static_cast(spkCacheLen.value().as(env)); - if (v > 0) config.spkCacheLen = v; + if (v > 0) + config.spkCacheLen = v; } - if (auto fifoLen = - configObj.getOptionalProperty(env, "fifoLen"); + if (auto fifoLen = configObj.getOptionalProperty(env, "fifoLen"); fifoLen.has_value()) { const auto v = static_cast(fifoLen.value().as(env)); - if (v > 0) config.fifoLen = v; + if (v > 0) + config.fifoLen = v; } if (auto chunkLeftContextMs = configObj.getOptionalProperty(env, "chunkLeftContextMs"); chunkLeftContextMs.has_value()) { const auto v = static_cast(chunkLeftContextMs.value().as(env)); - if (v >= 0) config.chunkLeftContextMs = v; + if (v >= 0) + config.chunkLeftContextMs = v; } if (auto chunkRightContextMs = configObj.getOptionalProperty(env, "chunkRightContextMs"); chunkRightContextMs.has_value()) { - const auto v = static_cast(chunkRightContextMs.value().as(env)); - if (v >= 0) config.chunkRightContextMs = v; + const auto v = + static_cast(chunkRightContextMs.value().as(env)); + if (v >= 0) + config.chunkRightContextMs = v; } - if (auto spkCacheUpdatePeriod = - configObj.getOptionalProperty(env, "spkCacheUpdatePeriod"); + if (auto spkCacheUpdatePeriod = configObj.getOptionalProperty( + env, "spkCacheUpdatePeriod"); spkCacheUpdatePeriod.has_value()) { - const auto v = static_cast(spkCacheUpdatePeriod.value().as(env)); - if (v > 0) config.spkCacheUpdatePeriod = v; + const auto v = + static_cast(spkCacheUpdatePeriod.value().as(env)); + if (v > 0) + config.spkCacheUpdatePeriod = v; } { diff --git a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp index a34bb16b84..3d20e448ee 100644 --- a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp +++ b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp @@ -108,8 +108,9 @@ auto JSAdapter::loadFromJSObject(js::Object jsObject, js_env_t* env) } // AOSC (v2.1+ Sortformer only). All optional; unspecified values keep - // ParakeetConfig's defaults. Forwarded into parakeet::SortformerStreamingOptions - // by ParakeetModel / ParakeetStreamingProcessor; ignored for v1/v2/non-Sortformer. + // ParakeetConfig's defaults. Forwarded into + // parakeet::SortformerStreamingOptions by ParakeetModel / + // ParakeetStreamingProcessor; ignored for v1/v2/non-Sortformer. auto streamingSpkCacheEnableOpt = jsObject.getOptionalProperty(env, "streamingSpkCacheEnable"); if (streamingSpkCacheEnableOpt.has_value()) { diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp index 9298d2c81d..2d53a4ba3e 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp @@ -49,10 +49,10 @@ ParakeetStreamingProcessor::ParakeetStreamingProcessor( // AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these fields for // v1/v2 GGUFs (variant detected from `parakeet.model_variant` metadata // or the encoder shape heuristic), so always-forward is safe. - opts.spkcache_enable = config_.spkCacheEnable; - opts.spkcache_len = config_.spkCacheLen; - opts.fifo_len = config_.fifoLen; - opts.chunk_left_context_ms = config_.chunkLeftContextMs; + opts.spkcache_enable = config_.spkCacheEnable; + opts.spkcache_len = config_.spkCacheLen; + opts.fifo_len = config_.fifoLen; + opts.chunk_left_context_ms = config_.chunkLeftContextMs; opts.chunk_right_context_ms = config_.chunkRightContextMs; opts.spkcache_update_period = config_.spkCacheUpdatePeriod; diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp index c2712c0cfc..559e9e9b04 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp @@ -60,12 +60,12 @@ class ParakeetStreamingProcessor { // `parakeet.model_variant` metadata tag). parakeet-cpp ignores these // fields on v1/v2 GGUFs and on non-Sortformer engines, so they are // always safe to forward. - bool spkCacheEnable = true; - int spkCacheLen = 188; - int fifoLen = 188; - int chunkLeftContextMs = 80; - int chunkRightContextMs = 560; - int spkCacheUpdatePeriod = 144; + bool spkCacheEnable = true; + int spkCacheLen = 188; + int fifoLen = 188; + int chunkLeftContextMs = 80; + int chunkRightContextMs = 560; + int spkCacheUpdatePeriod = 144; }; ParakeetStreamingProcessor( diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp index 3ad72100ef..b05d3e4672 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp @@ -71,12 +71,12 @@ struct ParakeetConfig { // // Setting streamingSpkCacheEnable = false on a v2.1 model forces the // v1 sliding-window code path (useful for regression comparison). - bool streamingSpkCacheEnable = true; - int streamingSpkCacheLen = 188; // long-term speaker rows (~15s) - int streamingFifoLen = 188; // FIFO warmup buffer rows - int streamingChunkLeftContextMs = 80; // encoder left context (~1 frame) - int streamingChunkRightContextMs = 560; // encoder right context (~7 frames) - int streamingSpkCacheUpdatePeriod = 144; // FIFO-overflow pop-out count + bool streamingSpkCacheEnable = true; + int streamingSpkCacheLen = 188; // long-term speaker rows (~15s) + int streamingFifoLen = 188; // FIFO warmup buffer rows + int streamingChunkLeftContextMs = 80; // encoder left context (~1 frame) + int streamingChunkRightContextMs = 560; // encoder right context (~7 frames) + int streamingSpkCacheUpdatePeriod = 144; // FIFO-overflow pop-out count // โ”€โ”€ Dynamic-backend loading โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ // Forwarded to parakeet::EngineOptions::backends_dir / @@ -117,7 +117,8 @@ struct ParakeetConfig { streamingFifoLen == other.streamingFifoLen && streamingChunkLeftContextMs == other.streamingChunkLeftContextMs && streamingChunkRightContextMs == other.streamingChunkRightContextMs && - streamingSpkCacheUpdatePeriod == other.streamingSpkCacheUpdatePeriod && + streamingSpkCacheUpdatePeriod == + other.streamingSpkCacheUpdatePeriod && backendsDir == other.backendsDir && openclCacheDir == other.openclCacheDir; } diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp index 9188973c44..adf4a736a6 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp @@ -723,10 +723,10 @@ void ParakeetModel::openStreamingSession_() { // AOSC (v2.1+ Sortformer only; ignored for v1/v2 GGUFs). The engine // detects v2.1 via the GGUF metadata tag `parakeet.model_variant` and // only consults these fields then -- safe to forward unconditionally. - opts.spkcache_enable = cfg_.streamingSpkCacheEnable; - opts.spkcache_len = cfg_.streamingSpkCacheLen; - opts.fifo_len = cfg_.streamingFifoLen; - opts.chunk_left_context_ms = cfg_.streamingChunkLeftContextMs; + opts.spkcache_enable = cfg_.streamingSpkCacheEnable; + opts.spkcache_len = cfg_.streamingSpkCacheLen; + opts.fifo_len = cfg_.streamingFifoLen; + opts.chunk_left_context_ms = cfg_.streamingChunkLeftContextMs; opts.chunk_right_context_ms = cfg_.streamingChunkRightContextMs; opts.spkcache_update_period = cfg_.streamingSpkCacheUpdatePeriod; diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp index 27c62be9bb..5e94cb2b5c 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp @@ -142,22 +142,18 @@ class ParakeetModel : public qvac_lib_inference_addon_cpp::model::IModel, // AOSC accessors (v2.1+ Sortformer only). Forwarded verbatim from // ParakeetConfig; parakeet-cpp ignores them for non-Sortformer engines // and for v1/v2 Sortformer GGUFs. - bool getStreamingSpkCacheEnable() const { + bool getStreamingSpkCacheEnable() const { return cfg_.streamingSpkCacheEnable; } - int getStreamingSpkCacheLen() const { - return cfg_.streamingSpkCacheLen; - } - int getStreamingFifoLen() const { - return cfg_.streamingFifoLen; - } - int getStreamingChunkLeftContextMs() const { + int getStreamingSpkCacheLen() const { return cfg_.streamingSpkCacheLen; } + int getStreamingFifoLen() const { return cfg_.streamingFifoLen; } + int getStreamingChunkLeftContextMs() const { return cfg_.streamingChunkLeftContextMs; } - int getStreamingChunkRightContextMs() const { + int getStreamingChunkRightContextMs() const { return cfg_.streamingChunkRightContextMs; } - int getStreamingSpkCacheUpdatePeriod() const { + int getStreamingSpkCacheUpdatePeriod() const { return cfg_.streamingSpkCacheUpdatePeriod; } bool isSortformer() const { From 86126f1e316ca60c684b19e6ffd71821f5cf7b1f Mon Sep 17 00:00:00 2001 From: GustavoA1604 Date: Wed, 20 May 2026 13:38:14 -0300 Subject: [PATCH 5/5] Bump version --- packages/transcription-parakeet/CHANGELOG.md | 10 ++++++++-- packages/transcription-parakeet/package.json | 2 +- 2 files changed, 9 insertions(+), 3 deletions(-) diff --git a/packages/transcription-parakeet/CHANGELOG.md b/packages/transcription-parakeet/CHANGELOG.md index d7880ce606..c7b333537f 100644 --- a/packages/transcription-parakeet/CHANGELOG.md +++ b/packages/transcription-parakeet/CHANGELOG.md @@ -5,9 +5,9 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [Unreleased] +## [0.6.0] -In this release we expose the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API and overhaul the Android prebuild to ship the ggml backends as separately-loadable MODULE `.so` files. AOSC anchors each speaker to a stable cache slot across silence and re-entry, fixing the per-chunk permutation-invariance drift v1's sliding-window streaming exhibits once two voices have been seen โ€” v2.1 becomes the recommended streaming Sortformer; v1 stays the offline-batch default. Six new optional config knobs surface the cache geometry for tuning and A/B comparison; defaults mirror parakeet-cpp's NeMo-port tuning so a bare `streaming: true` against a v2.1 GGUF Just Works. On the Android side, Vulkan and OpenCL ship as runtime-discovered `.so` files (qvac-ext-ggml@speech's `GGML_BACKEND_DL=ON`), alongside per-arch CPU variants (`libqvac-speech-ggml-cpu-android_armv{8.0,8.2,8.6,9.0,9.2}_*.so`); inference still runs on CPU there pending Vulkan/Mali + OpenCL/Adreno driver fixes (`useGPU` is overridden at the engine boundary), but the GPU `.so` files are in place for when the override is lifted. +In this release we reestablish the GGML implementation from `0.4.0` with extra additions. The main features are exposing the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API and overhaul the Android prebuild to ship the ggml backends as separately-loadable MODULE `.so` files. v2.1 becomes the recommended streaming Sortformer model; v1 stays the offline-batch default. On the Android side, Vulkan and OpenCL ship as runtime-discovered `.so` files (qvac-ext-ggml@speech's `GGML_BACKEND_DL=ON`), alongside per-arch CPU variants (`libqvac-speech-ggml-cpu-android_armv{8.0,8.2,8.6,9.0,9.2}_*.so`); inference still runs on CPU there pending Vulkan/Mali + OpenCL/Adreno driver fixes (`useGPU` is overridden at the engine boundary), but the GPU `.so` files are in place for when the override is lifted. ### Added - **AOSC config knobs.** `ParakeetConfig` gains six optional fields โ€” `streamingSpkCacheEnable` (default `true`), `streamingSpkCacheLen` (188), `streamingFifoLen` (188), `streamingChunkLeftContextMs` (80), `streamingChunkRightContextMs` (560), `streamingSpkCacheUpdatePeriod` (144) โ€” forwarded into `parakeet::SortformerStreamingOptions` for both the in-process Mode-3 streaming path (`ParakeetModel::runStreamingProcess_`) and the duplex `runStreaming()` processor (`ParakeetStreamingProcessor`). Mirrored as per-call overrides on `StreamingRunConfig` (`spkCacheEnable`, `spkCacheLen`, `fifoLen`, `chunkLeftContextMs`, `chunkRightContextMs`, `spkCacheUpdatePeriod`). parakeet-cpp ignores these on v1 / v2 Sortformer GGUFs and on non-Sortformer engines, so always-forward is safe. @@ -28,6 +28,12 @@ In this release we expose the v2.1 streaming Sortformer model with NeMo-port AOS - **`examples/diarized-transcribe.js`** header: notes v1 remains the recommended OFFLINE diarization model โ€” AOSC's slot-stability benefit only applies to continuous streaming and is wasted in batch mode. - **`README.md`** โ€” extended Model Variants table with v1 (offline default) and v2.1 + AOSC (streaming default) rows; new `streamingSpkCache*` rows in the ParakeetConfig table; dedicated "Sortformer Streaming Diarization (v2.1 + AOSC)" section explaining the v1-drift problem AOSC solves, the model-variant auto-detection, and when to leave the defaults alone. +## [0.5.0] + +- Temporarily reverted back to ONNX implementation of `0.3.3` to ensure stability in SDK `0.11.*`. +- Bumped `inference-addon-cpp` dependency version to `1.1.7#1`. +- Bumped `onnx` dependency version to `0.15.0`. + ## [0.4.0] In this release, we have replaced the onnxruntime backend with a pure C++/ggml engine, added a duplex-streaming entry point that bypasses the framework's batch-then-process lifecycle for live use cases, and surfaced two new per-segment signals (`isEndOfTurn`, `startsWord`) so consumers can build cleaner live transcripts. The release also exposes per-engine backend stats (`backendDevice`, `backendId`) so callers can verify the GPU path actually engaged, and consolidates the examples / docs / mock fixtures into a single duplex-aware surface. diff --git a/packages/transcription-parakeet/package.json b/packages/transcription-parakeet/package.json index 59fb9fc9d1..04f8c3d728 100644 --- a/packages/transcription-parakeet/package.json +++ b/packages/transcription-parakeet/package.json @@ -1,6 +1,6 @@ { "name": "@qvac/transcription-parakeet", - "version": "0.4.0", + "version": "0.6.0", "description": "High-performance speech-to-text inference addon using NVIDIA Parakeet models for Bare runtime", "addon": true, "engines": {