diff --git a/packages/transcription-parakeet/CHANGELOG.md b/packages/transcription-parakeet/CHANGELOG.md index 05b48234e9..c7b333537f 100644 --- a/packages/transcription-parakeet/CHANGELOG.md +++ b/packages/transcription-parakeet/CHANGELOG.md @@ -5,18 +5,35 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [Unreleased] +## [0.6.0] -Update Android prebuild to ship Vulkan and OpenCL as separately-loadable MODULE `.so` files (qvac-ext-ggml@speech's `GGML_BACKEND_DL=ON`) discovered at runtime via `ggml_backend_load_all_from_path()`, as well as per-arch CPU variants (`libqvac-speech-ggml-cpu-android_armv{8.0,8.2,8.6,9.0,9.2}_*.so`). +In this release we reestablish the GGML implementation from `0.4.0` with extra additions. The main features are exposing the v2.1 streaming Sortformer model with NeMo-port AOSC (Audio-Online Speaker Cache) through the addon's public API and overhaul the Android prebuild to ship the ggml backends as separately-loadable MODULE `.so` files. v2.1 becomes the recommended streaming Sortformer model; v1 stays the offline-batch default. On the Android side, Vulkan and OpenCL ship as runtime-discovered `.so` files (qvac-ext-ggml@speech's `GGML_BACKEND_DL=ON`), alongside per-arch CPU variants (`libqvac-speech-ggml-cpu-android_armv{8.0,8.2,8.6,9.0,9.2}_*.so`); inference still runs on CPU there pending Vulkan/Mali + OpenCL/Adreno driver fixes (`useGPU` is overridden at the engine boundary), but the GPU `.so` files are in place for when the override is lifted. ### Added - +- **AOSC config knobs.** `ParakeetConfig` gains six optional fields — `streamingSpkCacheEnable` (default `true`), `streamingSpkCacheLen` (188), `streamingFifoLen` (188), `streamingChunkLeftContextMs` (80), `streamingChunkRightContextMs` (560), `streamingSpkCacheUpdatePeriod` (144) — forwarded into `parakeet::SortformerStreamingOptions` for both the in-process Mode-3 streaming path (`ParakeetModel::runStreamingProcess_`) and the duplex `runStreaming()` processor (`ParakeetStreamingProcessor`). Mirrored as per-call overrides on `StreamingRunConfig` (`spkCacheEnable`, `spkCacheLen`, `fifoLen`, `chunkLeftContextMs`, `chunkRightContextMs`, `spkCacheUpdatePeriod`). parakeet-cpp ignores these on v1 / v2 Sortformer GGUFs and on non-Sortformer engines, so always-forward is safe. +- **v2.1 Sortformer auto-detection.** When a `diar_streaming_sortformer_4spk-v2.1.*` GGUF is loaded, parakeet-cpp's engine recognises it from the GGUF metadata tag `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` and enables AOSC by default. Setting `streamingSpkCacheEnable: false` forces the v1 sliding-window code path on a v2.1 model (A/B comparison). +- **`examples/live-mic-diarized-aosc.js`** — v2.1-focused dual-stream live mic example mirroring `live-mic-diarized.js`'s ASR + Sortformer pattern, with CLI flags for every AOSC knob (`--spk-cache-enable`, `--spk-cache-len`, `--fifo-len`, `--chunk-left-context-ms`, `--chunk-right-context-ms`, `--spk-cache-update-period`). +- **`test/integration/sortformer-aosc-streaming.test.js`** — covers default-AOSC streaming and `streamingSpkCacheEnable=false` fallback. The full AOSC slot-stability contract (same physical speaker → same `Speaker N` tag across non-contiguous re-entries) is verified at C++ level in `parakeet-cpp/test/test_sortformer_aosc_speakers.cpp`; this JS-level test focuses on wiring correctness — that the override actually reaches the engine and the engine emits well-formed segments in both modes. +- **`MODEL_CONFIGS.sortformerStreaming`** entry in `test/integration/helpers.js` pointing at `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. Tests skip cleanly when the GGUF isn't staged via `npm run setup-models` / `QVAC_TEST_GGUF_*`. - **`backendsDir` ParakeetConfig field.** Directory the native addon scans for dynamically-loaded ggml backend libraries (`libqvac-speech-ggml-vulkan.so`, `libqvac-speech-ggml-opencl.so`, per-arch `libqvac-speech-ggml-cpu-android_armv*_*.so`). -- **`openclCacheDir` ParakeetConfig field.** Persistent directory for ggml-opencl's `clCreateProgramWithBinary` cache. +- **`openclCacheDir` ParakeetConfig field.** Persistent directory for ggml-opencl's `clCreateProgramWithBinary` cache. - **CMake install plumbing for dynamic ggml backends.** Two complementary install paths cover the full backend set that the `ggml-speech` vcpkg port emits on Android. - **`BACKENDS_SUBDIR` compile define** on the addon target. Derived from cmake-bare's `bare_target()` + `bare_module_target()` so the addon can join `/` onto the host-provided `backendsDir` root without the host needing to know the per-target shape. - **Mobile dynamic-backend coverage.** `test/mobile/integration-runtime.cjs` now sets `NO_GPU=false` so Device Farm runs `gpu-smoke` and `mobile-perf-*-gpu` tests that exercise backend dlopen / discovery (Vulkan, OpenCL, and per-arch CPU `.so` loading). On Android, inference still runs on CPU (`useGPU` is overridden at the engine boundary and gpu-smoke passes early); iOS may engage Metal when `useGPU: true`. +### Changed +- **parakeet-cpp dep bumped** to `version>= 2026-05-20#2` (was `2026-05-05#1`) across all three platform branches in `vcpkg.json`. The new port (qvac-registry-vcpkg PR #156 + the `ggml-speech#3` follow-up) pulls in PRs #22 + #24 of `qvac-ext-lib-whisper.cpp`, which introduce the v2.1 Sortformer support, AOSC engine implementation, strict variant detection via the `parakeet.model_variant` GGUF tag, and review-fixup cleanups (magic-number elimination, dead-code removal, test utility consolidation, Windows `` include), and tightens the `ggml-speech` constraint to the per-arch Android CPU build (`GGML_CPU_ALL_VARIANTS=ON`). +- **`index.js::_buildConfigurationParams()`** now forwards the 6 new AOSC fields (and explicit defaults for unset values) into `createInstance` / `reload`. Without this, JSDoc + native plumbing would exist but JS-layer overrides would never reach C++. +- **`examples/live-mic-diarized.js`** header: recommends the v2.1 GGUF as `--diar-model` and notes that `streamingHistoryMs` is superseded by AOSC on v2.1 models (kept for v1 back-compat). Points to the new `live-mic-diarized-aosc.js` for explicit knob control. +- **`examples/diarized-transcribe.js`** header: notes v1 remains the recommended OFFLINE diarization model — AOSC's slot-stability benefit only applies to continuous streaming and is wasted in batch mode. +- **`README.md`** — extended Model Variants table with v1 (offline default) and v2.1 + AOSC (streaming default) rows; new `streamingSpkCache*` rows in the ParakeetConfig table; dedicated "Sortformer Streaming Diarization (v2.1 + AOSC)" section explaining the v1-drift problem AOSC solves, the model-variant auto-detection, and when to leave the defaults alone. + +## [0.5.0] + +- Temporarily reverted back to ONNX implementation of `0.3.3` to ensure stability in SDK `0.11.*`. +- Bumped `inference-addon-cpp` dependency version to `1.1.7#1`. +- Bumped `onnx` dependency version to `0.15.0`. + ## [0.4.0] In this release, we have replaced the onnxruntime backend with a pure C++/ggml engine, added a duplex-streaming entry point that bypasses the framework's batch-then-process lifecycle for live use cases, and surfaced two new per-segment signals (`isEndOfTurn`, `startsWord`) so consumers can build cleaner live transcripts. The release also exposes per-engine backend stats (`backendDevice`, `backendId`) so callers can verify the GPU path actually engaged, and consolidates the examples / docs / mock fixtures into a single duplex-aware surface. diff --git a/packages/transcription-parakeet/README.md b/packages/transcription-parakeet/README.md index bf2b2a2e7c..5bcb7995e9 100644 --- a/packages/transcription-parakeet/README.md +++ b/packages/transcription-parakeet/README.md @@ -214,11 +214,48 @@ Most users interact with the package through `index.js`. From that entrypoint we | | `streamingEnergyVad` | CTC/TDT energy-VAD events (default: `false`) | | | `streamingLeftContextMs` | ASR encoder left-context window in ms; `-1` keeps parakeet-cpp's default of 10000. ASR sessions only (Sortformer ignores it). | | | `streamingRightLookaheadMs` | ASR encoder right-lookahead window in ms; `-1` keeps parakeet-cpp's default of 2000. Adds directly to the per-segment latency floor (`chunk_ms + right_lookahead_ms`). ASR sessions only. | +| | `streamingSpkCacheEnable` | AOSC: enable v2.1 Sortformer's speaker-cache streaming (default: `true`). Ignored on v1/v2 Sortformer GGUFs and on non-Sortformer models. Set `false` to force a v2.1 GGUF onto the v1 sliding-window path (A/B comparison). | +| | `streamingSpkCacheLen` | AOSC: long-term speaker-cache rows (~15 s of encoder frames). Default: 188. | +| | `streamingFifoLen` | AOSC: FIFO warmup buffer rows. Default: 188. | +| | `streamingChunkLeftContextMs` | AOSC: encoder left-context window (ms; ~1 encoder frame). Default: 80. | +| | `streamingChunkRightContextMs` | AOSC: encoder right-context window (ms; ~7 encoder frames). Default: 560. | +| | `streamingSpkCacheUpdatePeriod` | AOSC: FIFO-overflow pop-out count. Default: 144. | | | `backendsDir` | Root directory for dynamically-loaded ggml backend `.so` files (Vulkan, OpenCL, per-arch CPU variants on Android). Defaults to the package's `prebuilds/` folder; the native addon appends `/` before scanning. Pass an explicit path when prebuilds live elsewhere — e.g. Android `ApplicationInfo.nativeLibraryDir` when backend libs ship inside the APK. No-op on Apple (statically linked). | | | `openclCacheDir` | Persistent directory for ggml-opencl's compiled program-binary cache (`$GGML_OPENCL_CACHE_DIR`). Android-only; pass the host app's cache directory (e.g. `Context.getCacheDir()`) to skip cold `clBuildProgram` on every process start. Ignored on other platforms. | The model type (CTC / TDT / EOU / Sortformer) is **auto-detected from the GGUF metadata**, so callers don't need to pass `modelType`. Other knobs (`captionEnabled`, `timestampsEnabled`, `seed`, `sampleRate`, `channels`) keep sensible defaults. +**Sortformer Streaming Diarization (v2.1 + AOSC).** parakeet-cpp ships +two streaming-diarization paths picked automatically by the GGUF: + +- **v1** uses a fixed-size sliding-history window inside the engine. + Once two voices have been seen, the per-chunk decisions are + permutation-invariant; if a speaker goes silent long enough to roll + out of the window, the slot can drift onto a different physical voice + when they return. Fine for short, stable clips; ships as + `sortformer-4spk-v1.q8_0.gguf`. +- **v2.1** replaces the sliding window with AOSC (Audio-Online Speaker + Cache, ported from NVIDIA NeMo) which anchors each slot to its + accumulated embedding. Same physical speaker comes back to the same + `Speaker N` tag across silences. Default for live capture; ships as + `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. The engine detects + v2.1 via the GGUF metadata tag + `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`; you + don't need to opt in via config. + +The defaults in the `streamingSpkCache*` / `streamingFifo*` / +`streamingChunk{Left,Right}ContextMs` table rows above are the NeMo-port +tuning parakeet-cpp ships -- you almost always want to keep them. The +knobs are exposed for A/B comparison (e.g. `--spk-cache-enable false` +in `examples/live-mic-diarized-aosc.js` to force a v2.1 GGUF onto the +v1 path) and for tuning unusual audio (longer cache, larger +right-context window for higher latency tolerance, etc.). + +For offline diarization (single batch over a finite clip) v1 remains +the recommended GGUF -- AOSC's slot-stability benefit only applies to +continuous streaming and offers no measurable improvement when the +entire clip is available at once. + #### Configuration Example ```javascript @@ -410,10 +447,16 @@ bare examples/diarized-transcribe.js \ # Live mic transcription bare examples/live-mic.js --model models/parakeet-eou-120m-v1.q8_0.gguf --accumulate -# Live mic + speaker tagging +# Live mic + speaker tagging (recommended: v2.1 diar GGUF, AOSC auto-on) bare examples/live-mic-diarized.js \ --asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \ - --diar-model models/sortformer-4spk-v1.q8_0.gguf --accumulate + --diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf --accumulate + +# Same as above, with explicit AOSC tuning knobs exposed as CLI flags +bare examples/live-mic-diarized-aosc.js \ + --asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \ + --diar-model models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf \ + --spk-cache-len 256 --chunk-right-context-ms 480 --accumulate ``` > If you use `npm run example:* -- ...` instead of `bare`, remember the `--` separator -- without it npm interprets `--model` as one of its own config flags. @@ -427,14 +470,16 @@ The live-mic examples capture the default input device via `sox -d` (install: `b | **CTC** | English | argmax CTC | ~ 700 MiB | Fast, no PnC. | | **TDT** | ~25 | RNN-T greedy + duration | ~ 715 MiB | Recommended default; PnC + auto-detect. | | **EOU** | English | RNN-T greedy + `` | ~ 132 MiB | Streaming-trained; native end-of-turn token. | -| **Sortformer** | n/a | Diarization head | ~ 141 MiB | 4-speaker. | +| **Sortformer v1** | n/a | Diarization head (sliding history) | ~ 141 MiB | 4-speaker. **Default for offline diarization.** | +| **Sortformer v2.1 + AOSC** | n/a | Diarization head + speaker cache | ~ 141 MiB | 4-speaker. **Default for streaming diarization.** AOSC anchors speaker slots across silence/re-entry; auto-detected via GGUF metadata tag `parakeet.model_variant`. | ## Other examples - [`examples/transcribe.js`](examples/transcribe.js) -- universal single-file transcribe / diarize (any GGUF, all model types). - [`examples/diarized-transcribe.js`](examples/diarized-transcribe.js) -- combined Sortformer + ASR pipeline ("who said what"). - [`examples/live-mic.js`](examples/live-mic.js) -- live microphone transcription via `sox` and the streaming session. -- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts. +- [`examples/live-mic-diarized.js`](examples/live-mic-diarized.js) -- live mic with parallel Sortformer + ASR for speaker-tagged transcripts. Pass a v2.1 Sortformer GGUF to get AOSC speaker-cache streaming automatically. +- [`examples/live-mic-diarized-aosc.js`](examples/live-mic-diarized-aosc.js) -- same as above but with CLI flags for the AOSC tuning knobs (`--spk-cache-len`, `--fifo-len`, `--chunk-right-context-ms`, `--spk-cache-enable`, etc.). Useful for A/B comparing AOSC vs the v1 sliding-window code path on the same v2.1 GGUF. - [`examples/decode-audio.js`](examples/decode-audio.js) -- decode + transcribe in one step. Same flag surface as `transcribe.js` but pipes the input through `@qvac/decoder-audio` (FFmpeg) first, so any container / codec FFmpeg supports (mp3, m4a, ogg, flac, mp4, ...) works -- not just 16 kHz mono `.wav` / raw s16le PCM. - [`examples/utils.js`](examples/utils.js) -- shared helpers used by the examples (`loadWeights` streaming, `Output`/`JobEnded` race resolution). diff --git a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp index 12cfea1766..02176d0311 100644 --- a/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp +++ b/packages/transcription-parakeet/addon/src/addon/AddonJs.hpp @@ -163,6 +163,14 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { parakeetModel.getDiarMinDurationOn() * 1000.0F); config.leftContextMs = parakeetModel.getStreamingLeftContextMs(); config.rightLookaheadMs = parakeetModel.getStreamingRightLookaheadMs(); + // AOSC defaults sourced from the model's load-time ParakeetConfig. + config.spkCacheEnable = parakeetModel.getStreamingSpkCacheEnable(); + config.spkCacheLen = parakeetModel.getStreamingSpkCacheLen(); + config.fifoLen = parakeetModel.getStreamingFifoLen(); + config.chunkLeftContextMs = parakeetModel.getStreamingChunkLeftContextMs(); + config.chunkRightContextMs = parakeetModel.getStreamingChunkRightContextMs(); + config.spkCacheUpdatePeriod = + parakeetModel.getStreamingSpkCacheUpdatePeriod(); if (auto chunkMs = configObj.getOptionalProperty(env, "chunkMs"); @@ -198,6 +206,48 @@ startStreaming(js_env_t* env, js_callback_info_t* info) try { emitEnergyVad.has_value()) { config.emitEnergyVad = emitEnergyVad.value().as(env); } + // AOSC per-call overrides (v2.1+ Sortformer only). + if (auto spkCacheEnable = + configObj.getOptionalProperty(env, "spkCacheEnable"); + spkCacheEnable.has_value()) { + config.spkCacheEnable = spkCacheEnable.value().as(env); + } + if (auto spkCacheLen = + configObj.getOptionalProperty(env, "spkCacheLen"); + spkCacheLen.has_value()) { + const auto v = static_cast(spkCacheLen.value().as(env)); + if (v > 0) + config.spkCacheLen = v; + } + if (auto fifoLen = configObj.getOptionalProperty(env, "fifoLen"); + fifoLen.has_value()) { + const auto v = static_cast(fifoLen.value().as(env)); + if (v > 0) + config.fifoLen = v; + } + if (auto chunkLeftContextMs = + configObj.getOptionalProperty(env, "chunkLeftContextMs"); + chunkLeftContextMs.has_value()) { + const auto v = static_cast(chunkLeftContextMs.value().as(env)); + if (v >= 0) + config.chunkLeftContextMs = v; + } + if (auto chunkRightContextMs = + configObj.getOptionalProperty(env, "chunkRightContextMs"); + chunkRightContextMs.has_value()) { + const auto v = + static_cast(chunkRightContextMs.value().as(env)); + if (v >= 0) + config.chunkRightContextMs = v; + } + if (auto spkCacheUpdatePeriod = configObj.getOptionalProperty( + env, "spkCacheUpdatePeriod"); + spkCacheUpdatePeriod.has_value()) { + const auto v = + static_cast(spkCacheUpdatePeriod.value().as(env)); + if (v > 0) + config.spkCacheUpdatePeriod = v; + } { std::lock_guard lock(g_streamingMtx); diff --git a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp index bedbf481fe..3d20e448ee 100644 --- a/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp +++ b/packages/transcription-parakeet/addon/src/js-interface/JSAdapter.cpp @@ -107,6 +107,54 @@ auto JSAdapter::loadFromJSObject(js::Object jsObject, js_env_t* env) streamingRightLookaheadMsOpt.value().as(env); } + // AOSC (v2.1+ Sortformer only). All optional; unspecified values keep + // ParakeetConfig's defaults. Forwarded into + // parakeet::SortformerStreamingOptions by ParakeetModel / + // ParakeetStreamingProcessor; ignored for v1/v2/non-Sortformer. + auto streamingSpkCacheEnableOpt = + jsObject.getOptionalProperty(env, "streamingSpkCacheEnable"); + if (streamingSpkCacheEnableOpt.has_value()) { + config.streamingSpkCacheEnable = + streamingSpkCacheEnableOpt.value().as(env); + } + + auto streamingSpkCacheLenOpt = + jsObject.getOptionalProperty(env, "streamingSpkCacheLen"); + if (streamingSpkCacheLenOpt.has_value()) { + config.streamingSpkCacheLen = + streamingSpkCacheLenOpt.value().as(env); + } + + auto streamingFifoLenOpt = + jsObject.getOptionalProperty(env, "streamingFifoLen"); + if (streamingFifoLenOpt.has_value()) { + config.streamingFifoLen = streamingFifoLenOpt.value().as(env); + } + + auto streamingChunkLeftContextMsOpt = + jsObject.getOptionalProperty( + env, "streamingChunkLeftContextMs"); + if (streamingChunkLeftContextMsOpt.has_value()) { + config.streamingChunkLeftContextMs = + streamingChunkLeftContextMsOpt.value().as(env); + } + + auto streamingChunkRightContextMsOpt = + jsObject.getOptionalProperty( + env, "streamingChunkRightContextMs"); + if (streamingChunkRightContextMsOpt.has_value()) { + config.streamingChunkRightContextMs = + streamingChunkRightContextMsOpt.value().as(env); + } + + auto streamingSpkCacheUpdatePeriodOpt = + jsObject.getOptionalProperty( + env, "streamingSpkCacheUpdatePeriod"); + if (streamingSpkCacheUpdatePeriodOpt.has_value()) { + config.streamingSpkCacheUpdatePeriod = + streamingSpkCacheUpdatePeriodOpt.value().as(env); + } + // Dynamic-backend loading knobs. Both forwarded to // parakeet::EngineOptions and consumed once per-process on the // first Engine construction (the ggml-backend registry + the diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp index 8161375bb5..2d53a4ba3e 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.cpp @@ -46,6 +46,15 @@ ParakeetStreamingProcessor::ParakeetStreamingProcessor( opts.threshold = config_.diarOnsetThreshold; opts.min_segment_ms = config_.diarMinSegmentMs; opts.emit_partials = config_.emitPartials; + // AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these fields for + // v1/v2 GGUFs (variant detected from `parakeet.model_variant` metadata + // or the encoder shape heuristic), so always-forward is safe. + opts.spkcache_enable = config_.spkCacheEnable; + opts.spkcache_len = config_.spkCacheLen; + opts.fifo_len = config_.fifoLen; + opts.chunk_left_context_ms = config_.chunkLeftContextMs; + opts.chunk_right_context_ms = config_.chunkRightContextMs; + opts.spkcache_update_period = config_.spkCacheUpdatePeriod; diar_session_ = model_.createDuplexDiarizationSession( opts, diff --git a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp index f611172eb6..559e9e9b04 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/ParakeetStreamingProcessor.hpp @@ -54,6 +54,18 @@ class ParakeetStreamingProcessor { // parakeet engine default in place" (10000 / 2000 ms respectively). int leftContextMs = -1; int rightLookaheadMs = -1; + // === AOSC (v2.1+ Sortformer only) ==================================== + // Forwarded into parakeet::SortformerStreamingOptions when the loaded + // model is a v2.1 Sortformer GGUF (auto-detected from the GGUF's + // `parakeet.model_variant` metadata tag). parakeet-cpp ignores these + // fields on v1/v2 GGUFs and on non-Sortformer engines, so they are + // always safe to forward. + bool spkCacheEnable = true; + int spkCacheLen = 188; + int fifoLen = 188; + int chunkLeftContextMs = 80; + int chunkRightContextMs = 560; + int spkCacheUpdatePeriod = 144; }; ParakeetStreamingProcessor( diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp index 744490632c..b05d3e4672 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetConfig.hpp @@ -57,6 +57,27 @@ struct ParakeetConfig { int streamingLeftContextMs = -1; int streamingRightLookaheadMs = -1; + // === AOSC (Audio-Online Speaker Cache; v2.1+ Sortformer only) ─────────── + // Forwarded to parakeet::SortformerStreamingOptions.spkcache_* / + // fifo_len / chunk_{left,right}_context_ms / spkcache_update_period. + // Ignored on non-Sortformer models and on v1/v2 Sortformer GGUFs; + // parakeet-cpp auto-enables AOSC for v2.1 via the GGUF metadata tag + // `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + // + // The cache anchors speaker-slot identity across silence and re-entry, + // fixing the per-chunk permutation-invariance drift that v1's sliding + // window suffers from. Defaults mirror parakeet-cpp's own (NeMo-port + // tuning); override only when A/B comparing or for specialised audio. + // + // Setting streamingSpkCacheEnable = false on a v2.1 model forces the + // v1 sliding-window code path (useful for regression comparison). + bool streamingSpkCacheEnable = true; + int streamingSpkCacheLen = 188; // long-term speaker rows (~15s) + int streamingFifoLen = 188; // FIFO warmup buffer rows + int streamingChunkLeftContextMs = 80; // encoder left context (~1 frame) + int streamingChunkRightContextMs = 560; // encoder right context (~7 frames) + int streamingSpkCacheUpdatePeriod = 144; // FIFO-overflow pop-out count + // ── Dynamic-backend loading ──────────────────────────────────────────── // Forwarded to parakeet::EngineOptions::backends_dir / // opencl_cache_dir. On Android (and any other GGML_BACKEND_DL=ON @@ -91,6 +112,13 @@ struct ParakeetConfig { streamingEnergyVad == other.streamingEnergyVad && streamingLeftContextMs == other.streamingLeftContextMs && streamingRightLookaheadMs == other.streamingRightLookaheadMs && + streamingSpkCacheEnable == other.streamingSpkCacheEnable && + streamingSpkCacheLen == other.streamingSpkCacheLen && + streamingFifoLen == other.streamingFifoLen && + streamingChunkLeftContextMs == other.streamingChunkLeftContextMs && + streamingChunkRightContextMs == other.streamingChunkRightContextMs && + streamingSpkCacheUpdatePeriod == + other.streamingSpkCacheUpdatePeriod && backendsDir == other.backendsDir && openclCacheDir == other.openclCacheDir; } diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp index e1b4e4acfb..adf4a736a6 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.cpp @@ -720,6 +720,15 @@ void ParakeetModel::openStreamingSession_() { opts.threshold = diarConfig_.onset; opts.min_segment_ms = static_cast(diarConfig_.minDurationOn * 1000.0f); opts.emit_partials = cfg_.streamingEmitPartials; + // AOSC (v2.1+ Sortformer only; ignored for v1/v2 GGUFs). The engine + // detects v2.1 via the GGUF metadata tag `parakeet.model_variant` and + // only consults these fields then -- safe to forward unconditionally. + opts.spkcache_enable = cfg_.streamingSpkCacheEnable; + opts.spkcache_len = cfg_.streamingSpkCacheLen; + opts.fifo_len = cfg_.streamingFifoLen; + opts.chunk_left_context_ms = cfg_.streamingChunkLeftContextMs; + opts.chunk_right_context_ms = cfg_.streamingChunkRightContextMs; + opts.spkcache_update_period = cfg_.streamingSpkCacheUpdatePeriod; auto session = engine->diarize_start( opts, [this](const parakeet::StreamingDiarizationSegment& seg) { diff --git a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp index 2cd7c5f993..5e94cb2b5c 100644 --- a/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp +++ b/packages/transcription-parakeet/addon/src/model-interface/parakeet/ParakeetModel.hpp @@ -139,6 +139,23 @@ class ParakeetModel : public qvac_lib_inference_addon_cpp::model::IModel, bool getStreamingEnergyVad() const { return cfg_.streamingEnergyVad; } + // AOSC accessors (v2.1+ Sortformer only). Forwarded verbatim from + // ParakeetConfig; parakeet-cpp ignores them for non-Sortformer engines + // and for v1/v2 Sortformer GGUFs. + bool getStreamingSpkCacheEnable() const { + return cfg_.streamingSpkCacheEnable; + } + int getStreamingSpkCacheLen() const { return cfg_.streamingSpkCacheLen; } + int getStreamingFifoLen() const { return cfg_.streamingFifoLen; } + int getStreamingChunkLeftContextMs() const { + return cfg_.streamingChunkLeftContextMs; + } + int getStreamingChunkRightContextMs() const { + return cfg_.streamingChunkRightContextMs; + } + int getStreamingSpkCacheUpdatePeriod() const { + return cfg_.streamingSpkCacheUpdatePeriod; + } bool isSortformer() const { return cfg_.modelType == ModelType::SORTFORMER; } diff --git a/packages/transcription-parakeet/examples/diarized-transcribe.js b/packages/transcription-parakeet/examples/diarized-transcribe.js index 424092d470..8f25c8f131 100644 --- a/packages/transcription-parakeet/examples/diarized-transcribe.js +++ b/packages/transcription-parakeet/examples/diarized-transcribe.js @@ -1,13 +1,21 @@ 'use strict' /** - * Combined ASR + diarization example. + * Combined ASR + diarization example (offline). * * Runs Sortformer to find speaker time-segments, then transcribes * each speaker's audio slice with the ASR model. Output is a * "Speaker N: ..." per-segment transcript. Both engines run * through the public `TranscriptionParakeet` class. * + * Recommended `--diar-model`: the v1 Sortformer GGUF + * (`sortformer-4spk-v1.q8_0.gguf`). v2.1 also works but the AOSC + * speaker cache it brings is a *streaming* optimisation -- in batch / + * offline mode the entire clip is available at once, so AOSC's slot + * stability across silence/re-entry provides no additional benefit + * over v1. For live capture, use `examples/live-mic-diarized.js` + * (or `examples/live-mic-diarized-aosc.js`) with the v2.1 GGUF. + * * Usage: * bare examples/diarized-transcribe.js \ * --asr-model --diar-model --audio diff --git a/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js b/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js new file mode 100644 index 0000000000..41970a3026 --- /dev/null +++ b/packages/transcription-parakeet/examples/live-mic-diarized-aosc.js @@ -0,0 +1,369 @@ +'use strict' + +/** + * Live-mic transcription + diarization example with full AOSC control. + * + * This is the v2.1-focused counterpart of `examples/live-mic-diarized.js`. + * Both files share the same duplex pattern (two `runStreaming()` + * sessions fanned from a single sox capture, with the ASR transcript + * tagged by the latest Sortformer speaker_id). What this file adds is + * explicit CLI control of the AOSC (Audio-Online Speaker Cache) knobs + * parakeet-cpp exposes for v2.1 Sortformer streaming: + * + * --spk-cache-enable {true|false} Toggle AOSC. Defaults to true. + * Set false to force a v2.1 GGUF + * onto the v1 sliding-window + * path (A/B comparison). + * --spk-cache-len Long-term speaker-cache rows + * (default 188 ≈ 15 s). + * --fifo-len FIFO warmup buffer rows + * (default 188). + * --chunk-left-context-ms Encoder left context, ~1 frame + * (default 80). + * --chunk-right-context-ms Encoder right context, ~7 frames + * (default 560). Adds directly to + * per-chunk emission latency. + * --spk-cache-update-period FIFO-overflow pop-out count + * (default 144). How many frames + * get promoted into the long-term + * cache each time the FIFO fills. + * + * Background -- what AOSC fixes: + * v1 / v2 Sortformer streams use a fixed-size sliding-history window + * inside the engine. Once two voices have been seen, the model's + * per-chunk decisions are permutation-invariant; if one speaker goes + * silent long enough to roll out of the window, its slot identity can + * silently drift onto a different physical voice when it returns. v2.1 + * replaces the sliding window with a NeMo-port speaker cache that + * anchors each slot to its accumulated embedding, so the same physical + * speaker comes back to the same `Speaker N` tag across silences. + * + * For the upstream API + algorithm details, see + * `parakeet-cpp/include/parakeet/diarization.h` and the upstream PRs + * that introduced this feature in qvac-ext-lib-whisper.cpp (PR #22 + * commit e6ba38c, PR #24 commit 08df2e7). + * + * Usage: + * bare examples/live-mic-diarized-aosc.js \ + * --asr-model \ + * --diar-model \ + * [--accumulate] [--chunk-ms ] [--capture ""] \ + * [--spk-cache-enable {true|false}] [--spk-cache-len ] \ + * [--fifo-len ] [--chunk-left-context-ms ] \ + * [--chunk-right-context-ms ] [--spk-cache-update-period ] + * + * Notes: + * - The AOSC knobs are silently ignored on v1/v2 GGUFs and on + * non-Sortformer models. The engine detects v2.1 via the GGUF + * metadata tag `parakeet.model_variant`. + * - On Windows, if sox exits without producing audio, override capture: + * --capture "sox -t waveaudio default -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -" + */ + +/* global Bare */ +const path = require('bare-path') +const process = require('bare-process') +const subprocess = require('bare-subprocess') +const TranscriptionParakeet = require('../index.js') +const addonLogging = require('../addonLogging.js') +const { setupLogger, validatePaths, pushableStream } = require('./utils.js') + +const CAPTURE_CMD = 'sox -d -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -' + +const SILENCE_SENTINELS = new Set([ + '[No speech detected]', + '[Audio too short]', + '[Model not ready]', + '[No speakers detected]' +]) + +function isSilenceText (text) { + return text.length === 0 || SILENCE_SENTINELS.has(text) +} + +function buildSegmentText (items) { + let text = '' + let firstStartsWord = true + let isFirst = true + for (const s of items) { + if (!s || !s.text || !s.toAppend) continue + const sw = s.startsWord !== false + if (isFirst) { + firstStartsWord = sw + text = s.text + isFirst = false + } else { + text += (sw ? ' ' : '') + s.text + } + } + return { text: text.replace(/\s+/g, ' '), firstStartsWord } +} + +function parseSortformerSpeakerId (text) { + const m = typeof text === 'string' + ? text.match(/Speaker\s+(\d+)/) + : null + return m ? parseInt(m[1], 10) : -1 +} + +function parseBoolFlag (value) { + if (value === undefined || value === null) return undefined + const normalised = String(value).toLowerCase() + if (normalised === 'true' || normalised === '1' || normalised === 'yes') return true + if (normalised === 'false' || normalised === '0' || normalised === 'no') return false + return undefined +} + +function parsePositiveInt (value) { + const n = parseInt(value, 10) + return Number.isFinite(n) && n > 0 ? n : null +} + +function parseNonNegativeInt (value) { + const n = parseInt(value, 10) + return Number.isFinite(n) && n >= 0 ? n : null +} + +function parseArgs () { + const args = { + asrModel: null, + diarModel: null, + accumulate: false, + capture: null, + chunkMs: null, + spkCacheEnable: undefined, + spkCacheLen: null, + fifoLen: null, + chunkLeftContextMs: null, + chunkRightContextMs: null, + spkCacheUpdatePeriod: null + } + const argv = Bare.argv.slice(2) + for (let i = 0; i < argv.length; i++) { + const a = argv[i] + if (a === '--asr-model' || a === '-m') args.asrModel = argv[++i] + else if (a === '--diar-model' || a === '-d') args.diarModel = argv[++i] + else if (a === '--accumulate') args.accumulate = true + else if (a === '--capture' || a === '-c') args.capture = argv[++i] + else if (a === '--chunk-ms') { + const v = parsePositiveInt(argv[++i]) + if (v !== null && v >= 200) args.chunkMs = v + } else if (a === '--spk-cache-enable') { + const v = parseBoolFlag(argv[++i]) + if (v !== undefined) args.spkCacheEnable = v + } else if (a === '--spk-cache-len') args.spkCacheLen = parsePositiveInt(argv[++i]) + else if (a === '--fifo-len') args.fifoLen = parsePositiveInt(argv[++i]) + else if (a === '--chunk-left-context-ms') args.chunkLeftContextMs = parseNonNegativeInt(argv[++i]) + else if (a === '--chunk-right-context-ms') args.chunkRightContextMs = parseNonNegativeInt(argv[++i]) + else if (a === '--spk-cache-update-period') args.spkCacheUpdatePeriod = parsePositiveInt(argv[++i]) + } + return args +} + +function buildDiarConfig (args) { + const config = { + streaming: true, + streamingChunkMs: args.chunkMs ?? 2000, + useGPU: true + } + if (args.spkCacheEnable !== undefined) config.streamingSpkCacheEnable = args.spkCacheEnable + if (args.spkCacheLen !== null) config.streamingSpkCacheLen = args.spkCacheLen + if (args.fifoLen !== null) config.streamingFifoLen = args.fifoLen + if (args.chunkLeftContextMs !== null) config.streamingChunkLeftContextMs = args.chunkLeftContextMs + if (args.chunkRightContextMs !== null) config.streamingChunkRightContextMs = args.chunkRightContextMs + if (args.spkCacheUpdatePeriod !== null) config.streamingSpkCacheUpdatePeriod = args.spkCacheUpdatePeriod + return config +} + +function describeAoscConfig (config) { + const parts = [] + if ('streamingSpkCacheEnable' in config) parts.push(`spkCacheEnable=${config.streamingSpkCacheEnable}`) + if ('streamingSpkCacheLen' in config) parts.push(`spkCacheLen=${config.streamingSpkCacheLen}`) + if ('streamingFifoLen' in config) parts.push(`fifoLen=${config.streamingFifoLen}`) + if ('streamingChunkLeftContextMs' in config) parts.push(`chunkLeftContextMs=${config.streamingChunkLeftContextMs}`) + if ('streamingChunkRightContextMs' in config) parts.push(`chunkRightContextMs=${config.streamingChunkRightContextMs}`) + if ('streamingSpkCacheUpdatePeriod' in config) parts.push(`spkCacheUpdatePeriod=${config.streamingSpkCacheUpdatePeriod}`) + return parts.length === 0 ? '(all AOSC defaults)' : parts.join(' ') +} + +async function main () { + const args = parseArgs() + if (!args.asrModel || !args.diarModel) { + console.error('Usage: bare examples/live-mic-diarized-aosc.js --asr-model --diar-model [--accumulate] [--chunk-ms ] [--capture ""] [--spk-cache-enable {true|false}] [--spk-cache-len ] [--fifo-len ] [--chunk-left-context-ms ] [--chunk-right-context-ms ] [--spk-cache-update-period ]') + process.exit(1) + } + + setupLogger(addonLogging) + let stopping = false + + const asrPath = path.resolve(args.asrModel) + const diarPath = path.resolve(args.diarModel) + if (!validatePaths({ model: asrPath })) { addonLogging.releaseLogger(); process.exit(1) } + if (!validatePaths({ model: diarPath })) { addonLogging.releaseLogger(); process.exit(1) } + + console.log(`Loading ASR: ${asrPath}`) + console.log(`Loading DIAR: ${diarPath}`) + + const diarConfig = buildDiarConfig(args) + console.log(`AOSC config: ${describeAoscConfig(diarConfig)}`) + + const asr = new TranscriptionParakeet({ + files: { model: asrPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: args.chunkMs ?? 2000, + useGPU: true + } + } + }) + const diar = new TranscriptionParakeet({ + files: { model: diarPath }, + config: { parakeetConfig: diarConfig } + }) + + await asr.load() + await diar.load() + console.log('Listening (Ctrl-C to stop)...\n') + + const captureCmd = args.capture && args.capture.length > 0 ? args.capture : CAPTURE_CMD + const [captureBin, ...captureArgs] = captureCmd.split(' ') + let child + try { + child = subprocess.spawn(captureBin, captureArgs, + { stdio: ['ignore', 'pipe', 'pipe'] }) + } catch (err) { + if (err && err.code === 'ENOENT') { + console.error(`\n'${captureBin}' not found on PATH.`) + console.error('Install sox (brew install sox / apt install sox / choco install sox / winget install ChrisBagwell.SoX).') + } else { + console.error(`\nFailed to spawn capture command: ${err.message}`) + } + addonLogging.releaseLogger() + process.exit(1) + } + child.on('error', (err) => { + console.error(`\nCapture command failed: ${err.message}`) + process.exit(1) + }) + + let firstAudioSeen = false + let stderrBuf = '' + child.stderr.on('data', (chunk) => { + stderrBuf += chunk.toString('utf8') + if (stderrBuf.length > 8192) stderrBuf = stderrBuf.slice(-8192) + }) + + let lineOpen = false + let lineSpeaker = null + let lastSpeaker = -1 + + function flushLine () { + if (lineOpen) { + process.stdout.write('\n') + lineOpen = false + lineSpeaker = null + } + } + function emitTranscript (speaker, text, firstStartsWord) { + if (isSilenceText(text)) { + if (args.accumulate) flushLine() + return + } + const tag = speaker >= 0 ? `speaker_${speaker}` : 'speaker_?' + const ts = new Date().toISOString().slice(11, 19) + if (args.accumulate) { + if (lineOpen && lineSpeaker !== speaker) flushLine() + if (!lineOpen) { + process.stdout.write(`[${ts}] ${tag}: ${text}`) + lineOpen = true + lineSpeaker = speaker + } else { + process.stdout.write((firstStartsWord ? ' ' : '') + text) + } + } else { + console.log(`[${ts}] ${tag}: ${text}`) + } + } + + const asrStream = pushableStream() + const diarStream = pushableStream() + child.stdout.on('data', (chunk) => { + if (!firstAudioSeen) firstAudioSeen = true + if (stopping) return + asrStream.push(chunk) + diarStream.push(chunk) + }) + + const streamingConfig = {} + if (args.chunkMs !== null) streamingConfig.chunkMs = args.chunkMs + + const diarRunPromise = (async () => { + const response = await diar.runStreaming(diarStream, streamingConfig) + await response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + for (let i = items.length - 1; i >= 0; i--) { + const s = items[i] + if (!s || !s.text || isSilenceText(s.text)) continue + const id = parseSortformerSpeakerId(s.text) + if (id >= 0) { + lastSpeaker = id + break + } + } + }) + .await() + })() + + const asrRunPromise = (async () => { + const response = await asr.runStreaming(asrStream, streamingConfig) + await response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + const { text, firstStartsWord } = buildSegmentText(items) + emitTranscript(lastSpeaker, text.trim(), firstStartsWord) + }) + .await() + })() + + async function shutdown () { + if (stopping) return + stopping = true + console.log('\nStopping...') + try { child.kill('SIGTERM') } catch (e) { /* ignore */ } + asrStream.end() + diarStream.end() + try { await Promise.all([asrRunPromise, diarRunPromise]) } catch (e) { /* swallow */ } + flushLine() + try { await asr.unload() } catch (e) { /* ignore */ } + try { await diar.unload() } catch (e) { /* ignore */ } + addonLogging.releaseLogger() + process.exit(0) + } + + process.once('SIGINT', shutdown) + process.once('SIGTERM', shutdown) + child.on('exit', (code, signal) => { + if (!firstAudioSeen && !stopping) { + console.error(`\nCapture command exited before producing audio (code=${code}, signal=${signal}).`) + const tail = stderrBuf.trim() + if (tail) { + console.error('--- sox stderr ---') + console.error(tail) + console.error('------------------') + } + console.error('Hints:') + console.error(' - On Windows, try: --capture "sox -t waveaudio default -t raw -r 16000 -b 16 -c 1 -e signed-integer -L -"') + console.error(' - Verify a default recording device exists (Settings -> System -> Sound -> Input).') + console.error(' - Confirm SoX can list audio devices: sox -V6 -d -t raw -r 16000 -c 1 -e signed-integer -b 16 -L - 2>&1 | head') + } + shutdown() + }) +} + +main().catch(err => { + console.error('Error:', err) + addonLogging.releaseLogger() + process.exit(1) +}) diff --git a/packages/transcription-parakeet/examples/live-mic-diarized.js b/packages/transcription-parakeet/examples/live-mic-diarized.js index 47808ea2f4..ada13e5b32 100644 --- a/packages/transcription-parakeet/examples/live-mic-diarized.js +++ b/packages/transcription-parakeet/examples/live-mic-diarized.js @@ -10,16 +10,27 @@ * Sortformer segment; the ASR side tags each printed transcript with * `lastSpeaker`. Press Ctrl-C to flush and exit. * - * Diarization tagging is best-effort. Sortformer's streaming session - * is permutation-invariant per chunk and prone to occasional - * speaker-ID drift on continuous single-speaker stretches once two - * voices have been seen in the rolling-history window. parakeet-cpp - * documents this behaviour in + * Recommended `--diar-model`: the v2.1 Sortformer GGUF + * (`diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`). parakeet-cpp + * detects v2.1 from the GGUF metadata tag + * `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"` and + * enables AOSC (Audio-Online Speaker Cache) automatically, which + * anchors speaker slots across silence and re-entry and largely + * removes the drift caveat described below. + * + * For an AOSC-aware variant that also exposes the speaker-cache + * tuning knobs from the CLI, see `examples/live-mic-diarized-aosc.js`. + * + * v1 caveat (kept for users running the older v1 GGUF): Sortformer's + * streaming session is permutation-invariant per chunk and prone to + * occasional speaker-ID drift on continuous single-speaker stretches + * once two voices have been seen in the rolling-history window. + * parakeet-cpp documents this behaviour in * `parakeet-cpp/include/parakeet/diarization.h:80-82`. Fixing it - * properly requires per-segment voice embeddings (currently not - * exposed by the engine) -- this example therefore renders the raw - * Sortformer ID and accepts the occasional mis-tag rather than try - * to second-guess the model in JS. + * properly required per-segment voice embeddings (now solved by v2.1's + * AOSC) -- this example therefore renders the raw Sortformer ID and + * accepts the occasional mis-tag rather than try to second-guess the + * model in JS. * * Usage: * bare examples/live-mic-diarized.js \ @@ -98,9 +109,14 @@ function parseArgs () { } // Pin the Sortformer rolling-history window at parakeet-cpp's default -// (30 s). Pushing past it puts the input outside the window the -// underlying model was trained on, which empirically causes the engine -// to collapse all voices onto sortformer_0. +// (30 s). Pushing past it on a v1 GGUF puts the input outside the +// window the underlying model was trained on, which empirically causes +// the engine to collapse all voices onto sortformer_0. +// +// On a v2.1 GGUF, AOSC is auto-enabled and supersedes this rolling +// window with a NeMo-port speaker cache. parakeet-cpp ignores +// `history_ms` for v2.1 sessions, so this constant is harmless either +// way and is kept for backwards compatibility with v1 GGUFs. const STREAMING_HISTORY_MS = 30000 // Pull the Sortformer speaker_id out of the addon's segment text diff --git a/packages/transcription-parakeet/index.d.ts b/packages/transcription-parakeet/index.d.ts index c99897efdc..cf8d38f5ee 100644 --- a/packages/transcription-parakeet/index.d.ts +++ b/packages/transcription-parakeet/index.d.ts @@ -71,6 +71,30 @@ declare interface ParakeetConfig { * (2000 ms). ASR sessions only. */ streamingRightLookaheadMs?: number + + /** + * AOSC (Audio-Online Speaker Cache): enable v2.1 Sortformer's + * speaker-cache streaming. Ignored on v1/v2 Sortformer GGUFs and on + * non-Sortformer models. Set false to force a v2.1 model onto the + * v1 sliding-window path (e.g. for A/B comparison). Default: true. + * + * The cache anchors each speaker to a stable slot across silence and + * re-entry, fixing the per-chunk permutation-invariance drift that v1 + * suffers from when two voices have been seen in the rolling window. + * v2.1 is auto-detected from the GGUF metadata tag + * `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + */ + streamingSpkCacheEnable?: boolean + /** AOSC: long-term speaker-cache rows (~15 s of encoder frames). Default: 188. */ + streamingSpkCacheLen?: number + /** AOSC: FIFO warmup buffer rows. Default: 188. */ + streamingFifoLen?: number + /** AOSC: encoder left-context window (ms; ~1 encoder frame). Default: 80. */ + streamingChunkLeftContextMs?: number + /** AOSC: encoder right-context window (ms; ~7 encoder frames). Default: 560. */ + streamingChunkRightContextMs?: number + /** AOSC: FIFO-overflow pop-out count. Default: 144. */ + streamingSpkCacheUpdatePeriod?: number /** * Directory the native addon scans for dynamically-loaded ggml * backend libraries (`libqvac-speech-ggml-vulkan.so`, @@ -196,6 +220,18 @@ declare interface StreamingRunConfig { emitPartials?: boolean /** CTC/TDT-only energy-VAD events. */ emitEnergyVad?: boolean + /** AOSC: enable/disable v2.1 speaker cache (overrides `streamingSpkCacheEnable`). */ + spkCacheEnable?: boolean + /** AOSC: long-term speaker-cache rows (overrides `streamingSpkCacheLen`). */ + spkCacheLen?: number + /** AOSC: FIFO warmup buffer rows (overrides `streamingFifoLen`). */ + fifoLen?: number + /** AOSC: encoder left-context window in ms (overrides `streamingChunkLeftContextMs`). */ + chunkLeftContextMs?: number + /** AOSC: encoder right-context window in ms (overrides `streamingChunkRightContextMs`). */ + chunkRightContextMs?: number + /** AOSC: FIFO-overflow pop-out count (overrides `streamingSpkCacheUpdatePeriod`). */ + spkCacheUpdatePeriod?: number } /** diff --git a/packages/transcription-parakeet/index.js b/packages/transcription-parakeet/index.js index 9b65e806ab..f335647367 100644 --- a/packages/transcription-parakeet/index.js +++ b/packages/transcription-parakeet/index.js @@ -126,6 +126,18 @@ class TranscriptionParakeet { streamingEnergyVad: this.params.streamingEnergyVad === true, streamingLeftContextMs: this.params.streamingLeftContextMs ?? -1, streamingRightLookaheadMs: this.params.streamingRightLookaheadMs ?? -1, + // AOSC (v2.1+ Sortformer only). parakeet-cpp ignores these on + // non-Sortformer engines and on v1/v2 GGUFs. Defaults mirror the + // C++ ParakeetConfig defaults; passing the field explicitly (vs + // letting C++ pick its own default) ensures user overrides at + // the JS layer reach the native engine instead of being silently + // discarded by _buildConfigurationParams. + streamingSpkCacheEnable: this.params.streamingSpkCacheEnable !== false, + streamingSpkCacheLen: this.params.streamingSpkCacheLen ?? 188, + streamingFifoLen: this.params.streamingFifoLen ?? 188, + streamingChunkLeftContextMs: this.params.streamingChunkLeftContextMs ?? 80, + streamingChunkRightContextMs: this.params.streamingChunkRightContextMs ?? 560, + streamingSpkCacheUpdatePeriod: this.params.streamingSpkCacheUpdatePeriod ?? 144, // Forwarded as-is; ParakeetInterface fills in a per-package // default for `backendsDir` (`path.join(__dirname, 'prebuilds')`) // when the host doesn't pass one, so explicit `undefined` diff --git a/packages/transcription-parakeet/package.json b/packages/transcription-parakeet/package.json index 59fb9fc9d1..04f8c3d728 100644 --- a/packages/transcription-parakeet/package.json +++ b/packages/transcription-parakeet/package.json @@ -1,6 +1,6 @@ { "name": "@qvac/transcription-parakeet", - "version": "0.4.0", + "version": "0.6.0", "description": "High-performance speech-to-text inference addon using NVIDIA Parakeet models for Bare runtime", "addon": true, "engines": { diff --git a/packages/transcription-parakeet/parakeet.js b/packages/transcription-parakeet/parakeet.js index bcd2dcb04d..541d5f9055 100644 --- a/packages/transcription-parakeet/parakeet.js +++ b/packages/transcription-parakeet/parakeet.js @@ -59,6 +59,20 @@ class ParakeetInterface { * left context (parakeet default 10000 ms; -1 keeps the engine default). * @param {number} [configurationParams.streamingRightLookaheadMs] - ASR encoder * right lookahead (parakeet default 2000 ms; -1 keeps the engine default). + * @param {boolean} [configurationParams.streamingSpkCacheEnable=true] - AOSC: + * enable v2.1 Sortformer speaker-cache streaming. Ignored on v1/v2 GGUFs + * and on non-Sortformer models. Set false to force the v1 sliding-window + * path on a v2.1 model (A/B comparison). + * @param {number} [configurationParams.streamingSpkCacheLen=188] - AOSC: + * long-term speaker-cache rows (~15 s of encoder frames). + * @param {number} [configurationParams.streamingFifoLen=188] - AOSC: FIFO + * warmup buffer rows. + * @param {number} [configurationParams.streamingChunkLeftContextMs=80] - + * AOSC: encoder left-context window (ms; ~1 encoder frame). + * @param {number} [configurationParams.streamingChunkRightContextMs=560] - + * AOSC: encoder right-context window (ms; ~7 encoder frames). + * @param {number} [configurationParams.streamingSpkCacheUpdatePeriod=144] - + * AOSC: FIFO-overflow pop-out count. * @param {string} [configurationParams.backendsDir] - root directory * for dynamically-loaded ggml backends. JS defaults to * `/prebuilds`; the native addon appends @@ -494,6 +508,12 @@ class ParakeetInterface { * @param {number} [config.rightLookaheadMs] - ASR encoder right lookahead (overrides cfg.streamingRightLookaheadMs) * @param {boolean} [config.emitPartials] - emit partial segments on chunk boundaries * @param {boolean} [config.emitEnergyVad] - surface energy-VAD events for CTC/TDT + * @param {boolean} [config.spkCacheEnable] - AOSC: enable/disable v2.1 speaker cache (overrides cfg.streamingSpkCacheEnable) + * @param {number} [config.spkCacheLen] - AOSC: long-term speaker-cache rows (overrides cfg.streamingSpkCacheLen) + * @param {number} [config.fifoLen] - AOSC: FIFO warmup buffer rows (overrides cfg.streamingFifoLen) + * @param {number} [config.chunkLeftContextMs] - AOSC: encoder left-context window in ms (overrides cfg.streamingChunkLeftContextMs) + * @param {number} [config.chunkRightContextMs] - AOSC: encoder right-context window in ms (overrides cfg.streamingChunkRightContextMs) + * @param {number} [config.spkCacheUpdatePeriod] - AOSC: FIFO-overflow pop-out count (overrides cfg.streamingSpkCacheUpdatePeriod) * @returns {Promise} jobId assigned to the streaming session */ async startStreaming (config = {}) { diff --git a/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py b/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py index c707ee1788..e18d9f8467 100644 --- a/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py +++ b/packages/transcription-parakeet/scripts/convert-nemo-to-gguf.py @@ -217,7 +217,24 @@ def fuse_bn(weight, bias, running_mean, running_var, eps=1e-5): return scale.astype(np.float32), shift.astype(np.float32) -def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): +def detect_sortformer_variant(ckpt: Path) -> str: + """ + Map a NeMo Sortformer .nemo filename to a stable variant tag the C++ + loader can match against. The tag is the only thing that distinguishes + cache-aware v2.1 from architecturally-identical v1 / v2 at GGUF time + (encoder shape alone is ambiguous against future variants). + """ + stem = ckpt.stem + if "streaming_sortformer" in stem and "-v2.1" in stem: + return "sortformer-streaming-v2.1-aosc" + if "streaming_sortformer" in stem and "-v2" in stem: + return "sortformer-streaming-v2" + if "diar_sortformer" in stem and "-v1" in stem: + return "sortformer-v1" + return "" + + +def write_gguf(out: Path, ckpt: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): model_type = detect_model_type(cfg) enc = cfg["encoder"] @@ -349,6 +366,12 @@ def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): writer.add_uint32("parakeet.sortformer.tf_n_heads", int(tfe["num_attention_heads"])) writer.add_bool ("parakeet.sortformer.tf_pre_ln", bool(tfe.get("pre_ln", False))) writer.add_string("parakeet.sortformer.tf_hidden_act", str(tfe.get("hidden_act", "relu"))) + # Variant tag (preferred over shape-based detection on the C++ side). + # Empty string = unknown checkpoint; loader falls back to encoder + # shape so older GGUFs continue to load. + variant = detect_sortformer_variant(ckpt) + if variant: + writer.add_string("parakeet.model_variant", variant) else: pred_hidden = int(dec["prednet"]["pred_hidden"]) pred_rnn_layers = int(dec["prednet"]["pred_rnn_layers"]) @@ -628,7 +651,7 @@ def main(): ckpt = ensure_ckpt(args.ckpt, args.hf_repo) cfg, sd, tok_bytes = load_nemo(ckpt) args.out.parent.mkdir(parents=True, exist_ok=True) - write_gguf(args.out, cfg, sd, tok_bytes, args.quant) + write_gguf(args.out, ckpt, cfg, sd, tok_bytes, args.quant) if __name__ == "__main__": diff --git a/packages/transcription-parakeet/scripts/convert-nemo.sh b/packages/transcription-parakeet/scripts/convert-nemo.sh index cd7be608bd..33de47fb53 100644 --- a/packages/transcription-parakeet/scripts/convert-nemo.sh +++ b/packages/transcription-parakeet/scripts/convert-nemo.sh @@ -17,7 +17,8 @@ # ./scripts/convert-nemo.sh [flags] # # Flags: -# --type, -t Which model(s) (default: all) +# --type, -t +# Which model(s) (default: all) # --quant, -q Quant tier (default: q8_0) # --python Python interpreter (default: # $PYTHON, then ./venv/bin/python, @@ -62,8 +63,8 @@ while [[ $# -gt 0 ]]; do done case "$TYPE" in - ctc|tdt|eou|sortformer|all) ;; - *) echo "Error: --type must be ctc|tdt|eou|sortformer|all" >&2; exit 2;; + ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all) ;; + *) echo "Error: --type must be ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all" >&2; exit 2;; esac case "$QUANT" in f32|f16|q8_0|q5_0|q4_0) ;; @@ -128,6 +129,7 @@ nemo_filename() { tdt) echo "parakeet-tdt-0.6b-v3.nemo";; eou) echo "parakeet_realtime_eou_120m-v1.nemo";; sortformer) echo "diar_sortformer_4spk-v1.nemo";; + sortformer-streaming-v2.1) echo "diar_streaming_sortformer_4spk-v2.1.nemo";; esac } gguf_filename() { @@ -137,6 +139,7 @@ gguf_filename() { tdt) echo "parakeet-tdt-0.6b-v3.${q}.gguf";; eou) echo "parakeet-eou-120m-v1.${q}.gguf";; sortformer) echo "sortformer-4spk-v1.${q}.gguf";; + sortformer-streaming-v2.1) echo "diar_streaming_sortformer_4spk-v2.1.${q}.gguf";; esac } @@ -196,7 +199,7 @@ echo failures=0 if [[ "$TYPE" == "all" ]]; then - for t in ctc tdt eou sortformer; do + for t in ctc tdt eou sortformer sortformer-streaming-v2.1; do convert_one "$t" || failures=$((failures + 1)) done else diff --git a/packages/transcription-parakeet/scripts/download-models.sh b/packages/transcription-parakeet/scripts/download-models.sh index 5b2404117f..d9eeb00c05 100755 --- a/packages/transcription-parakeet/scripts/download-models.sh +++ b/packages/transcription-parakeet/scripts/download-models.sh @@ -12,7 +12,8 @@ # ./scripts/download-models.sh [flags] # # Flags: -# --type, -t Which model(s) (default: all) +# --type, -t +# Which model(s) (default: all) # --output, -o Destination dir (default: ./models/nemo) # --force, -f Re-download even if present # --help, -h Show this help @@ -43,8 +44,8 @@ while [[ $# -gt 0 ]]; do done case "$TYPE" in - ctc|tdt|eou|sortformer|all) ;; - *) echo "Error: --type must be ctc|tdt|eou|sortformer|all" >&2; exit 2;; + ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all) ;; + *) echo "Error: --type must be ctc|tdt|eou|sortformer|sortformer-streaming-v2.1|all" >&2; exit 2;; esac # Map model type -> { hf_repo, nemo_filename } @@ -54,6 +55,7 @@ nemo_url() { tdt) echo "https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3/resolve/main/parakeet-tdt-0.6b-v3.nemo";; eou) echo "https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1/resolve/main/parakeet_realtime_eou_120m-v1.nemo";; sortformer) echo "https://huggingface.co/nvidia/diar_sortformer_4spk-v1/resolve/main/diar_sortformer_4spk-v1.nemo";; + sortformer-streaming-v2.1) echo "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1/resolve/main/diar_streaming_sortformer_4spk-v2.1.nemo";; esac } nemo_filename() { @@ -95,7 +97,7 @@ echo "Output: ${OUTPUT_DIR}" echo if [[ "$TYPE" == "all" ]]; then - for t in ctc tdt eou sortformer; do + for t in ctc tdt eou sortformer sortformer-streaming-v2.1; do fetch_nemo "$t" done else diff --git a/packages/transcription-parakeet/test/integration/helpers.js b/packages/transcription-parakeet/test/integration/helpers.js index 27960763bf..0d34f03245 100644 --- a/packages/transcription-parakeet/test/integration/helpers.js +++ b/packages/transcription-parakeet/test/integration/helpers.js @@ -802,6 +802,19 @@ const MODEL_CONFIGS = { mobileFile: 'sortformer-4spk-v1.q4_0.gguf', minSize: 50 * 1024 * 1024, url: null + }, + // Streaming-default Sortformer (v2.1 + NeMo-port AOSC). The AOSC + // speaker cache anchors slot identity across silence and re-entry, + // fixing the per-chunk drift v1 shows when two voices have been seen + // in the rolling-history window. Auto-enabled by parakeet-cpp when the + // GGUF carries `parakeet.model_variant == "sortformer-streaming-v2.1-aosc"`. + // The GGUF needs to be staged (npm run setup-models / QVAC_TEST_GGUF_DIR) + // before sortformer-streaming tests can run; otherwise they skip. + sortformerStreaming: { + file: 'diar_streaming_sortformer_4spk-v2.1.q8_0.gguf', + mobileFile: 'diar_streaming_sortformer_4spk-v2.1.q4_0.gguf', + minSize: 50 * 1024 * 1024, + url: null } } diff --git a/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js b/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js new file mode 100644 index 0000000000..f3749349a9 --- /dev/null +++ b/packages/transcription-parakeet/test/integration/sortformer-aosc-streaming.test.js @@ -0,0 +1,222 @@ +'use strict' + +/** + * Sortformer v2.1 + AOSC streaming integration test. + * + * Verifies that: + * 1. The v2.1 Sortformer GGUF loads and the JS-side AOSC config + * knobs flow through the native binding without errors. + * 2. A streaming diarization session with default AOSC config emits + * well-formed speaker segments matching the + * "Speaker N: HH:MM:SS.fff - HH:MM:SS.fff" pattern that the + * offline diarization path also produces. + * 3. Forcing `streamingSpkCacheEnable: false` on the same v2.1 GGUF + * falls back to the v1 sliding-window path cleanly (still emits + * segments; just without the AOSC stability guarantees). + * + * The full AOSC slot-stability contract (same speaker -> same hyp_ + * across non-contiguous re-entries) is verified at C++ level by + * `parakeet-cpp/test/test_sortformer_aosc_speakers.cpp` using the + * `abcba.wav` / `abcdba.wav` fixtures. This JS-level test focuses on + * wiring correctness; if it passes, the AOSC knobs are reaching the + * engine and parakeet-cpp's own regression tests cover the runtime + * behaviour. + * + * Skips cleanly when the v2.1 GGUF is missing + * (`MODEL_CONFIGS.sortformerStreaming`); the file isn't bundled with + * the repo -- stage it via `npm run setup-models` or by pointing + * `QVAC_TEST_GGUF_DIR` at a directory containing + * `diar_streaming_sortformer_4spk-v2.1.q8_0.gguf`. + */ + +const test = require('brittle') +const fs = require('bare-fs') +const path = require('bare-path') +const { + binding, + TranscriptionParakeet, + setupJsLogger, + getTestPaths, + loadGgufOrSkip +} = require('./helpers.js') + +const { samplesDir } = getTestPaths() + +const SAMPLE_RATE = 16000 +const STREAM_CHUNK_MS = 2000 +const FEED_CHUNK_MS = 500 + +function loadAudioSample () { + const samplePath = path.join(samplesDir, 'sample.raw') + if (!fs.existsSync(samplePath)) return null + const rawBuffer = fs.readFileSync(samplePath) + const pcm = new Int16Array( + rawBuffer.buffer, rawBuffer.byteOffset, rawBuffer.length / 2) + const audio = new Float32Array(pcm.length) + for (let i = 0; i < pcm.length; i++) audio[i] = pcm[i] / 32768.0 + return audio +} + +function pushableStream () { + const queue = [] + let waiter = null + let ended = false + return { + push (chunk) { + if (ended) return + queue.push(chunk) + if (waiter) { const w = waiter; waiter = null; w() } + }, + end () { + ended = true + if (waiter) { const w = waiter; waiter = null; w() } + }, + async * [Symbol.asyncIterator] () { + while (true) { + if (queue.length > 0) { yield queue.shift(); continue } + if (ended) return + await new Promise(resolve => { waiter = resolve }) + } + } + } +} + +async function feedAndCollect (model, audio) { + const samplesPerChunk = Math.floor((FEED_CHUNK_MS / 1000) * SAMPLE_RATE) + const stream = pushableStream() + const segments = [] + + const response = await model.runStreaming(stream) + const updateDone = response + .onUpdate(out => { + const items = Array.isArray(out) ? out : [out] + for (const seg of items) { + if (!seg || !seg.text) continue + segments.push(seg) + } + }) + .await() + + for (let i = 0; i < audio.length; i += samplesPerChunk) { + const endIdx = Math.min(i + samplesPerChunk, audio.length) + const chunk = new Float32Array(audio.slice(i, endIdx)) + stream.push(chunk) + if (i + samplesPerChunk < audio.length) { + await new Promise(resolve => setTimeout(resolve, FEED_CHUNK_MS)) + } + } + stream.end() + await updateDone + + return segments +} + +// Pull "Speaker N" out of the addon's emitted text. Returns -1 when +// the text doesn't match (e.g. silence sentinels). Mirrors the parser +// used by examples/live-mic-diarized.js so the assertion below stays +// in sync with the actual contract consumers rely on. +function parseSpeakerId (text) { + const m = typeof text === 'string' ? text.match(/Speaker\s+(\d+)/) : null + return m ? parseInt(m[1], 10) : -1 +} + +test('Sortformer v2.1 AOSC — default config streams diarization segments', + { timeout: 600000 }, async (t) => { + const loggerBinding = setupJsLogger(binding) + + try { + const modelPath = await loadGgufOrSkip(t, 'sortformerStreaming') + if (!modelPath) return + + const audio = loadAudioSample() + if (!audio) { t.pass('sample.raw not found - skipping'); return } + + const model = new TranscriptionParakeet({ + files: { model: modelPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: STREAM_CHUNK_MS, + // streamingSpkCacheEnable defaults to true; left unset so + // the AOSC default path runs as it would for real users. + maxThreads: 4, + useGPU: false + } + } + }) + + try { + await model.load() + const segments = await feedAndCollect(model, audio) + + t.ok(segments.length > 0, + `AOSC streaming should emit at least one segment (got ${segments.length})`) + + const speakerIds = segments + .map(s => parseSpeakerId(s.text)) + .filter(id => id >= 0) + t.ok(speakerIds.length > 0, + 'segments should match the "Speaker N: ..." format') + + const distinctIds = new Set(speakerIds) + console.log( + `[aosc/default] segments=${segments.length} ` + + `speakers=${distinctIds.size} ids=[${[...distinctIds].sort().join(',')}]`) + } finally { + try { await model.unload() } catch (e) { /* ignore */ } + } + } finally { + try { loggerBinding.releaseLogger() } catch (e) { /* ignore */ } + } + }) + +test('Sortformer v2.1 AOSC — streamingSpkCacheEnable=false falls back to v1 path', + { timeout: 600000 }, async (t) => { + const loggerBinding = setupJsLogger(binding) + + try { + const modelPath = await loadGgufOrSkip(t, 'sortformerStreaming') + if (!modelPath) return + + const audio = loadAudioSample() + if (!audio) { t.pass('sample.raw not found - skipping'); return } + + const model = new TranscriptionParakeet({ + files: { model: modelPath }, + config: { + parakeetConfig: { + streaming: true, + streamingChunkMs: STREAM_CHUNK_MS, + // Force the v1 sliding-window code path on the v2.1 GGUF. + // The engine must accept this without errors and continue + // to emit speaker segments; speaker IDs may drift in ways + // they would not with AOSC active. + streamingSpkCacheEnable: false, + maxThreads: 4, + useGPU: false + } + } + }) + + try { + await model.load() + const segments = await feedAndCollect(model, audio) + + t.ok(segments.length > 0, + 'v1-path streaming should still emit at least one segment ' + + `(got ${segments.length})`) + + const speakerIds = segments + .map(s => parseSpeakerId(s.text)) + .filter(id => id >= 0) + t.ok(speakerIds.length > 0, + 'segments should match the "Speaker N: ..." format') + + console.log(`[aosc/disabled] segments=${segments.length}`) + } finally { + try { await model.unload() } catch (e) { /* ignore */ } + } + } finally { + try { loggerBinding.releaseLogger() } catch (e) { /* ignore */ } + } + })