feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1 by pratiknarola-t · Pull Request #22 · tetherto/qvac-ext-lib-whisper.cpp

pratiknarola-t · 2026-05-18T15:58:38Z

`feat(QVAC-18625): Sortformer v2.1 streaming with NeMo Audio-Online Speaker Cache (AOSC)`

TL;DR

Adds v2.1 streaming Sortformer to parakeet-cpp via a port of NeMo's Audio-Online Speaker Cache (AOSC) algorithm, plus a slot-continuity fix on the existing v1 streaming path, plus a regression test suite. Speakers now keep their original speaker_id even after long silences. The v1 path is bit-identical apart from the slot-continuity fix on its own streaming code.

The original problem

In v1 streaming, the session only carries a rolling history_ms window of audio context. If a speaker goes silent for longer than that window and then returns, the model has no memory of them — they get whatever speaker_id slot happens to be free, which may or may not match their original slot. The user sees what looks like a speaker swap mid-conversation.

The fix (model-side)

The newer v2 / v2.1 NeMo Sortformer introduced a speaker cache: a small set of acoustic embeddings retained per speaker across the entire session, plus a FIFO of recent embeddings, plus a per-session "silence profile" embedding. The model then processes a concatenated row

[ cache_rows | fifo_rows | (mel → subsampling) ]

through the FastConformer blocks every chunk, so it always sees a fresh attention window over every speaker it has ever heard in this session. After arbitrarily long silences, the same voice rebinds to its original slot.

For v1 (which doesn't have the cache architecture), this PR also adds a separate overlap-based slot-continuity remap so cross-chunk drift inside the history window is suppressed.

What's added

Full AOSC port of NeMo's sortformer_modules.py + sortformer_diar_models.py into parakeet-cpp. Helpers in C++ carry // matches NeMo <fn> at <line> comments: _compress_spkcache, _get_silence_profile, _disable_low_scores / _boost_topk_scores, streaming_update, forward_streaming_step, and run_encoder_bypass_pre_encode (feeds pre-subsampled embeddings straight into the conformer stack — required because the cache lives in post-subsampling space).
Encoder context windowing (chunk_left_context_ms=80, chunk_right_context_ms=560), defaults from NeMo's e2e_diarize_speech.py inference YAML.
v2.1 detection by encoder shape (17 layers / 128 mels); cache_active=false for v1, so the v1 code path is bit-identical.
Overlap-based slot-continuity remap on the v1 streaming path (separate fix, additive).
New ctest test-sortformer-aosc-speakers-{abcba,abcdba} asserting three invariants against an RTTM ground truth: (a) every reference speaker is covered, (b) re-entering speakers land in the same hyp_<id> they were first assigned to (the AOSC contract), (c) frame-level DER ≤ 30 %.
Two redistributable ElevenLabs-generated fixtures with hand-built ground-truth RTTMs (LIFO re-entry patterns).
SortformerStreamSession::aosc_active() — runtime introspection getter so callers can tell v1 from v2.1+AOSC.
examples/live-mic: banner branches on aosc_active(); v2.1 prints spkcache/fifo/lc/rc, v1 keeps the legacy history-window banner unchanged.
README / PROGRESS docs updated (new v2.1 row in the model table; "Shipped / Not in-repo" status corrected; new Phase 17 closing §11.11.2's reservation).

The biggest blocker: `mean_sil_emb`

NeMo's algorithm uses a special "what does silence sound like in this room" baseline embedding to fill cache slots that aren't yet bound to a real speaker. Without it, those slots stay at zeros — out-of-distribution input for the model — and it panics and collapses every chunk to a single slot. Once the runtime EMA update rule from sortformer_modules.py was matched in C++, the cache-aware forward started behaving.

Results

Complete 5-way DER table (synthetic fixtures, q8_0 GGUFs, Apple M-series CPU):

fixture	mode	DER%	speakers tracked
abcba (3-speaker, LIFO re-entry)	v1 streaming	24.31	2 (no C)
abcba	v2.1+AOSC	27.29	3 (all)
abcba	v2.1 no-cache	23.74	2 (no C)
abcdba (4-speaker, LIFO re-entry)	v1 streaming	66.28	2 (collapsed)
abcdba	v2.1+AOSC	22.22	4 (all)
abcdba	v2.1 no-cache	65.72	2 (collapsed)

Voices used (free elevenlabs api key for testing):
A=Sarah (♀US), B=Adam (♂US), C=Alice (♀UK), D=George (♂UK). 1 s silence between turns. Ground-truth RTTM auto-generated from clip durations.

Per-fixture analysis

three-english-speakers — no re-entry, no long silence; v1 offline almost perfect (2.14 %). v2.1 conflates the two female-adjacent voices and gets 26 %. v1 vs v2.1 streaming track their offline numbers closely because there's no silence-return event to trigger the cache.
abcab_real (distinct voices) — every mode handles this fine. Confirms the new code doesn't regress on easy material.
absaman (A→B→C→Aman→B→A, 4-speaker with long silences) — the failure mode this PR targets.
- v1 offline handles it (0.81 %) because the model sees the whole clip at once.
- v1 streaming drops to 51 % — by the time A returns, A's voice has aged out of the history window; A gets whatever slot was free.
- v2.1 + AOSC holds at 21.6 % — A re-binds to its original slot via the cache. Most of the residual error is the v2.1-side acoustic conflation of Aman with Alex (two male voices, the cache thinks "close enough" and reuses Alex's slot).
- v2.1 offline and v2.1 streaming nearly match — because v2.1's offline pipeline also runs the cache-aware forward, so streaming is consistent with offline.

Known limitation (and the trade-off)

v2.1 is not strictly better than v1. When two speakers are acoustically similar (notably two female voices in a row, or two male voices in a row), v2.1's cache + FIFO + attention layer thinks they're the same speaker and assigns them the same ID. v1 — without the cache — happens to separate them better in that exact scenario.

So in practice:

For ~80–90 % of use cases (distinct voices, normal conversation), v1 wins.
When the long-silence re-entry mismatch hits (≥ history_ms of silence between same-speaker turns), v1 catastrophically fails (~50 % DER) and v2.1 + AOSC catches it (~22 %).

This PR makes v2.1+AOSC available as an option that works correctly per the NeMo algorithm. It does not change the default model selection; downstream consumers keep choosing whichever variant fits their workload.

API additions (additive only)

SortformerStreamingOptions::spkcache_enable (default true), spkcache_len, fifo_len, chunk_left_context_ms, chunk_right_context_ms, spkcache_update_period — public knobs matching NeMo's inference YAML.
SortformerStreamSession::aosc_active() — runtime introspection.

No signature changes on existing methods. v1 GGUFs take the unchanged code path.

Validation

ctest -R sortformer-aosc-speakers passes both -abcba and -abcdba with the v2.1 GGUF in parakeet-cpp/models/.
v1 GGUF on the same harness fails with exit code 21 (continuity broken) — proving the test discriminates AOSC from non-AOSC.
live-mic on a v2.1 model prints [live-mic] listening at 16 kHz mono (v2.1 diarization, AOSC). chunk=2000 ms spkcache_len=188 fifo_len=188 lc=80 ms rc=560 ms. — verified end-to-end with a live mic capture that emitted multiple speaker IDs.
v1 path: existing test-sortformer-streaming stays bit-identical (4.17 % DER on the English fixture).

…ssion SortformerStreamSession::Impl::process_chunk previously assigned each emitted segment's speaker_id directly from Sortformer's per-pass output (s.speaker_id), with no inter-chunk slot stabilisation. When a speaker aged out of the rolling history window, the model's per-pass slot ordering could permute and the consumer saw "the same speaker" under a different slot index. On a synthetic 3-English-speaker 90s clip with the default history_ms=30000, the FIO089 monologue (30-90s) drifted twice: hyp_2 -> hyp_1 at t=44s (FIO084 ageing out of the 30s window) and hyp_1 -> hyp_0 at t=58s (FIO087 ageing out). Bumping history_ms to 90000 hid the bug only because the rolling window then matched the clip length and never emptied -- on real conversations longer than history_ms, drift always returned at the predicted age-out points. This patch carries forward the previous chunk's session-stable segments and computes a remap[local_id] -> session_id by maximising overlap between the current chunk's local-ID segments and the previous chunk's session-ID segments. Greedy assignment (highest-overlap pair first) is sufficient for 4-speaker Sortformer; Hungarian would be optimal but overkill for a 4x4 cost matrix. Unmatched local slots get the lowest unused session ID. Identity remap on the first chunk (empty previous state). Verification on synthetic three-english-speakers.wav with the v1 sortformer-4spk q8_0 GGUF: DER% speakerSwitches offline (baseline) 4.95 0 streaming hist=30s pre-fix 50.34 2 (drift at t=44s, t=58s) streaming hist=30s post-fix 4.17 0 streaming hist=60s post-fix 3.60 0 Cross-language synthetic three-speakers.wav (control): DER% speakerSwitches offline (baseline) 26.01 0 streaming hist=30s pre-fix 57.66 1 streaming hist=30s post-fix 23.76 0 The cross-language Croatian+French slot-collapse persists (model-side acoustic-similarity issue, intentionally not addressed by this patch). Public APIs (SortformerStreamSession, SortformerStreamingOptions, StreamingDiarizationSegment) are unchanged. Also extends test/test_sortformer_streaming.cpp with --history-ms, --chunk-ms, --rttm-out CLI flags so the streaming path can be exercised at multiple history values and a NIST RTTM dump consumed by external DER scoring.

… library Faithful port of NeMo's Audio-Online Speaker Cache (AOSC) from sortformer_modules.py + sortformer_diar_models.py, replacing the previous shallow stub that collapsed v2.1 streaming output to a single speaker slot. Key changes: - Add run_encoder_bypass_pre_encode for the cache-aware streaming forward path. Lets callers feed pre-subsampled embeddings directly into the conformer layers (skipping the subsampling block), which is required for splicing the speaker cache + FIFO + chunk in the post-subsampling embedding space the way NeMo trained v2.1 with. - Port _compress_spkcache, _get_silence_profile, _disable_low_scores, _boost_topk_scores, streaming_update, and forward_streaming_step end-to-end. Each C++ helper carries a comment naming the NeMo source line(s) it mirrors. - Extend SortformerSpeakerCache with mean_sil_emb (runtime EMA over silence frames), spkcache_preds, fifo_preds, n_sil_frames. Add SortformerStreamingConfig with NeMo's e2e_diarize_speech.py inference defaults (spkcache_len=188, fifo_len=188, chunk_len=6, chunk_left_context=1, chunk_right_context=7, spkcache_update_period=144, spkcache_sil_frames_per_spk=3, sil_threshold=0.2, pred_score_threshold=0.25, scores_boost_latest=0.05, strong_boost_rate=0.75, weak_boost_rate=1.5, min_pos_scores_rate=0.5). - Wire chunk left/right audio context windowing in the engine's streaming session: try_emit_chunks now waits for chunk_right_context_ms of lookahead audio before emitting, finalize uses left-context-only for the tail chunk, and diarize_start populates the new config fields from SortformerStreamingOptions. - Public API: flip SortformerStreamingOptions::spkcache_enable default to true; add chunk_left_context_ms (=80) alongside the existing chunk_right_context_ms (now =560); switch fifo_len default to 188 and spkcache_update_period to 144. v1 path is unchanged. cache_active=false for v1 GGUFs (detected via encoder shape: 18 layers / 80 mels for v1, 17 / 128 for v2.1). v1 streaming DER on the synthetic English regression fixture stays at 4.17% (bit-for-bit). Behaviour on synthetic test fixtures: - 3 distinct voices (Alex/Samantha/Daniel) re-entry test: v1 streaming 0.91% DER, v2.1+AOSC 0.45% DER. - 4-speaker re-entry test where v1's overlap-remap fails: v1 streaming 47-51% DER, v2.1+AOSC 18-22% DER. - Both Samantha (47-66s gap) and Alex (93s gap) cleanly recovered to their original hyp slots in the AOSC path; v1 collapses multiple speakers into one slot after the long silence. QVAC-18625

Follow-up to 8f11c2a (the AOSC port itself). Locks the v2.1 streaming behaviour into ctest and surfaces it to the live-mic example user, so neither piece silently regresses. Added regression suite: - test/test_sortformer_aosc_speakers.cpp asserts three invariants against a reference RTTM: (a) every ref speaker has at least one hyp frame, (b) speakers that re-enter after a gap land in the SAME hyp_<id> they were first assigned to (the AOSC contract), (c) frame-level DER under the optimal hyp->ref permutation is below --der-max (default 30 %). Brute-force permutation, 10 ms frame grid, std-lib only. - test/samples/abcba.{wav,rttm} (160.6 s, 3 speakers, A->B->C->B->A, A returns after a 97 s gap) and test/samples/abcdba.{wav,rttm} (191.2 s, 4 speakers, A->B->C->D->B->A, A returns after a 128 s gap, B after a 66 s gap). Generated from ElevenLabs TTS so the audio is redistributable; ground-truth RTTMs auto-built from clip durations. - CMakeLists.txt registers two ctest entries test-sortformer-aosc-speakers-{abcba,abcdba} sharing one binary, REQUIRES-gated on the v2.1 GGUF so a fresh checkout without models/ shows them as DISABLED rather than failing. Measured on q8_0 v2.1, M-series CPU backend: abcba DER 27.29 % (3 slots tracked, A and B re-bind correctly); abcdba DER 22.22 % (all 4 slots tracked, A and B re-bind). v1 streaming on the same fixtures collapses to 2 slots (abcdba 66.28 %), confirming the test distinguishes AOSC from non-AOSC. Public API: - SortformerStreamSession::aosc_active() — small getter returning the engine's internal cache_active flag. Lets callers tell v2.1+AOSC from v1 / v2.x-without-cache in CLI banners and logs without duplicating the v2.1 detection logic. live-mic example: - Banner now branches on aosc_active(): on v2.1 prints "(v2.1 diarization, AOSC) chunk=... spkcache_len=... fifo_len=... lc=... rc=..."; on v1 keeps the existing "(v1 diarization) chunk=... history=..." line bit-identical. --history-ms help text clarifies the flag is v1-only and that v2.1 takes the AOSC path automatically. No new CLI flags. Docs: - README.md: new model-table row for diar_streaming_sortformer_4spk-v2.1 (v2 row left untouched); API table's diarize_start description distinguishes v1 sliding-history vs v2.1 AOSC; "Shipped / Not in-repo" status block moves Sortformer spkcache streaming to "Shipped". - PROGRESS.md: new Phase 17 closing the §11.11.2 reservation. Covers the algorithm port (8 ported NeMo helpers), encoder context windowing, bypass_pre_encode forward, validation methodology, the measured DER table from above, files touched, and remaining follow-ups (engine n_finals end-of-session glitch; downstream qvac-addon plumbing). v1 path is bit-identical to pre-commit; all existing tests stay green. QVAC-18625

GustavoA1604

I checked locally and noticed

need to update download-all-models.sh to include the new model used
need to update sortformer-streaming test to use v2.1 model instead of v2 model

Besides that worked in windows and mac

Also Cursor flagged the following, not sure if relevant but could be solved by pasting this message in a new thread:

Critical / Must-Fix

spkcache_len_per_spk can go negative with small spkcache_len

In compress_speaker_cache (parakeet_sortformer.cpp):

const int spkcache_len_per_spk = spkcache_len / num_spks - A_sil;
const int strong_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.strong_boost_rate);
const int weak_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.weak_boost_rate);
With the defaults spkcache_len=188, num_spks=4, A_sil=3 this is 47 - 3 = 44 — fine. But if a caller passes spkcache_len < num_spks * A_sil (e.g. spkcache_len=8, num_spks=4, A_sil=3 → -1), the nth_element calls in boost_topk_scores receive a negative k after std::min(n_boost_per_spk, n_frames), and std::nth_element with a negative distance is UB. Add a guard at the top of the function:

if (spkcache_len_per_spk <= 0) {
// degenerate config; fill with silence profile and return
...
}
4. run_encoder_bypass_pre_encode cache invalidation — every chunk misses until FIFO reaches steady-state

The bypass encoder graph is cached by (bypass_pre_encode, T_enc, n_run_layers). T_enc = spkcache_n + fifo_n + T_chunk_pre grows chunk-by-chunk as the FIFO fills (0 → fifo_len), so for the first fifo_len / chunk_len ≈ 188 / 6 ≈ 31 chunks a new ggml graph is built from scratch on every call. With k_encoder_graph_cache_max = 4, those graphs evict each other immediately and zero reuse occurs.

This is a performance bug, not a correctness bug, but it could make the first ~60 seconds of a session noticeably slower on slower hardware. Consider either:

Caching by (bypass_pre_encode, T_enc_max) and passing a mask / sequence-length argument, or
Pre-building the graph at diarize_start for the known steady-state size spkcache_len + fifo_len + max_chunk_pre_frames and always feeding that size (padding with silence rows when the FIFO isn't full yet).
Medium / Should-Fix
5. v2.1 detection by encoder shape is fragile

const bool model_is_v2_1 =
pimpl_->model.encoder_cfg.n_layers == 17 &&
pimpl_->model.mel_cfg.n_mels == 128;
If NeMo ships a v3 or a v2.2 variant that happens to share {17 layers, 128 mels} but was not trained with the cache-aware concat forward, enabling AOSC on it will produce garbage silently. A GGUF metadata key (e.g. parakeet.model_variant = "sortformer-v2.1-aosc") set by the converter would be more robust. At minimum, document this assumption in diarization.h next to the detection logic and add a note that it must be revisited when a new variant is converted.

streaming_update parameter name chunk_pre_encode_lc is misleading

The function signature says chunk_pre_encode_lc but the call site passes the committed chunk slice (already offset past the left context):

const float * chunk_pre_committed = chunk_pre_encode_embs + (size_t) lc * D;
streaming_update(cache, chunk_pre_committed, chunk_len_eff, ...);
The name implies it includes the left context, which it does not. Rename to committed_chunk_pre_encode to match the call-site variable name and the comment in the function body.

load_wav_pcm16le_mono duplicated verbatim from test_sortformer_streaming.cpp

The comment in the new test file acknowledges this ("duplicated here on purpose"). For a 60-line helper this is borderline, but the two copies will drift. A shared test/test_utils.h header in the test/ directory would be the right solution. Not a blocker, but worth a TODO at minimum.

WAV fixtures committed as binary blobs (~11 MB total)

abcba.wav (~5.0 MB) and abcdba.wav (~5.9 MB) are committed directly into the repo. Git LFS would be the cleaner long-term approach, consistent with how the project will likely handle future audio test fixtures. If the project doesn't use LFS yet, at least leave a comment in CMakeLists.txt pointing to where the fixtures can be regenerated.

Low / Nice-to-Have
9. -std::numeric_limits::infinity() is not UB

The comment /* very-negative sentinel; -inf is UB with current FP flags */ appears three times. IEEE 754 infinity is a well-defined value; the UB concern applies to operations like inf - inf, not to storing or comparing the value. Using std::numeric_limits::lowest() (which is approximately -3.4e38) or -std::numeric_limits::infinity() directly would both be more readable than the magic −1.0e30f sentinel. Not a bug, just misleading documentation.

ring.erase is O(n) — pre-existing, but AOSC retention differs

AOSC retains only chunk_left_context_samples behind emit_end, which is much smaller than history_ms. So ring trims happen more aggressively and ring.erase is called more frequently (every chunk vs. lazily on v1). This amplifies the pre-existing O(n) cost. No action required now, but worth a note for a future std::deque refactor.

prev_chunk_full_segments populated on AOSC path unnecessarily

On the AOSC path slot_remap is always identity, but cur_full is still moved into prev_chunk_full_segments every chunk. This is harmless (just a small memory/copy overhead) but a if (!cache_active) guard around those two lines would clarify intent.

encoder_ms attribution is slightly surprising

const double encoder_ms = ms_since(t_enc) - dres.decode_ms;
t_enc is set before run_subsampling, so ms_since(t_enc) covers subsampling + bypass-encode + diarize. Subtracting decode_ms leaves the "everything except the diarizer head" time, which is actually the subsampling + conformer-bypass time. The field name encoder_ms matches the existing convention in the non-AOSC path, so this is consistent — just worth a comment explaining what the subtraction is doing.

Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".

…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments

…er-aosc feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1

Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".

…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments

Pratik Narola added 3 commits May 15, 2026 13:22

pratiknarola-t requested review from a team as code owners May 18, 2026 15:58

GustavoA1604 requested changes May 18, 2026

View reviewed changes

GustavoA1604 merged commit e6ba38c into tetherto:master May 18, 2026

pratiknarola-t mentioned this pull request May 19, 2026

parakeet-cpp: address PR #22 AOSC v2.1 review comments #24

Merged

GustavoA1604 added a commit that referenced this pull request May 19, 2026

Merge pull request #24 from pratiknarola-t/fix-parakeet-cpp-aosc-revi…

08df2e7

…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments

gianni-cor pushed a commit that referenced this pull request May 28, 2026

Merge pull request #22 from pratiknarola-t/feat-parakeet-cpp-sortform…

ddd47a4

…er-aosc feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1

gianni-cor pushed a commit that referenced this pull request May 28, 2026

Merge pull request #24 from pratiknarola-t/fix-parakeet-cpp-aosc-revi…

15cd2a7

…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1#22

feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1#22
GustavoA1604 merged 3 commits into
tetherto:masterfrom
pratiknarola-t:feat-parakeet-cpp-sortformer-aosc

pratiknarola-t commented May 18, 2026

Uh oh!

GustavoA1604 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pratiknarola-t commented May 18, 2026

feat(QVAC-18625): Sortformer v2.1 streaming with NeMo Audio-Online Speaker Cache (AOSC)

TL;DR

The original problem

The fix (model-side)

What's added

The biggest blocker: mean_sil_emb

Results

Per-fixture analysis

Known limitation (and the trade-off)

API additions (additive only)

Validation

Uh oh!

GustavoA1604 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`feat(QVAC-18625): Sortformer v2.1 streaming with NeMo Audio-Online Speaker Cache (AOSC)`

The biggest blocker: `mean_sil_emb`

GustavoA1604 left a comment •

edited

Loading