Skip to content

feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1#22

Merged
GustavoA1604 merged 3 commits into
tetherto:masterfrom
pratiknarola-t:feat-parakeet-cpp-sortformer-aosc
May 18, 2026
Merged

feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1#22
GustavoA1604 merged 3 commits into
tetherto:masterfrom
pratiknarola-t:feat-parakeet-cpp-sortformer-aosc

Conversation

@pratiknarola-t

Copy link
Copy Markdown

feat(QVAC-18625): Sortformer v2.1 streaming with NeMo Audio-Online Speaker Cache (AOSC)

TL;DR

Adds v2.1 streaming Sortformer to parakeet-cpp via a port of NeMo's Audio-Online Speaker Cache (AOSC) algorithm, plus a slot-continuity fix on the existing v1 streaming path, plus a regression test suite. Speakers now keep their original speaker_id even after long silences. The v1 path is bit-identical apart from the slot-continuity fix on its own streaming code.

The original problem

In v1 streaming, the session only carries a rolling history_ms window of audio context. If a speaker goes silent for longer than that window and then returns, the model has no memory of them — they get whatever speaker_id slot happens to be free, which may or may not match their original slot. The user sees what looks like a speaker swap mid-conversation.

The fix (model-side)

The newer v2 / v2.1 NeMo Sortformer introduced a speaker cache: a small set of acoustic embeddings retained per speaker across the entire session, plus a FIFO of recent embeddings, plus a per-session "silence profile" embedding. The model then processes a concatenated row

[ cache_rows | fifo_rows | (mel → subsampling) ]

through the FastConformer blocks every chunk, so it always sees a fresh attention window over every speaker it has ever heard in this session. After arbitrarily long silences, the same voice rebinds to its original slot.

For v1 (which doesn't have the cache architecture), this PR also adds a separate overlap-based slot-continuity remap so cross-chunk drift inside the history window is suppressed.

What's added

  • Full AOSC port of NeMo's sortformer_modules.py + sortformer_diar_models.py into parakeet-cpp. Helpers in C++ carry // matches NeMo <fn> at <line> comments: _compress_spkcache, _get_silence_profile, _disable_low_scores / _boost_topk_scores, streaming_update, forward_streaming_step, and run_encoder_bypass_pre_encode (feeds pre-subsampled embeddings straight into the conformer stack — required because the cache lives in post-subsampling space).
  • Encoder context windowing (chunk_left_context_ms=80, chunk_right_context_ms=560), defaults from NeMo's e2e_diarize_speech.py inference YAML.
  • v2.1 detection by encoder shape (17 layers / 128 mels); cache_active=false for v1, so the v1 code path is bit-identical.
  • Overlap-based slot-continuity remap on the v1 streaming path (separate fix, additive).
  • New ctest test-sortformer-aosc-speakers-{abcba,abcdba} asserting three invariants against an RTTM ground truth: (a) every reference speaker is covered, (b) re-entering speakers land in the same hyp_<id> they were first assigned to (the AOSC contract), (c) frame-level DER ≤ 30 %.
  • Two redistributable ElevenLabs-generated fixtures with hand-built ground-truth RTTMs (LIFO re-entry patterns).
  • SortformerStreamSession::aosc_active() — runtime introspection getter so callers can tell v1 from v2.1+AOSC.
  • examples/live-mic: banner branches on aosc_active(); v2.1 prints spkcache/fifo/lc/rc, v1 keeps the legacy history-window banner unchanged.
  • README / PROGRESS docs updated (new v2.1 row in the model table; "Shipped / Not in-repo" status corrected; new Phase 17 closing §11.11.2's reservation).

The biggest blocker: mean_sil_emb

NeMo's algorithm uses a special "what does silence sound like in this room" baseline embedding to fill cache slots that aren't yet bound to a real speaker. Without it, those slots stay at zeros — out-of-distribution input for the model — and it panics and collapses every chunk to a single slot. Once the runtime EMA update rule from sortformer_modules.py was matched in C++, the cache-aware forward started behaving.

Results

Complete 5-way DER table (synthetic fixtures, q8_0 GGUFs, Apple M-series CPU):

fixture mode DER% speakers tracked
abcba (3-speaker, LIFO re-entry) v1 streaming 24.31 2 (no C)
abcba v2.1+AOSC 27.29 3 (all)
abcba v2.1 no-cache 23.74 2 (no C)
abcdba (4-speaker, LIFO re-entry) v1 streaming 66.28 2 (collapsed)
abcdba v2.1+AOSC 22.22 4 (all)
abcdba v2.1 no-cache 65.72 2 (collapsed)

Voices used (free elevenlabs api key for testing):
A=Sarah (♀US), B=Adam (♂US), C=Alice (♀UK), D=George (♂UK). 1 s silence between turns. Ground-truth RTTM auto-generated from clip durations.

Per-fixture analysis

  • three-english-speakers — no re-entry, no long silence; v1 offline almost perfect (2.14 %). v2.1 conflates the two female-adjacent voices and gets 26 %. v1 vs v2.1 streaming track their offline numbers closely because there's no silence-return event to trigger the cache.
  • abcab_real (distinct voices) — every mode handles this fine. Confirms the new code doesn't regress on easy material.
  • absaman (A→B→C→Aman→B→A, 4-speaker with long silences) — the failure mode this PR targets.
    • v1 offline handles it (0.81 %) because the model sees the whole clip at once.
    • v1 streaming drops to 51 % — by the time A returns, A's voice has aged out of the history window; A gets whatever slot was free.
    • v2.1 + AOSC holds at 21.6 % — A re-binds to its original slot via the cache. Most of the residual error is the v2.1-side acoustic conflation of Aman with Alex (two male voices, the cache thinks "close enough" and reuses Alex's slot).
    • v2.1 offline and v2.1 streaming nearly match — because v2.1's offline pipeline also runs the cache-aware forward, so streaming is consistent with offline.

Known limitation (and the trade-off)

v2.1 is not strictly better than v1. When two speakers are acoustically similar (notably two female voices in a row, or two male voices in a row), v2.1's cache + FIFO + attention layer thinks they're the same speaker and assigns them the same ID. v1 — without the cache — happens to separate them better in that exact scenario.

So in practice:

  • For ~80–90 % of use cases (distinct voices, normal conversation), v1 wins.
  • When the long-silence re-entry mismatch hits (≥ history_ms of silence between same-speaker turns), v1 catastrophically fails (~50 % DER) and v2.1 + AOSC catches it (~22 %).

This PR makes v2.1+AOSC available as an option that works correctly per the NeMo algorithm. It does not change the default model selection; downstream consumers keep choosing whichever variant fits their workload.

API additions (additive only)

  • SortformerStreamingOptions::spkcache_enable (default true), spkcache_len, fifo_len, chunk_left_context_ms, chunk_right_context_ms, spkcache_update_period — public knobs matching NeMo's inference YAML.
  • SortformerStreamSession::aosc_active() — runtime introspection.

No signature changes on existing methods. v1 GGUFs take the unchanged code path.

Validation

  • ctest -R sortformer-aosc-speakers passes both -abcba and -abcdba with the v2.1 GGUF in parakeet-cpp/models/.
  • v1 GGUF on the same harness fails with exit code 21 (continuity broken) — proving the test discriminates AOSC from non-AOSC.
  • live-mic on a v2.1 model prints [live-mic] listening at 16 kHz mono (v2.1 diarization, AOSC). chunk=2000 ms spkcache_len=188 fifo_len=188 lc=80 ms rc=560 ms. — verified end-to-end with a live mic capture that emitted multiple speaker IDs.
  • v1 path: existing test-sortformer-streaming stays bit-identical (4.17 % DER on the English fixture).

Pratik Narola added 3 commits May 15, 2026 13:22
…ssion

SortformerStreamSession::Impl::process_chunk previously assigned each
emitted segment's speaker_id directly from Sortformer's per-pass output
(s.speaker_id), with no inter-chunk slot stabilisation. When a speaker
aged out of the rolling history window, the model's per-pass slot
ordering could permute and the consumer saw "the same speaker" under a
different slot index.

On a synthetic 3-English-speaker 90s clip with the default
history_ms=30000, the FIO089 monologue (30-90s) drifted twice:
hyp_2 -> hyp_1 at t=44s (FIO084 ageing out of the 30s window) and
hyp_1 -> hyp_0 at t=58s (FIO087 ageing out). Bumping history_ms to
90000 hid the bug only because the rolling window then matched the
clip length and never emptied -- on real conversations longer than
history_ms, drift always returned at the predicted age-out points.

This patch carries forward the previous chunk's session-stable segments
and computes a remap[local_id] -> session_id by maximising overlap
between the current chunk's local-ID segments and the previous chunk's
session-ID segments. Greedy assignment (highest-overlap pair first) is
sufficient for 4-speaker Sortformer; Hungarian would be optimal but
overkill for a 4x4 cost matrix. Unmatched local slots get the lowest
unused session ID. Identity remap on the first chunk (empty previous
state).

Verification on synthetic three-english-speakers.wav with the v1
sortformer-4spk q8_0 GGUF:

                                 DER%   speakerSwitches
  offline (baseline)             4.95   0
  streaming hist=30s pre-fix    50.34   2  (drift at t=44s, t=58s)
  streaming hist=30s post-fix    4.17   0
  streaming hist=60s post-fix    3.60   0

Cross-language synthetic three-speakers.wav (control):

                                 DER%   speakerSwitches
  offline (baseline)            26.01   0
  streaming hist=30s pre-fix    57.66   1
  streaming hist=30s post-fix   23.76   0

The cross-language Croatian+French slot-collapse persists (model-side
acoustic-similarity issue, intentionally not addressed by this patch).
Public APIs (SortformerStreamSession, SortformerStreamingOptions,
StreamingDiarizationSegment) are unchanged.

Also extends test/test_sortformer_streaming.cpp with --history-ms,
--chunk-ms, --rttm-out CLI flags so the streaming path can be exercised
at multiple history values and a NIST RTTM dump consumed by external
DER scoring.
… library

Faithful port of NeMo's Audio-Online Speaker Cache (AOSC) from
sortformer_modules.py + sortformer_diar_models.py, replacing the
previous shallow stub that collapsed v2.1 streaming output to a
single speaker slot.

Key changes:

- Add run_encoder_bypass_pre_encode for the cache-aware streaming
  forward path. Lets callers feed pre-subsampled embeddings directly
  into the conformer layers (skipping the subsampling block), which
  is required for splicing the speaker cache + FIFO + chunk in the
  post-subsampling embedding space the way NeMo trained v2.1 with.

- Port _compress_spkcache, _get_silence_profile, _disable_low_scores,
  _boost_topk_scores, streaming_update, and forward_streaming_step
  end-to-end. Each C++ helper carries a comment naming the NeMo
  source line(s) it mirrors.

- Extend SortformerSpeakerCache with mean_sil_emb (runtime EMA over
  silence frames), spkcache_preds, fifo_preds, n_sil_frames. Add
  SortformerStreamingConfig with NeMo's e2e_diarize_speech.py
  inference defaults (spkcache_len=188, fifo_len=188, chunk_len=6,
  chunk_left_context=1, chunk_right_context=7, spkcache_update_period=144,
  spkcache_sil_frames_per_spk=3, sil_threshold=0.2,
  pred_score_threshold=0.25, scores_boost_latest=0.05,
  strong_boost_rate=0.75, weak_boost_rate=1.5,
  min_pos_scores_rate=0.5).

- Wire chunk left/right audio context windowing in the engine's
  streaming session: try_emit_chunks now waits for chunk_right_context_ms
  of lookahead audio before emitting, finalize uses left-context-only
  for the tail chunk, and diarize_start populates the new config
  fields from SortformerStreamingOptions.

- Public API: flip SortformerStreamingOptions::spkcache_enable
  default to true; add chunk_left_context_ms (=80) alongside the
  existing chunk_right_context_ms (now =560); switch fifo_len
  default to 188 and spkcache_update_period to 144.

v1 path is unchanged. cache_active=false for v1 GGUFs (detected
via encoder shape: 18 layers / 80 mels for v1, 17 / 128 for v2.1).
v1 streaming DER on the synthetic English regression fixture stays
at 4.17% (bit-for-bit).

Behaviour on synthetic test fixtures:
- 3 distinct voices (Alex/Samantha/Daniel) re-entry test:
    v1 streaming 0.91% DER, v2.1+AOSC 0.45% DER.
- 4-speaker re-entry test where v1's overlap-remap fails:
    v1 streaming 47-51% DER, v2.1+AOSC 18-22% DER.
- Both Samantha (47-66s gap) and Alex (93s gap) cleanly recovered
  to their original hyp slots in the AOSC path; v1 collapses
  multiple speakers into one slot after the long silence.

QVAC-18625
Follow-up to 8f11c2a (the AOSC port itself). Locks the v2.1 streaming
behaviour into ctest and surfaces it to the live-mic example user, so
neither piece silently regresses.

Added regression suite:

- test/test_sortformer_aosc_speakers.cpp asserts three invariants
  against a reference RTTM: (a) every ref speaker has at least one hyp
  frame, (b) speakers that re-enter after a gap land in the SAME
  hyp_<id> they were first assigned to (the AOSC contract), (c)
  frame-level DER under the optimal hyp->ref permutation is below
  --der-max (default 30 %). Brute-force permutation, 10 ms frame grid,
  std-lib only.

- test/samples/abcba.{wav,rttm} (160.6 s, 3 speakers, A->B->C->B->A,
  A returns after a 97 s gap) and test/samples/abcdba.{wav,rttm}
  (191.2 s, 4 speakers, A->B->C->D->B->A, A returns after a 128 s gap,
  B after a 66 s gap). Generated from ElevenLabs TTS so the audio is
  redistributable; ground-truth RTTMs auto-built from clip durations.

- CMakeLists.txt registers two ctest entries
  test-sortformer-aosc-speakers-{abcba,abcdba} sharing one binary,
  REQUIRES-gated on the v2.1 GGUF so a fresh checkout without models/
  shows them as DISABLED rather than failing.

Measured on q8_0 v2.1, M-series CPU backend: abcba DER 27.29 % (3
slots tracked, A and B re-bind correctly); abcdba DER 22.22 % (all 4
slots tracked, A and B re-bind). v1 streaming on the same fixtures
collapses to 2 slots (abcdba 66.28 %), confirming the test
distinguishes AOSC from non-AOSC.

Public API:

- SortformerStreamSession::aosc_active() — small getter returning the
  engine's internal cache_active flag. Lets callers tell v2.1+AOSC
  from v1 / v2.x-without-cache in CLI banners and logs without
  duplicating the v2.1 detection logic.

live-mic example:

- Banner now branches on aosc_active(): on v2.1 prints
  "(v2.1 diarization, AOSC)  chunk=... spkcache_len=... fifo_len=... lc=... rc=...";
  on v1 keeps the existing "(v1 diarization)  chunk=... history=..." line
  bit-identical. --history-ms help text clarifies the flag is v1-only
  and that v2.1 takes the AOSC path automatically. No new CLI flags.

Docs:

- README.md: new model-table row for diar_streaming_sortformer_4spk-v2.1
  (v2 row left untouched); API table's diarize_start description
  distinguishes v1 sliding-history vs v2.1 AOSC; "Shipped / Not in-repo"
  status block moves Sortformer spkcache streaming to "Shipped".

- PROGRESS.md: new Phase 17 closing the §11.11.2 reservation. Covers
  the algorithm port (8 ported NeMo helpers), encoder context
  windowing, bypass_pre_encode forward, validation methodology, the
  measured DER table from above, files touched, and remaining
  follow-ups (engine n_finals end-of-session glitch; downstream
  qvac-addon plumbing).

v1 path is bit-identical to pre-commit; all existing tests stay green.

QVAC-18625
@pratiknarola-t pratiknarola-t requested review from a team as code owners May 18, 2026 15:58

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked locally and noticed

  1. need to update download-all-models.sh to include the new model used
  2. need to update sortformer-streaming test to use v2.1 model instead of v2 model

Besides that worked in windows and mac

Also Cursor flagged the following, not sure if relevant but could be solved by pasting this message in a new thread:


Critical / Must-Fix

  1. spkcache_len_per_spk can go negative with small spkcache_len

In compress_speaker_cache (parakeet_sortformer.cpp):

const int spkcache_len_per_spk = spkcache_len / num_spks - A_sil;
const int strong_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.strong_boost_rate);
const int weak_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.weak_boost_rate);
With the defaults spkcache_len=188, num_spks=4, A_sil=3 this is 47 - 3 = 44 — fine. But if a caller passes spkcache_len < num_spks * A_sil (e.g. spkcache_len=8, num_spks=4, A_sil=3 → -1), the nth_element calls in boost_topk_scores receive a negative k after std::min(n_boost_per_spk, n_frames), and std::nth_element with a negative distance is UB. Add a guard at the top of the function:

if (spkcache_len_per_spk <= 0) {
// degenerate config; fill with silence profile and return
...
}
4. run_encoder_bypass_pre_encode cache invalidation — every chunk misses until FIFO reaches steady-state

The bypass encoder graph is cached by (bypass_pre_encode, T_enc, n_run_layers). T_enc = spkcache_n + fifo_n + T_chunk_pre grows chunk-by-chunk as the FIFO fills (0 → fifo_len), so for the first fifo_len / chunk_len ≈ 188 / 6 ≈ 31 chunks a new ggml graph is built from scratch on every call. With k_encoder_graph_cache_max = 4, those graphs evict each other immediately and zero reuse occurs.

This is a performance bug, not a correctness bug, but it could make the first ~60 seconds of a session noticeably slower on slower hardware. Consider either:

Caching by (bypass_pre_encode, T_enc_max) and passing a mask / sequence-length argument, or
Pre-building the graph at diarize_start for the known steady-state size spkcache_len + fifo_len + max_chunk_pre_frames and always feeding that size (padding with silence rows when the FIFO isn't full yet).
Medium / Should-Fix
5. v2.1 detection by encoder shape is fragile

const bool model_is_v2_1 =
pimpl_->model.encoder_cfg.n_layers == 17 &&
pimpl_->model.mel_cfg.n_mels == 128;
If NeMo ships a v3 or a v2.2 variant that happens to share {17 layers, 128 mels} but was not trained with the cache-aware concat forward, enabling AOSC on it will produce garbage silently. A GGUF metadata key (e.g. parakeet.model_variant = "sortformer-v2.1-aosc") set by the converter would be more robust. At minimum, document this assumption in diarization.h next to the detection logic and add a note that it must be revisited when a new variant is converted.

  1. streaming_update parameter name chunk_pre_encode_lc is misleading

The function signature says chunk_pre_encode_lc but the call site passes the committed chunk slice (already offset past the left context):

const float * chunk_pre_committed = chunk_pre_encode_embs + (size_t) lc * D;
streaming_update(cache, chunk_pre_committed, chunk_len_eff, ...);
The name implies it includes the left context, which it does not. Rename to committed_chunk_pre_encode to match the call-site variable name and the comment in the function body.

  1. load_wav_pcm16le_mono duplicated verbatim from test_sortformer_streaming.cpp

The comment in the new test file acknowledges this ("duplicated here on purpose"). For a 60-line helper this is borderline, but the two copies will drift. A shared test/test_utils.h header in the test/ directory would be the right solution. Not a blocker, but worth a TODO at minimum.

  1. WAV fixtures committed as binary blobs (~11 MB total)

abcba.wav (~5.0 MB) and abcdba.wav (~5.9 MB) are committed directly into the repo. Git LFS would be the cleaner long-term approach, consistent with how the project will likely handle future audio test fixtures. If the project doesn't use LFS yet, at least leave a comment in CMakeLists.txt pointing to where the fixtures can be regenerated.

Low / Nice-to-Have
9. -std::numeric_limits::infinity() is not UB

The comment /* very-negative sentinel; -inf is UB with current FP flags */ appears three times. IEEE 754 infinity is a well-defined value; the UB concern applies to operations like inf - inf, not to storing or comparing the value. Using std::numeric_limits::lowest() (which is approximately -3.4e38) or -std::numeric_limits::infinity() directly would both be more readable than the magic −1.0e30f sentinel. Not a bug, just misleading documentation.

  1. ring.erase is O(n) — pre-existing, but AOSC retention differs

AOSC retains only chunk_left_context_samples behind emit_end, which is much smaller than history_ms. So ring trims happen more aggressively and ring.erase is called more frequently (every chunk vs. lazily on v1). This amplifies the pre-existing O(n) cost. No action required now, but worth a note for a future std::deque refactor.

  1. prev_chunk_full_segments populated on AOSC path unnecessarily

On the AOSC path slot_remap is always identity, but cur_full is still moved into prev_chunk_full_segments every chunk. This is harmless (just a small memory/copy overhead) but a if (!cache_active) guard around those two lines would clarify intent.

  1. encoder_ms attribution is slightly surprising

const double encoder_ms = ms_since(t_enc) - dres.decode_ms;
t_enc is set before run_subsampling, so ms_since(t_enc) covers subsampling + bypass-encode + diarize. Subtracting decode_ms leaves the "everything except the diarizer head" time, which is actually the subsampling + conformer-bypass time. The field name encoder_ms matches the existing convention in the non-AOSC path, so this is consistent — just worth a comment explaining what the subtraction is doing.

@GustavoA1604 GustavoA1604 merged commit e6ba38c into tetherto:master May 18, 2026
GustavoA1604 pushed a commit that referenced this pull request May 19, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
GustavoA1604 added a commit that referenced this pull request May 19, 2026
…ew-comments

parakeet-cpp: address PR #22 AOSC v2.1 review comments
gianni-cor pushed a commit that referenced this pull request May 28, 2026
…er-aosc

feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1
gianni-cor pushed a commit that referenced this pull request May 28, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
gianni-cor pushed a commit that referenced this pull request May 28, 2026
…ew-comments

parakeet-cpp: address PR #22 AOSC v2.1 review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants