feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1#22
Conversation
…ssion
SortformerStreamSession::Impl::process_chunk previously assigned each
emitted segment's speaker_id directly from Sortformer's per-pass output
(s.speaker_id), with no inter-chunk slot stabilisation. When a speaker
aged out of the rolling history window, the model's per-pass slot
ordering could permute and the consumer saw "the same speaker" under a
different slot index.
On a synthetic 3-English-speaker 90s clip with the default
history_ms=30000, the FIO089 monologue (30-90s) drifted twice:
hyp_2 -> hyp_1 at t=44s (FIO084 ageing out of the 30s window) and
hyp_1 -> hyp_0 at t=58s (FIO087 ageing out). Bumping history_ms to
90000 hid the bug only because the rolling window then matched the
clip length and never emptied -- on real conversations longer than
history_ms, drift always returned at the predicted age-out points.
This patch carries forward the previous chunk's session-stable segments
and computes a remap[local_id] -> session_id by maximising overlap
between the current chunk's local-ID segments and the previous chunk's
session-ID segments. Greedy assignment (highest-overlap pair first) is
sufficient for 4-speaker Sortformer; Hungarian would be optimal but
overkill for a 4x4 cost matrix. Unmatched local slots get the lowest
unused session ID. Identity remap on the first chunk (empty previous
state).
Verification on synthetic three-english-speakers.wav with the v1
sortformer-4spk q8_0 GGUF:
DER% speakerSwitches
offline (baseline) 4.95 0
streaming hist=30s pre-fix 50.34 2 (drift at t=44s, t=58s)
streaming hist=30s post-fix 4.17 0
streaming hist=60s post-fix 3.60 0
Cross-language synthetic three-speakers.wav (control):
DER% speakerSwitches
offline (baseline) 26.01 0
streaming hist=30s pre-fix 57.66 1
streaming hist=30s post-fix 23.76 0
The cross-language Croatian+French slot-collapse persists (model-side
acoustic-similarity issue, intentionally not addressed by this patch).
Public APIs (SortformerStreamSession, SortformerStreamingOptions,
StreamingDiarizationSegment) are unchanged.
Also extends test/test_sortformer_streaming.cpp with --history-ms,
--chunk-ms, --rttm-out CLI flags so the streaming path can be exercised
at multiple history values and a NIST RTTM dump consumed by external
DER scoring.
… library
Faithful port of NeMo's Audio-Online Speaker Cache (AOSC) from
sortformer_modules.py + sortformer_diar_models.py, replacing the
previous shallow stub that collapsed v2.1 streaming output to a
single speaker slot.
Key changes:
- Add run_encoder_bypass_pre_encode for the cache-aware streaming
forward path. Lets callers feed pre-subsampled embeddings directly
into the conformer layers (skipping the subsampling block), which
is required for splicing the speaker cache + FIFO + chunk in the
post-subsampling embedding space the way NeMo trained v2.1 with.
- Port _compress_spkcache, _get_silence_profile, _disable_low_scores,
_boost_topk_scores, streaming_update, and forward_streaming_step
end-to-end. Each C++ helper carries a comment naming the NeMo
source line(s) it mirrors.
- Extend SortformerSpeakerCache with mean_sil_emb (runtime EMA over
silence frames), spkcache_preds, fifo_preds, n_sil_frames. Add
SortformerStreamingConfig with NeMo's e2e_diarize_speech.py
inference defaults (spkcache_len=188, fifo_len=188, chunk_len=6,
chunk_left_context=1, chunk_right_context=7, spkcache_update_period=144,
spkcache_sil_frames_per_spk=3, sil_threshold=0.2,
pred_score_threshold=0.25, scores_boost_latest=0.05,
strong_boost_rate=0.75, weak_boost_rate=1.5,
min_pos_scores_rate=0.5).
- Wire chunk left/right audio context windowing in the engine's
streaming session: try_emit_chunks now waits for chunk_right_context_ms
of lookahead audio before emitting, finalize uses left-context-only
for the tail chunk, and diarize_start populates the new config
fields from SortformerStreamingOptions.
- Public API: flip SortformerStreamingOptions::spkcache_enable
default to true; add chunk_left_context_ms (=80) alongside the
existing chunk_right_context_ms (now =560); switch fifo_len
default to 188 and spkcache_update_period to 144.
v1 path is unchanged. cache_active=false for v1 GGUFs (detected
via encoder shape: 18 layers / 80 mels for v1, 17 / 128 for v2.1).
v1 streaming DER on the synthetic English regression fixture stays
at 4.17% (bit-for-bit).
Behaviour on synthetic test fixtures:
- 3 distinct voices (Alex/Samantha/Daniel) re-entry test:
v1 streaming 0.91% DER, v2.1+AOSC 0.45% DER.
- 4-speaker re-entry test where v1's overlap-remap fails:
v1 streaming 47-51% DER, v2.1+AOSC 18-22% DER.
- Both Samantha (47-66s gap) and Alex (93s gap) cleanly recovered
to their original hyp slots in the AOSC path; v1 collapses
multiple speakers into one slot after the long silence.
QVAC-18625
Follow-up to 8f11c2a (the AOSC port itself). Locks the v2.1 streaming behaviour into ctest and surfaces it to the live-mic example user, so neither piece silently regresses. Added regression suite: - test/test_sortformer_aosc_speakers.cpp asserts three invariants against a reference RTTM: (a) every ref speaker has at least one hyp frame, (b) speakers that re-enter after a gap land in the SAME hyp_<id> they were first assigned to (the AOSC contract), (c) frame-level DER under the optimal hyp->ref permutation is below --der-max (default 30 %). Brute-force permutation, 10 ms frame grid, std-lib only. - test/samples/abcba.{wav,rttm} (160.6 s, 3 speakers, A->B->C->B->A, A returns after a 97 s gap) and test/samples/abcdba.{wav,rttm} (191.2 s, 4 speakers, A->B->C->D->B->A, A returns after a 128 s gap, B after a 66 s gap). Generated from ElevenLabs TTS so the audio is redistributable; ground-truth RTTMs auto-built from clip durations. - CMakeLists.txt registers two ctest entries test-sortformer-aosc-speakers-{abcba,abcdba} sharing one binary, REQUIRES-gated on the v2.1 GGUF so a fresh checkout without models/ shows them as DISABLED rather than failing. Measured on q8_0 v2.1, M-series CPU backend: abcba DER 27.29 % (3 slots tracked, A and B re-bind correctly); abcdba DER 22.22 % (all 4 slots tracked, A and B re-bind). v1 streaming on the same fixtures collapses to 2 slots (abcdba 66.28 %), confirming the test distinguishes AOSC from non-AOSC. Public API: - SortformerStreamSession::aosc_active() — small getter returning the engine's internal cache_active flag. Lets callers tell v2.1+AOSC from v1 / v2.x-without-cache in CLI banners and logs without duplicating the v2.1 detection logic. live-mic example: - Banner now branches on aosc_active(): on v2.1 prints "(v2.1 diarization, AOSC) chunk=... spkcache_len=... fifo_len=... lc=... rc=..."; on v1 keeps the existing "(v1 diarization) chunk=... history=..." line bit-identical. --history-ms help text clarifies the flag is v1-only and that v2.1 takes the AOSC path automatically. No new CLI flags. Docs: - README.md: new model-table row for diar_streaming_sortformer_4spk-v2.1 (v2 row left untouched); API table's diarize_start description distinguishes v1 sliding-history vs v2.1 AOSC; "Shipped / Not in-repo" status block moves Sortformer spkcache streaming to "Shipped". - PROGRESS.md: new Phase 17 closing the §11.11.2 reservation. Covers the algorithm port (8 ported NeMo helpers), encoder context windowing, bypass_pre_encode forward, validation methodology, the measured DER table from above, files touched, and remaining follow-ups (engine n_finals end-of-session glitch; downstream qvac-addon plumbing). v1 path is bit-identical to pre-commit; all existing tests stay green. QVAC-18625
There was a problem hiding this comment.
I checked locally and noticed
- need to update download-all-models.sh to include the new model used
- need to update sortformer-streaming test to use v2.1 model instead of v2 model
Besides that worked in windows and mac
Also Cursor flagged the following, not sure if relevant but could be solved by pasting this message in a new thread:
Critical / Must-Fix
- spkcache_len_per_spk can go negative with small spkcache_len
In compress_speaker_cache (parakeet_sortformer.cpp):
const int spkcache_len_per_spk = spkcache_len / num_spks - A_sil;
const int strong_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.strong_boost_rate);
const int weak_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.weak_boost_rate);
With the defaults spkcache_len=188, num_spks=4, A_sil=3 this is 47 - 3 = 44 — fine. But if a caller passes spkcache_len < num_spks * A_sil (e.g. spkcache_len=8, num_spks=4, A_sil=3 → -1), the nth_element calls in boost_topk_scores receive a negative k after std::min(n_boost_per_spk, n_frames), and std::nth_element with a negative distance is UB. Add a guard at the top of the function:
if (spkcache_len_per_spk <= 0) {
// degenerate config; fill with silence profile and return
...
}
4. run_encoder_bypass_pre_encode cache invalidation — every chunk misses until FIFO reaches steady-state
The bypass encoder graph is cached by (bypass_pre_encode, T_enc, n_run_layers). T_enc = spkcache_n + fifo_n + T_chunk_pre grows chunk-by-chunk as the FIFO fills (0 → fifo_len), so for the first fifo_len / chunk_len ≈ 188 / 6 ≈ 31 chunks a new ggml graph is built from scratch on every call. With k_encoder_graph_cache_max = 4, those graphs evict each other immediately and zero reuse occurs.
This is a performance bug, not a correctness bug, but it could make the first ~60 seconds of a session noticeably slower on slower hardware. Consider either:
Caching by (bypass_pre_encode, T_enc_max) and passing a mask / sequence-length argument, or
Pre-building the graph at diarize_start for the known steady-state size spkcache_len + fifo_len + max_chunk_pre_frames and always feeding that size (padding with silence rows when the FIFO isn't full yet).
Medium / Should-Fix
5. v2.1 detection by encoder shape is fragile
const bool model_is_v2_1 =
pimpl_->model.encoder_cfg.n_layers == 17 &&
pimpl_->model.mel_cfg.n_mels == 128;
If NeMo ships a v3 or a v2.2 variant that happens to share {17 layers, 128 mels} but was not trained with the cache-aware concat forward, enabling AOSC on it will produce garbage silently. A GGUF metadata key (e.g. parakeet.model_variant = "sortformer-v2.1-aosc") set by the converter would be more robust. At minimum, document this assumption in diarization.h next to the detection logic and add a note that it must be revisited when a new variant is converted.
- streaming_update parameter name chunk_pre_encode_lc is misleading
The function signature says chunk_pre_encode_lc but the call site passes the committed chunk slice (already offset past the left context):
const float * chunk_pre_committed = chunk_pre_encode_embs + (size_t) lc * D;
streaming_update(cache, chunk_pre_committed, chunk_len_eff, ...);
The name implies it includes the left context, which it does not. Rename to committed_chunk_pre_encode to match the call-site variable name and the comment in the function body.
- load_wav_pcm16le_mono duplicated verbatim from test_sortformer_streaming.cpp
The comment in the new test file acknowledges this ("duplicated here on purpose"). For a 60-line helper this is borderline, but the two copies will drift. A shared test/test_utils.h header in the test/ directory would be the right solution. Not a blocker, but worth a TODO at minimum.
- WAV fixtures committed as binary blobs (~11 MB total)
abcba.wav (~5.0 MB) and abcdba.wav (~5.9 MB) are committed directly into the repo. Git LFS would be the cleaner long-term approach, consistent with how the project will likely handle future audio test fixtures. If the project doesn't use LFS yet, at least leave a comment in CMakeLists.txt pointing to where the fixtures can be regenerated.
Low / Nice-to-Have
9. -std::numeric_limits::infinity() is not UB
The comment /* very-negative sentinel; -inf is UB with current FP flags */ appears three times. IEEE 754 infinity is a well-defined value; the UB concern applies to operations like inf - inf, not to storing or comparing the value. Using std::numeric_limits::lowest() (which is approximately -3.4e38) or -std::numeric_limits::infinity() directly would both be more readable than the magic −1.0e30f sentinel. Not a bug, just misleading documentation.
- ring.erase is O(n) — pre-existing, but AOSC retention differs
AOSC retains only chunk_left_context_samples behind emit_end, which is much smaller than history_ms. So ring trims happen more aggressively and ring.erase is called more frequently (every chunk vs. lazily on v1). This amplifies the pre-existing O(n) cost. No action required now, but worth a note for a future std::deque refactor.
- prev_chunk_full_segments populated on AOSC path unnecessarily
On the AOSC path slot_remap is always identity, but cur_full is still moved into prev_chunk_full_segments every chunk. This is harmless (just a small memory/copy overhead) but a if (!cache_active) guard around those two lines would clarify intent.
- encoder_ms attribution is slightly surprising
const double encoder_ms = ms_since(t_enc) - dres.decode_ms;
t_enc is set before run_subsampling, so ms_since(t_enc) covers subsampling + bypass-encode + diarize. Subtracting decode_ms leaves the "everything except the diarizer head" time, which is actually the subsampling + conformer-bypass time. The field name encoder_ms matches the existing convention in the non-AOSC path, so this is consistent — just worth a comment explaining what the subtraction is doing.
Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".
…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments
…er-aosc feat(QVAC-18625): Add better support for long-term streaming diarization with v2.1
Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".
…ew-comments parakeet-cpp: address PR #22 AOSC v2.1 review comments
feat(QVAC-18625): Sortformer v2.1 streaming with NeMo Audio-Online Speaker Cache (AOSC)TL;DR
Adds v2.1 streaming Sortformer to parakeet-cpp via a port of NeMo's Audio-Online Speaker Cache (AOSC) algorithm, plus a slot-continuity fix on the existing v1 streaming path, plus a regression test suite. Speakers now keep their original
speaker_ideven after long silences. The v1 path is bit-identical apart from the slot-continuity fix on its own streaming code.The original problem
In v1 streaming, the session only carries a rolling
history_mswindow of audio context. If a speaker goes silent for longer than that window and then returns, the model has no memory of them — they get whateverspeaker_idslot happens to be free, which may or may not match their original slot. The user sees what looks like a speaker swap mid-conversation.The fix (model-side)
The newer v2 / v2.1 NeMo Sortformer introduced a speaker cache: a small set of acoustic embeddings retained per speaker across the entire session, plus a FIFO of recent embeddings, plus a per-session "silence profile" embedding. The model then processes a concatenated row
through the FastConformer blocks every chunk, so it always sees a fresh attention window over every speaker it has ever heard in this session. After arbitrarily long silences, the same voice rebinds to its original slot.
For v1 (which doesn't have the cache architecture), this PR also adds a separate overlap-based slot-continuity remap so cross-chunk drift inside the history window is suppressed.
What's added
sortformer_modules.py+sortformer_diar_models.pyinto parakeet-cpp. Helpers in C++ carry// matches NeMo <fn> at <line>comments:_compress_spkcache,_get_silence_profile,_disable_low_scores/_boost_topk_scores,streaming_update,forward_streaming_step, andrun_encoder_bypass_pre_encode(feeds pre-subsampled embeddings straight into the conformer stack — required because the cache lives in post-subsampling space).chunk_left_context_ms=80,chunk_right_context_ms=560), defaults from NeMo'se2e_diarize_speech.pyinference YAML.cache_active=falsefor v1, so the v1 code path is bit-identical.test-sortformer-aosc-speakers-{abcba,abcdba}asserting three invariants against an RTTM ground truth: (a) every reference speaker is covered, (b) re-entering speakers land in the samehyp_<id>they were first assigned to (the AOSC contract), (c) frame-level DER ≤ 30 %.SortformerStreamSession::aosc_active()— runtime introspection getter so callers can tell v1 from v2.1+AOSC.examples/live-mic: banner branches onaosc_active(); v2.1 prints spkcache/fifo/lc/rc, v1 keeps the legacy history-window banner unchanged.The biggest blocker:
mean_sil_embNeMo's algorithm uses a special "what does silence sound like in this room" baseline embedding to fill cache slots that aren't yet bound to a real speaker. Without it, those slots stay at zeros — out-of-distribution input for the model — and it panics and collapses every chunk to a single slot. Once the runtime EMA update rule from
sortformer_modules.pywas matched in C++, the cache-aware forward started behaving.Results
Complete 5-way DER table (synthetic fixtures, q8_0 GGUFs, Apple M-series CPU):
Voices used (free elevenlabs api key for testing):
A=Sarah (♀US), B=Adam (♂US), C=Alice (♀UK), D=George (♂UK). 1 s silence between turns. Ground-truth RTTM auto-generated from clip durations.
Per-fixture analysis
three-english-speakers— no re-entry, no long silence; v1 offline almost perfect (2.14 %). v2.1 conflates the two female-adjacent voices and gets 26 %. v1 vs v2.1 streaming track their offline numbers closely because there's no silence-return event to trigger the cache.abcab_real(distinct voices) — every mode handles this fine. Confirms the new code doesn't regress on easy material.absaman(A→B→C→Aman→B→A, 4-speaker with long silences) — the failure mode this PR targets.Known limitation (and the trade-off)
v2.1 is not strictly better than v1. When two speakers are acoustically similar (notably two female voices in a row, or two male voices in a row), v2.1's cache + FIFO + attention layer thinks they're the same speaker and assigns them the same ID. v1 — without the cache — happens to separate them better in that exact scenario.
So in practice:
history_msof silence between same-speaker turns), v1 catastrophically fails (~50 % DER) and v2.1 + AOSC catches it (~22 %).This PR makes v2.1+AOSC available as an option that works correctly per the NeMo algorithm. It does not change the default model selection; downstream consumers keep choosing whichever variant fits their workload.
API additions (additive only)
SortformerStreamingOptions::spkcache_enable(defaulttrue),spkcache_len,fifo_len,chunk_left_context_ms,chunk_right_context_ms,spkcache_update_period— public knobs matching NeMo's inference YAML.SortformerStreamSession::aosc_active()— runtime introspection.No signature changes on existing methods. v1 GGUFs take the unchanged code path.
Validation
ctest -R sortformer-aosc-speakerspasses both-abcbaand-abcdbawith the v2.1 GGUF inparakeet-cpp/models/.live-micon a v2.1 model prints[live-mic] listening at 16 kHz mono (v2.1 diarization, AOSC). chunk=2000 ms spkcache_len=188 fifo_len=188 lc=80 ms rc=560 ms.— verified end-to-end with a live mic capture that emitted multiple speaker IDs.test-sortformer-streamingstays bit-identical (4.17 % DER on the English fixture).