From c41c0f19c7302b7085c74cf7f9f440278eded89a Mon Sep 17 00:00:00 2001
From: Pratik Narola <pratiknarola@Mac.bbrouter>
Date: Tue, 19 May 2026 10:45:16 +0530
Subject: [PATCH] parakeet-cpp: address PR #22 AOSC v2.1 review comments

Resolves the review comments on the merged AOSC v2.1 PR
(tetherto/qvac-ext-lib-whisper.cpp#22, merge commit e6ba38cf). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
---
 parakeet-cpp/CMakeLists.txt                   |  4 +-
 parakeet-cpp/include/parakeet/diarization.h   | 20 ++++--
 parakeet-cpp/scripts/convert-nemo-to-gguf.py  | 27 +++++++-
 parakeet-cpp/scripts/download-all-models.sh   | 11 ++-
 parakeet-cpp/src/parakeet_ctc.cpp             |  3 +
 parakeet-cpp/src/parakeet_ctc.h               |  7 ++
 parakeet-cpp/src/parakeet_engine.cpp          | 31 +++++----
 parakeet-cpp/src/parakeet_sortformer.cpp      | 40 +++++++++--
 .../test/test_sortformer_aosc_speakers.cpp    | 54 +--------------
 .../test/test_sortformer_streaming.cpp        | 50 ++------------
 parakeet-cpp/test/test_utils.h                | 69 +++++++++++++++++++
 11 files changed, 187 insertions(+), 129 deletions(-)
 create mode 100644 parakeet-cpp/test/test_utils.h
diff --git a/parakeet-cpp/CMakeLists.txt b/parakeet-cpp/CMakeLists.txt
index eac64cc6957..eecbae2c6f4 100644
--- a/parakeet-cpp/CMakeLists.txt
+++ b/parakeet-cpp/CMakeLists.txt
@@ -554,8 +554,8 @@ if (PARAKEET_BUILD_TESTS)
     parakeet_apply_ccache(test-sortformer-streaming)
     parakeet_register_test(test-sortformer-streaming
         LABEL    "fixture"
-        ARGS     "--model" "${_qvp_sfs_q8_gguf}" "--wav" "${_qvp_diar_wav}"
-        REQUIRES "${_qvp_sfs_q8_gguf}" "${_qvp_diar_wav}")
+        ARGS     "--model" "${_qvp_sfsv21_q8_gguf}" "--wav" "${_qvp_diar_wav}"
+        REQUIRES "${_qvp_sfsv21_q8_gguf}" "${_qvp_diar_wav}")
 
     # v2.1 AOSC speaker-correctness regression. Asserts speaker coverage,
     # re-entry slot continuity (the AOSC contract), and frame-level DER
diff --git a/parakeet-cpp/include/parakeet/diarization.h b/parakeet-cpp/include/parakeet/diarization.h
index 6c0498919ac..9ea09b06ab9 100644
--- a/parakeet-cpp/include/parakeet/diarization.h
+++ b/parakeet-cpp/include/parakeet/diarization.h
@@ -75,12 +75,20 @@ struct SortformerStreamingOptions {
 
     // === AOSC (Audio-Online Speaker Cache, Sortformer v2.1) ===
     // Cache-aware streaming forward (port of NeMo's `forward_streaming_step` +
-    // `streaming_update` + `_compress_spkcache`). On v2.1 models (auto-detected
-    // from encoder shape) and spkcache_enable=true, the engine concatenates the
-    // speaker cache + FIFO + current chunk's pre-encode embeddings, runs the
-    // conformer layers over the concat, then the diariser head, before updating
-    // the runtime cache. This preserves speaker identity across silences far
-    // longer than `history_ms`. v1 models always take the legacy path.
+    // `streaming_update` + `_compress_spkcache`). On v2.1 models with
+    // spkcache_enable=true, the engine concatenates the speaker cache + FIFO +
+    // current chunk's pre-encode embeddings, runs the conformer layers over the
+    // concat, then the diariser head, before updating the runtime cache. This
+    // preserves speaker identity across silences far longer than `history_ms`.
+    // v1 and v2 models always take the legacy path.
+    //
+    // Variant detection: prefers the converter's `parakeet.model_variant` GGUF
+    // metadata tag (a stable per-checkpoint string, e.g.
+    // `sortformer-streaming-v2.1-aosc`) so a future variant that happens to
+    // share the v2.1 encoder shape can't silently opt into AOSC. GGUFs that
+    // pre-date the tag fall back to the encoder-shape heuristic: v1 has
+    // n_layers=18 / n_mels=80, v2.1 has n_layers=17 / n_mels=128. Re-run the
+    // converter after upgrading to populate the tag.
     //
     // `mean_sil_emb` is RUNTIME state (zeros at session start, EMA of detected
     // silence frames), NOT a learned tensor -- no converter changes required.
diff --git a/parakeet-cpp/scripts/convert-nemo-to-gguf.py b/parakeet-cpp/scripts/convert-nemo-to-gguf.py
index 34693a0b528..aed3a2314e1 100644
--- a/parakeet-cpp/scripts/convert-nemo-to-gguf.py
+++ b/parakeet-cpp/scripts/convert-nemo-to-gguf.py
@@ -199,7 +199,24 @@ def fuse_bn(weight, bias, running_mean, running_var, eps=1e-5):
     return scale.astype(np.float32), shift.astype(np.float32)
 
 
-def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str):
+def detect_sortformer_variant(ckpt: Path) -> str:
+    """
+    Map a NeMo Sortformer .nemo filename to a stable variant tag the C++
+    loader can match against. The tag is the only thing that distinguishes
+    cache-aware v2.1 from architecturally-identical v1 / v2 at GGUF time
+    (encoder shape alone is ambiguous against future variants).
+    """
+    stem = ckpt.stem
+    if "streaming_sortformer" in stem and "-v2.1" in stem:
+        return "sortformer-streaming-v2.1-aosc"
+    if "streaming_sortformer" in stem and "-v2" in stem:
+        return "sortformer-streaming-v2"
+    if "diar_sortformer" in stem and "-v1" in stem:
+        return "sortformer-v1"
+    return ""
+
+
+def write_gguf(out: Path, ckpt: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str):
     model_type = detect_model_type(cfg)
 
     enc = cfg["encoder"]
@@ -331,6 +348,12 @@ def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str):
         writer.add_uint32("parakeet.sortformer.tf_n_heads",      int(tfe["num_attention_heads"]))
         writer.add_bool  ("parakeet.sortformer.tf_pre_ln",       bool(tfe.get("pre_ln", False)))
         writer.add_string("parakeet.sortformer.tf_hidden_act",   str(tfe.get("hidden_act", "relu")))
+        # Variant tag (preferred over shape-based detection on the C++ side).
+        # Empty string = unknown checkpoint; loader falls back to encoder
+        # shape so older GGUFs continue to load.
+        variant = detect_sortformer_variant(ckpt)
+        if variant:
+            writer.add_string("parakeet.model_variant", variant)
     else:
         pred_hidden      = int(dec["prednet"]["pred_hidden"])
         pred_rnn_layers  = int(dec["prednet"]["pred_rnn_layers"])
@@ -610,7 +633,7 @@ def main():
     ckpt = ensure_ckpt(args.ckpt, args.hf_repo)
     cfg, sd, tok_bytes = load_nemo(ckpt)
     args.out.parent.mkdir(parents=True, exist_ok=True)
-    write_gguf(args.out, cfg, sd, tok_bytes, args.quant)
+    write_gguf(args.out, ckpt, cfg, sd, tok_bytes, args.quant)
 
 
 if __name__ == "__main__":
diff --git a/parakeet-cpp/scripts/download-all-models.sh b/parakeet-cpp/scripts/download-all-models.sh
index 4e2a434a7ae..5327e77e791 100644
--- a/parakeet-cpp/scripts/download-all-models.sh
+++ b/parakeet-cpp/scripts/download-all-models.sh
@@ -4,10 +4,10 @@
 # as `.nemo` archives, ready for `convert-nemo-to-gguf.py`.
 #
 # Idempotent: skips files that already exist on disk. Re-run any time to top up.
-# Total download budget on a clean machine: ~14 GiB at the time of writing
+# Total download budget on a clean machine: ~14.5 GiB at the time of writing
 # (TDT v3 + TDT 1.1b + CTC 0.6b + CTC 1.1b + TDT_CTC hybrid + EOU 120M +
-# Sortformer v1 + streaming Sortformer v2). Already-cached checkpoints are
-# untouched.
+# Sortformer v1 + streaming Sortformer v2 + streaming Sortformer v2.1).
+# Already-cached checkpoints are untouched.
 #
 # Usage:
 #     ./scripts/download-all-models.sh             # everything
@@ -99,6 +99,11 @@ if [[ "${1:-all}" != "tdt" ]]; then
   echo "== nemo: diar_streaming_sortformer_4spk-v2 (4-speaker, streaming-trained, ~470 MiB)"
   fetch "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2/resolve/main/diar_streaming_sortformer_4spk-v2.nemo" \
         "$NEMO_DIR/diar_streaming_sortformer_4spk-v2.nemo"
+
+  hr
+  echo "== nemo: diar_streaming_sortformer_4spk-v2.1 (4-speaker, streaming + AOSC fine-tune, ~470 MiB)"
+  fetch "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1/resolve/main/diar_streaming_sortformer_4spk-v2.1.nemo" \
+        "$NEMO_DIR/diar_streaming_sortformer_4spk-v2.1.nemo"
 fi
 
 hr
diff --git a/parakeet-cpp/src/parakeet_ctc.cpp b/parakeet-cpp/src/parakeet_ctc.cpp
index 62a95cf1c63..d8b6edb2d87 100644
--- a/parakeet-cpp/src/parakeet_ctc.cpp
+++ b/parakeet-cpp/src/parakeet_ctc.cpp
@@ -679,6 +679,9 @@ int load_from_gguf(const std::string & gguf_path,
     else if (mtype_str == "sortformer") out_model.model_type = ParakeetModelType::SORTFORMER;
     else                                out_model.model_type = ParakeetModelType::CTC;
 
+    // Optional variant tag (empty for legacy GGUFs that predate the key).
+    out_model.model_variant = get_str(g, "parakeet.model_variant", "");
+
     if (out_model.model_type == ParakeetModelType::TDT) {
         out_model.encoder_cfg.tdt_pred_hidden     = get_u32(g, "parakeet.tdt.pred_hidden",     640);
         out_model.encoder_cfg.tdt_pred_rnn_layers = get_u32(g, "parakeet.tdt.pred_rnn_layers", 2);
diff --git a/parakeet-cpp/src/parakeet_ctc.h b/parakeet-cpp/src/parakeet_ctc.h
index 32fefe2947d..1187e75ce3f 100644
--- a/parakeet-cpp/src/parakeet_ctc.h
+++ b/parakeet-cpp/src/parakeet_ctc.h
@@ -259,6 +259,13 @@ struct SortformerWeights {
 struct ParakeetCtcModel {
     ParakeetModelType model_type = ParakeetModelType::CTC;
 
+    // Optional GGUF metadata tag (key `parakeet.model_variant`). Carries
+    // a stable identifier for the converted checkpoint that the engine
+    // can match against -- preferred over shape-based heuristics where
+    // two variants share the same encoder shape (e.g. sortformer-v2 vs
+    // sortformer-v2.1-aosc). Empty if the GGUF predates the key.
+    std::string model_variant;
+
     EncoderConfig encoder_cfg;
     MelConfig     mel_cfg;
     BpeVocab      vocab;
diff --git a/parakeet-cpp/src/parakeet_engine.cpp b/parakeet-cpp/src/parakeet_engine.cpp
index e78e19e333f..b789d62229c 100644
--- a/parakeet-cpp/src/parakeet_engine.cpp
+++ b/parakeet-cpp/src/parakeet_engine.cpp
@@ -1455,11 +1455,16 @@ void SortformerStreamSession::Impl::process_chunk(int64_t window_start_sample,
 
     // Remap cur_full into session-stable IDs and store as the new
     // baseline so the next chunk's `compute_slot_remap_` can match
-    // against today's emitted identity scheme.
-    for (auto & f : cur_full) {
-        f.speaker_id = remap_id(f.speaker_id);
+    // against today's emitted identity scheme. AOSC anchors slot
+    // identity through the speaker cache, so `compute_slot_remap_`
+    // is never consulted on that path -- skip the storage and the
+    // identity-remap loop entirely.
+    if (!cache_active) {
+        for (auto & f : cur_full) {
+            f.speaker_id = remap_id(f.speaker_id);
+        }
+        prev_chunk_full_segments = std::move(cur_full);
     }
-    prev_chunk_full_segments = std::move(cur_full);
 
     // VadStateChanged from speaker_probs: a frame speaks if any speaker exceeds threshold;
     // the chunk speaks if any emitting-frame qualifies; dominant speaker from mean probs.
@@ -1656,14 +1661,16 @@ std::unique_ptr<SortformerStreamSession> Engine::diarize_start(
     impl->history_samples = opts.sample_rate * opts.history_ms / 1000;
     impl->ring.reserve(impl->history_samples);
 
-    // v2.1 detection (Audio-Online Speaker Cache eligibility).
-    // v1 sortformer-4spk-v1.q8_0: encoder.n_layers=18, preproc.n_mels=80.
-    // v2.1 sortformer-streaming-v2.1.q8_0: encoder.n_layers=17, preproc.n_mels=128.
-    // The v2.1 fine-tune is what trained the cache-aware concat-then-graph
-    // forward path; enabling it on v1 would just be untrained noise.
-    const bool model_is_v2_1 =
-        pimpl_->model.encoder_cfg.n_layers == 17 &&
-        pimpl_->model.mel_cfg.n_mels == 128;
+    // v2.1 detection (Audio-Online Speaker Cache eligibility). Documented
+    // in detail next to SortformerStreamingOptions::spkcache_enable in
+    // include/parakeet/diarization.h. Prefer the explicit variant tag
+    // emitted by the converter; fall back to encoder shape for legacy
+    // GGUFs that pre-date the parakeet.model_variant key.
+    const std::string & variant = pimpl_->model.model_variant;
+    const bool model_is_v2_1 = !variant.empty()
+        ? (variant == "sortformer-streaming-v2.1-aosc")
+        : (pimpl_->model.encoder_cfg.n_layers == 17 &&
+           pimpl_->model.mel_cfg.n_mels == 128);
     impl->cache_active = opts.spkcache_enable && model_is_v2_1;
 
     if (impl->cache_active) {
diff --git a/parakeet-cpp/src/parakeet_sortformer.cpp b/parakeet-cpp/src/parakeet_sortformer.cpp
index 49de08ee595..c99fe1dd0a3 100644
--- a/parakeet-cpp/src/parakeet_sortformer.cpp
+++ b/parakeet-cpp/src/parakeet_sortformer.cpp
@@ -29,6 +29,14 @@ namespace parakeet {
 
 namespace {
 
+// Score sentinels for the speaker-cache compression top-K. We use finite
+// extrema (well-defined under FE_DIVBYZERO trapping FP modes that some
+// host builds enable) instead of std::numeric_limits<float>::infinity()
+// purely so that subsequent arithmetic on these values cannot produce
+// NaNs -- they are only stored and compared with == / !=, never added.
+constexpr float k_score_neg_inf = std::numeric_limits<float>::lowest();
+constexpr float k_score_pos_inf = std::numeric_limits<float>::max();
+
 // Threshold speaker probabilities into time-sorted segments.
 void sf_threshold_segments(const std::vector<float> & speaker_probs,
                            int T_enc, int num_spks,
@@ -256,7 +264,7 @@ static void compute_log_pred_scores(const float * preds, int n_frames, int num_s
 static void disable_low_scores(std::vector<float> & scores,
                                const float * preds, int n_frames, int num_spks,
                                int min_pos_scores_per_spk) {
-    const float neg_inf = -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */;
+    const float neg_inf = k_score_neg_inf;
 
     // First pass: non-speech -> -inf.
     for (int t = 0; t < n_frames; ++t) {
@@ -313,7 +321,7 @@ static void boost_topk_scores(std::vector<float> & scores,
         for (int i = 0; i < k; ++i) {
             const int t = idx_buf[i];
             float & s = scores[(size_t) t * num_spks + spk];
-            if (s != -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) {
+            if (s != k_score_neg_inf) {
                 s += boost;
             }
         }
@@ -343,6 +351,24 @@ static void compress_speaker_cache(
 
     const int A_sil = cfg.spkcache_sil_frames_per_spk;
     const int spkcache_len_per_spk = spkcache_len / num_spks - A_sil;
+    if (spkcache_len_per_spk <= 0) {
+        // Degenerate config: num_spks * A_sil >= spkcache_len leaves no
+        // budget for retained frames, so the boost / top-K stages would
+        // run with non-positive k and (for nth_element) a negative
+        // distance. Fall back to a silence-only cache and bail.
+        cache.spkcache.assign((size_t) spkcache_len * D, 0.0f);
+        if (cache.mean_sil_emb.size() == (size_t) D) {
+            for (int r = 0; r < spkcache_len; ++r) {
+                std::memcpy(cache.spkcache.data() + (size_t) r * D,
+                            cache.mean_sil_emb.data(),
+                            (size_t) D * sizeof(float));
+            }
+        }
+        cache.spkcache_preds.assign((size_t) spkcache_len * num_spks, 0.0f);
+        cache.n_rows = spkcache_len;
+        cache.spkcache_preds_valid = true;
+        return;
+    }
     const int strong_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.strong_boost_rate);
     const int weak_boost   = (int) std::floor((float) spkcache_len_per_spk * cfg.weak_boost_rate);
     const int min_pos_per  = (int) std::floor((float) spkcache_len_per_spk * cfg.min_pos_scores_rate);
@@ -360,7 +386,7 @@ static void compress_speaker_cache(
         for (int t = spkcache_len; t < n_frames; ++t) {
             float * s = scores.data() + (size_t) t * num_spks;
             for (int i = 0; i < num_spks; ++i) {
-                if (s[i] != -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) {
+                if (s[i] != k_score_neg_inf) {
                     s[i] += cfg.scores_boost_latest;
                 }
             }
@@ -378,7 +404,7 @@ static void compress_speaker_cache(
     const int n_total = n_frames + A_sil;
     if (A_sil > 0) {
         scores.resize((size_t) n_total * num_spks);
-        const float pos_inf = 1.0e30f /* very-positive sentinel; +inf is UB with current FP flags */;
+        const float pos_inf = k_score_pos_inf;
         for (int t = n_frames; t < n_total; ++t) {
             float * s = scores.data() + (size_t) t * num_spks;
             for (int i = 0; i < num_spks; ++i) s[i] = pos_inf;
@@ -409,7 +435,7 @@ static void compress_speaker_cache(
     // speaker blocks contiguous; `torch.remainder(idx, n_frames)` returns the
     // frame index; our `idx % n_total` does the same.)
     for (int & idx : topk) {
-        if (flat_score(idx) == -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) {
+        if (flat_score(idx) == k_score_neg_inf) {
             idx = MAX_INDEX;
         }
     }
@@ -467,7 +493,7 @@ static void compress_speaker_cache(
 // `lc` is the left-context offset within the chunk region; the committed-chunk
 // preds start at index `prev_spkcache_n + prev_fifo_n + lc` and span `chunk_committed`.
 static void streaming_update(SortformerSpeakerCache & cache,
-                             const float * chunk_pre_encode_lc, int chunk_committed,
+                             const float * committed_chunk_pre_encode, int chunk_committed,
                              const float * preds_full,
                              int prev_spkcache_len_at_call, int prev_fifo_len_at_call,
                              int lc,
@@ -492,7 +518,7 @@ static void streaming_update(SortformerSpeakerCache & cache,
     const int new_fifo_after_append = cache.n_fifo + chunk_committed;
     cache.fifo.resize((size_t) new_fifo_after_append * D);
     std::memcpy(cache.fifo.data() + (size_t) cache.n_fifo * D,
-                chunk_pre_encode_lc,
+                committed_chunk_pre_encode,
                 (size_t) chunk_committed * D * sizeof(float));
     cache.fifo_preds.resize((size_t) new_fifo_after_append * num_spks);
     std::memcpy(cache.fifo_preds.data() + (size_t) cache.n_fifo * num_spks,
diff --git a/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp b/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp
index dc37faa883f..c4a27bb0306 100644
--- a/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp
+++ b/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp
@@ -47,6 +47,7 @@
 //   ctest fixtures behave when their fixtures aren't on disk.
 
 #include "parakeet/engine.h"
+#include "test_utils.h"
 
 #include <algorithm>
 #include <cstdio>
@@ -64,57 +65,8 @@ namespace {
 
 constexpr double FRAME_S = 0.01;  // 10 ms grid
 
-bool file_exists(const std::string & p) {
-    std::ifstream f(p, std::ios::binary);
-    return f.good();
-}
-
-// Pulled verbatim from test_sortformer_streaming.cpp (line 37-76 of that
-// file). parakeet-cpp has no shared test-util header today, so the
-// helper is duplicated here on purpose; it matches how the existing
-// streaming/parity tests are organised.
-bool load_wav_pcm16le_mono(const std::string & path,
-                           std::vector<float> & samples,
-                           int & sample_rate) {
-    std::ifstream f(path, std::ios::binary);
-    if (!f) return false;
-    char riff[4]; f.read(riff, 4);
-    if (std::memcmp(riff, "RIFF", 4) != 0) return false;
-    f.ignore(4);
-    char wave[4]; f.read(wave, 4);
-    if (std::memcmp(wave, "WAVE", 4) != 0) return false;
-
-    bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0;
-    std::vector<char> data;
-    while (f) {
-        char id[4]; f.read(id, 4);
-        if (!f) break;
-        uint32_t sz = 0; f.read((char *) &sz, 4);
-        if (std::memcmp(id, "fmt ", 4) == 0) {
-            std::vector<char> hdr(sz);
-            f.read(hdr.data(), sz);
-            uint16_t fmt = *(uint16_t *) hdr.data();
-            channels    = *(uint16_t *) (hdr.data() + 2);
-            srate       = *(uint32_t *) (hdr.data() + 4);
-            bits        = *(uint16_t *) (hdr.data() + 14);
-            if (fmt != 1 || channels != 1 || bits != 16) return false;
-            fmt_ok = true;
-        } else if (std::memcmp(id, "data", 4) == 0) {
-            data.resize(sz);
-            f.read(data.data(), sz);
-            break;
-        } else {
-            f.ignore(sz);
-        }
-    }
-    if (!fmt_ok || data.empty()) return false;
-    sample_rate = (int) srate;
-    const int n = (int) (data.size() / 2);
-    samples.resize(n);
-    const int16_t * s16 = reinterpret_cast<const int16_t *>(data.data());
-    for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f;
-    return true;
-}
+using parakeet_test::file_exists;
+using parakeet_test::load_wav_pcm16le_mono;
 
 struct RttmSeg {
     double      start_s;
diff --git a/parakeet-cpp/test/test_sortformer_streaming.cpp b/parakeet-cpp/test/test_sortformer_streaming.cpp
index 4fd60a65a20..cbc069cd63b 100644
--- a/parakeet-cpp/test/test_sortformer_streaming.cpp
+++ b/parakeet-cpp/test/test_sortformer_streaming.cpp
@@ -16,6 +16,7 @@
 // ingest the file directly against the matching reference RTTM.
 
 #include "parakeet/engine.h"
+#include "test_utils.h"
 
 #include <atomic>
 #include <cstdio>
@@ -29,51 +30,8 @@
 
 namespace {
 
-bool file_exists(const std::string & p) {
-    std::ifstream f(p, std::ios::binary);
-    return f.good();
-}
-
-bool load_wav_pcm16le_mono(const std::string & path, std::vector<float> & samples, int & sample_rate) {
-    std::ifstream f(path, std::ios::binary);
-    if (!f) return false;
-    char riff[4]; f.read(riff, 4);
-    if (std::memcmp(riff, "RIFF", 4) != 0) return false;
-    f.ignore(4);
-    char wave[4]; f.read(wave, 4);
-    if (std::memcmp(wave, "WAVE", 4) != 0) return false;
-
-    bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0;
-    std::vector<char> data;
-    while (f) {
-        char id[4]; f.read(id, 4);
-        if (!f) break;
-        uint32_t sz = 0; f.read((char *) &sz, 4);
-        if (std::memcmp(id, "fmt ", 4) == 0) {
-            std::vector<char> hdr(sz);
-            f.read(hdr.data(), sz);
-            uint16_t fmt = *(uint16_t *) hdr.data();
-            channels    = *(uint16_t *) (hdr.data() + 2);
-            srate       = *(uint32_t *) (hdr.data() + 4);
-            bits        = *(uint16_t *) (hdr.data() + 14);
-            if (fmt != 1 || channels != 1 || bits != 16) return false;
-            fmt_ok = true;
-        } else if (std::memcmp(id, "data", 4) == 0) {
-            data.resize(sz);
-            f.read(data.data(), sz);
-            break;
-        } else {
-            f.ignore(sz);
-        }
-    }
-    if (!fmt_ok || data.empty()) return false;
-    sample_rate = (int) srate;
-    const int n = (int) (data.size() / 2);
-    samples.resize(n);
-    const int16_t * s16 = reinterpret_cast<const int16_t *>(data.data());
-    for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f;
-    return true;
-}
+using parakeet_test::file_exists;
+using parakeet_test::load_wav_pcm16le_mono;
 
 using namespace parakeet;
 
@@ -290,7 +248,7 @@ int run_basic(const std::string & gguf_path,
 }
 
 int main(int argc, char ** argv) {
-    std::string gguf = "models/sortformer-4spk-v1.f16.gguf";
+    std::string gguf = "models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf";
     std::string wav  = "test/samples/diarization-sample-16k.wav";
     int         history_ms   = 30000;
     int         chunk_ms     = 2000;
diff --git a/parakeet-cpp/test/test_utils.h b/parakeet-cpp/test/test_utils.h
new file mode 100644
index 00000000000..f819e192641
--- /dev/null
+++ b/parakeet-cpp/test/test_utils.h
@@ -0,0 +1,69 @@
+// Tiny shared helpers for the C++ test binaries. Kept dependency-light
+// (just the standard headers below) so any test can include this without
+// pulling in the public Engine surface or any project-internal types.
+//
+// History: previously these helpers lived inline in
+// test_sortformer_streaming.cpp and test_sortformer_aosc_speakers.cpp.
+// Pulling them up here avoids drift between two near-identical copies.
+#pragma once
+
+#include <cstdint>
+#include <cstring>
+#include <fstream>
+#include <string>
+#include <vector>
+
+namespace parakeet_test {
+
+inline bool file_exists(const std::string & p) {
+    std::ifstream f(p, std::ios::binary);
+    return f.good();
+}
+
+// Load a 16 kHz / mono / s16le RIFF/WAVE file into [-1, 1) float samples.
+// Returns false on any header mismatch (non-PCM, non-mono, non-16bit) or
+// missing chunk; on success writes the sample rate via `sample_rate`.
+inline bool load_wav_pcm16le_mono(const std::string & path,
+                                  std::vector<float> & samples,
+                                  int & sample_rate) {
+    std::ifstream f(path, std::ios::binary);
+    if (!f) return false;
+    char riff[4]; f.read(riff, 4);
+    if (std::memcmp(riff, "RIFF", 4) != 0) return false;
+    f.ignore(4);
+    char wave[4]; f.read(wave, 4);
+    if (std::memcmp(wave, "WAVE", 4) != 0) return false;
+
+    bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0;
+    std::vector<char> data;
+    while (f) {
+        char id[4]; f.read(id, 4);
+        if (!f) break;
+        uint32_t sz = 0; f.read((char *) &sz, 4);
+        if (std::memcmp(id, "fmt ", 4) == 0) {
+            std::vector<char> hdr(sz);
+            f.read(hdr.data(), sz);
+            uint16_t fmt = *(uint16_t *) hdr.data();
+            channels    = *(uint16_t *) (hdr.data() + 2);
+            srate       = *(uint32_t *) (hdr.data() + 4);
+            bits        = *(uint16_t *) (hdr.data() + 14);
+            if (fmt != 1 || channels != 1 || bits != 16) return false;
+            fmt_ok = true;
+        } else if (std::memcmp(id, "data", 4) == 0) {
+            data.resize(sz);
+            f.read(data.data(), sz);
+            break;
+        } else {
+            f.ignore(sz);
+        }
+    }
+    if (!fmt_ok || data.empty()) return false;
+    sample_rate = (int) srate;
+    const int n = (int) (data.size() / 2);
+    samples.resize(n);
+    const int16_t * s16 = reinterpret_cast<const int16_t *>(data.data());
+    for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f;
+    return true;
+}
+
+}  // namespace parakeet_test