From c41c0f19c7302b7085c74cf7f9f440278eded89a Mon Sep 17 00:00:00 2001 From: Pratik Narola Date: Tue, 19 May 2026 10:45:16 +0530 Subject: [PATCH] parakeet-cpp: address PR #22 AOSC v2.1 review comments Resolves the review comments on the merged AOSC v2.1 PR (tetherto/qvac-ext-lib-whisper.cpp#22, merge commit e6ba38cf). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications". --- parakeet-cpp/CMakeLists.txt | 4 +- parakeet-cpp/include/parakeet/diarization.h | 20 ++++-- parakeet-cpp/scripts/convert-nemo-to-gguf.py | 27 +++++++- parakeet-cpp/scripts/download-all-models.sh | 11 ++- parakeet-cpp/src/parakeet_ctc.cpp | 3 + parakeet-cpp/src/parakeet_ctc.h | 7 ++ parakeet-cpp/src/parakeet_engine.cpp | 31 +++++---- parakeet-cpp/src/parakeet_sortformer.cpp | 40 +++++++++-- .../test/test_sortformer_aosc_speakers.cpp | 54 +-------------- .../test/test_sortformer_streaming.cpp | 50 ++------------ parakeet-cpp/test/test_utils.h | 69 +++++++++++++++++++ 11 files changed, 187 insertions(+), 129 deletions(-) create mode 100644 parakeet-cpp/test/test_utils.h diff --git a/parakeet-cpp/CMakeLists.txt b/parakeet-cpp/CMakeLists.txt index eac64cc6957..eecbae2c6f4 100644 --- a/parakeet-cpp/CMakeLists.txt +++ b/parakeet-cpp/CMakeLists.txt @@ -554,8 +554,8 @@ if (PARAKEET_BUILD_TESTS) parakeet_apply_ccache(test-sortformer-streaming) parakeet_register_test(test-sortformer-streaming LABEL "fixture" - ARGS "--model" "${_qvp_sfs_q8_gguf}" "--wav" "${_qvp_diar_wav}" - REQUIRES "${_qvp_sfs_q8_gguf}" "${_qvp_diar_wav}") + ARGS "--model" "${_qvp_sfsv21_q8_gguf}" "--wav" "${_qvp_diar_wav}" + REQUIRES "${_qvp_sfsv21_q8_gguf}" "${_qvp_diar_wav}") # v2.1 AOSC speaker-correctness regression. Asserts speaker coverage, # re-entry slot continuity (the AOSC contract), and frame-level DER diff --git a/parakeet-cpp/include/parakeet/diarization.h b/parakeet-cpp/include/parakeet/diarization.h index 6c0498919ac..9ea09b06ab9 100644 --- a/parakeet-cpp/include/parakeet/diarization.h +++ b/parakeet-cpp/include/parakeet/diarization.h @@ -75,12 +75,20 @@ struct SortformerStreamingOptions { // === AOSC (Audio-Online Speaker Cache, Sortformer v2.1) === // Cache-aware streaming forward (port of NeMo's `forward_streaming_step` + - // `streaming_update` + `_compress_spkcache`). On v2.1 models (auto-detected - // from encoder shape) and spkcache_enable=true, the engine concatenates the - // speaker cache + FIFO + current chunk's pre-encode embeddings, runs the - // conformer layers over the concat, then the diariser head, before updating - // the runtime cache. This preserves speaker identity across silences far - // longer than `history_ms`. v1 models always take the legacy path. + // `streaming_update` + `_compress_spkcache`). On v2.1 models with + // spkcache_enable=true, the engine concatenates the speaker cache + FIFO + + // current chunk's pre-encode embeddings, runs the conformer layers over the + // concat, then the diariser head, before updating the runtime cache. This + // preserves speaker identity across silences far longer than `history_ms`. + // v1 and v2 models always take the legacy path. + // + // Variant detection: prefers the converter's `parakeet.model_variant` GGUF + // metadata tag (a stable per-checkpoint string, e.g. + // `sortformer-streaming-v2.1-aosc`) so a future variant that happens to + // share the v2.1 encoder shape can't silently opt into AOSC. GGUFs that + // pre-date the tag fall back to the encoder-shape heuristic: v1 has + // n_layers=18 / n_mels=80, v2.1 has n_layers=17 / n_mels=128. Re-run the + // converter after upgrading to populate the tag. // // `mean_sil_emb` is RUNTIME state (zeros at session start, EMA of detected // silence frames), NOT a learned tensor -- no converter changes required. diff --git a/parakeet-cpp/scripts/convert-nemo-to-gguf.py b/parakeet-cpp/scripts/convert-nemo-to-gguf.py index 34693a0b528..aed3a2314e1 100644 --- a/parakeet-cpp/scripts/convert-nemo-to-gguf.py +++ b/parakeet-cpp/scripts/convert-nemo-to-gguf.py @@ -199,7 +199,24 @@ def fuse_bn(weight, bias, running_mean, running_var, eps=1e-5): return scale.astype(np.float32), shift.astype(np.float32) -def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): +def detect_sortformer_variant(ckpt: Path) -> str: + """ + Map a NeMo Sortformer .nemo filename to a stable variant tag the C++ + loader can match against. The tag is the only thing that distinguishes + cache-aware v2.1 from architecturally-identical v1 / v2 at GGUF time + (encoder shape alone is ambiguous against future variants). + """ + stem = ckpt.stem + if "streaming_sortformer" in stem and "-v2.1" in stem: + return "sortformer-streaming-v2.1-aosc" + if "streaming_sortformer" in stem and "-v2" in stem: + return "sortformer-streaming-v2" + if "diar_sortformer" in stem and "-v1" in stem: + return "sortformer-v1" + return "" + + +def write_gguf(out: Path, ckpt: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): model_type = detect_model_type(cfg) enc = cfg["encoder"] @@ -331,6 +348,12 @@ def write_gguf(out: Path, cfg: dict, sd: dict, tok_bytes: bytes, quant: str): writer.add_uint32("parakeet.sortformer.tf_n_heads", int(tfe["num_attention_heads"])) writer.add_bool ("parakeet.sortformer.tf_pre_ln", bool(tfe.get("pre_ln", False))) writer.add_string("parakeet.sortformer.tf_hidden_act", str(tfe.get("hidden_act", "relu"))) + # Variant tag (preferred over shape-based detection on the C++ side). + # Empty string = unknown checkpoint; loader falls back to encoder + # shape so older GGUFs continue to load. + variant = detect_sortformer_variant(ckpt) + if variant: + writer.add_string("parakeet.model_variant", variant) else: pred_hidden = int(dec["prednet"]["pred_hidden"]) pred_rnn_layers = int(dec["prednet"]["pred_rnn_layers"]) @@ -610,7 +633,7 @@ def main(): ckpt = ensure_ckpt(args.ckpt, args.hf_repo) cfg, sd, tok_bytes = load_nemo(ckpt) args.out.parent.mkdir(parents=True, exist_ok=True) - write_gguf(args.out, cfg, sd, tok_bytes, args.quant) + write_gguf(args.out, ckpt, cfg, sd, tok_bytes, args.quant) if __name__ == "__main__": diff --git a/parakeet-cpp/scripts/download-all-models.sh b/parakeet-cpp/scripts/download-all-models.sh index 4e2a434a7ae..5327e77e791 100644 --- a/parakeet-cpp/scripts/download-all-models.sh +++ b/parakeet-cpp/scripts/download-all-models.sh @@ -4,10 +4,10 @@ # as `.nemo` archives, ready for `convert-nemo-to-gguf.py`. # # Idempotent: skips files that already exist on disk. Re-run any time to top up. -# Total download budget on a clean machine: ~14 GiB at the time of writing +# Total download budget on a clean machine: ~14.5 GiB at the time of writing # (TDT v3 + TDT 1.1b + CTC 0.6b + CTC 1.1b + TDT_CTC hybrid + EOU 120M + -# Sortformer v1 + streaming Sortformer v2). Already-cached checkpoints are -# untouched. +# Sortformer v1 + streaming Sortformer v2 + streaming Sortformer v2.1). +# Already-cached checkpoints are untouched. # # Usage: # ./scripts/download-all-models.sh # everything @@ -99,6 +99,11 @@ if [[ "${1:-all}" != "tdt" ]]; then echo "== nemo: diar_streaming_sortformer_4spk-v2 (4-speaker, streaming-trained, ~470 MiB)" fetch "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2/resolve/main/diar_streaming_sortformer_4spk-v2.nemo" \ "$NEMO_DIR/diar_streaming_sortformer_4spk-v2.nemo" + + hr + echo "== nemo: diar_streaming_sortformer_4spk-v2.1 (4-speaker, streaming + AOSC fine-tune, ~470 MiB)" + fetch "https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1/resolve/main/diar_streaming_sortformer_4spk-v2.1.nemo" \ + "$NEMO_DIR/diar_streaming_sortformer_4spk-v2.1.nemo" fi hr diff --git a/parakeet-cpp/src/parakeet_ctc.cpp b/parakeet-cpp/src/parakeet_ctc.cpp index 62a95cf1c63..d8b6edb2d87 100644 --- a/parakeet-cpp/src/parakeet_ctc.cpp +++ b/parakeet-cpp/src/parakeet_ctc.cpp @@ -679,6 +679,9 @@ int load_from_gguf(const std::string & gguf_path, else if (mtype_str == "sortformer") out_model.model_type = ParakeetModelType::SORTFORMER; else out_model.model_type = ParakeetModelType::CTC; + // Optional variant tag (empty for legacy GGUFs that predate the key). + out_model.model_variant = get_str(g, "parakeet.model_variant", ""); + if (out_model.model_type == ParakeetModelType::TDT) { out_model.encoder_cfg.tdt_pred_hidden = get_u32(g, "parakeet.tdt.pred_hidden", 640); out_model.encoder_cfg.tdt_pred_rnn_layers = get_u32(g, "parakeet.tdt.pred_rnn_layers", 2); diff --git a/parakeet-cpp/src/parakeet_ctc.h b/parakeet-cpp/src/parakeet_ctc.h index 32fefe2947d..1187e75ce3f 100644 --- a/parakeet-cpp/src/parakeet_ctc.h +++ b/parakeet-cpp/src/parakeet_ctc.h @@ -259,6 +259,13 @@ struct SortformerWeights { struct ParakeetCtcModel { ParakeetModelType model_type = ParakeetModelType::CTC; + // Optional GGUF metadata tag (key `parakeet.model_variant`). Carries + // a stable identifier for the converted checkpoint that the engine + // can match against -- preferred over shape-based heuristics where + // two variants share the same encoder shape (e.g. sortformer-v2 vs + // sortformer-v2.1-aosc). Empty if the GGUF predates the key. + std::string model_variant; + EncoderConfig encoder_cfg; MelConfig mel_cfg; BpeVocab vocab; diff --git a/parakeet-cpp/src/parakeet_engine.cpp b/parakeet-cpp/src/parakeet_engine.cpp index e78e19e333f..b789d62229c 100644 --- a/parakeet-cpp/src/parakeet_engine.cpp +++ b/parakeet-cpp/src/parakeet_engine.cpp @@ -1455,11 +1455,16 @@ void SortformerStreamSession::Impl::process_chunk(int64_t window_start_sample, // Remap cur_full into session-stable IDs and store as the new // baseline so the next chunk's `compute_slot_remap_` can match - // against today's emitted identity scheme. - for (auto & f : cur_full) { - f.speaker_id = remap_id(f.speaker_id); + // against today's emitted identity scheme. AOSC anchors slot + // identity through the speaker cache, so `compute_slot_remap_` + // is never consulted on that path -- skip the storage and the + // identity-remap loop entirely. + if (!cache_active) { + for (auto & f : cur_full) { + f.speaker_id = remap_id(f.speaker_id); + } + prev_chunk_full_segments = std::move(cur_full); } - prev_chunk_full_segments = std::move(cur_full); // VadStateChanged from speaker_probs: a frame speaks if any speaker exceeds threshold; // the chunk speaks if any emitting-frame qualifies; dominant speaker from mean probs. @@ -1656,14 +1661,16 @@ std::unique_ptr Engine::diarize_start( impl->history_samples = opts.sample_rate * opts.history_ms / 1000; impl->ring.reserve(impl->history_samples); - // v2.1 detection (Audio-Online Speaker Cache eligibility). - // v1 sortformer-4spk-v1.q8_0: encoder.n_layers=18, preproc.n_mels=80. - // v2.1 sortformer-streaming-v2.1.q8_0: encoder.n_layers=17, preproc.n_mels=128. - // The v2.1 fine-tune is what trained the cache-aware concat-then-graph - // forward path; enabling it on v1 would just be untrained noise. - const bool model_is_v2_1 = - pimpl_->model.encoder_cfg.n_layers == 17 && - pimpl_->model.mel_cfg.n_mels == 128; + // v2.1 detection (Audio-Online Speaker Cache eligibility). Documented + // in detail next to SortformerStreamingOptions::spkcache_enable in + // include/parakeet/diarization.h. Prefer the explicit variant tag + // emitted by the converter; fall back to encoder shape for legacy + // GGUFs that pre-date the parakeet.model_variant key. + const std::string & variant = pimpl_->model.model_variant; + const bool model_is_v2_1 = !variant.empty() + ? (variant == "sortformer-streaming-v2.1-aosc") + : (pimpl_->model.encoder_cfg.n_layers == 17 && + pimpl_->model.mel_cfg.n_mels == 128); impl->cache_active = opts.spkcache_enable && model_is_v2_1; if (impl->cache_active) { diff --git a/parakeet-cpp/src/parakeet_sortformer.cpp b/parakeet-cpp/src/parakeet_sortformer.cpp index 49de08ee595..c99fe1dd0a3 100644 --- a/parakeet-cpp/src/parakeet_sortformer.cpp +++ b/parakeet-cpp/src/parakeet_sortformer.cpp @@ -29,6 +29,14 @@ namespace parakeet { namespace { +// Score sentinels for the speaker-cache compression top-K. We use finite +// extrema (well-defined under FE_DIVBYZERO trapping FP modes that some +// host builds enable) instead of std::numeric_limits::infinity() +// purely so that subsequent arithmetic on these values cannot produce +// NaNs -- they are only stored and compared with == / !=, never added. +constexpr float k_score_neg_inf = std::numeric_limits::lowest(); +constexpr float k_score_pos_inf = std::numeric_limits::max(); + // Threshold speaker probabilities into time-sorted segments. void sf_threshold_segments(const std::vector & speaker_probs, int T_enc, int num_spks, @@ -256,7 +264,7 @@ static void compute_log_pred_scores(const float * preds, int n_frames, int num_s static void disable_low_scores(std::vector & scores, const float * preds, int n_frames, int num_spks, int min_pos_scores_per_spk) { - const float neg_inf = -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */; + const float neg_inf = k_score_neg_inf; // First pass: non-speech -> -inf. for (int t = 0; t < n_frames; ++t) { @@ -313,7 +321,7 @@ static void boost_topk_scores(std::vector & scores, for (int i = 0; i < k; ++i) { const int t = idx_buf[i]; float & s = scores[(size_t) t * num_spks + spk]; - if (s != -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) { + if (s != k_score_neg_inf) { s += boost; } } @@ -343,6 +351,24 @@ static void compress_speaker_cache( const int A_sil = cfg.spkcache_sil_frames_per_spk; const int spkcache_len_per_spk = spkcache_len / num_spks - A_sil; + if (spkcache_len_per_spk <= 0) { + // Degenerate config: num_spks * A_sil >= spkcache_len leaves no + // budget for retained frames, so the boost / top-K stages would + // run with non-positive k and (for nth_element) a negative + // distance. Fall back to a silence-only cache and bail. + cache.spkcache.assign((size_t) spkcache_len * D, 0.0f); + if (cache.mean_sil_emb.size() == (size_t) D) { + for (int r = 0; r < spkcache_len; ++r) { + std::memcpy(cache.spkcache.data() + (size_t) r * D, + cache.mean_sil_emb.data(), + (size_t) D * sizeof(float)); + } + } + cache.spkcache_preds.assign((size_t) spkcache_len * num_spks, 0.0f); + cache.n_rows = spkcache_len; + cache.spkcache_preds_valid = true; + return; + } const int strong_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.strong_boost_rate); const int weak_boost = (int) std::floor((float) spkcache_len_per_spk * cfg.weak_boost_rate); const int min_pos_per = (int) std::floor((float) spkcache_len_per_spk * cfg.min_pos_scores_rate); @@ -360,7 +386,7 @@ static void compress_speaker_cache( for (int t = spkcache_len; t < n_frames; ++t) { float * s = scores.data() + (size_t) t * num_spks; for (int i = 0; i < num_spks; ++i) { - if (s[i] != -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) { + if (s[i] != k_score_neg_inf) { s[i] += cfg.scores_boost_latest; } } @@ -378,7 +404,7 @@ static void compress_speaker_cache( const int n_total = n_frames + A_sil; if (A_sil > 0) { scores.resize((size_t) n_total * num_spks); - const float pos_inf = 1.0e30f /* very-positive sentinel; +inf is UB with current FP flags */; + const float pos_inf = k_score_pos_inf; for (int t = n_frames; t < n_total; ++t) { float * s = scores.data() + (size_t) t * num_spks; for (int i = 0; i < num_spks; ++i) s[i] = pos_inf; @@ -409,7 +435,7 @@ static void compress_speaker_cache( // speaker blocks contiguous; `torch.remainder(idx, n_frames)` returns the // frame index; our `idx % n_total` does the same.) for (int & idx : topk) { - if (flat_score(idx) == -1.0e30f /* very-negative sentinel; -inf is UB with current FP flags */) { + if (flat_score(idx) == k_score_neg_inf) { idx = MAX_INDEX; } } @@ -467,7 +493,7 @@ static void compress_speaker_cache( // `lc` is the left-context offset within the chunk region; the committed-chunk // preds start at index `prev_spkcache_n + prev_fifo_n + lc` and span `chunk_committed`. static void streaming_update(SortformerSpeakerCache & cache, - const float * chunk_pre_encode_lc, int chunk_committed, + const float * committed_chunk_pre_encode, int chunk_committed, const float * preds_full, int prev_spkcache_len_at_call, int prev_fifo_len_at_call, int lc, @@ -492,7 +518,7 @@ static void streaming_update(SortformerSpeakerCache & cache, const int new_fifo_after_append = cache.n_fifo + chunk_committed; cache.fifo.resize((size_t) new_fifo_after_append * D); std::memcpy(cache.fifo.data() + (size_t) cache.n_fifo * D, - chunk_pre_encode_lc, + committed_chunk_pre_encode, (size_t) chunk_committed * D * sizeof(float)); cache.fifo_preds.resize((size_t) new_fifo_after_append * num_spks); std::memcpy(cache.fifo_preds.data() + (size_t) cache.n_fifo * num_spks, diff --git a/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp b/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp index dc37faa883f..c4a27bb0306 100644 --- a/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp +++ b/parakeet-cpp/test/test_sortformer_aosc_speakers.cpp @@ -47,6 +47,7 @@ // ctest fixtures behave when their fixtures aren't on disk. #include "parakeet/engine.h" +#include "test_utils.h" #include #include @@ -64,57 +65,8 @@ namespace { constexpr double FRAME_S = 0.01; // 10 ms grid -bool file_exists(const std::string & p) { - std::ifstream f(p, std::ios::binary); - return f.good(); -} - -// Pulled verbatim from test_sortformer_streaming.cpp (line 37-76 of that -// file). parakeet-cpp has no shared test-util header today, so the -// helper is duplicated here on purpose; it matches how the existing -// streaming/parity tests are organised. -bool load_wav_pcm16le_mono(const std::string & path, - std::vector & samples, - int & sample_rate) { - std::ifstream f(path, std::ios::binary); - if (!f) return false; - char riff[4]; f.read(riff, 4); - if (std::memcmp(riff, "RIFF", 4) != 0) return false; - f.ignore(4); - char wave[4]; f.read(wave, 4); - if (std::memcmp(wave, "WAVE", 4) != 0) return false; - - bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0; - std::vector data; - while (f) { - char id[4]; f.read(id, 4); - if (!f) break; - uint32_t sz = 0; f.read((char *) &sz, 4); - if (std::memcmp(id, "fmt ", 4) == 0) { - std::vector hdr(sz); - f.read(hdr.data(), sz); - uint16_t fmt = *(uint16_t *) hdr.data(); - channels = *(uint16_t *) (hdr.data() + 2); - srate = *(uint32_t *) (hdr.data() + 4); - bits = *(uint16_t *) (hdr.data() + 14); - if (fmt != 1 || channels != 1 || bits != 16) return false; - fmt_ok = true; - } else if (std::memcmp(id, "data", 4) == 0) { - data.resize(sz); - f.read(data.data(), sz); - break; - } else { - f.ignore(sz); - } - } - if (!fmt_ok || data.empty()) return false; - sample_rate = (int) srate; - const int n = (int) (data.size() / 2); - samples.resize(n); - const int16_t * s16 = reinterpret_cast(data.data()); - for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f; - return true; -} +using parakeet_test::file_exists; +using parakeet_test::load_wav_pcm16le_mono; struct RttmSeg { double start_s; diff --git a/parakeet-cpp/test/test_sortformer_streaming.cpp b/parakeet-cpp/test/test_sortformer_streaming.cpp index 4fd60a65a20..cbc069cd63b 100644 --- a/parakeet-cpp/test/test_sortformer_streaming.cpp +++ b/parakeet-cpp/test/test_sortformer_streaming.cpp @@ -16,6 +16,7 @@ // ingest the file directly against the matching reference RTTM. #include "parakeet/engine.h" +#include "test_utils.h" #include #include @@ -29,51 +30,8 @@ namespace { -bool file_exists(const std::string & p) { - std::ifstream f(p, std::ios::binary); - return f.good(); -} - -bool load_wav_pcm16le_mono(const std::string & path, std::vector & samples, int & sample_rate) { - std::ifstream f(path, std::ios::binary); - if (!f) return false; - char riff[4]; f.read(riff, 4); - if (std::memcmp(riff, "RIFF", 4) != 0) return false; - f.ignore(4); - char wave[4]; f.read(wave, 4); - if (std::memcmp(wave, "WAVE", 4) != 0) return false; - - bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0; - std::vector data; - while (f) { - char id[4]; f.read(id, 4); - if (!f) break; - uint32_t sz = 0; f.read((char *) &sz, 4); - if (std::memcmp(id, "fmt ", 4) == 0) { - std::vector hdr(sz); - f.read(hdr.data(), sz); - uint16_t fmt = *(uint16_t *) hdr.data(); - channels = *(uint16_t *) (hdr.data() + 2); - srate = *(uint32_t *) (hdr.data() + 4); - bits = *(uint16_t *) (hdr.data() + 14); - if (fmt != 1 || channels != 1 || bits != 16) return false; - fmt_ok = true; - } else if (std::memcmp(id, "data", 4) == 0) { - data.resize(sz); - f.read(data.data(), sz); - break; - } else { - f.ignore(sz); - } - } - if (!fmt_ok || data.empty()) return false; - sample_rate = (int) srate; - const int n = (int) (data.size() / 2); - samples.resize(n); - const int16_t * s16 = reinterpret_cast(data.data()); - for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f; - return true; -} +using parakeet_test::file_exists; +using parakeet_test::load_wav_pcm16le_mono; using namespace parakeet; @@ -290,7 +248,7 @@ int run_basic(const std::string & gguf_path, } int main(int argc, char ** argv) { - std::string gguf = "models/sortformer-4spk-v1.f16.gguf"; + std::string gguf = "models/diar_streaming_sortformer_4spk-v2.1.q8_0.gguf"; std::string wav = "test/samples/diarization-sample-16k.wav"; int history_ms = 30000; int chunk_ms = 2000; diff --git a/parakeet-cpp/test/test_utils.h b/parakeet-cpp/test/test_utils.h new file mode 100644 index 00000000000..f819e192641 --- /dev/null +++ b/parakeet-cpp/test/test_utils.h @@ -0,0 +1,69 @@ +// Tiny shared helpers for the C++ test binaries. Kept dependency-light +// (just the standard headers below) so any test can include this without +// pulling in the public Engine surface or any project-internal types. +// +// History: previously these helpers lived inline in +// test_sortformer_streaming.cpp and test_sortformer_aosc_speakers.cpp. +// Pulling them up here avoids drift between two near-identical copies. +#pragma once + +#include +#include +#include +#include +#include + +namespace parakeet_test { + +inline bool file_exists(const std::string & p) { + std::ifstream f(p, std::ios::binary); + return f.good(); +} + +// Load a 16 kHz / mono / s16le RIFF/WAVE file into [-1, 1) float samples. +// Returns false on any header mismatch (non-PCM, non-mono, non-16bit) or +// missing chunk; on success writes the sample rate via `sample_rate`. +inline bool load_wav_pcm16le_mono(const std::string & path, + std::vector & samples, + int & sample_rate) { + std::ifstream f(path, std::ios::binary); + if (!f) return false; + char riff[4]; f.read(riff, 4); + if (std::memcmp(riff, "RIFF", 4) != 0) return false; + f.ignore(4); + char wave[4]; f.read(wave, 4); + if (std::memcmp(wave, "WAVE", 4) != 0) return false; + + bool fmt_ok = false; uint16_t channels = 0; uint16_t bits = 0; uint32_t srate = 0; + std::vector data; + while (f) { + char id[4]; f.read(id, 4); + if (!f) break; + uint32_t sz = 0; f.read((char *) &sz, 4); + if (std::memcmp(id, "fmt ", 4) == 0) { + std::vector hdr(sz); + f.read(hdr.data(), sz); + uint16_t fmt = *(uint16_t *) hdr.data(); + channels = *(uint16_t *) (hdr.data() + 2); + srate = *(uint32_t *) (hdr.data() + 4); + bits = *(uint16_t *) (hdr.data() + 14); + if (fmt != 1 || channels != 1 || bits != 16) return false; + fmt_ok = true; + } else if (std::memcmp(id, "data", 4) == 0) { + data.resize(sz); + f.read(data.data(), sz); + break; + } else { + f.ignore(sz); + } + } + if (!fmt_ok || data.empty()) return false; + sample_rate = (int) srate; + const int n = (int) (data.size() / 2); + samples.resize(n); + const int16_t * s16 = reinterpret_cast(data.data()); + for (int i = 0; i < n; ++i) samples[i] = (float) s16[i] / 32768.0f; + return true; +} + +} // namespace parakeet_test