chore: rebase fork to whisper.cpp v1.8.4 by sharmaraju352 · Pull Request #8 · tetherto/qvac-ext-lib-whisper.cpp

sharmaraju352 · 2026-03-30T05:21:25Z

Summary

Rebases our fork (qvac-ext-lib-whisper.cpp) from whisper.cpp v1.7.5 to v1.8.4 (upstream ggml-org/whisper.cpp)
Cherry-picks all 4 fork-specific commits onto the new v1.8.4 base:
- 80f95ec0 — Add seed parameter for reproducible sampling (whisper.h + whisper.cpp)
- e4215731 — Add CODEOWNERS file
- 962c3805 — Added approval check worker
- 06c478e7 — DEVOPS-916: Add ai-runtime-merge to CODEOWNERS
Tagged as v1.8.4.1 (already consumed by qvac-registry-vcpkg port)

Key upstream improvements (v1.7.6 → v1.8.4)

Flash Attention enabled by default (significant speed improvement on GPU)
ggml backend optimizations (Metal shaders, CPU norm scalar fix, threading)
VAD memory leak fixes
UTF-8 segment wrapping fix

Benchmark results (darwin-arm64, Metal)

Test case	v1.7.5.1	v1.8.4.1	Improvement
tiny / 20s audio	193ms	181ms	6% faster
tiny / 18s audio	200ms	145ms	27% faster
tiny / 328s audio	3876ms	3102ms	20% faster
small / 20s audio	502ms	405ms	19% faster

No accuracy regression — identical transcription output on all test files.

Related PRs

Registry: chore: update whisper-cpp port to v1.8.4.1 qvac-registry-vcpkg#108 (merged)
Monorepo: chore: upgrade whisper.cpp backend from v1.7.5.1 to v1.8.4 qvac#1141

Note

This PR replaces master history with a rebase onto upstream v1.8.4. It is not a standard merge — the branch has diverged from master (1348 ahead / 8 behind). The 8 "behind" commits are the old fork-specific commits that have been cherry-picked onto the new base.

Made with Cursor

…lama/18718)

Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric. See more details here: KhronosGroup/MoltenVK#2683. The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration. The deduplication logic here is changed to only filter GPUs if UUID is same but driver is different.

The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (ggml-org/llama.cpp#17826). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.

* vulkan: use coopmat for flash attention p*v matrix multiplication * fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives

…t to support more cases (llama/19154) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

…) (llama/19126)

…llama/19159)

* refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit 4d2ef3d483932659801a59a5af0b6b48f6ffd5c7. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>

…ma/19165) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert

…/19150) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…g (llama/19151) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll Q*K accumlation inner loop * ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI

* sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS

) The script creates dSYMs/ but references dSYMS/ for macOS, causing build failures on case-sensitive filesystems.

…ggml-org#3633) * ruby : Bump version to 1.3.6 * Fix code in example * Add sample code to transcribe from MemoryView * Define GetVADContext macro * Use GetVADContext * Extract parse_full_args function * Use parse_full_args in ruby_whisper_full_parallel * Free samples after use * Check return value of parse_full_args() * Define GetVADParams macro * Add VAD::Context#segments_from_samples * Add tests for VAD::Context#segments_from_samples * Add signature for VAD::Context#segments_from_samples * Add sample code for VAD::Context#segments_from_samples * Add test for Whisper::Context#transcribe with Pathname * Make Whisper::Context#transcribe and Whisper::VAD::Context#detect accept Pathname * Update signature of Whisper::Context#transcribe * Fix variable name * Don't free memory view * Make parse_full_args return struct * Fallback when failed to get MemoryView * Add num of samples when too long * Check members of MemoryView * Fix a typo * Remove unnecessary include * Fix a typo * Fix a typo * Care the case of MemoryView doesn't fit spec * Add TODO comment * Add optimazation option to compiler flags * Use ALLOC_N instead of malloc * Add description to sample code * Rename and change args: parse_full_args -> parse_samples * Free samples when exception raised * Assign type check result to a variable * Define wrapper function of whisper_full * Change signature of parse_samples for rb_ensure * Ensure release MemoryView * Extract fill_samples function * Free samples memory when filling it failed * Free samples memory when transcription failed * Prepare transcription in wrapper funciton * Change function name * Simplify function boundary

…gml-org#3647) * Don't convert to temporary VALUE * Define Whisper::Context::Params * Add test for Whisper::Context::Params * Implement Whisper::Context::Params * Add tests for Context::Params * Fix Whisper::Token memory management * Add test for token_timestamps * Make Context accept Context::Params * Make Context::Params.new accept keyword args * Add test for Context::Params.new with keyword args * Add signature of Context::Params * Add example for Whisper::Token * Fix typos * Revert "Don't convert to temporary VALUE" This reverts commit dee66e7. * Hold Token#text as Ruby objectd * Don't use pointer for ruby_whisper_context_params.params * Use RUBY_DEFAULT_FREE instead of custom function * Update bindings/ruby/README.md Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * Add document for Whisper::Context::Params --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

…llama/19194)

* Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>

* wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…ma/19188) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists

* Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format

* Update old URLs to github.com/ggml-org/ * Bump copyrights

* metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

sharmaraju352 · 2026-03-30T07:50:19Z

/review

jpgaribotti · 2026-03-30T08:37:36Z

/review

Squash-rebase of feat/metal-optimization-supertonic onto master post-#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR #8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ay-port Replaces the local vcpkg overlay-port machinery with a simpler bundled- ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into `tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it. What's in / what's out: + tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the pinned commit (currently 60a172e48f, the merge of #8) into tts-cpp/ggml/. Idempotent; re-run to bump the pin via the script's GGML_REF variable. + tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`) no longer requires a `patches/` directory. Speech branch is pre-patched at the commit level, so `add_subdirectory(ggml)` consumes the source directly. - tts-cpp/cmake/vcpkg-overlay-ports/ggml/ (all 4 files) - tts-cpp/vcpkg-configuration.json - tts-cpp/vcpkg.json Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build script. The vcpkg overlay was always a stopgap until the registry pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching to the bundled flow side-steps that wait entirely for dev builds. Performance bonus: bundled `add_subdirectory(ggml)` defaults to GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series), where the vcpkg port had GGML_NATIVE=OFF for portable redistributables. On Apple M2, the dev flow benches ~9 ms faster total median and ~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms peak: vcpkg-overlay (rebased): total med 100.48 range 96-125 ms 31.9x bundled-ggml (this): total med 91.15 range 88-92 ms 35.2x ^ +3.3x Downstream production builds still go through vcpkg via `TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144 bumps to the same speech commit). README §1 updated with the new dev flow as the canonical recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Squash-rebase of feat/metal-optimization-supertonic onto master post-tetherto#16 (OpenCL Supertonic merge). Combines: - Five custom fused Metal kernels (supertonic_depthwise_1d / layer_norm_channel / pw2_residual / bias_gelu / edge_pad_1d) with `_ct` and `_causal_ct` variants for [C, T] activation layout. Patches live upstream in qvac-ext-ggml@speech (PR tetherto#8, merged); our overlay-port redirects vcpkg to that branch. - Full Phase B2: every ConvNeXt block in vector_estimator (16 blocks) and vocoder (10 blocks) runs end-to-end on [C, T] activations. K=1 pointwise becomes direct ggml_mul_mat (no im2col). Single entry/exit permute spans each chain. - Phase B1: end-to-end f16 via asymmetric load (`:onnx::MatMul_*` stays f16 on Metal, expands to f32 elsewhere). - Phase A1+A2: 5-step CFM unrolled into one ggml_cgraph; latent stays in GPU memory step-to-step. - Phase A3: q8_0 storage on Metal, kernel_mul_mm_q8_0_f32 dispatches. - Tier 2 load-time matmul weight pretranspose. - Causal-pad mode in depthwise_1d_ct + K=7 support for vocoder. Coexists with master's OpenCL Supertonic work: - `supertonic_op_dispatch_scope` (master) toggles CPU custom_4d fast paths via thread-local; replaces our `use_cpu_fastpath` parameter plumbing. - F1/F2/F6/F13/F14/F16 audit optimisations from master preserved. - F7 vocoder convnext-block fusion (master) runs on the CPU path; Metal path runs our `_ct` chain. Bench on Apple M2 (5 runs, --steps 5, en/M1, "fox" prompt), post-rebase: Metal med 98.4 ms vec_est 65.6 vocoder 13.1 RTM 32.6x CPU (unchanged from master) ONNX CPU (unchanged from master) Net floor moved from ~88 ms (pre-rebase peak) to ~98 ms (post-rebase), ~10 ms slip absorbed where master's front_cache refactor replaced parts of our trace_proj step-builder per the agent's resolution rule "prefer master's cache pattern when refactored." Causal kernel intact; vocoder at 13.1 ms vs master's CPU 39.4 ms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ay-port Replaces the local vcpkg overlay-port machinery with a simpler bundled- ggml dev flow that clones tetherto/qvac-ext-ggml@speech directly into `tts-cpp/ggml/` and lets CMake's `add_subdirectory(ggml)` consume it. What's in / what's out: + tts-cpp/scripts/setup-ggml.sh — clones qvac-ext-ggml@speech at the pinned commit (currently 60a172e48f, the merge of tetherto#8) into tts-cpp/ggml/. Idempotent; re-run to bump the pin via the script's GGML_REF variable. + tts-cpp/CMakeLists.txt — bundled path (`TTS_CPP_USE_SYSTEM_GGML=OFF`) no longer requires a `patches/` directory. Speech branch is pre-patched at the commit level, so `add_subdirectory(ggml)` consumes the source directly. - tts-cpp/cmake/vcpkg-overlay-ports/ggml/ (all 4 files) - tts-cpp/vcpkg-configuration.json - tts-cpp/vcpkg.json Net diff: −250 lines of bridge plumbing, +50 lines of clone-and-build script. The vcpkg overlay was always a stopgap until the registry pin advanced past 60a172e (see qvac-registry-vcpkg#144); switching to the bundled flow side-steps that wait entirely for dev builds. Performance bonus: bundled `add_subdirectory(ggml)` defaults to GGML_NATIVE=ON (native ARM dotprod / SVE / wider SIMD on M-series), where the vcpkg port had GGML_NATIVE=OFF for portable redistributables. On Apple M2, the dev flow benches ~9 ms faster total median and ~30 ms tighter variance — back within 3 ms of the pre-rebase 88 ms peak: vcpkg-overlay (rebased): total med 100.48 range 96-125 ms 31.9x bundled-ggml (this): total med 91.15 range 88-92 ms 35.2x ^ +3.3x Downstream production builds still go through vcpkg via `TTS_CPP_USE_SYSTEM_GGML=ON` and find_package(ggml) — those pull from the `ggml` port in qvac-registry-vcpkg (which qvac-registry-vcpkg#144 bumps to the same speech commit). README §1 updated with the new dev flow as the canonical recipe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".

chore: rebase fork to whisper.cpp v1.8.4

Resolves the review comments on the merged AOSC v2.1 PR (#22, merge commit e6ba38c). All eight changes are minimal and behaviour-preserving except the v2.1 detection upgrade (now strict-tag with shape fallback) and the degenerate-config guard (silence-only fallback instead of UB-adjacent boost arithmetic). Reviewer comments classified as "perf only / out of scope / would only add a TODO" are intentionally not addressed in this commit -- see the plan file referenced in the PR description. src/parakeet_sortformer.cpp -- `compress_speaker_cache` - Early-return when `spkcache_len_per_spk <= 0` (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K stages are mostly defended (`boost_topk_scores` already returns early on non-positive k), but the function was otherwise running a no-op pass that produced an all-silence cache via the slow path. Fall back to an explicit silence-only profile and bail. - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to `committed_chunk_pre_encode`. The call site already advances past the left context (`chunk_pre_committed = ... + lc * D`), so the old `_lc` suffix was misleading. `int lc` stays -- it's used inside the function to index into `preds_full`, which still contains the left-context preds. - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites) with named constants `k_score_neg_inf` / `k_score_pos_inf` backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped the inline "-inf is UB with current FP flags" comments: IEEE 754 +/-inf is well-defined; the original concern (avoiding NaN-on-arithmetic) still holds because we only store and compare the sentinels. src/parakeet_engine.cpp - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop and the `prev_chunk_full_segments = std::move(cur_full)` store: `compute_slot_remap_` is never consulted when `cache_active` is true (AOSC anchors slot identity through the speaker cache), so the work was dead. - Switched v2.1 detection from pure-shape to "prefer the converter's `parakeet.model_variant` GGUF tag; fall back to `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This prevents a future v2.2/v3 variant that happens to share v2.1's encoder shape from silently opting into AOSC. include/parakeet/diarization.h - Moved the v1-vs-v2.1 detection rationale comment out of parakeet_engine.cpp and into the `SortformerStreamingOptions:: spkcache_enable` block, with a paragraph on the tag-first / shape-fallback policy. src/parakeet_ctc.{h,cpp} - Added `std::string ParakeetCtcModel::model_variant` (optional GGUF metadata mirror; empty on legacy GGUFs). - Loader reads `parakeet.model_variant` next to the existing `parakeet.model.type` read; absent key -> empty string -> detection falls back to shape. scripts/convert-nemo-to-gguf.py - New `detect_sortformer_variant(ckpt: Path)` derives a stable variant tag from the source .nemo filename (`sortformer-v1` / `sortformer-streaming-v2` / `sortformer-streaming-v2.1-aosc`); empty string for unknown checkpoints. - Sortformer branch of `write_gguf` writes `parakeet.model_variant` when the tag is non-empty. - `write_gguf` signature extended with `ckpt: Path`; only the one internal call site adjusted. scripts/download-all-models.sh - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the AOSC fine-tune that this PR's tests target); bumped the budget comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the contents line. CMakeLists.txt + test/test_sortformer_streaming.cpp - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default GGUF path is the matching v2.1 q8_0. Aligns the test with the line-299 comment that says the binary "reflects the production v2.1 AOSC config out of the box". test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists` duplicates into a shared inline header in the `parakeet_test` namespace. The duplicate copies and the "duplicated here on purpose" comment block in test_sortformer_aosc_speakers.cpp are gone; both tests `#include "test_utils.h"` and use `using parakeet_test::...`. Build + ctest verification - `cmake --build build -j` clean (no new warnings). - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`: test-sortformer-streaming ........ Passed 8.23 s test-sortformer-aosc-speakers-abcba . Passed 33.80 s test-sortformer-aosc-speakers-abcdba Passed 36.91 s The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant` key, so the AOSC tests passing here also verifies the shape-fallback path. Re-running the converter on the v2.1 .nemo will populate the new key for the strict-tag path. Reviewer comments deferred / skipped (rationale): - Encoder graph cache thrashing during FIFO ramp-up (#4): perf only; proper fix wants pre-build-at-diarize_start + silence padding or a mask argument, not minimal. Tracked for a follow-up perf PR. - WAV fixtures committed as ~11 MB binaries (#8): project-wide Git LFS adoption decision, not a code change. - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing on the v1 path; wants a std::deque refactor, out of scope. - `encoder_ms` attribution surprising (#12): code is correct and matches sibling paths; the user explicitly opted against comment-only "clarifications".

kpouget and others added 30 commits January 30, 2026 15:56

ggml: new backend for Virglrenderer API Remoting acceleration (v2) (l…

531d7b6

…lama/18718)

sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove asser…

f0e85bb

…t to support more cases (llama/19154) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.…

62ba8b5

…) (llama/19126)

ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (…

e0a2182

…llama/19159)

cuda : fix nkvo, offload and cuda graph node properties matching (lla…

b997e69

…ma/19165) * cuda : fix nkvo * cont : more robust cuda graph node property matching * cont : restore pre-leafs implementation * cont : comments + static_assert

sycl: implement GGML_OP_TRI (llama/19089)

1b3c27e

* sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI

sycl: implement GGML_UNARY_OP_SOFTPLUS (llama/19114)

2a16e7a

* sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS

add tensor type checking as part of cuda graph properties (llama/19186)

5dca0db

sync : ggml

b529c06

talk-llama : sync llama.cpp

953e503

cuda : fix compile warnings (#0)

acbace0

scripts : Fix dSYMs path case for macOS xcframework build (ggml-org#3630

bf422cb

) The script creates dSYMs/ but references dSYMS/ for macOS, causing build failures on case-sensitive filesystems.

cmake : remove unused file (ggml/1419)

fc1a3e5

ggml : bump version to 0.9.6 (ggml/1423)

06e3750

Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (…

efd6344

…llama/19194)

Bump cmake max version (needed for Windows on Snapdragon builds) (lla…

aca5953

…ma/19188) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists

Remove pipeline cache mutexes (llama/19195)

a0256b8

* Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format

docs : Minor cleanups (llama/19252)

0e219eb

* Update old URLs to github.com/ggml-org/ * Bump copyrights

ggml-backend: fix async set/get fallback sync (llama/19179)

625c8d8

metal : support virtual devices (llama/18919)

73e0455

* metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async

ggerganov and others added 10 commits March 18, 2026 15:18

ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)

945d315

ggml : bump version to 0.9.8 (ggml/1442)

b2be162

sync : ggml

f5b477a

benches : update

4bbce1e

ci : update workflows

ef3463b

release : v1.8.4

9386f23

Add seed parameter for reproducible sampling

7ce31d4

- Add seed field to whisper_full_params structure - Default seed value is 0 (maintains backward compatibility) - Each decoder uses seed + decoder_index for unique seeds - Enables reproducible results when temperature > 0

add_codeowners file

2cc2313

added approval check worker

6befb6f

DEVOPS-916: Add ai-runtime-merge to CODEOWNERS

2a94ba2

sharmaraju352 requested review from a team as code owners March 30, 2026 05:21

github-advanced-security AI found potential problems Mar 30, 2026

View reviewed changes

Comment thread ggml/src/ggml-hexagon/htp/hex-dump.h Dismissed

Comment thread ggml/src/ggml-hexagon/htp/hex-dump.h Dismissed

Comment thread ggml/src/ggml-hexagon/htp/hex-dump.h Dismissed

jpgaribotti approved these changes Mar 30, 2026

View reviewed changes

ishanvohra2 approved these changes Mar 30, 2026

View reviewed changes

Merge branch 'master' into rebase-v1.8.4

8519283

ogad-tether approved these changes Mar 30, 2026

View reviewed changes

sharmaraju352 merged commit e361028 into master Mar 31, 2026
69 of 77 checks passed

gianni-cor pushed a commit that referenced this pull request May 28, 2026

Merge pull request #8 from tetherto/rebase-v1.8.4

0effd71

chore: rebase fork to whisper.cpp v1.8.4

gianni-cor deleted the rebase-v1.8.4 branch May 28, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: rebase fork to whisper.cpp v1.8.4#8

chore: rebase fork to whisper.cpp v1.8.4#8
sharmaraju352 merged 1349 commits into
masterfrom
rebase-v1.8.4

sharmaraju352 commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sharmaraju352 commented Mar 30, 2026

Uh oh!

jpgaribotti commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

sharmaraju352 commented Mar 30, 2026

Summary

Key upstream improvements (v1.7.6 → v1.8.4)

Benchmark results (darwin-arm64, Metal)

Related PRs

Note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sharmaraju352 commented Mar 30, 2026

Uh oh!

jpgaribotti commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants