QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test) by Zbig9000 · Pull Request #25 · tetherto/qvac-ext-lib-whisper.cpp

Zbig9000 · 2026-05-19T15:17:57Z

QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)

Summary

Resyncs whisper.cpp with ggml-org/whisper.cpp:master up to v1.8.4.3, and adds the regression test that upstream PR ggml-org#3677 (whisper_vad_detect_speech_no_reset + whisper_vad_reset_state) shipped without.

Builds on top of @mariusz-rei's earlier sync work (PR #12), so the bulk of the diff is identical. Two material differences vs. that PR:

Recovers 21 commits that landed on tetherto/master after Mariusz's branch point (incl. tts-cpp and parakeet-cpp first appearing in-tree). His branch had diverged before those landed; merging it straight into master silently dropped them. I merged tetherto/master into his branch, which resolved cleanly.
Adds tests/test-vad-streaming.cpp — three cases covering the contract of the new VAD-streaming API: (a) whisper_vad_detect_speech is idempotent, (b) whisper_vad_reset_state correctly restores the LSTM state, (c) whisper_vad_detect_speech_no_reset on a chunk-boundary split of the input produces probabilities byte-identical to a single full-input call.

What changed at a glance

Bucket	Files	Notes
Upstream sync (ggml-org master → v1.8.4.3)	~150	Mariusz's PR #12 content, unchanged. 243 upstream commits brought in.
`tetherto/master` merge	0 conflicts	recovers `tts-cpp` + `parakeet-cpp`
New regression test	`tests/test-vad-streaming.cpp` (+~120 lines), `tests/CMakeLists.txt` (+5 lines)	wired into `ctest`

Validation

✅ Linux x64 CPU build: clean, ctest green incl. the new test-vad-streaming (3/3 cases).
✅ Linux x64 Vulkan build (against system LunarG SDK 1.4.341.1 once spirv.hpp is on the include path): clean.
✅ whisper-cli JFK transcription matches the golden text byte-for-byte on both CPU and Vulkan.
✅ On-device validation on OnePlus 7T Pro (Snapdragon 855+, Adreno 640, Android 12) via the cross-compiled artifacts from PR 3 — JFK transcription correct on CPU backend (auto-picked libggml-cpu-android_armv8.2_2.so, 1.81 s for 11 s of audio = ~6× realtime).
✅ CI green on all jobs that exercise the changed code (android, android_java, vad, bindings-java, ios-xcode-build, Linux/macOS/Windows builds). The 2 still-failing checks (Push Docker image to Docker Hub (main-intel) + (main-vulkan)) are pre-existing failures on tetherto/master — they fail on the 3 most recent master commits too. Unrelated to this PR.
⚠️ whisper-cpp/ggml/tests is not present in whisper.cpp's bundled ggml tree, so test-backend-ops can't run here (it does run as part of PR 2). Java bindings build was deferred to CI — confirmed green here. Documented as deferred in aiDocs/01-QVAC-18991.md.

Notes for reviewer

The merge of tetherto/master into the sync branch is a no-op for everything Mariusz already had — the new commit 9ead0b71 only adds the post-divergence tetherto/master content.
The new test uses a constexpr int kSileroWindow = 512 hard-code (matches Silero v6.2.0's fixed 512-sample window at 16 kHz). An earlier dynamic-derive attempt off-by-one'd because ceil(n_samples / 512) differs from n_samples / probs.size() on the last (zero-padded) chunk. Comment in the test explains why.
Branch is currently referenced by the whisper-cpp vcpkg port on PR 3 (Zbig9000 fork + commit SHA). Once this PR is merged + tagged v1.8.4.3 on tetherto, PR 3's port needs to flip REPO -> tetherto/qvac-ext-lib-whisper.cpp + REF -> v1.8.4.3 + recompute SHA512 (TODO in PR 3 description).

Refs

Refs QVAC-18991 (Asana)
Supersedes Mariusz's local sync PR QVAC-18300: sync with upstream ggml-org/whisper.cpp master (v1.8.4.3 prep) #12 (changes contained here)
Upstream PR adding VAD streaming API: vad : add streaming detect + explicit state reset ggml-org/whisper.cpp#3677

…cription (ggml-org#3715) * Prevent dangling pointers * Use proper free function * Free callback containers * Set default log callback when nil is passed to log_set * Raise error if callbacks set when parallel transcription * Bump version to 1.3.7 * Make tests follow spec change * Add note on parallel transcription and callbacks * Update signature of Whisper.log_set [skip ci]

* kleidiai: add data type check to get_tensor_traits * Added check for F16 data type into get_tensor_traits path with input data not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8) Signed-off-by: Martin Klacer <martin.klacer@arm.com> Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7 * updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp updated kleidiai.cpp file as per suggestion Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1

* vulkan: avoid graphics queue on non-RADV AMD drivers * avoid graphics queues on small GPUs * change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE * reenable transfer queue if graphics queue is not used

* kleidiai : fix MUL_MAT support for batched (3D) inputs The supports_op() check incorrectly rejected MUL_MAT operations with 3D inputs (ne[2] > 1), but the actual compute_forward_qx() implementation handles batched inputs correctly via a loop over ne12. This caused models with Q4_0/Q8_0 weights to crash during graph scheduling when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during loading (tested with 2D inputs) but the runtime used 3D inputs. Also relax the buffer check to allow supports_op() to be called during weight loading when src[0]->buffer is NULL. Fixes #20608 * Kleidiai support_ops should only return true for 3D inputs, not also 4D

* vulkan: fix event wait submission, event command buffer reset * fix event command buffer reset validation error * also reset command buffers before reuse * use timeline semaphores instead of fences for event_synchronize * don't use initializer list for semaphore wait info * use multiple events to avoid reset issues * fix event reuse issue with multiple vectors * add semaphore wait condition also if compute_ctx already exists * remove event pending stage

* ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback

* ggml blas: set mkl threads from thread context * add code to run blas locally

* vulkan: disable mmvq on Intel Windows driver * improve comment

…/20701) Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear attention layers. These ops follow the existing unary-ops pattern with VTCM DMA double-buffering. - neg: negate via scale by -1.0 - exp: uses existing hvx_exp_f32 HVX intrinsics - sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics - softplus: log(1 + exp(x)) scalar fallback - CONT reuses the existing CPY infrastructure since making a tensor contiguous is equivalent to a same-type copy. - REPEAT implements tiled memory copy with multi-threaded execution via the worker pool, supporting f32 and f16 types. The kernel parallelizes across output rows and uses memcpy for each tile. Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

… in some cases on llvm-pipe backends (llama/20618)

…iBi slope offset (llama/20031) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.

* Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.

…ama/20665) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.

RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>

* cmake : fix build warning when kleidiai is enabled * remove LLAMA_ARG_THREADS from KleidiAI backend

…_DELTA_NET) + GET_ROWS optimization (llama/20687) * Implement l2_norm, set, tri * Add DIAG/SOLVE_TRI * Add SSM_CONV * Better get_rows and gated_delta_net to support qwen3.5 * Clean up, update ops.md * Fix binding_index type for wasm * Fix read write annotations * cleanups

* CI: add hip quality check * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Revert "Update .github/workflows/hip-quality-check.yml" This reverts commit efa0bfcdb01dfac0feee674987a0482d50f46145. * scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs * scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list * Bump ccache version * Add mssing seperators to list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…0693) * migrate(vtcm): unify VTCM management for HMX merge - Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled, hmx_dma, vtcm_scratch_size, exp2_table - Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for entire session instead of per-op acquire/release - Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in session-hold mode, delegate in per-op mode - Add VTCM tail reservation for precompute tables (256KB, 64KB aligned) in htp_iface_start under HTP_HAS_HMX - Add HMX init/cleanup hooks in htp_iface_start/stop - Add precompute table recovery in vtcm_acquire after VTCM preemption - Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation) * migrate(repack): replace x4x2 with HMX tile-permuted super-block format - Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants) - Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted - Implement inverse repack for get_tensor debug verification - Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2 - MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy) - Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks - Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate - Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op * migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op migration. Reuses existing dma_queue_flush() for sync points instead of adding redundant dma_queue_drain(). * migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for cleaner separation between HVX and HMX code. Simplify HMX hardware locking by replacing the two-level lock design (SHARED HAP lock + custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock on the existing vtcm_rctx, which already has HMX capability. Key changes: - Create htp/hmx/ subdirectory with all HMX infrastructure and ops - Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx) - Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed) - Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals) - Update main.c includes to use hmx/ prefix - Clean up duplicate declarations from hmx-worker-pool.h * migrate(hmx-infra): consolidate HMX infrastructure into htp_context - Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops - Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func) - Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx - Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release - Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue - Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers - Delete upstream llama.cpp AGENTS.md (not applicable to fork) * migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table - Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable - Remove table duplication loop in precompute-table.c - Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c - Fix table_size to 65536 (single 64 KB copy) in main.c The exp2 lookup table is read-only; concurrent VTCM reads do not cause bank conflicts, so duplicating the table wastes 192 KB of VTCM for no benefit. * migrate(dsp-main): add HMX priority dispatch in packet_callback - Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types) - Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16) - Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32 - Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled - Split RMS_NORM and SCALE into separate case blocks for independent dispatch - All HMX wrappers guarded by #ifdef HTP_HAS_HMX * migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build integration: compile hmx-matmul-ops, hmx-flash-attn-ops, hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain unchanged. * migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility - hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions - hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names - hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings - hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable * hmx: set Q/O element type to fp16 for flash attention The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element should be false to match the actual data layout. * hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback Remove the v73+ HMX-specific super-block/tile-permuted weight format and unify all architectures on the HVX x4x2 packed format. The DSP now decides at runtime whether to use the HMX or HVX matmul path based on dimension constraints (M%32, N%32, K%256 alignment), rather than the host rejecting ops in supports_op. This simplifies the host repack logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL quantization support across host and DSP. Key changes: - Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile permutation (ggml-hexagon.cpp, hmx-quants.h) - Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL - Force is_host=false so tensor copies go through format conversion - Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h) - Rewrite DSP dequantizers to work directly on x4x2 layout (hmx-matmul-ops.c) - Fix mxclracc.hf placement: clear per output tile, not once globally - Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c) - Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks - Add VTCM allocation overflow asserts - Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt) * Enhance HMX debugging capabilities with new tile dumping functions - Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data. - Updated hmx_dump_tile_rows to provide raw memory output for debugging. - Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting. - Refined existing macros for dumping HVX vector values to streamline debugging output. These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations. * OK for small mat mul * hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing truncated transfers and NaN results. Fix by using 2D DMA (per-row stride × n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both x4x2 and fp16 weight paths. Also includes: - Use standard vlut16 instead of _nomatch variant for dequantization - Add per-tile vscatter drain barrier for correctness - Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default) * hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over the generic unary path. Re-enable DMA pipeline mode for QK matmul. * hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride) are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the non-pipeline path by switching from 1D to 2D DMA, but the pipeline path (3 call sites) was left unchanged and still used 1D DMA with chunk_size = n_cols * row_stride as roiwidth, which overflows for any practical matrix size when the pipeline is active. Add a local hmx_dma_push_safe() helper that transparently handles overflow: - Fast path (zero overhead): all params fit in 16 bits -> direct call. - Contiguous block: reshapes into a single 2D descriptor with sub_width that fits in 16 bits, preserving async DMA behavior. - Stride overflow: row-by-row fallback for future large-k models where per-row stride itself exceeds 65535. Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use the safe helper, including the 3 pipeline sites (1D -> 2D fix), the FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the output-stationary activation fetch. * hexagon: multithread activation/output transfer and add HMX matmul fallback - Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with transfer_activation_chunk_multithread across all HMX matmul paths - Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32 output store, following the same worker pool pattern - Rename transfer_activation_chunk_no_prefetch back to transfer_activation_chunk_fp32_to_fp16 and clean up stale comments - Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error * [todo]: dynamic alloc vtcm, cause prefill regression. * hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges. * hexagon: split unaligned-M HMX matmul into HMX+HVX phases - keep HMX for the 32-aligned head rows and process tail rows with HVX - force re-quantization for HVX tail after HMX phase to avoid stale VTCM state - preserve fallback behavior when N is unaligned or no aligned M rows exist * hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead. The dequantize loop now takes the batch-4 path when 4 aligned K-tiles are available within the same column tile, falling back to the original single-tile path otherwise. Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no longer needed. * hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag Promote DSP response error from log to GGML_ABORT for fail-fast behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX matmul so the HVX path correctly re-quantizes activations. * hexagon: support batch matmul. This fix perplexity issue The problem comes from Grouped-Query Attention(GQA). Strides between batches are not well respected TODO: optimize batch matmul to reuse weights between batches. * hexagon: reuse weights in fp16 batch matmul * hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test * hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options * hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL * hexagon: remove the unnecessary changes not related to HMX * hexagon: bypass HMX by default * hexagon: add upstream repo link to htp-ops-lib ported file headers * hexagon: restore host buffer support * hexagon: add HMX=1 option for the adb scripts * hex-hmx: improve DMA pipelining * hex-hmx: further improvements to dma pipelining * hex-hmx: minor cleanup * hex-hmx: move hmx lock out of inner loops/calls * hex-hmx: remove unnecessary state and wrappers * hex-hmx: remove hmx dir and unify f32 to f16 conversions * hex-hmx: further unify hvx conversions * hex-hmx: revert f16 converter to the original for now * hex-hmx: minor cleanup for f16 to f32 converter * hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code * hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm * hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned * hex-mm: use hvx_vec_splat_f16 in the hmx code * hex-mm: use VLEN and HTP types in hmx-code * hex-mm: remove duplicate QK and defs * hexagon: pre-shuffle quants before vlut16 * hexagon: enable HMX by default * hex-mm: code indent fixes for hmx-matmul * hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm * hex-mm: more formatting fixes * hex-mm: minor naming updates in hmx code * hex-mm: remove leftover from rebase conflict * Fix the incorrect indents --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…n Windows (llama/20655)

…ma/20767)

* CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.

…lama/20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on ggml-org/llama.cpp#20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups

…20791) Explicitly mark save_acc and add_save_Acc with always_inline in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator disassembly within kernel's register context, preventing un-necessary stask spills. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

Two latent bugs surfaced together when whisper.cpp is built with -DWHISPER_COREML=ON, both reproducible at CMake configure time: 1. install(TARGETS whisper.coreml) did not join the whisper-targets export set. Since whisper PRIVATE-links to whisper.coreml and is itself in whisper-targets, CMake refuses to generate with install(EXPORT "whisper-targets" ...) includes target "whisper" which requires target "whisper.coreml" that is not in any export set. Add EXPORT whisper-targets to the install (must come before LIBRARY in CMake's install(TARGETS ...) signature). 2. Once whisper.coreml is in the export set, its PUBLIC include dirs are validated against the install interface. The current "." include dir is a raw source-tree path with no $<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses with INTERFACE_INCLUDE_DIRECTORIES property contains path "..." which is prefixed in the source directory. The headers under coreml/ are internal implementation details only consumed by whisper.cpp (in the same directory), so the correct fix is to mark them PRIVATE rather than wrapping them in install/build generator expressions. Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure clean, whisper.coreml + libwhisper.dylib build end-to-end. This unblocks the ios-xcode-build CI job on PR tetherto#12. QVAC-18300 Co-authored-by: Cursor <cursoragent@cursor.com>

The bindings-java tests testGetDefaultFullParams_Greedy / testGetDefaultFullParams_BeamSearch on PR tetherto#12 fail with expected: <5> but was: <0> (greedy.best_of) expected: <5> but was: <-1> (beam_search.beam_size) while whisper_full_default_params() still returns 5 for both — the actual transcription test (testFullTranscribe) produces correct text. Diagnosis: the Java JNA WhisperFullParams Structure is missing fields that exist in the C whisper_full_params struct, so JNA computes wrong offsets and reads garbage at greedy.best_of / beam_search.beam_size. Specifically the Java layout was missing: 1. int32_t seed — added by tetherto's local seed patch between no_speech_thold and greedy (include/whisper.h:553). This single omission shifts every subsequent field by 4 bytes and is the proximate cause of both failing assertions. 2. bool vad — added by upstream 3. const char * vad_model_path 4. whisper_vad_params vad_params (struct) Fix: * New WhisperVadParams.java JNA Structure mirroring whisper_vad_params {threshold, min_speech_duration_ms, min_silence_duration_ms, max_speech_duration_s, speech_pad_ms, samples_overlap}. * Add `public int seed`, `public CBool vad`, `public String vad_model_path`, `public WhisperVadParams vad_params` fields and thread them into getFieldOrder() at the matching positions. Field order in WhisperFullParams.getFieldOrder() now matches the C struct in include/whisper.h field-for-field, so JNA-computed offsets agree with the native side. QVAC-18300 Co-authored-by: Cursor <cursoragent@cursor.com>

…rakeet-cpp work post-divergence)

Upstream ggml-org/whisper.cpp PR ggml-org#3677 added the streaming VAD entry points but shipped no test. Lock the public contract on the tetherto fork so regressions surface immediately: - whisper_vad_detect_speech idempotent (reset is implicit) - whisper_vad_reset_state restores LSTM state exactly - detect_speech == reset_state + detect_speech_no_reset - detect_speech_no_reset on contiguous halves == single-shot detect_speech (state carries across no-reset call boundary) Splits at a 512-sample boundary (Silero v6.2.0 window size) so no mid-stream zero padding is introduced. Uses the bundled silero VAD model and samples/jfk.wav; no whisper transcribe model needed. QVAC-18991 Co-authored-by: Cursor <cursoragent@cursor.com>

Repoints the port at the latest tetherto/qvac-ext-lib-whisper.cpp@master tip (ef0f2ae637dc3be8bcd52b17374f9bb804beb06b), which folds in three PRs: * tetherto/qvac-ext-lib-whisper.cpp#23 -- parakeet-cpp: android dynamic backend loading + Adreno-tier GPU policy. The parakeet-cpp subtree now defaults Android builds to GGML_BACKEND_DL=ON + GGML_CPU_ALL_VARIANTS=ON + GGML_CPU_REPACK=ON + GGML_VULKAN=ON + GGML_OPENCL=ON, matching the qvac llm-llamacpp Android port. Vulkan and OpenCL ship as separately-loadable MODULE .so files; per-arch CPU variants ship as `libqvac-speech-ggml-cpu-android_armv*_*.so`. Backend selection is centralised in `init_gpu_backend()`: Adreno 700+ -> OpenCL, every other GPU -> Vulkan (or Metal / CUDA on matching platforms). No static GPU backend entry points are linked anywhere in libparakeet; the ggml-backend registry walk handles every case in both GGML_BACKEND_DL=ON and GGML_BACKEND_DL=OFF modes. Also adds public `set_backends_directory()` / `set_opencl_cache_dir()` entry points plus the matching `EngineOptions::backends_dir` / `opencl_cache_dir` fields and the `--backends-dir` CLI flag so embedded host apps can pin the backends scan directory and the ggml-opencl program-binary cache per-process. * tetherto/qvac-ext-lib-whisper.cpp#24 -- parakeet-cpp: address PR #22 AOSC v2.1 review comments (Sortformer streaming fixes that landed shortly after PR #23 merged; safe to fold in). * tetherto/qvac-ext-lib-whisper.cpp#25 -- Fix missing include for windows (compile-only follow-up to PR #23; needed for the Windows desktop dev path that exercises the new init_gpu_backend tier policy). Date-stamped rather than port-versioned because the upstream commits land Android-specific backend-loading machinery that previous pv1 builds genuinely lacked (not just a bugfix on the same source set). Consumers pinning to `2026-05-05#1` keep the StreamingSegment .starts_word baseline; consumers tracking the date-stamped baseline move forward to the dynamic-backend Android shape. Dependency floor on ggml-speech tightened from `2026-04-09#1` to `2026-04-09#2` -- the new Android CPU_ALL_VARIANTS path requires the per-arch CPU variant dlopen fallback that landed in ggml-speech pv2 (previous commit). Without that floor a downstream registry override could silently pull pv1 and fail to register any CPU backend at runtime under AGP's `useLegacyPackaging=false` (the universal Android default since 3.6). No behaviour change on macOS / iOS (Metal still statically linked into libggml-*) or desktop Linux / Windows (Vulkan / CUDA likewise static). The Android-defaults block in parakeet-cpp's CMakeLists.txt is gated on `CMAKE_SYSTEM_NAME STREQUAL "Android"` and only flips the dynamic-loading switches there. Verified by host build: `nm libparakeet.dylib | grep ggml_backend_(vulkan|opencl|metal|cuda|blas)_init` returns empty. git-tree for ports/parakeet-cpp: 2961794. Co-authored-by: Cursor <cursoragent@cursor.com>

…pp-upstream QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)

KitaitiMakoto and others added 30 commits March 22, 2026 02:03

sycl : fix for untransposed GDA recurrent state (llama/20583)

1335dfa

CUDA: GDN hide memory latency (llama/20537)

dae7781

vulkan: fix flash attention dot product precision (llama/20589)

724ea71

ehance UPSCALE to support all UT cases (llama/20637)

6494251

* [SYCL] ehance UPSCALE to support more cases * rm test case result of SYCL1

ggml-cpu: fix RVV checks in quants and repacking (llama/20682)

c890a9d

* ggml-cpu: refactor quants.c; add rvv check * ggml-cpu: refactor; disable generic fallback

ggml-blas: set mkl threads from thread context (llama/20602)

906aef3

* ggml blas: set mkl threads from thread context * add code to run blas locally

vulkan: disable mmvq on Intel Windows driver (llama/20672)

16ca5e6

* vulkan: disable mmvq on Intel Windows driver * improve comment

HIP : ignore return of hipMemAdvise [no ci] (llama/20696)

61c7cd0

ggml-cpu/x86: fix unused changemask warning in repack (llama/20692)

14caedf

Move to no timeout for WaitAny in graph submission to avoid deadlocks…

d6a0f0d

… in some cases on llvm-pipe backends (llama/20618)

ggml-webgpu: Add supports for DIAG and TRI (llama/20664)

12015a2

* Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.

ggml-webgpu: Update the RMS_NORM preprocessor and add L2_NORM (ll…

3d004fb

…ama/20665) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.

cmake : fix build warning when kleidiai is enabled (llama/20457)

fea629d

* cmake : fix build warning when kleidiai is enabled * remove LLAMA_ARG_THREADS from KleidiAI backend

vulkan: dequantize iq4_xs 4 at a time (llama/20657)

43c7c0f

hip: Avoid compiler bug in RDNA code generation during debug builds o…

e1cdce4

…n Windows (llama/20655)

ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (lla…

65d820a

…ma/20767)

ggerganov and others added 11 commits May 2, 2026 15:02

ggml : bump version to 0.10.2 (ggml/1474)

28f8534

ggml : remove obsoloete wgsl templates (ggml/0)

a5a8496

ggml : remove obsolete rms_norm.wgsl (ggml/0)

bbdaa21

sync : ggml

8384aa8

cmake : add FindNCCL.cmake (ggml/0)

18162bc

talk-llama : sync llama.cpp

4bf7336

Merge upstream ggml-org/whisper.cpp master into v1.8.5 prep

bcbaaae

merge tetherto/master into upstream-sync-v1.8.4.3 (pull in tts-cpp/pa…

9ead0b7

…rakeet-cpp work post-divergence)

Zbig9000 requested review from GustavoA1604, freddy311082, ishanvohra2 and ogad-tether May 19, 2026 15:17

Zbig9000 requested review from a team as code owners May 19, 2026 15:17

Zbig9000 changed the title ~~Qvac 18991 pull latest whisper cpp upstream~~ QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test) May 19, 2026

This was referenced May 19, 2026

QVAC-18993: bundled-ggml Android dynamic-backend + tts-cpp <atomic> fix #26

Closed

whisper-cpp: bump to 1.8.4.3 with Android OpenCL + dynamic backends + MSVC fix (QVAC-18300, QVAC-18993) tetherto/qvac-registry-vcpkg#152

Merged

GustavoA1604 merged commit e5677f7 into tetherto:master May 20, 2026
59 of 66 checks passed

Zbig9000 mentioned this pull request May 20, 2026

QVAC-18992: merge ggml-org @ 19eac6f0 (v0.10.2) into speech tetherto/qvac-ext-ggml#13

Merged

gianni-cor pushed a commit that referenced this pull request May 28, 2026

Merge pull request #25 from Zbig9000/QVAC-18991-pull-latest-whisper-c…

cea114a

…pp-upstream QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)#25

QVAC-18991: pull latest whisper.cpp from upstream (+ VAD-streaming regression test)#25
GustavoA1604 merged 248 commits into
tetherto:masterfrom
Zbig9000:QVAC-18991-pull-latest-whisper-cpp-upstream

Zbig9000 commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Zbig9000 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!