Skip to content

QVAC-18300: sync with upstream ggml-org/whisper.cpp master (v1.8.4.3 prep)#12

Merged
GustavoA1604 merged 246 commits into
tetherto:masterfrom
mario-rei:feat/upstream-sync-v1.8.4.3
May 20, 2026
Merged

QVAC-18300: sync with upstream ggml-org/whisper.cpp master (v1.8.4.3 prep)#12
GustavoA1604 merged 246 commits into
tetherto:masterfrom
mario-rei:feat/upstream-sync-v1.8.4.3

Conversation

@mario-rei

@mario-rei mario-rei commented May 4, 2026

Copy link
Copy Markdown

Summary

Sync our fork with upstream ggml-org/whisper.cpp:master (243 commits past
v1.8.4) so we can ship whisper-cpp 1.8.4.3 to qvac-registry-vcpkg with
a working OpenCL backend on Android.

  • git merge upstream/master produced zero conflicts.
  • All 25 tetherto-specific commits preserved (BCI patches is_bci,
    window_mask, compute_window_mask(), the per-layer flash-attention
    guard, and the older cmake / windows-pthread / vcpkg fixes).
  • Preserves the seed parameter and the ai-runtime-merge codeowners
    entry.

Test plan

Build sanity (host macOS arm64)

  • -DGGML_OPENCL=OFF configure + build → libwhisper.dylib produced
    end-to-end.
  • -DGGML_OPENCL=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN_DISABLE_COOPMAT=ON -DGGML_VULKAN_DISABLE_COOPMAT2=ON
    configure clean; OpenCL TU itself can't compile on macOS without
    CL/cl.h (expected — Adreno target is Android). All CPU variants
    linked successfully.

Functional inference (host macOS arm64, Metal backend)

  • whisper-cli -m models/ggml-tiny.en.bin -f samples/jfk.wav
    output: And so my fellow Americans ask not what your country can do for you ask what you can do for your country. (exact JFK
    quote). Total wall time 486 ms; Metal backend on Apple M3 Ultra.

Bundled tests (ctest)

  • test-whisper-cli-tiny.en — PASS (Metal).
  • test-vad (silero VAD unit) — PASS (CPU/BLAS), correct VAD
    segments produced.
  • test-vad-full (VAD + whisper end-to-end) — PASS, output
    matches JFK quote.

Out of scope on host (will run as part of the downstream qvac PR)

  • Android NDK cross-compile with GGML_OPENCL=ON (covered by
    prebuilds-qvac-lib-infer-whispercpp.yml on the qvac PR).
  • OpenCL backend execution on a real Adreno device (mobile
    integration test on the qvac PR).
  • BCI-specific path on a BCI model (not in public ggml model bucket
    — source-level BCI guards / window_mask code intact, verified
    structurally).

Follow-up after merge

  • Tag this commit as v1.8.4.3 so the qvac-registry-vcpkg portfile
    in the downstream PR can switch from the temporary mario-rei/SHA
    pin back to tetherto REF=v1.8.4.3 (and SHA512 will be
    recomputed against the new tarball).

Notes for reviewers

"unstable" merge state is structural, not a real CI failure

The 5 perpetually-pending jobs (ggml-ci-mac-metal, ggml-ci-mac-vulkan,
ggml-ci-x64-nvidia-cuda, ggml-ci-x64-nvidia-vulkan-cm,
ggml-ci-x64-nvidia-vulkan-cm2) are upstream ggml-org/whisper.cpp
self-hosted-runner checks that came along with the upstream sync.
Tetherto's fork doesn't host those runners, so they won't go green here
regardless of how long we wait — they just sit queued and eventually get
cancelled.

Run #25340972534
shows the actual signal: 53 success / 0 failure / 5 cancelled, and
all 5 cancellations are those upstream-only jobs. The substantive jobs —
ios-xcode-build, bindings-java, all windows-* variants, android,
android_java, the macOS/iOS xcframework builds, sanitized Linux x64,
arm64, ppc64le, etc. — are all green.

Merge commit subject vs target tag

The merge commit reads "v1.8.5 prep" — that was the working name when
the merge was performed; we settled on v1.8.4.3 afterwards (continues
the existing v1.8.4.x post-fix scheme). Branch name and version files
in downstream PRs all use v1.8.4.3. Please use v1.8.4.3 when tagging
the merge commit.

KitaitiMakoto and others added 30 commits March 22, 2026 02:03
…cription (ggml-org#3715)

* Prevent dangling pointers

* Use proper free function

* Free callback containers

* Set default log callback when nil is passed to log_set

* Raise error if callbacks set when parallel transcription

* Bump version to 1.3.7

* Make tests follow spec change

* Add note on parallel transcription and callbacks

* Update signature of Whisper.log_set [skip ci]
* kleidiai: add data type check to get_tensor_traits

 * Added check for F16 data type into get_tensor_traits path with input data
   not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7

* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp

updated kleidiai.cpp file as per suggestion

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* [SYCL] ehance UPSCALE to support more cases

* rm test case result of SYCL1
* vulkan: avoid graphics queue on non-RADV AMD drivers

* avoid graphics queues on small GPUs

* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE

* reenable transfer queue if graphics queue is not used
* kleidiai : fix MUL_MAT support for batched (3D) inputs

The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.

This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.

Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.

Fixes #20608

* Kleidiai support_ops should only return true for 3D inputs, not also 4D
* vulkan: fix event wait submission, event command buffer reset

* fix event command buffer reset validation error

* also reset command buffers before reuse

* use timeline semaphores instead of fences for event_synchronize

* don't use initializer list for semaphore wait info

* use multiple events to avoid reset issues

* fix event reuse issue with multiple vectors

* add semaphore wait condition also if compute_ctx already exists

* remove event pending stage
* ggml-cpu: refactor quants.c; add rvv check

* ggml-cpu: refactor; disable generic fallback
* ggml blas: set mkl threads from thread context

* add code to run blas locally
* vulkan: disable mmvq on Intel Windows driver

* improve comment
…/20701)

Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.

- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
  contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
  the worker pool, supporting f32 and f16 types. The kernel parallelizes
  across output rows and uses memcpy for each tile.

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
… in some cases on llvm-pipe backends (llama/20618)
…iBi slope offset (llama/20031)

- Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by
  padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2,
  then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp).
- Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of
  sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with
  48 heads); fixes buffer overflow and large numerical errors in those cases.
* Add supports for DIAG and TRI.

* Remove extra ttype and add a comment for TRI op.
…ama/20665)

* Update the preprocessor of RMS_NORM and add L2_NORM.

* Fix the name of rms_norm to row_norm.
RotaryPositionEmbedding on CANN fails when src and dst share the same
non-contiguous buffer (inplace + view), because the operator overwrites
source data before it is fully read.

Add a branch that detects this case and uses contiguous temporary
buffers: copy src to temp, run ROPE into another temp, then copy back
to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1,
inplace=1).

Signed-off-by: noemotiovon <757486878@qq.com>
* cmake : fix build warning when kleidiai is enabled

* remove LLAMA_ARG_THREADS from KleidiAI backend
…_DELTA_NET) + GET_ROWS optimization (llama/20687)

* Implement l2_norm, set, tri

* Add DIAG/SOLVE_TRI

* Add SSM_CONV

* Better get_rows and gated_delta_net to support qwen3.5

* Clean up, update ops.md

* Fix binding_index type for wasm

* Fix read write annotations

* cleanups
* CI: add hip quality check

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Revert "Update .github/workflows/hip-quality-check.yml"

This reverts commit efa0bfcdb01dfac0feee674987a0482d50f46145.

* scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs

* scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list

* Bump ccache version

* Add mssing seperators to list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
…0693)

* migrate(vtcm): unify VTCM management for HMX merge

- Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled,
  hmx_dma, vtcm_scratch_size, exp2_table
- Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for
  entire session instead of per-op acquire/release
- Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in
  session-hold mode, delegate in per-op mode
- Add VTCM tail reservation for precompute tables (256KB, 64KB aligned)
  in htp_iface_start under HTP_HAS_HMX
- Add HMX init/cleanup hooks in htp_iface_start/stop
- Add precompute table recovery in vtcm_acquire after VTCM preemption
- Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation)

* migrate(repack): replace x4x2 with HMX tile-permuted super-block format

- Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants)
- Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted
- Implement inverse repack for get_tensor debug verification
- Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2
- MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy)
- Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks
- Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate
- Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op

* migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops

Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op
migration. Reuses existing dma_queue_flush() for sync points instead
of adding redundant dma_queue_drain().

* migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking

Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for
cleaner separation between HVX and HMX code. Simplify HMX hardware
locking by replacing the two-level lock design (SHARED HAP lock +
custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock
on the existing vtcm_rctx, which already has HMX capability.

Key changes:
- Create htp/hmx/ subdirectory with all HMX infrastructure and ops
- Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx)
- Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed)
- Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals)
- Update main.c includes to use hmx/ prefix
- Clean up duplicate declarations from hmx-worker-pool.h

* migrate(hmx-infra): consolidate HMX infrastructure into htp_context

- Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops
- Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func)
- Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx
- Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release
- Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue
- Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers
- Delete upstream llama.cpp AGENTS.md (not applicable to fork)

* migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table

- Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable
- Remove table duplication loop in precompute-table.c
- Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c
- Fix table_size to 65536 (single 64 KB copy) in main.c

The exp2 lookup table is read-only; concurrent VTCM reads do not cause
bank conflicts, so duplicating the table wastes 192 KB of VTCM for no
benefit.

* migrate(dsp-main): add HMX priority dispatch in packet_callback

- Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types)
- Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16)
- Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32
- Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled
- Split RMS_NORM and SCALE into separate case blocks for independent dispatch
- All HMX wrappers guarded by #ifdef HTP_HAS_HMX

* migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels

Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build
integration: compile hmx-matmul-ops, hmx-flash-attn-ops,
hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels
with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain
unchanged.

* migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility

- hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions
- hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names
- hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings
- hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable

* hmx: set Q/O element type to fp16 for flash attention

The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element
should be false to match the actual data layout.

* hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback

Remove the v73+ HMX-specific super-block/tile-permuted weight format
and unify all architectures on the HVX x4x2 packed format. The DSP now
decides at runtime whether to use the HMX or HVX matmul path based on
dimension constraints (M%32, N%32, K%256 alignment), rather than the
host rejecting ops in supports_op. This simplifies the host repack
logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL
quantization support across host and DSP.

Key changes:
- Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile
  permutation (ggml-hexagon.cpp, hmx-quants.h)
- Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL
- Force is_host=false so tensor copies go through format conversion
- Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h)
- Rewrite DSP dequantizers to work directly on x4x2 layout
  (hmx-matmul-ops.c)
- Fix mxclracc.hf placement: clear per output tile, not once globally
- Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c)
- Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks
- Add VTCM allocation overflow asserts
- Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt)

* Enhance HMX debugging capabilities with new tile dumping functions

- Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data.
- Updated hmx_dump_tile_rows to provide raw memory output for debugging.
- Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting.
- Refined existing macros for dumping HVX vector values to streamline debugging output.

These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations.

* OK for small mat mul

* hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers

The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix
DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing
truncated transfers and NaN results. Fix by using 2D DMA (per-row stride ×
n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both
x4x2 and fp16 weight paths.

Also includes:
- Use standard vlut16 instead of _nomatch variant for dequantization
- Add per-tile vscatter drain barrier for correctness
- Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default)

* hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline

Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over
the generic unary path. Re-enable DMA pipeline mode for QK matmul.

* hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow

All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride)
are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the
non-pipeline path by switching from 1D to 2D DMA, but the pipeline path
(3 call sites) was left unchanged and still used 1D DMA with
chunk_size = n_cols * row_stride as roiwidth, which overflows for any
practical matrix size when the pipeline is active.

Add a local hmx_dma_push_safe() helper that transparently handles overflow:
- Fast path (zero overhead): all params fit in 16 bits -> direct call.
- Contiguous block: reshapes into a single 2D descriptor with sub_width
  that fits in 16 bits, preserving async DMA behavior.
- Stride overflow: row-by-row fallback for future large-k models where
  per-row stride itself exceeds 65535.

Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use
the safe helper, including the 3 pipeline sites (1D -> 2D fix), the
FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the
output-stationary activation fetch.

* hexagon: multithread activation/output transfer and add HMX matmul fallback

- Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with
  transfer_activation_chunk_multithread across all HMX matmul paths
- Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32
  output store, following the same worker pool pattern
- Rename transfer_activation_chunk_no_prefetch back to
  transfer_activation_chunk_fp32_to_fp16 and clean up stale comments
- Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error

* [todo]: dynamic alloc vtcm, cause prefill regression.

* hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults

Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges.

* hexagon: split unaligned-M HMX matmul into HMX+HVX phases

- keep HMX for the 32-aligned head rows and process tail rows with HVX
- force re-quantization for HVX tail after HMX phase to avoid stale VTCM state
- preserve fallback behavior when N is unaligned or no aligned M rows exist

* hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces

Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous
K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead.
The dequantize loop now takes the batch-4 path when 4 aligned K-tiles
are available within the same column tile, falling back to the original
single-tile path otherwise.

Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no
longer needed.

* hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag

Promote DSP response error from log to GGML_ABORT for fail-fast
behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX
matmul so the HVX path correctly re-quantizes activations.

* hexagon: support batch matmul. This fix perplexity issue
The problem comes from Grouped-Query Attention(GQA).  Strides between batches are not well respected
TODO: optimize batch matmul to reuse weights between batches.

* hexagon: reuse weights in fp16 batch matmul

* hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test

* hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options

* hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL

* hexagon: remove the unnecessary changes not related to HMX

* hexagon: bypass HMX by default

* hexagon: add upstream repo link to htp-ops-lib ported file headers

* hexagon: restore host buffer support

* hexagon: add HMX=1 option for the adb scripts

* hex-hmx: improve DMA pipelining

* hex-hmx: further improvements to dma pipelining

* hex-hmx: minor cleanup

* hex-hmx: move hmx lock out of inner loops/calls

* hex-hmx: remove unnecessary state and wrappers

* hex-hmx: remove hmx dir and unify f32 to f16 conversions

* hex-hmx: further unify hvx conversions

* hex-hmx: revert f16 converter to the original for now

* hex-hmx: minor cleanup for f16 to f32 converter

* hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code

* hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm

* hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned

* hex-mm: use hvx_vec_splat_f16 in the hmx code

* hex-mm: use VLEN and HTP types in hmx-code

* hex-mm: remove duplicate QK and defs

* hexagon: pre-shuffle quants before vlut16

* hexagon: enable HMX by default

* hex-mm: code indent fixes for hmx-matmul

* hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm

* hex-mm: more formatting fixes

* hex-mm: minor naming updates in hmx code

* hex-mm: remove leftover from rebase conflict

* Fix the incorrect indents

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* CANN: add BF16 support for core operators

Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.

* CANN: skip NZ weight format for BF16 and add 310P compile guards

NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.
…lama/20662)

* vulkan: change gated_delta_net to shard a column across a subgroup

This is based on ggml-org/llama.cpp#20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).

This fixes a perf regression from the transposing of the values in memory
(!20443).

* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
…20791)

Explicitly mark save_acc and add_save_Acc with always_inline
in tinyBLAS_PPC. This ensures the compiler keeps MMA accumulator
disassembly within kernel's register context, preventing un-necessary
stask spills.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
jeffbolznv and others added 11 commits May 2, 2026 15:02
* vulkan: Support asymmetric FA in coopmat2 path

There has been some recent interest/experimentation with mixed quantization
types for FA. I had originally designed the cm2 FA shader with this in mind
(because I didn't realize it wasn't supported at the time!), this change
adds the missing pieces and enables it.

Also support Q1_0 since people have been trying that out (seems crazy, but
who knows).

We should be able to do similar things in the coopmat1/scalar path, but
there's another change open against the scalar path and I don't want to
conflict.

* reorder cases
…/22578)

* Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hmx: extract shared interleave headers and unify matmul batched

* hmx: add HMX-accelerated flash attention for prefill

* hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h

Switches three single-instruction helpers from inline asm to the matching
Q6_ intrinsics, matching the style established by aizip f8737609a and used
by the upstream PR #21554 hmx-matmul-ops.c rewrite:

  hmx_set_output_scales       asm "bias=mxmem2"  -> Q6_bias_mxmem2_A
  hmx_load_tile_pair_fp16     asm packet         -> Q6_activation_hf_mxmem_RR
                                                    + Q6_weight_hf_mxmem_RR
  hmx_consume_accumulator_fp16 asm "mxmem=acc"   -> Q6_mxmem_AR_after_hf

hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation
streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep
Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check
("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps
both halves in one VLIW packet and avoids the diagnostic.

Functionally equivalent — same instructions emitted; the Q6_ intrinsics
just give the compiler more visibility for scheduling.

* hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles

* hmx:  apply upstream optimization to hmx-flash-attn-ops.c
apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext.

* hmx: unify interleave helper

* hmx: multi-thread Q load / O store and enable prefill FA dispatch

Extract inline Q-load and O-store loops into worker_pool-parallel helpers
(fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16
conversion work across row ranges.  Also relax the softmax threading
gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily
forcing single-thread fallback when n_rows_g < 512.

On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head
(prefill) FA from reaching the HTP backend — GQA is already handled
internally by both the HMX and HVX flash-attention paths.

* hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down)

* hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup)

* hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug

* [experimental]: fp16 softmax (EXP2_HF) to accelerate fa

Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff
(base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs
~44 for the F32 round-trip path.

* hmx flash-attn: refine cost model coefficients based on profiling data

* hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles

* hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride)

* hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline

* hmx flash-attn: preserve additive mask bias in no-ALiBi fast path

The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on
the assumption that mask values are only {0, -inf}.  This is wrong when
the mask carries additive positional bias — those terms were silently
dropped.  Keep the slope-mul skip (slope≡1.0) but add mask back so the
bias survives; vmux still clamps below -16 to -inf.

Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64,
nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV,
hs ∈ {64, 128}.

* hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests

- flash-attn: when EXP2_HF is on AND logit_softcap is active, fold
  log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking
  it into qk_scale.  Pre-baking shifted the tanh knee from x≈c to
  x≈c/log2(e) and produced numerically wrong softcapped outputs
  whenever both knobs were enabled.
- flash-attn softmax (fa_softmax_thread): replace the union+memcpy
  scalar extract pattern with HVX vmux-based per-row accumulators on
  rowmax/rowsum.  Add hvx_vec_get_f16 helper in hvx-base.h.  Functional
  parity, less scalar code, clearer hf/qf16 lane-format contract.
- matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential
  layout based on whether the chunker actually yields >=2 n-chunks,
  instead of the static (m>=128 && n>=256) gate.  Avoids paying for
  output double-buffer + worker dispatch when there is no HMX/HVX
  overlap to gain (e.g. shapes that collapse to one n-chunk).
- tests: add HMX flash-attention coverage over the
  {mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill
  path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32
  non-aligned case.

* [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability

* format code

* hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead

* hexagon: use hex_ prefix for swap_ptr

* hexagon: move vtcm_seq_alloc into vtcm-utils.h

More vtcm allocator updates are coming so it makes sense to start the separate hdr for it.

* hmx-utils: add hmx_prefix for layout converters

* hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc

* hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes

* hmx-fa: minor alignment fixes

* hmx-fa: move hmx_flash_atten into hmx-ops.h

* hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates

* hmx-fa: minor alignment and simplifications

* hexagon: move FA_EXP_F16 option to hostside CMake file

* hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits

* hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill

* hmx-fa: some more alignment updates in the core fa function

* hmx-fa: keep slopes in vtcm in fp16

Saves malloc/free and removes the need for float -> fp16 downcast on every use.

* hexagon: consistent noinline usage (after static)

* hex-hmx: consistent use FARF_HIGH to enable debug output

* hmx-utils: no need for always_inline attr

* hex-hmx: consistent noinline usage (static noinline ...)

* hex-hmx: simplify init_col_scales

* hexagon: fix editorconfig errors

* hmx-mm: minor alignment fixes

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
@mario-rei mario-rei requested review from a team as code owners May 4, 2026 14:51
@mario-rei mario-rei marked this pull request as draft May 4, 2026 15:15
reichert-dev and others added 2 commits May 4, 2026 17:28
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:

1. install(TARGETS whisper.coreml) did not join the whisper-targets
   export set. Since whisper PRIVATE-links to whisper.coreml and is
   itself in whisper-targets, CMake refuses to generate with
       install(EXPORT "whisper-targets" ...) includes target "whisper"
       which requires target "whisper.coreml" that is not in any
       export set.
   Add EXPORT whisper-targets to the install (must come before LIBRARY
   in CMake's install(TARGETS ...) signature).

2. Once whisper.coreml is in the export set, its PUBLIC include dirs
   are validated against the install interface. The current "."
   include dir is a raw source-tree path with no
   $<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
   with
       INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
       which is prefixed in the source directory.
   The headers under coreml/ are internal implementation details only
   consumed by whisper.cpp (in the same directory), so the correct fix
   is to mark them PRIVATE rather than wrapping them in install/build
   generator expressions.

Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.

This unblocks the ios-xcode-build CI job on PR tetherto#12.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
The bindings-java tests testGetDefaultFullParams_Greedy /
testGetDefaultFullParams_BeamSearch on PR tetherto#12 fail with

    expected: <5> but was: <0>     (greedy.best_of)
    expected: <5> but was: <-1>    (beam_search.beam_size)

while whisper_full_default_params() still returns 5 for both — the
actual transcription test (testFullTranscribe) produces correct text.

Diagnosis: the Java JNA WhisperFullParams Structure is missing fields
that exist in the C whisper_full_params struct, so JNA computes wrong
offsets and reads garbage at greedy.best_of / beam_search.beam_size.

Specifically the Java layout was missing:

  1. int32_t seed           — added by tetherto's local seed patch
                              between no_speech_thold and greedy
                              (include/whisper.h:553). This single
                              omission shifts every subsequent field
                              by 4 bytes and is the proximate cause of
                              both failing assertions.
  2. bool vad               — added by upstream
  3. const char * vad_model_path
  4. whisper_vad_params vad_params (struct)

Fix:

* New WhisperVadParams.java JNA Structure mirroring
  whisper_vad_params {threshold, min_speech_duration_ms,
  min_silence_duration_ms, max_speech_duration_s, speech_pad_ms,
  samples_overlap}.
* Add `public int seed`, `public CBool vad`, `public String
  vad_model_path`, `public WhisperVadParams vad_params` fields and
  thread them into getFieldOrder() at the matching positions.

Field order in WhisperFullParams.getFieldOrder() now matches the C
struct in include/whisper.h field-for-field, so JNA-computed offsets
agree with the native side.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
@mario-rei mario-rei marked this pull request as ready for review May 5, 2026 09:44
@mario-rei

mario-rei commented May 5, 2026

Copy link
Copy Markdown
Author

Successful run: https://github.com/tetherto/qvac-ext-lib-whisper.cpp/actions/runs/25340972534

Note: the run-level badge shows "cancelled" only because the 5 upstream ggml-ci-*
self-hosted-runner jobs (which don't run on tetherto's infra and don't test our
code, per earlier comment) eventually timed out in the queue and were cancelled.
The substantive jobs are all green: 53 success, 0 failures, including
ios-xcode-build, bindings-java, all windows variants, android, and the
macOS/iOS xcframework builds.

@ogad-tether ogad-tether left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary (automated spot-check + CI)

What I verified locally (branch pr12 @ 1318aee9)

  • Fork BCI patches still present in src/whisper.cpp (is_bci, window_mask, compute_window_mask, flash-attn path with layer_needs_mask).
  • seed still in include/whisper.h on whisper_full_params (with int32_t).
  • CODEOWNERS still lists @tetherto/ai-runtime-merge.
  • Java: WhisperFullParams includes seed; WhisperVadParams matches whisper_vad_params field order in whisper.h.

CI / mergeability

  • GitHub API: mergeable: true but mergeable_state: "unstable" — consistent with still-pending jobs last I checked: ggml-ci-mac-metal, ggml-ci-mac-vulkan, ggml-ci-x64-nvidia-cuda, ggml-ci-x64-nvidia-vulkan-cm, ggml-ci-x64-nvidia-vulkan-cm2. Recommend waiting for those to finish before merge if your branch protection keys off a green run.
  • Many other matrix jobs already passed (Android, ubuntu/windows/mac variants, VAD, etc.).

Non-blocking notes

  1. PR description typo: the CMake flag reads -DGGML_VULKAN_DISABLE_COOPMAT[2]=ON — likely meant -DGGML_VULKAN_DISABLE_COOPMAT2=ON (bracket looks like markdown/typo). Worth fixing so copy-paste builds don’t confuse people.
  2. Ruby bindings: ruby_whisper_params embeds struct whisper_full_params and defaults via whisper_full_default_params, but there is no Ruby accessor for seed (no seed in bindings/ruby). C/Java users can set it; Ruby users cannot without an extension change. Fine if Ruby isn’t a ship target; flagging in case you want parity with the fork’s C API.
  3. Merge commit message (v1.8.5 prep vs release v1.8.4.3): you already explained in the PR body — just ensure release/tag messaging matches to avoid audit confusion.

Verdict

No correctness issues spotted in the fork-specific paths I checked; I’m not issuing an Approve here only because pending CI + unstable merge state should be resolved per your org’s merge rules. After GPU runners go green, this looks reasonable to merge from a fork-parity spot-check perspective.

@mario-rei

mario-rei commented May 12, 2026

Copy link
Copy Markdown
Author

Quick consumer-side validation update, plus answers to the inline notes.

Verification list

  • Fork of qvac-registry-vcpkg on main bumped to whisper-cpp 1.8.4.3#1 (plus spirv-headers dep for the Vulkan triplets).

    • mario-rei/qvac-registry-vcpkg@main HEAD: e36670e6
    • Portfile pins this PR's HEAD 1318aee9 as REF until the merge is tagged v1.8.4.3.
  • Branch in tetherto/qvac consuming it: tmp-whisper-184-3-validation. Both whispercpp addons point at the fork registry:

    • packages/transcription-whispercpp/vcpkg-configuration.json + vcpkg.json (override 1.8.4.3#1)
    • packages/bci-whispercpp/vcpkg-configuration.json + vcpkg.json (override 1.8.4.3#1)
  • Triggered addon CI / tests. Status on the validation branch:

    Workflow Run Status
    Prebuilds (Whispercpp) 25714970981 ✅ success (all 9 triplets, incl. SPIRV-fixed linux/win/android)
    Prebuilds (BCI Whispercpp) 25714971248 ✅ success
    On Merge (Whispercpp) 25714964910 ✅ success (prior HEAD)
    On Merge (Whispercpp) 25730579334 ⏳ queued on new HEAD
    On Merge (BCI Whispercpp) 25730579316 ✅ success on new HEAD
    On PR (Whispercpp) 25730580796 ⏳ in progress
    On PR (BCI Whispercpp) 25730580716 ❌ failing — orthogonal to this PR (see below), under triage
    Mobile Integration Tests (Whispercpp) 25694631693 ✅ success on parent SHA; new-HEAD mobile is dispatched via the parent on-PR workflow above

    The BCI on-PR failure reproduces on the prior HEAD 163f3fe1 too and is unrelated to the whisper bump — on-pr-bci-whispercpp.yml runs as pull_request_target and calls a reusable workflow resolved against main, which installs LLVM 19 headers but leaves unversioned clang pointing at the runner-image default (14). The toolchain pin I pushed to the validation branch (packages/bci-whispercpp/vcpkg/toolchains/linux-clang.cmake) can't take effect there until it lands on main. Will land that separately; it's not gating this PR.

Inline notes

  1. -DGGML_VULKAN_DISABLE_COOPMAT[2]=ON typo. That's actually two flags in the description — …COOPMAT=ON -DGGML_VULKAN_DISABLE_COOPMAT2=ON. The [2] you saw was a line-wrap artifact; nothing to fix on our side, but happy to reword the bullet if it's confusing.
  2. Ruby seed parity. Acknowledged — no Ruby consumer in the qvac stack today, so this stays a known parity gap on the fork. Not in scope for this sync.
  3. Merge commit subject v1.8.5 prep vs v1.8.4.3. Will tag the squashed/merge commit v1.8.4.3 so the registry can flip from the SHA pin back to REF=v1.8.4.3 and audit messaging matches.

Consumer side is effectively green (prebuilds + on-merge across both addons). Re-review whenever convenient; I'll update once the two in-progress on-PR runs land.

GustavoA1604 pushed a commit that referenced this pull request May 19, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
@GustavoA1604 GustavoA1604 merged commit 9ead0b7 into tetherto:master May 20, 2026
57 of 62 checks passed
gianni-cor pushed a commit that referenced this pull request May 28, 2026
Resolves the review comments on the merged AOSC v2.1 PR
(#22, merge commit e6ba38c). All
eight changes are minimal and behaviour-preserving except the v2.1
detection upgrade (now strict-tag with shape fallback) and the
degenerate-config guard (silence-only fallback instead of UB-adjacent
boost arithmetic). Reviewer comments classified as "perf only / out
of scope / would only add a TODO" are intentionally not addressed in
this commit -- see the plan file referenced in the PR description.

src/parakeet_sortformer.cpp -- `compress_speaker_cache`
  - Early-return when `spkcache_len_per_spk <= 0`
    (`num_spks * A_sil >= spkcache_len`). The downstream boost/top-K
    stages are mostly defended (`boost_topk_scores` already returns
    early on non-positive k), but the function was otherwise running
    a no-op pass that produced an all-silence cache via the slow
    path. Fall back to an explicit silence-only profile and bail.
  - Renamed `streaming_update`'s `chunk_pre_encode_lc` parameter to
    `committed_chunk_pre_encode`. The call site already advances
    past the left context (`chunk_pre_committed = ... + lc * D`),
    so the old `_lc` suffix was misleading. `int lc` stays -- it's
    used inside the function to index into `preds_full`, which
    still contains the left-context preds.
  - Replaced the magic `-1.0e30f` / `+1.0e30f` sentinels (4 sites)
    with named constants `k_score_neg_inf` / `k_score_pos_inf`
    backed by `std::numeric_limits<float>::{lowest,max}()`. Dropped
    the inline "-inf is UB with current FP flags" comments: IEEE
    754 +/-inf is well-defined; the original concern (avoiding
    NaN-on-arithmetic) still holds because we only store and
    compare the sentinels.

src/parakeet_engine.cpp
  - On the AOSC path, skip the `for (cur_full) remap_id(...)` loop
    and the `prev_chunk_full_segments = std::move(cur_full)` store:
    `compute_slot_remap_` is never consulted when `cache_active` is
    true (AOSC anchors slot identity through the speaker cache), so
    the work was dead.
  - Switched v2.1 detection from pure-shape to "prefer the
    converter's `parakeet.model_variant` GGUF tag; fall back to
    `(n_layers == 17, n_mels == 128)` for legacy GGUFs". This
    prevents a future v2.2/v3 variant that happens to share v2.1's
    encoder shape from silently opting into AOSC.

include/parakeet/diarization.h
  - Moved the v1-vs-v2.1 detection rationale comment out of
    parakeet_engine.cpp and into the `SortformerStreamingOptions::
    spkcache_enable` block, with a paragraph on the tag-first /
    shape-fallback policy.

src/parakeet_ctc.{h,cpp}
  - Added `std::string ParakeetCtcModel::model_variant` (optional
    GGUF metadata mirror; empty on legacy GGUFs).
  - Loader reads `parakeet.model_variant` next to the existing
    `parakeet.model.type` read; absent key -> empty string ->
    detection falls back to shape.

scripts/convert-nemo-to-gguf.py
  - New `detect_sortformer_variant(ckpt: Path)` derives a stable
    variant tag from the source .nemo filename
    (`sortformer-v1` / `sortformer-streaming-v2` /
    `sortformer-streaming-v2.1-aosc`); empty string for unknown
    checkpoints.
  - Sortformer branch of `write_gguf` writes
    `parakeet.model_variant` when the tag is non-empty.
  - `write_gguf` signature extended with `ckpt: Path`; only the
    one internal call site adjusted.

scripts/download-all-models.sh
  - Added the diar_streaming_sortformer_4spk-v2.1 fetch block (the
    AOSC fine-tune that this PR's tests target); bumped the budget
    comment from "~14 GiB" to "~14.5 GiB" and listed v2.1 in the
    contents line.

CMakeLists.txt + test/test_sortformer_streaming.cpp
  - Streaming ctest now consumes `${_qvp_sfsv21_q8_gguf}` (was
    `${_qvp_sfs_q8_gguf}`, the v2 model). The in-binary default
    GGUF path is the matching v2.1 q8_0. Aligns the test with the
    line-299 comment that says the binary "reflects the production
    v2.1 AOSC config out of the box".

test/test_utils.h (new) + test/test_sortformer_{streaming,aosc_speakers}.cpp
  - Extracted the two 40-line `load_wav_pcm16le_mono` / `file_exists`
    duplicates into a shared inline header in the `parakeet_test`
    namespace. The duplicate copies and the "duplicated here on
    purpose" comment block in test_sortformer_aosc_speakers.cpp
    are gone; both tests `#include "test_utils.h"` and use
    `using parakeet_test::...`.

Build + ctest verification
  - `cmake --build build -j` clean (no new warnings).
  - `ctest -R 'test-sortformer-(streaming |aosc-speakers)'`:
      test-sortformer-streaming ........  Passed   8.23 s
      test-sortformer-aosc-speakers-abcba . Passed  33.80 s
      test-sortformer-aosc-speakers-abcdba  Passed  36.91 s
    The locally-symlinked v2.1 GGUF predates the `parakeet.model_variant`
    key, so the AOSC tests passing here also verifies the shape-fallback
    path. Re-running the converter on the v2.1 .nemo will populate
    the new key for the strict-tag path.

Reviewer comments deferred / skipped (rationale):
  - Encoder graph cache thrashing during FIFO ramp-up (#4): perf
    only; proper fix wants pre-build-at-diarize_start + silence
    padding or a mask argument, not minimal. Tracked for a follow-up
    perf PR.
  - WAV fixtures committed as ~11 MB binaries (#8): project-wide
    Git LFS adoption decision, not a code change.
  - `ring.erase` O(n) under AOSC's aggressive trim (#10): pre-existing
    on the v1 path; wants a std::deque refactor, out of scope.
  - `encoder_ms` attribution surprising (#12): code is correct and
    matches sibling paths; the user explicitly opted against
    comment-only "clarifications".
gianni-cor pushed a commit that referenced this pull request May 28, 2026
Two latent bugs surfaced together when whisper.cpp is built with
-DWHISPER_COREML=ON, both reproducible at CMake configure time:

1. install(TARGETS whisper.coreml) did not join the whisper-targets
   export set. Since whisper PRIVATE-links to whisper.coreml and is
   itself in whisper-targets, CMake refuses to generate with
       install(EXPORT "whisper-targets" ...) includes target "whisper"
       which requires target "whisper.coreml" that is not in any
       export set.
   Add EXPORT whisper-targets to the install (must come before LIBRARY
   in CMake's install(TARGETS ...) signature).

2. Once whisper.coreml is in the export set, its PUBLIC include dirs
   are validated against the install interface. The current "."
   include dir is a raw source-tree path with no
   $<BUILD_INTERFACE>/$<INSTALL_INTERFACE> guards and CMake refuses
   with
       INTERFACE_INCLUDE_DIRECTORIES property contains path "..."
       which is prefixed in the source directory.
   The headers under coreml/ are internal implementation details only
   consumed by whisper.cpp (in the same directory), so the correct fix
   is to mark them PRIVATE rather than wrapping them in install/build
   generator expressions.

Verified locally with -DWHISPER_COREML=ON -DGGML_METAL=ON: configure
clean, whisper.coreml + libwhisper.dylib build end-to-end.

This unblocks the ios-xcode-build CI job on PR #12.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
gianni-cor pushed a commit that referenced this pull request May 28, 2026
The bindings-java tests testGetDefaultFullParams_Greedy /
testGetDefaultFullParams_BeamSearch on PR #12 fail with

    expected: <5> but was: <0>     (greedy.best_of)
    expected: <5> but was: <-1>    (beam_search.beam_size)

while whisper_full_default_params() still returns 5 for both — the
actual transcription test (testFullTranscribe) produces correct text.

Diagnosis: the Java JNA WhisperFullParams Structure is missing fields
that exist in the C whisper_full_params struct, so JNA computes wrong
offsets and reads garbage at greedy.best_of / beam_search.beam_size.

Specifically the Java layout was missing:

  1. int32_t seed           — added by tetherto's local seed patch
                              between no_speech_thold and greedy
                              (include/whisper.h:553). This single
                              omission shifts every subsequent field
                              by 4 bytes and is the proximate cause of
                              both failing assertions.
  2. bool vad               — added by upstream
  3. const char * vad_model_path
  4. whisper_vad_params vad_params (struct)

Fix:

* New WhisperVadParams.java JNA Structure mirroring
  whisper_vad_params {threshold, min_speech_duration_ms,
  min_silence_duration_ms, max_speech_duration_s, speech_pad_ms,
  samples_overlap}.
* Add `public int seed`, `public CBool vad`, `public String
  vad_model_path`, `public WhisperVadParams vad_params` fields and
  thread them into getFieldOrder() at the matching positions.

Field order in WhisperFullParams.getFieldOrder() now matches the C
struct in include/whisper.h field-for-field, so JNA-computed offsets
agree with the native side.

QVAC-18300

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.