Skip to content

QVAC-18992: merge ggml-org @ 19eac6f0 (v0.10.2) into speech#13

Merged
Zbig9000 merged 141 commits into
tetherto:speechfrom
Zbig9000:QVAC-18992-merge-ggml-from-whisper-cpp
May 27, 2026
Merged

QVAC-18992: merge ggml-org @ 19eac6f0 (v0.10.2) into speech#13
Zbig9000 merged 141 commits into
tetherto:speechfrom
Zbig9000:QVAC-18992-merge-ggml-from-whisper-cpp

Conversation

@Zbig9000

@Zbig9000 Zbig9000 commented May 19, 2026

Copy link
Copy Markdown

QVAC-18992: merge ggml-org @ 19eac6f (v0.10.2) into speech

Day-4 update (2026-05-20). Group 1 (whisper-cpp PR #25) + Group 2 (whisper-cpp PR #27) merged. This PR is awaiting CI signal — see "CI gap notice" below: the speech-branch .github/workflows/ci.yml only triggers on master-targeted PRs/pushes, so this PR receives no automatic CI by design. Validation has been done out-of-band (full test-backend-ops 13465/13465 + on-device OnePlus 7T Pro + clean rebuild of tts-cpp and parakeet-cpp against the merged speech HEAD).

Day-2 update (2026-05-19, post review): Added a second merge commit 45dbdecd that pulls in upstream/speech HEAD at 2fa6e3d9 so this PR now also includes @GustavoA1604's PR #11 (9562ed04) "ggml-backend: android per-arch CPU variant dlopen fallback". That commit is QVAC-18993 territory and is required for APK consumers of the speech stack (parakeet-cpp / chatterbox / ...) to keep CPU init working after they flip to GGML_CPU_ALL_VARIANTS=ON. Merge was conflict-free (single-file, additive). No changes to the ggml-org sync content this PR was originally about.

Summary

Brings the speech branch up to the same ggml-org/master commit that qvac-ext-lib-whisper.cpp's bundled ggml is pinned to (scripts/sync-ggml.last = 19eac6f0, matching whisper.cpp v1.8.4.3). Single merge commit pulls in the full 0.9.11 → 0.10.2 upstream history — 137 commits, mostly llama.cpp sync-backs plus the v0.10.2 release bump. All speech-stack patches preserved.

Speech-stack patches preserved (verified post-merge)

# Patch Where
1 Persistent VkPipelineCache ggml-vulkan.cpp
2 OpenCL persistent kernel-binary cache via clCreateProgramWithBinary ggml-opencl.cpp
3 GGML_OPENCL_ALLOW_UNKNOWN_GPU (Adreno/Intel whitelist relaxation) ggml-opencl.cpp
4 GGML_BACKEND_DL_PROJECT_PREFIX (project-prefixed dl backend names) ggml-backend-reg.cpp, CMakeLists.txt
5 GGML_METAL_FUSE_MV_BIAS (Q-variant mul_mv + bias/residual fusion) ggml-metal/*
6 Supertonic Metal ops: depthwise_1d, layer_norm_channel, pw2_residual, bias_gelu, edge_pad_1d (+ ct/causal_ct variants) ggml-metal/*, ggml.h
7 diag_mask_inf, pad_ext lp0..lp3 Metal kernels + conv_transpose_1d simd_sum rewrite ggml-metal/*

Manual conflict resolution

4 files in src/ggml-metal/, all resolved by union-merge (kept all speech-stack pipeline/dispatch entries and added upstream's new roll op alongside):

  • ggml-metal-device.hggml_metal_pipeline_with_params decls
  • ggml-metal-impl.hggml_metal_kargs_roll added next to supertonic kargs structs
  • ggml-metal-ops.h — op decls
  • ggml-metal-ops.cpp — dispatch cases (incl. GGML_OP_ROLL)

Side effect the reviewer should know about

The upstream v0.10.2 commit set adds ggml-vulkan.cpp code that uses spv::* enums unconditionally for SPIR-V capability injection (rounding-mode RTE) — Vulkan-Headers itself no longer bundles spirv.hpp, so the build now needs the standalone SPIRV-Headers tree on the include path. This is handled in vcpkg PR #152 (spirv-headers declared as a ggml-speech[vulkan] dep + an -isystem shim injected via CMAKE_CXX_FLAGS because ggml-vulkan's CMakeLists doesn't find_package(SpirvHeaders)).

Validation

  • ✅ Build: cmake -GNinja clean on Linux x64 (CPU + Vulkan), zero warnings.
  • test-backend-ops: 13465/13465 OK on Vulkan0 + Vulkan1 + CPU = 3/3 backends OK. Baseline pre-merge was 12075/12075; the +1390 deltas are new tests added by the 137 upstream commits, all green.
  • whisper.cpp v1.8.4.3 (now merged into tetherto/master via PR Illegal instruction (core dumped) ggml-org/ggml#25) + parakeet-cpp build + link + run cleanly against this merged ggml-speech (installed locally via vcpkg PR Fix warnings in examples ggml-org/ggml#152's port).
  • Day-2: post-merge of upstream/speech (45dbdecd) rebuilt clean on Linux x64; the new cpu-android_armv*_* candidate names land in src/ggml-backend-reg.cpp as expected.
  • Day-3 — tts-cpp clean rebuild: clean configure + clean build of tts-cpp against -DBUILD_SHARED_LIBS=ON -DGGML_VULKAN=ON install of this branch's HEAD (45dbdecd). libtts-cpp.a produced clean, all three variants of chatterbox_tts.cpp.o (tts-cpp, test-streaming, test-cpu-caches) compile. Required a 1-line #include <atomic> fix that's now merged via whisper-cpp PR #27 (tts-cpp source lives inside the whisper.cpp repo). Root cause was a transitive-include change in the v0.10.2 sync that exposed a pre-existing missing-include — same crash reproduced on the pre-merge speech HEAD too.
  • Day-3 — parakeet-cpp clean rebuild: clean configure (-DPARAKEET_USE_SYSTEM_GGML=ON) + clean build against the same install. libparakeet.a + parakeet CLI built clean. No behavioural changes to parakeet.
  • Day-3 — OpenCL backend build: backend builds clean (124/124 compile units, no warnings) with -DGGML_OPENCL=ON -DGGML_OPENCL_ALLOW_UNKNOWN_GPU=ON. The bundled test-backend-ops correctly drops the local Adreno-incompatible device (RTX 5090) and exercises CPU + Vulkan instead — backend code path is in scope of the speech-stack OpenCL patch series and that series is byte-identical to its pre-merge form (git diff on ggml/src/ggml-opencl/ shows zero changes from the merge).
  • ⚠️ Metal not buildable on the local Linux host (needs macOS); the conflict-resolution is structurally trivial (union of decls/dispatch cases, both halves additive) — deferred to CI / device farm. The Metal patch series is documented in the commit message for 166c4e12.

CI gap notice

.github/workflows/ci.yml on the speech branch only triggers on push/pull_request for branches: [master]:

on:
  push:
    branches: [ master ]
  pull_request:
    branches: [ master ]

This PR targets speech, so neither event fires. There is no workflow_dispatch trigger either, so the workflow can't be kicked off manually via gh workflow run. The only run that ever started on this branch (26106801732, day-1, on the older HEAD 166c4e12) is still stuck queued ~24h later because the ggml-ci-x64-nvidia-cuda / ggml-ci-mac-vulkan jobs require self-hosted runners that haven't picked up the work.

Three options to consider:

  1. Patch the workflow trigger (smallest blast radius): add branches: [master, speech] (and ideally workflow_dispatch:) to .github/workflows/ci.yml in a separate PR against tetherto/qvac-ext-ggml/speech. After merge, re-trigger this PR by pushing an empty commit.
  2. Indirect consumer-CI validation: the speech-stack ggml-speech port bumped by vcpkg PR #152 ultimately propagates into the parakeet-cpp and tts-cpp packages in the qvac monorepo. Running those package's CIs against PR Fix warnings in examples ggml-org/ggml#152 transitively exercises this PR's content.
  3. Accept the out-of-band validation above (full test-backend-ops + on-device OnePlus 7T Pro + clean rebuilds of tts-cpp and parakeet-cpp).

Notes for reviewer

  • Two commits at the top of the branch: 166c4e12 (the ggml-org v0.10.2 merge — original day-1 scope) and 45dbdecd (day-2 merge of upstream/speech to pick up Gustavo's PR ggml-backend: android per-arch CPU variant dlopen fallback #11). Both are merge commits and stand on their own. All resolution decisions are in the commit messages + aiDocs/03-QVAC-18992.md + aiDocs/08-day2-fixes.md in the side-channel.
  • This branch is referenced by the ggml-speech vcpkg port on PR #152 (Zbig9000 fork + commit SHA 45dbdecd). Once this PR lands on tetherto/speech, that port needs REPO -> tetherto/qvac-ext-ggml + recompute SHA512 (already documented in PR #152's notes).

Refs

fairydreaming and others added 30 commits April 21, 2026 10:59
…y all return cudaError_t) (llama/21676)

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
* ggml: backend-agnostic tensor parallelism

* support for GPT-OSS, Qwen 3 MoE

* partial Vulkan fix

* add support for 4/8 GPUs

* unconditional peer access

* re-use buffers + ggml contexts

* fix output pattern

* NCCL support

* GGML: HIP: add RCCL support

* Remove shfl and AllReduce from backend interface

* move allocation workaround out of ggml-alloc.c

* 2d tensor set/get support

* Fix the seg fault without NCCL

* Apply suggestion from JohannesGaessler

* support for tensor dims % n_devs != 0

* fix view_offs scaling

* arbitrary num. of GPUs/tensor split

* fix compilation

* better granularity estimate

* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.

Fix compilation errors.

* partial Qwen 3 Next support

* Fix qwen3 30b (llama/8)

* Fix crash with Qwen-30B-A3B Q4_0

Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.

* Decide block size based on tensor quantization type

* Fix crashes due to KV cache serialization (llama/9)

KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.

* metal : fix build (llama/7)

* static memory allocations, fix usage count

* fix tensor granularity

* more even memory distribution

* use BF16 for allreduce

* rebase fixup

* better error message for unsupported architectures

* Fix device mismatch during scatter of allReduce. (llama/11)

There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies

* Enable the previous allreduce implementation. It is better in both perf and stability (llama/12)

* delay AllReduce for Moe for less I/O

* build : clean-up compile warnings

* backend : move most of the meta backend API to ggml-backend-impl.h

* cont : hide unused public API in the implementation

* llama : use llama_device + remove ggml_backend_dev_is_meta()

* ggml-backend : remove unused alloc include

* minor : remove regex include

* ggml : introduce ggml-ext.h for staging new APIs

* rebase fixup

* fix tests

* llama : more robust logic for determining Meta devices (llama/16)

* llama : more robust logic for determining Meta devices

* cont : fix devs size check

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* cont : fix log type

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* disable roundtrip for meta backend

* fix arch selection

* Qwen 3.5 support

* fix Gemma 4 MoE

* fix OpenVino, SYCL

* fix test-llama-archs for CPU-only builds

* Fix Qwen 3.5 MoE

* disable meta backend tests for WebGPU

* tests : filter CPU-based devices from the Meta backend tests (llama/17)

* meta : formatting, naming, indentation (llama/18)

* formatting : llama-model.cpp

* formatting : ggml-ext.h

* formatting : ggml-backend-meta.cpp

* meta : add TODO

* add documentation

* better error messages

* fix GPT-OSS

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…/21570)

Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support:

- vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__
- common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros
- mma.cuh: Route CDNA4 to compatible MFMA instructions:
  * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950)
  * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3)
  * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3)
- mmq.cuh: Include CDNA4 in stream-k kernel dispatch

CDNA4 is largely compatible with CDNA3 except:
- No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path
- Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here

Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1:
- Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950
- llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU):
  * f16+FA: 40,013 tok/s prefill, 254 tok/s decode
  * q8_0+FA: functional
- Flash attention: works correctly
- MMQ: works correctly with stream-k dispatch

Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>
* vulkan: Support Q1_0

* use get_dm
…agment (llama/21521)

* ggml(webgpu): fix the busy-polls in Emscripten  in the waitAny after #20618, and remove the busy webgpu log

* Merge with upstream

* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants

* Update Unary wgsl EXP and EXPM1 for f16 stability

* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization

* Fix numerical percision for unary sqrt when working with f16

* Fix NaN canonicalization for packed integers using f16

* Update err threshold for binary div ops when using f16

* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend

* clean: uncomment existing code logs

* clean: clean the unncessary debug info

* Refactor and generalize dequant helpers

* Remove deprecated quant structs

* Refactor shader defines to reduce repetition

* Remove error override for F16 type

* fix: fix the accidential removal of the proper initialization of ctx

* clean: clean legacy and format code

* fix: did not modify tests ops

---------

Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
* hexagon: introduce op request batching and rewrite buffer managment

The host now prepares batches of requests and dispatches them via a single dspqueue message.

Buffers are mapped explicitly by NPU while processing batches.

* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops

* hex-utils: add explicit l2flush and l2clear helpers

* hex-opreq: use fine-grain per tensor l2 management

* hex-opreq: avoid redundant invalidates for tensors we already flushed

* hex-opreq: update debug messages

* htp-opreq: reuse ops_context

* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry

* hex-opreq: fix errors in log message

* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"

This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.

* hexagon: limit l2 flushes to 1MB which covers l2 cache

* hex-opreq: limit cache flush to 4MB

Looks like 4MB cont. vitual space should cover the 1MB cache.

* hexagon: drop cache flush size to 2MB

* hex-opreq: start reworking opreq packing

* hex-opreq: introduce new way of packing opbatch where tensors are stored separately

* hex-opreq: add a simple fastrpc call to force unmap all buffers

* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size

* hex-opreq: bump opreq batch size to 256

* hex-mm: place src1 spad at the top of vtcm for easy reuse

* hex-ops: introduce internal types and disable src1 reuse for now

Nothing new just formalizing the repack / qyn.quant types we've been using.

* htp-opreq: use tensor pointers instead of copies

* hex-opreq: introduce more robust way for tracking vtcm/spad reuse

This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.

* hex-cumsum: fix error post opreq merge

* hex-opreq: move request batch handling into the session

Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.

* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx

* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers

* hex-buf: add support for allocating shared/pinned buffer for opreqs

* hex-opbatch: make opbatches configurable

* hex-naming: better name for ggml_hexagon_shared_buffer

* hex-naming: add session->c_name() helper

* hex-opbatch: start using shm but still copy for now

* hex-opbatch: use shared buffer for packing opbatch

* hex-opbatch: beter naming for opbatch related classes and code

* hex-opbatch: reuse batched tensors with same data/dims/strides

* hex-opbatch: update logging

* hex-opbatch: add support for vmem limit for op batching

* hex-opbatch: update htp side to properly support dynamic mmap/unmap

* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing

* hex-opbatch: fixed src1 handling in act ops

* hex-act: fix empty src1 handling in swiglu and friends

Simplify preamble macro while at it

* hex-mm: minor fix vtcm and dma handling in matmul

cleaning up some left-overs from merges

* hex-opbatch: allocate extra 1KB for dspqueue overhead

* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc

* hex-mm: properly handle hmx_disabled flag

* hex-ops: update comments

* hex-ops: add debug output for get/set-rows

* hex-mmap: optimize un/mapping of buffers

* hex-opreq: global cache flush and invalidate beyond 128KB threshold

* hex-ops: add super simple opfilter regex for debugging

If an Op matches the regex hex backend will reject it.

* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future

* hexagon: improved vtcm acquision to remove inter-op overhead

Fully compatible with QNN-HTP coex

* hex-mm: fixed hvx fallback path

* hex-mm: lower the vmem threshold a bit further to ~3GB

* hexagon: update debug & error logs

This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.

* hexagon: move ops context into main context

Just a cleanup. We don't need separate contexts at this point.

* hex-opbatch: cleanup naming and headers for opbatch and related descriptors

* hex-fa: it's now better to enable FA during TG to reduce graph splits

* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var

It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.

* hexagon: fixed editorconfig check

* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* opencl: add general q5_k mv

* opencl: add flattened Q5_K mv and general Q5_K mm

* opencl: fix Q5_K unit tests
* mtmd: add Gemma 4 audio conformer encoder support

Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer.

Architecture:
- 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm
- Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm
- Full self-attention with sinusoidal RPE and sliding window mask (24)
- Logit softcapping at 50.0, ClippableLinear clamping
- Output: 1024 → 1536 → RMSNorm → multimodal embedder

Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a):
- HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3
- Standard periodic Hann window (320 samples), zero-padded to FFT size
- Semicausal left-padding (frame_length/2 samples)
- Frame count matched to PyTorch (unfold formula)
- No pre-emphasis, no Whisper-style normalization
- Mel cosine similarity vs PyTorch: 0.9998

Key fixes:
- Tensor loading dedup: prevent get_tensor() from creating duplicate
  entries in ctx_data. Fixed with std::set guard.
- ClippableLinear clamp_info loading moved after per-layer tensors.
- Sliding window mask (24 positions) matching PyTorch context_size.
- Skip Whisper normalization for Gemma4 mel output.

Tested on E2B and E4B with CPU and Vulkan backends.
Transcribes: "Glad to see things are going well and business is starting
to pick up" (matching ground truth).

Ref: #21325
* CUDA: Limit DeviceSegmentedSort to immediate mode

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 12291 runs -   105.94 us/run -     8192 kB/run -   73.75 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 10245 runs -   115.08 us/run -    16384 kB/run -  135.77 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   221.22 us/run -    32768 kB/run -  141.26 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   430.98 us/run -    65536 kB/run -  145.02 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1028 runs -  1185.83 us/run -   131072 kB/run -  105.41 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  387 runs -  2748.62 us/run -   262144 kB/run -   90.95 GB/s

DeviceSegmentedSort in immediate mode

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 16388 runs -    71.17 us/run -     8192 kB/run -  109.78 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 12294 runs -    81.38 us/run -    16384 kB/run -  192.00 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   240.81 us/run -    32768 kB/run -  129.77 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   406.60 us/run -    65536 kB/run -  153.71 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1285 runs -   873.23 us/run -   131072 kB/run -  143.15 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  516 runs -  2288.46 us/run -   262144 kB/run -  109.24 GB/s

* Add test case for dispatch to DeviceSegmentedRadixSort

We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode
* use integer dot product for quantized KV flash attention

* small improvements

* fix SHMEM_STAGING indexing

* add missing KV type quants

* fixes

* add supported quants to FA tests

* readd fast paths for <8bit quants

* fix mmq gate and shmem checks
This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For
mul_mat, it does not add support for the dp4/q8_1 path, it's all via
fp16/fp32.
…lama/21644)

* Update register tiling matmul to use f32 accumulation

* fix profiling code

* Fix register tiling matmul for chrome, i'm blaming dawn

* Update batch tuning value for iOS

* compile fix

* Fix use of new load function
* cmake: fix CMP0194 warning on Windows with MSVC

Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1.

The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler.

This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147).

Closes ggml-org/llama.cpp#20311

* cmake: apply cisc's formatting suggestion

---------

Co-authored-by: texasich <texasich@users.noreply.github.com>
* ci : re-enable mac workflows

* vulkan : fix compile warning
…device supports it (llama/21572)

* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it

* use FetchContent to get SPIRV-Headers

* Fetch spirv-headers unconditionally

* remove fetchcontent, rely on installed headers

* fix ubuntu job

* Update docs/build.md
* ggml: correct placement of ggml-ext.h

* ggml : remove ggml-ext.h

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* hexagon: add async HMX worker

Introduce hmx-worker (dedicated thread for HMX compute) to overlap HMX
matmul with HVX dequant/DMA stages in the pipeline path, replacing the
previous synchronous HMX calls that blocked the main thread.

* hexagon: cost-based VTCM chunk search for out-stationary matmul

* hexagon: fix futex race in hmx_worker_drain
Store the boolean to local variable avoid atomic load twice

* hex-mm: hmx optimize scatter/transpose and use HMX intrinsics

* hex-vmem: drop vmem limit a touch under 3GB on v73

* hexagon: add fwd declaration of htp_context

* hex-hmx: replace hmx-worker with hmx-queue that mimics dma-queue interface

Simplifies the overall implemantion, reduces thread wakeup roundtrips.

* hex-mm: add debug log to hmx work func called from hmx-queue

* Update hmx-queue.h

Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <max.krasnyansky@gmail.com>
* more extensive ggml_rope documentation

* add more docs

* nits
* CUDA: manage NCCL communicators in context

* add check that all backends are CUDA

* remove unused vector, limit init to > 1 GPUs

* fix warnings

* fix cuda device, cache allreduce
yomaytk and others added 8 commits May 2, 2026 08:41
…/22578)

* Fix vectorized condition of mul-mat-fast pipeline and add vectorized variant to mul-mat-id

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* hmx: extract shared interleave headers and unify matmul batched

* hmx: add HMX-accelerated flash attention for prefill

* hmx: replace asm wrappers with Q6_ intrinsics in hmx-utils.h

Switches three single-instruction helpers from inline asm to the matching
Q6_ intrinsics, matching the style established by aizip f8737609a and used
by the upstream PR #21554 hmx-matmul-ops.c rewrite:

  hmx_set_output_scales       asm "bias=mxmem2"  -> Q6_bias_mxmem2_A
  hmx_load_tile_pair_fp16     asm packet         -> Q6_activation_hf_mxmem_RR
                                                    + Q6_weight_hf_mxmem_RR
  hmx_consume_accumulator_fp16 asm "mxmem=acc"   -> Q6_mxmem_AR_after_hf

hmx_load_tiles_fp16 stays on inline asm: it uses ":deep" activation
streaming, and the mixed Q6_activation_hf_mxmem_RR_deep + non-deep
Q6_weight_hf_mxmem_RR pair fails the HMX backend constraint check
("activate weight pair (1) exceeds limit (1)"). The asm bundle keeps
both halves in one VLIW packet and avoids the diagnostic.

Functionally equivalent — same instructions emitted; the Q6_ intrinsics
just give the compiler more visibility for scheduling.

* hmx: drop the duplicate interleave_fp16_weight_chunk_to_tiles

* hmx:  apply upstream optimization to hmx-flash-attn-ops.c
apply restrict, __builtin_assume, and pointer accumulation to the three HMX workers (qk_dot, o_update, o_norm) and the matching inline HMX loops in op_hmx_flash_attn_ext.

* hmx: unify interleave helper

* hmx: multi-thread Q load / O store and enable prefill FA dispatch

Extract inline Q-load and O-store loops into worker_pool-parallel helpers
(fa_phase_q_load, fa_phase_o_store) so HVX threads split the F32↔F16
conversion work across row ranges.  Also relax the softmax threading
gate from n_row_vec_cnt >= n_threads to >= 2, which was unnecessarily
forcing single-thread fallback when n_rows_g < 512.

On the dispatch side, remove the ne[2] != 1 guard that blocked multi-head
(prefill) FA from reaching the HTP backend — GQA is already handled
internally by both the HMX and HVX flash-attention paths.

* hmx: relax matmul pipeline gate to cover k > n shapes (e.g. FFN_down)

* hmx: optimize FA softmax mask phase (no-ALiBi fast path + GQA dedup)

* hmx: Add an asm memory clobber at the phase boundary to prevent reorder bug

* [experimental]: fp16 softmax (EXP2_HF) to accelerate fa

Bake log2(e) into qk_scale and use hvx_exp2_hf directly for P and m_diff
(base-2 consistent, matches htp-ops-lib). ~22 ALU ops for 64 lanes vs
~44 for the F32 round-trip path.

* hmx flash-attn: refine cost model coefficients based on profiling data

* hmx flash-attn: replace asm clobber with targeted volatile reads on vtcm_d_tiles

* hmx flash-attn: fix prefill correctness (dst indexing, softmax reduce, V stride)

* hmx flash-attn: fix p_tiles dual-tile OOB race; enable MT + pipeline

* hmx flash-attn: preserve additive mask bias in no-ALiBi fast path

The no-ALiBi fast path (max_bias==0) was skipping mask add entirely on
the assumption that mask values are only {0, -inf}.  This is wrong when
the mask carries additive positional bias — those terms were silently
dropped.  Keep the slope-mul skip (slope≡1.0) but add mask back so the
bias survives; vmux still clamps below -16 to -inf.

Also add HMX FA coverage to test-backend-ops: prefill shapes (nb=64,
nb=32) × {mask on/off} × {ALiBi on/off} × {softcap on/off}, F16 KV,
hs ∈ {64, 128}.

* hmx: fix softcap+EXP2_HF interaction, tighten matmul pipeline gate, add FA tests

- flash-attn: when EXP2_HF is on AND logit_softcap is active, fold
  log2(e) into the post-tanh multiplier (v_cap) instead of pre-baking
  it into qk_scale.  Pre-baking shifted the tanh knee from x≈c to
  x≈c/log2(e) and produced numerically wrong softcapped outputs
  whenever both knobs were enabled.
- flash-attn softmax (fa_softmax_thread): replace the union+memcpy
  scalar extract pattern with HVX vmux-based per-row accumulators on
  rowmax/rowsum.  Add hvx_vec_get_f16 helper in hvx-base.h.  Functional
  parity, less scalar code, clearer hf/qf16 lane-format contract.
- matmul (hmx_mat_mul_permuted_qk_0_d16a32): pick pipeline vs sequential
  layout based on whether the chunker actually yields >=2 n-chunks,
  instead of the static (m>=128 && n>=256) gate.  Avoids paying for
  output double-buffer + worker dispatch when there is no HMX/HVX
  overlap to gain (e.g. shapes that collapse to one n-chunk).
- tests: add HMX flash-attention coverage over the
  {mask, ALiBi (max_bias), logit_softcap} cross-product for the prefill
  path — head_dim 64/128, GQA 4×4, kv=512/nb=64 plus a kv=113/nb=32
  non-aligned case.

* [Help Wanted]: refactor D matrix computation into separate function for clarity and maintainability

* format code

* hexagon: looks like -O3 is causing issues with the large code base, switch to -O2 and -flto instead

* hexagon: use hex_ prefix for swap_ptr

* hexagon: move vtcm_seq_alloc into vtcm-utils.h

More vtcm allocator updates are coming so it makes sense to start the separate hdr for it.

* hmx-utils: add hmx_prefix for layout converters

* hmx-mm: move main hmx_mm functions to the end, remove unused fwd decls, etc

* hmx-mm: remove unused qweight_fetch_task_state_t and minor alignment fixes

* hmx-fa: minor alignment fixes

* hmx-fa: move hmx_flash_atten into hmx-ops.h

* hmx-fa: remove redundant workpool pointer in the hmx_fa_ctx, plus minor alignment updates

* hmx-fa: minor alignment and simplifications

* hexagon: move FA_EXP_F16 option to hostside CMake file

* hmx-fa: use hvx_vec_splat_f16 instead of fp16_to_bits

* hmx-fa: add hvx_splat_u16/u8 and use that in the fa instead custom hvx_fill

* hmx-fa: some more alignment updates in the core fa function

* hmx-fa: keep slopes in vtcm in fp16

Saves malloc/free and removes the need for float -> fp16 downcast on every use.

* hexagon: consistent noinline usage (after static)

* hex-hmx: consistent use FARF_HIGH to enable debug output

* hmx-utils: no need for always_inline attr

* hex-hmx: consistent noinline usage (static noinline ...)

* hex-hmx: simplify init_col_scales

* hexagon: fix editorconfig errors

* hmx-mm: minor alignment fixes

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Brings the speech branch up to the same ggml-org/master commit that
qvac-ext-lib-whisper.cpp's bundled ggml is pinned to (whisper.cpp's
scripts/sync-ggml.last = 19eac6f), pulling in the entire
0.9.11 -> 0.10.2 upstream history (137 commits, mostly llama.cpp
sync-backs plus the v0.10.2 release bump).

Speech-stack patches preserved (verified post-merge):
- Persistent VkPipelineCache (ggml-vulkan.cpp)
- OpenCL persistent kernel binary cache via clCreateProgramWithBinary
- GGML_OPENCL_ALLOW_UNKNOWN_GPU (Adreno/Intel whitelist relaxation)
- GGML_BACKEND_DL_PROJECT_PREFIX (project-prefixed dl backend names)
- GGML_METAL_FUSE_MV_BIAS (Q-variant mul_mv + bias/residual fusion)
- Supertonic Metal ops: depthwise_1d, layer_norm_channel, pw2_residual,
  bias_gelu, edge_pad_1d (+ ct/causal_ct variants in ggml.h)
- diag_mask_inf, pad_ext lp0..lp3 Metal kernels
- conv_transpose_1d simd_sum rewrite

Manual conflict resolution (4 files in src/ggml-metal/):
- ggml-metal-device.h:  union supertonic + diag_mask_inf + (new) roll pipeline decls
- ggml-metal-impl.h:    keep all supertonic kargs structs + add (new) ggml_metal_kargs_roll
- ggml-metal-ops.h:     union supertonic + diag_mask_inf + (new) roll op decls
- ggml-metal-ops.cpp:   union dispatch cases for supertonic + diag_mask_inf + (new) GGML_OP_ROLL

Validation:
- Build: cmake + ninja clean, no warnings (Linux x64, CPU+Vulkan).
- test-backend-ops: 13465/13465 OK on Vulkan0 + Vulkan1, CPU skipped,
  3/3 backends OK (baseline was 12075/12075 -- +1390 new upstream tests
  all green; no regressions in pre-existing tests).
- Metal/OpenCL not buildable on this Linux x64 host; deferred to CI/
  device farm.

Refs QVAC-18992.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ct_1d_f32

The Metal pad kernels were addressing the source's dim-0 with element
stride sizeof(float) (src0_ptr[i00]), implicitly assuming nb00 == 4.
For any non-contiguous source (e.g. a tensor reshaped via ggml_permute),
nb00 != sizeof(float) and the kernel reads from the wrong byte offsets,
producing garbage. CPU pad_ext / pad_reflect_1d honor nb[] correctly,
which is why test-backend-ops surfaces the mismatch on Metal only.

Concretely this fixes the failing Mac M2 test case:
  PAD(type=f32, ne_a=[11,22,33,44],
      lp0=1,rp0=2, lp1=3,rp1=4, lp2=5,rp2=6, lp3=7,rp3=8,
      tfrm=2, circular=0)  -> ERR = 1.93 > 1e-7

where tfrm=2 applies ggml_permute(a, 2, 1, 0, 3) so nb00 becomes the
original nb02 (968 bytes for [11,22,33,44] f32), and the old kernel
indexed past the row.

Fix: address dim 0 via byte arithmetic on a `device const char *` row
pointer (src0_row + i00*args.nb00), then a single float load. This is
correct for both contiguous (nb00 = 4) and permuted (nb00 != 4) inputs,
and avoids alignment hazards because nb00 for an f32 ggml tensor is
always a multiple of 4.

The same anti-pattern existed in kernel_pad_reflect_1d_f32; no current
test-backend-ops case exercises it (test_pad_reflect_1d is 2D and never
permutes its arg), but the failure mode would be identical, so the two
kernels are fixed together to keep them in step.

Destination indexing (dst_ptr[i0]) is unchanged: pad outputs are freshly
allocated and always contiguous (nb0 = sizeof(float)).

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000

Copy link
Copy Markdown
Author

⚠️ Metal not buildable on the local Linux host (needs macOS); the conflict-resolution is structurally trivial (union of decls/dispatch cases, both halves additive) — deferred to CI / device farm. The Metal patch series is documented in the commit message for 166c4e1.
@ishanvohra2 Can you paste your results for building it for macOS ?

@ishanvohra2

Copy link
Copy Markdown

⚠️ Metal not buildable on the local Linux host (needs macOS); the conflict-resolution is structurally trivial (union of decls/dispatch cases, both halves additive) — deferred to CI / device farm. The Metal patch series is documented in the commit message for 166c4e1.
@ishanvohra2 Can you paste your results for building it for macOS ?

I was able to test the build and run tests on mac m2. All tests passed on all 3 backends.

…upertonic_depthwise_1d

The ggml-org v0.10.2 merge into speech (commit 166c4e1) dropped the
'typedef struct {' header line that opens the
ggml_metal_kargs_supertonic_depthwise_1d declaration in
src/ggml-metal/ggml-metal-impl.h. As a result, the struct fields became
file-scope variable redeclarations, and the closing '} ggml_metal_kargs_supertonic_depthwise_1d;'
parsed as an extraneous brace + invalid type declaration. The cascade
broke compilation of every Apple Metal target (darwin-arm64, ios-arm64,
ios-arm64-simulator, ios-x64-simulator) at:

  src/ggml-metal/ggml-metal-impl.h:1054:1: error: extraneous closing brace ('}')
  src/ggml-metal/ggml-metal-impl.h:1054:3: error: a type specifier is required for all declarations
  src/ggml-metal/ggml-metal-ops.cpp:4300:45: error: expected ';' after expression
  src/ggml-metal/ggml-metal-ops.cpp:4300:46: error: use of undeclared identifier 'args'

This was caught when the transcription-whispercpp addon pulled PR tetherto#13
via an overlay port and the qvac CI matrix exercised the Apple prebuild
triplets (which the qvac-ext-ggml repo does not currently test on its
own CI, per QVAC-18992 follow-ups).

Fix: re-insert the 'typedef struct {' line. Verified balanced:
62 'typedef struct {' openings, 62 '} ggml_metal_kargs_*;' closings.

Tested locally on linux-x64 via the overlay-port path (107/107 C++
addon tests pass against d39c0d2 + this fix); Apple coverage will be
validated by qvac CI after the overlay is bumped to the new HEAD.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Zbig9000 Zbig9000 merged commit c9126af into tetherto:speech May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.