hexagon: minor refresh for HMX FA and MM by max-krasnyansky · Pull Request #23796 · ggml-org/llama.cpp

max-krasnyansky · 2026-05-28T03:27:13Z

Overview

Another pass at improving HMX FA and MM, and FA in general.
This does include a critical fix for Gemma-4 on Hexagon v79 which was broken due to v79 specific issues with handling INF and NaN which was breaking Gemma-4 FA after a certain context size.

The changes do provide a little perf uplift, especially for the Token Gen on older SOCs with fewer HVX threads.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, Antigravity helped find that bug with INF and NaNs on v79, and with some refactoring of the HMX MM code (ie generating nice macros, etc). Otherwise written/reviewed/tested manually.

… gemma4 on v79

… capacity

…nd precomputing fastdiv

This is a bit faster than LUT.

max-krasnyansky · 2026-05-28T04:07:00Z

@lhez @ggml-org/maintainers can I get some review/approvals please.

* hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>

* origin/master: (32 commits) hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (ggml-org#23835) mtmd-debug: add color and rainbow mode (ggml-org#23829) mtmd: fix gemma 4 projector pre_norm (ggml-org#23822) opencl: move backend info printing into its own function (ggml-org#23702) ci : run ui publish on ubuntu-slim (ggml-org#23818) ui: fix audio and video modality detection (ggml-org#23756) ci : releases use Github-hosted builds for the UI (ggml-org#23823) app : improve help output (ggml-org#23805) mtmd: n_head_kv defaults to n_head (ggml-org#23782) mtmd: fix gemma 4 audio rms norm eps (ggml-org#23815) ci : change Vulkan builds to Release to reduce ccache (ggml-org#23820) arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (ggml-org#23167) test-llama-archs: fix table format [no release] (ggml-org#23810) ggml: auto apply iGPU flag CUDA/HIP if integrated device (ggml-org#23007) mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (ggml-org#23729) CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (ggml-org#23227) server: minor tweaks to use more cpp features (ggml-org#23785) hexagon: minor refresh for HMX FA and MM (ggml-org#23796) vulkan: fast path for walsh-hadamard transform (ggml-org#23687) chat : add Granite 4.1 chat template (ggml-org#23518) ...

* hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>

max-krasnyansky and others added 12 commits May 27, 2026 16:42

hex-fa: clean up qf32/fp32 handling and stride handling

d3be6e8

hex-fa: fix corner case fp NAN issues that were cause bad output from…

ec55d37

… gemma4 on v79

hex-fa: vectorize leftover handling

16affaa

hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute…

7852e3a

… capacity

hmx-mm: remove dead code

c248ae5

hmx-mm: use fastdiv in x4x2 dequant

ed85a36

hmx-mm: sandwich dequant and scatter to improve perf

c0597e5

hmx-mm: fixed rebase conflicts

cd1aa81

hmx-mm: further improve weight dequant by doing early type dispatch a…

2213cc5

…nd precomputing fastdiv

hmx-mm: an even earlier dispatch for per-type dequant

b3ad38c

hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

519372d

This is a bit faster than LUT.

hex-cmake: one more tweak for lto

2fa1211

max-krasnyansky requested a review from a team as a code owner May 28, 2026 03:27

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels May 28, 2026

lhez approved these changes May 28, 2026

View reviewed changes

CISC approved these changes May 28, 2026

View reviewed changes

max-krasnyansky merged commit a919001 into ggml-org:master May 28, 2026
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: minor refresh for HMX FA and MM#23796

hexagon: minor refresh for HMX FA and MM#23796
max-krasnyansky merged 12 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-revisit-fa-and-mm

max-krasnyansky commented May 28, 2026

Uh oh!

max-krasnyansky commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

max-krasnyansky commented May 28, 2026

Overview

Requirements

Uh oh!

max-krasnyansky commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants