Skip to content

hexagon: minor refresh for HMX FA and MM#23796

Merged
max-krasnyansky merged 12 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-revisit-fa-and-mm
May 28, 2026
Merged

hexagon: minor refresh for HMX FA and MM#23796
max-krasnyansky merged 12 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-revisit-fa-and-mm

Conversation

@max-krasnyansky
Copy link
Copy Markdown
Member

Overview

Another pass at improving HMX FA and MM, and FA in general.
This does include a critical fix for Gemma-4 on Hexagon v79 which was broken due to v79 specific issues with handling INF and NaN which was breaking Gemma-4 FA after a certain context size.

The changes do provide a little perf uplift, especially for the Token Gen on older SOCs with fewer HVX threads.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, Antigravity helped find that bug with INF and NaNs on v79, and with some refactoring of the HMX MM code (ie generating nice macros, etc). Otherwise written/reviewed/tested manually.

@max-krasnyansky max-krasnyansky requested a review from a team as a code owner May 28, 2026 03:27
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels May 28, 2026
@max-krasnyansky
Copy link
Copy Markdown
Member Author

@lhez @ggml-org/maintainers can I get some review/approvals please.

@max-krasnyansky max-krasnyansky merged commit a919001 into ggml-org:master May 28, 2026
36 checks passed
adrianhoehne pushed a commit to adrianhoehne/llama.cpp that referenced this pull request May 28, 2026
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 28, 2026
* origin/master: (32 commits)
hexagon: basic/generic op fusion support and RMS_NORM+MUL fusion (ggml-org#23835)
mtmd-debug: add color and rainbow mode (ggml-org#23829)
mtmd: fix gemma 4 projector pre_norm (ggml-org#23822)
opencl: move backend info printing into its own function (ggml-org#23702)
ci : run ui publish on ubuntu-slim (ggml-org#23818)
ui: fix audio and video modality detection (ggml-org#23756)
ci : releases use Github-hosted builds for the UI (ggml-org#23823)
app : improve help output (ggml-org#23805)
mtmd: n_head_kv defaults to n_head (ggml-org#23782)
mtmd: fix gemma 4 audio rms norm eps (ggml-org#23815)
ci : change Vulkan builds to Release to reduce ccache (ggml-org#23820)
arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-file (ggml-org#23167)
test-llama-archs: fix table format [no release] (ggml-org#23810)
ggml: auto apply iGPU flag CUDA/HIP if integrated device (ggml-org#23007)
mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for … (ggml-org#23729)
CUDA: route batch>=4 quantized matmul to MMQ on AMD MFMA hardware (ggml-org#23227)
server: minor tweaks to use more cpp features (ggml-org#23785)
hexagon: minor refresh for HMX FA and MM (ggml-org#23796)
vulkan: fast path for walsh-hadamard transform (ggml-org#23687)
chat : add Granite 4.1 chat template (ggml-org#23518)
...
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* hex-fa: clean up qf32/fp32 handling and stride handling

* hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79

* hex-fa: vectorize leftover handling

* hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity

* hmx-mm: remove dead code

* hmx-mm: use fastdiv in x4x2 dequant

* hmx-mm: sandwich dequant and scatter to improve perf

* hmx-mm: fixed rebase conflicts

* hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv

* hmx-mm: an even earlier dispatch for per-type dequant

* hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs

This is a bit faster than LUT.

* hex-cmake: one more tweak for lto

---------

Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants