hexagon: HMX quantized matmul rework by max-krasnyansky · Pull Request #23368 · ggml-org/llama.cpp

max-krasnyansky · 2026-05-20T00:11:15Z

Overview

This PR updates the HMX matmul to use activation depth mode, and simplifies quantized HMX matmul implementation.
Based on testing with latest models (see the sweep below) we do not really need non-pipelined kernel flavors any more.
Perhaps, at some point those provided benefits but after all the recent updates and fixes they do not.

Additional information

Details

## S26+ (v81)

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |    pp25 |   56.58 ± 1.01 |     57.55 ± 0.28 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp128 |  368.03 ± 3.21 |    473.84 ± 9.24 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp200 |  474.88 ± 2.23 |    548.86 ± 2.79 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp256 |  559.54 ± 2.60 |    638.01 ± 9.17 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp300 |  521.03 ± 5.11 |    584.54 ± 2.14 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp400 |  617.81 ± 1.90 |    645.06 ± 2.47 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp512 |  652.48 ± 0.63 |    689.97 ± 4.18 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp800 |  580.48 ± 2.90 |    592.78 ± 3.14 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp1024 |  598.40 ± 2.59 |    628.46 ± 3.36 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp1334 |  544.45 ± 1.81 |    574.65 ± 2.57 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp2048 |  533.06 ± 0.88 |    548.57 ± 4.73 |

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |    pp25 |   82.45 ± 0.44 |     81.58 ± 0.70 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp128 |  395.62 ± 2.65 |    418.83 ± 2.58 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp200 |  449.89 ± 2.01 |    441.39 ± 5.24 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp256 |  477.68 ± 2.66 |    461.36 ± 9.62 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp300 |  481.75 ± 4.84 |    465.00 ± 8.40 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp400 |  498.12 ± 3.78 |    470.01 ± 9.17 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp512 |  511.80 ± 3.57 |    502.04 ± 3.99 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp800 |  503.10 ± 4.39 |    487.99 ± 3.19 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1024 |  469.80 ± 4.61 |    486.48 ± 2.59 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1334 |  452.50 ± 1.39 |    470.13 ± 1.65 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp2048 |  446.86 ± 3.48 |    464.10 ± 1.26 |

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |    pp25 |   91.21 ± 0.90 |     91.29 ± 0.84 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp128 |  410.90 ± 4.29 |    520.26 ± 2.22 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp200 |  517.40 ± 4.48 |    625.35 ± 5.89 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp256 |  587.21 ± 7.23 |    700.74 ± 7.44 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp300 |  594.04 ± 6.83 |    658.50 ± 6.24 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp400 |  641.20 ± 4.94 |    707.10 ± 3.80 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp512 |  694.36 ± 6.46 |    763.82 ± 7.01 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp800 |  630.60 ± 5.71 |    677.29 ± 2.66 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1024 |  640.28 ± 1.00 |    690.08 ± 2.20 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1334 |  607.27 ± 2.98 |    662.12 ± 2.12 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp2048 |  577.48 ± 2.44 |    622.38 ± 4.18 |

## S24U (Hex v75)

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |    pp25 |  108.16 ± 1.07 |   107.54 ± 0.66 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp128 |  877.65 ± 7.46 |   888.87 ± 0.90 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp200 |  970.85 ± 9.22 |  1005.83 ± 6.19 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp256 | 1087.36 ± 6.04 |  1119.79 ± 9.91 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp300 |  909.44 ± 9.01 |   946.88 ± 9.23 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp400 | 1004.08 ± 8.82 |  1029.45 ± 8.90 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp512 | 1023.82 ± 9.24 |  1060.55 ± 9.52 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp800 |  879.68 ± 6.11 |   907.78 ± 3.99 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp1024 |  868.60 ± 3.75 |   895.50 ± 4.57 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp1334 |  819.69 ± 7.21 |   850.83 ± 2.99 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp2048 |  754.53 ± 2.69 |   786.59 ± 1.90 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |    pp25 |   39.60 ± 0.22 |    40.32 ± 0.14 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp128 |  342.47 ± 2.82 |   342.26 ± 2.52 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp200 |  411.13 ± 5.55 |   422.50 ± 3.39 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp256 |  458.27 ± 5.34 |   468.22 ± 7.34 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp300 |  421.36 ± 3.38 |   436.76 ± 1.48 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp400 |  460.33 ± 3.67 |   476.64 ± 1.38 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp512 |  487.08 ± 2.82 |   510.72 ± 2.04 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp800 |  411.20 ± 3.16 |   429.67 ± 1.53 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1024 |  429.99 ± 3.02 |   448.14 ± 4.66 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1334 |  400.51 ± 0.92 |   425.18 ± 2.94 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp2048 |  382.92 ± 1.62 |   410.36 ± 0.78 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |    pp25 |   41.64 ± 1.07 |    43.57 ± 0.88 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp128 |  252.32 ± 4.31 |   271.20 ± 0.08 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp200 |  276.95 ± 1.73 |   290.11 ± 1.44 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp256 |  291.52 ± 5.04 |   314.51 ± 5.57 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp300 |  303.51 ± 8.51 |   311.49 ± 1.96 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp400 |  306.40 ± 9.38 |   317.73 ± 2.08 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp512 |  321.93 ± 5.62 |   353.54 ± 0.78 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp800 |  293.84 ± 9.57 |   340.43 ± 0.66 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1024 |  296.90 ± 9.93 |   339.94 ± 0.72 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1334 |  291.18 ± 9.10 |   323.27 ± 4.19 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp2048 |  299.31 ± 7.84 |   327.53 ± 0.74 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |    pp25 |   25.86 ± 0.23 |    25.73 ± 0.17 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp128 |  260.53 ± 1.62 |   259.61 ± 1.74 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp200 |  286.95 ± 0.38 |   285.16 ± 1.22 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp256 |  324.01 ± 1.44 |   324.04 ± 1.24 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp300 |  307.37 ± 0.33 |   307.56 ± 0.51 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp400 |  322.96 ± 1.08 |   323.75 ± 1.13 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp512 |  339.22 ± 1.30 |   338.42 ± 1.28 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp800 |  295.73 ± 0.80 |   295.85 ± 0.47 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp1024 |  302.61 ± 0.43 |   303.76 ± 0.55 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp1334 |  288.09 ± 0.43 |   289.76 ± 0.48 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp2048 |  265.52 ± 0.57 |   265.32 ± 0.68 |

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

It seems that we don't reall need non-pipelined version

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

…ower call

max-krasnyansky · 2026-05-20T05:27:28Z

@ggml-org/maintainers can I get the second approval please

* hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* upstream/HEAD: (38 commits) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) ...

* hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

max-krasnyansky and others added 9 commits May 19, 2026 14:52

hmx-mm: update debug logging in hmx-mm

5cbab92

hmx-mm: update dequant logic to use HVX_vector_x2/4

7147fba

hmx-mm: remove non-pipelined version of the quantize matmul

3736e18

It seems that we don't reall need non-pipelined version

hmx-mm: use activation depth mode and update naming

9820f1a

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

hex-mm: minor hmx matmul naming updates

d48ce85

hmx-mm: remove unused vars

d84255c

snapdragon: scripts bump default ubatch-size to 1K

2649238

hexagon: combine HMX and power and clock settings into a single set_p…

9ecaf61

…ower call

hmx-mm: remove leftover of the scale repl helper

7423b55

max-krasnyansky requested a review from a team as a code owner May 20, 2026 00:11

lhez approved these changes May 20, 2026

View reviewed changes

github-actions Bot added script Script related ggml changes relating to the ggml tensor library for machine learning Hexagon labels May 20, 2026

hexagon: fix editconf error

099c8e3

CISC approved these changes May 20, 2026

View reviewed changes

max-krasnyansky merged commit c9872a2 into ggml-org:master May 20, 2026
54 checks passed

nyo16 mentioned this pull request May 21, 2026

Bump llama.cpp to 52fb93a2b (30 commits) nyo16/llama_cpp_ex#42

Merged

4 tasks

a-ghorbani mentioned this pull request May 25, 2026

chore(deps): upgrade llama.rn to 0.12.4 a-ghorbani/pocketpal-ai#743

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: HMX quantized matmul rework#23368

hexagon: HMX quantized matmul rework#23368
max-krasnyansky merged 10 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-matmul-rework

max-krasnyansky commented May 20, 2026

Uh oh!

max-krasnyansky commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

max-krasnyansky commented May 20, 2026

Overview

Additional information

Requirements

Uh oh!

max-krasnyansky commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants