Skip to content

hexagon: HMX quantized matmul rework#23368

Merged
max-krasnyansky merged 10 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-matmul-rework
May 20, 2026
Merged

hexagon: HMX quantized matmul rework#23368
max-krasnyansky merged 10 commits into
ggml-org:masterfrom
qualcomm:hexagon-hmx-matmul-rework

Conversation

@max-krasnyansky
Copy link
Copy Markdown
Member

Overview

This PR updates the HMX matmul to use activation depth mode, and simplifies quantized HMX matmul implementation.
Based on testing with latest models (see the sweep below) we do not really need non-pipelined kernel flavors any more.
Perhaps, at some point those provided benefits but after all the recent updates and fixes they do not.

Additional information

Details
## S26+ (v81)

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |    pp25 |   56.58 ± 1.01 |     57.55 ± 0.28 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp128 |  368.03 ± 3.21 |    473.84 ± 9.24 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp200 |  474.88 ± 2.23 |    548.86 ± 2.79 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp256 |  559.54 ± 2.60 |    638.01 ± 9.17 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp300 |  521.03 ± 5.11 |    584.54 ± 2.14 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp400 |  617.81 ± 1.90 |    645.06 ± 2.47 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp512 |  652.48 ± 0.63 |    689.97 ± 4.18 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |   pp800 |  580.48 ± 2.90 |    592.78 ± 3.14 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp1024 |  598.40 ± 2.59 |    628.46 ± 3.36 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp1334 |  544.45 ± 1.81 |    574.65 ± 2.57 |
| qwen3 4B Q4_0    |   2.11 GiB | HTP0 |  pp2048 |  533.06 ± 0.88 |    548.57 ± 4.73 |

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |    pp25 |   82.45 ± 0.44 |     81.58 ± 0.70 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp128 |  395.62 ± 2.65 |    418.83 ± 2.58 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp200 |  449.89 ± 2.01 |    441.39 ± 5.24 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp256 |  477.68 ± 2.66 |    461.36 ± 9.62 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp300 |  481.75 ± 4.84 |    465.00 ± 8.40 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp400 |  498.12 ± 3.78 |    470.01 ± 9.17 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp512 |  511.80 ± 3.57 |    502.04 ± 3.99 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp800 |  503.10 ± 4.39 |    487.99 ± 3.19 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1024 |  469.80 ± 4.61 |    486.48 ± 2.59 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1334 |  452.50 ± 1.39 |    470.13 ± 1.65 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp2048 |  446.86 ± 3.48 |    464.10 ± 1.26 |

| model            |       size | dev  |    test |            t/s | hmx-rework   t/s |
| ---------------- | ---------: | ---- | ------: | -------------: | ---------------: |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |    pp25 |   91.21 ± 0.90 |     91.29 ± 0.84 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp128 |  410.90 ± 4.29 |    520.26 ± 2.22 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp200 |  517.40 ± 4.48 |    625.35 ± 5.89 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp256 |  587.21 ± 7.23 |    700.74 ± 7.44 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp300 |  594.04 ± 6.83 |    658.50 ± 6.24 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp400 |  641.20 ± 4.94 |    707.10 ± 3.80 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp512 |  694.36 ± 6.46 |    763.82 ± 7.01 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp800 |  630.60 ± 5.71 |    677.29 ± 2.66 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1024 |  640.28 ± 1.00 |    690.08 ± 2.20 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1334 |  607.27 ± 2.98 |    662.12 ± 2.12 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp2048 |  577.48 ± 2.44 |    622.38 ± 4.18 |

## S24U (Hex v75)

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |    pp25 |  108.16 ± 1.07 |   107.54 ± 0.66 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp128 |  877.65 ± 7.46 |   888.87 ± 0.90 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp200 |  970.85 ± 9.22 |  1005.83 ± 6.19 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp256 | 1087.36 ± 6.04 |  1119.79 ± 9.91 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp300 |  909.44 ± 9.01 |   946.88 ± 9.23 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp400 | 1004.08 ± 8.82 |  1029.45 ± 8.90 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp512 | 1023.82 ± 9.24 |  1060.55 ± 9.52 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |   pp800 |  879.68 ± 6.11 |   907.78 ± 3.99 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp1024 |  868.60 ± 3.75 |   895.50 ± 4.57 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp1334 |  819.69 ± 7.21 |   850.83 ± 2.99 |
| llama 1B Q4_0    | 729.75 MiB | HTP0 |  pp2048 |  754.53 ± 2.69 |   786.59 ± 1.90 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |    pp25 |   39.60 ± 0.22 |    40.32 ± 0.14 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp128 |  342.47 ± 2.82 |   342.26 ± 2.52 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp200 |  411.13 ± 5.55 |   422.50 ± 3.39 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp256 |  458.27 ± 5.34 |   468.22 ± 7.34 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp300 |  421.36 ± 3.38 |   436.76 ± 1.48 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp400 |  460.33 ± 3.67 |   476.64 ± 1.38 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp512 |  487.08 ± 2.82 |   510.72 ± 2.04 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |   pp800 |  411.20 ± 3.16 |   429.67 ± 1.53 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1024 |  429.99 ± 3.02 |   448.14 ± 4.66 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp1334 |  400.51 ± 0.92 |   425.18 ± 2.94 |
| llama 3B Q4_0    |   1.78 GiB | HTP0 |  pp2048 |  382.92 ± 1.62 |   410.36 ± 0.78 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |    pp25 |   41.64 ± 1.07 |    43.57 ± 0.88 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp128 |  252.32 ± 4.31 |   271.20 ± 0.08 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp200 |  276.95 ± 1.73 |   290.11 ± 1.44 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp256 |  291.52 ± 5.04 |   314.51 ± 5.57 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp300 |  303.51 ± 8.51 |   311.49 ± 1.96 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp400 |  306.40 ± 9.38 |   317.73 ± 2.08 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp512 |  321.93 ± 5.62 |   353.54 ± 0.78 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |   pp800 |  293.84 ± 9.57 |   340.43 ± 0.66 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1024 |  296.90 ± 9.93 |   339.94 ± 0.72 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp1334 |  291.18 ± 9.10 |   323.27 ± 4.19 |
| gemma4 E2B Q4_0  |   2.82 GiB | HTP0 |  pp2048 |  299.31 ± 7.84 |   327.53 ± 0.74 |

| model            |       size | dev  |    test |            t/s | hmx-rework  t/s |
| ---------------- | ---------: | -----| ------: | -------------: | --------------: |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |    pp25 |   25.86 ± 0.23 |    25.73 ± 0.17 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp128 |  260.53 ± 1.62 |   259.61 ± 1.74 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp200 |  286.95 ± 0.38 |   285.16 ± 1.22 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp256 |  324.01 ± 1.44 |   324.04 ± 1.24 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp300 |  307.37 ± 0.33 |   307.56 ± 0.51 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp400 |  322.96 ± 1.08 |   323.75 ± 1.13 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp512 |  339.22 ± 1.30 |   338.42 ± 1.28 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |   pp800 |  295.73 ± 0.80 |   295.85 ± 0.47 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp1024 |  302.61 ± 0.43 |   303.76 ± 0.55 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp1334 |  288.09 ± 0.43 |   289.76 ± 0.48 |
| qwen3 4B Q4_0    |   2.21 GiB | HTP0 |  pp2048 |  265.52 ± 0.57 |   265.32 ± 0.68 |

Requirements

@max-krasnyansky max-krasnyansky requested a review from a team as a code owner May 20, 2026 00:11
@github-actions github-actions Bot added script Script related ggml changes relating to the ggml tensor library for machine learning Hexagon labels May 20, 2026
@max-krasnyansky
Copy link
Copy Markdown
Member Author

@ggml-org/maintainers can I get the second approval please

@max-krasnyansky max-krasnyansky merged commit c9872a2 into ggml-org:master May 20, 2026
54 checks passed
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 20, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
max-krasnyansky added a commit to qualcomm/llama.cpp that referenced this pull request May 20, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request May 23, 2026
* upstream/HEAD: (38 commits)
  vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
  doc: fix spec mtp typo (ggml-org#23435)
  ui: Improve Git Hooks for UI development (ggml-org#23403)
  ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
  llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
  hexagon: ssm-conv fix for large prompts (ggml-org#23307)
  app : show version (ggml-org#23426)
  mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
  ui: Add max image size option (ggml-org#22849)
  Move to backend sampling for MTP draft path (ggml-org#23287)
  opencl: refactor backend initilization (ggml-org#23318)
  common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
  mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345)
  vulkan: optimize operations in the IM2COL shader (ggml-org#22685)
  feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396)
  hexagon: HMX quantized matmul rework (ggml-org#23368)
  Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522)
  app : introduce the llama unified executable (ggml-org#23296)
  refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406)
  mtmd: fit_params now take into account mmproj (ggml-org#21489)
  ...
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* hmx-mm: update debug logging in hmx-mm

* hmx-mm: update dequant logic to use HVX_vector_x2/4

* hmx-mm: remove non-pipelined version of the quantize matmul

It seems that we don't reall need non-pipelined version

* hmx-mm: use activation depth mode and update naming

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

* hex-mm: minor hmx matmul naming updates

* hmx-mm: remove unused vars

* snapdragon: scripts bump default ubatch-size to 1K

* hexagon: combine HMX and power and clock settings into a single set_power call

* hmx-mm: remove leftover of the scale repl helper

* hexagon: fix editconf error

---------

Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon script Script related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants