Skip to content

CUDA: enroll mul_mat_vec_q_moe into pdl#24087

Merged
JohannesGaessler merged 2 commits into
ggml-org:masterfrom
ORippler:osimons/enroll_mul_mat_vec_q_moe_into_PDL
Jun 5, 2026
Merged

CUDA: enroll mul_mat_vec_q_moe into pdl#24087
JohannesGaessler merged 2 commits into
ggml-org:masterfrom
ORippler:osimons/enroll_mul_mat_vec_q_moe_into_PDL

Conversation

@ORippler
Copy link
Copy Markdown
Collaborator

@ORippler ORippler commented Jun 3, 2026

Overview

Gives small perf boost in 1 < BS < 8 setting.

Additional information

Numbers collected on a B4500

build: b508845ca (9493)
+ ./scripts/compare-llama-bench.py -b master -c osimons/enroll_mul_mat_vec_q_moe_into_PDL --tool llama-bench -i llama-bench.sqlite
| CPU                         | Model                    |   Microbatch size | Test   |   t/s master |   t/s osimons/enroll_mul_mat_vec_q_moe_into_PDL |   Speedup |
|:----------------------------|:-------------------------|------------------:|:-------|-------------:|------------------------------------------------:|----------:|
| INTEL(R) XEON(R) GOLD 6542Y | qwen35moe 35B.A3B Q4_K_M |                 2 | pp256  |       320.55 |                                          324.65 |      1.01 |
| INTEL(R) XEON(R) GOLD 6542Y | qwen35moe 35B.A3B Q4_K_M |                 4 | pp256  |       490.29 |                                          494.64 |      1.01 |
| INTEL(R) XEON(R) GOLD 6542Y | qwen35moe 35B.A3B Q4_K_M |                 8 | pp256  |       739.61 |                                          741.59 |      1.00 |

Requirements

ORippler added 2 commits June 3, 2026 16:34
Data collected on a B4500:

Before
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```

Server launched with:
```
➜  llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
    -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    -ngl all \
    -fa on \
    --host 0.0.0.0 \
    --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```
@ORippler ORippler requested a review from a team as a code owner June 3, 2026 19:23
@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2026
@JohannesGaessler JohannesGaessler merged commit 2154a0f into ggml-org:master Jun 5, 2026
20 of 22 checks passed
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request Jun 6, 2026
* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW

Data collected on a B4500:

Before
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9
```
After
```
(llama.cpp) ➜  llama.cpp git:(master) ✗ python mtp-bench.py
  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9
  code_cpp           pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6
  explain_concept    pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8
  summarize          pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5
  translation        pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8
  stepwise_math      pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7
  long_code_review   pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7
```

Server launched with:
```
➜  llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \
    -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \
    --spec-type draft-mtp \
    --spec-draft-n-max 2 \
    -ngl all \
    -fa on \
    --host 0.0.0.0 \
    --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"
```

* LC to overlap with following kernels

(cherry picked from commit 2154a0f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants