Skip to content

vulkan: Enable topk_moe fusion for GLM-4.7-Flash#18947

Closed
jeffbolznv wants to merge 1 commit intoggml-org:masterfrom
jeffbolznv:topk_moe_early_softmax_norm_bias_edges
Closed

vulkan: Enable topk_moe fusion for GLM-4.7-Flash#18947
jeffbolznv wants to merge 1 commit intoggml-org:masterfrom
jeffbolznv:topk_moe_early_softmax_norm_bias_edges

Conversation

@jeffbolznv
Copy link
Contributor

Just need to add the fusion detection logic, this is a combination of existing modes (early softmax, bias, norm, scale), and is covered by the existing backend tests.

before

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -r 10 -fa 1 -p 512 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8434.22 ± 37.67 |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |       185.26 ± 16.12 |

after

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m c:\models\GLM-4.7-Flash-Q4_K_M.gguf -r 10 -fa 1 -p 512 -n 128
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
load_backend: loaded Vulkan backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-vulkan.dll
load_backend: loaded CPU backend from Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo\ggml-cpu.dll
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           pp512 |      8504.07 ± 57.02 |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | Vulkan     |  99 |  1 |           tg128 |       206.38 ± 16.00 |

Just need to add the fusion detection logic, this is a combination of
existing modes (early softmax, bias, norm, scale), and is covered by
the existing backend tests.
@jeffbolznv jeffbolznv requested a review from 0cc4m as a code owner January 20, 2026 04:39
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jan 20, 2026
@jeffbolznv
Copy link
Contributor Author

This may not be needed after #18980, I'll check once that lands.

@jeffbolznv
Copy link
Contributor Author

I verified the model is hitting TOPK_MOE_SIGMOID_NORM_BIAS now, without this change. So we don't really need the change.

@0cc4m
Copy link
Contributor

0cc4m commented Jan 22, 2026

So this can be closed?

@jeffbolznv
Copy link
Contributor Author

I'm fine with abandoning it. If another model needs it in the future the code will still be here.

@jeffbolznv jeffbolznv closed this Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants