Skip to content

UPSTREAM PR #18934: ggml-cuda: enable cuda-graphs for n-cpu-moe#971

Open
loci-dev wants to merge 2 commits intomainfrom
upstream-PR18934-branch_am17an-n-cpu-moe-piecewise
Open

UPSTREAM PR #18934: ggml-cuda: enable cuda-graphs for n-cpu-moe#971
loci-dev wants to merge 2 commits intomainfrom
upstream-PR18934-branch_am17an-n-cpu-moe-piecewise

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18934

Add piece-wise cuda graph for the multiple split case. Currently cuda graphs get disabled when there are splits as we only keep 1 cuda graph per device. Multiple updates with different sized splits/shapes triggers the disable.
This PR adds a cuda graph per split (a split is keyed via the first node in the split)

Tested on 2x4090 and 1x 5090

Model n_cpu_moe Test t/s 3d55846 t/s n-cpu-moe-piecewise Speedup
glm4moe 106B.A12B IQ4_XS - 4.25 bpw 8 tg128 60.84 63.12 1.04
glm4moe 106B.A12B IQ4_XS - 4.25 bpw 16 tg128 48.25 50.75 1.05
glm4moe 106B.A12B IQ4_XS - 4.25 bpw 32 tg128 32.84 35.03 1.07
glm4moe 106B.A12B IQ4_XS - 4.25 bpw 64 tg128 25.08 27.49 1.10
gpt-oss 120B MXFP4 MoE 8 tg128 95.96 100.93 1.05
gpt-oss 120B MXFP4 MoE 16 tg128 70.42 75.44 1.07
gpt-oss 120B MXFP4 MoE 32 tg128 44.80 48.47 1.08
gpt-oss 120B MXFP4 MoE 64 tg128 40.87 44.90 1.10

@loci-review
Copy link

loci-review bot commented Jan 19, 2026

Explore the complete analysis inside the Version Insights

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 5b137d4 to ab9ebfa Compare January 23, 2026 08:12
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 706d8e7 to 83ca7a9 Compare January 29, 2026 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants