Skip to content

UPSTREAM PR #18740: POC: group gate_exps and up_exps + fix mxfp4 alignment for PP boost#881

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18740-branch_am17an-merge-qkv
Open

UPSTREAM PR #18740: POC: group gate_exps and up_exps + fix mxfp4 alignment for PP boost#881
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18740-branch_am17an-merge-qkv

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18740

Instead of doing the gate and up expert MUL_MAT_ID separately, we group these two together (interestingly, the HF model also has them together but interleaved), so do 1 larger sparse MM (+ add) than 2. I uploaded a GGUF with this change for just gpt-oss-20b

Results:

Master

Details

4090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp2048 9858.26 ± 68.48
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp4096 10464.32 ± 44.26
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp8192 10591.82 ± 11.03
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp16384 10229.44 ± 16.46

5090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp2048 14544.29 ± 124.42
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp4096 15873.04 ± 81.31
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp8192 16310.87 ± 39.83
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 2048 1 pp16384 13459.03 ± 63.68

This PR

Details

4090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp2048 10908.66 ± 16.40
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp4096 11663.88 ± 46.38
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp8192 11815.96 ± 22.13
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp16384 11267.29 ± 4.64

5090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp2048 16002.91 ± 12.40
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp4096 17792.65 ± 47.00
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp8192 17774.20 ± 127.46
gpt-oss 20B Q8_0 11.27 GiB 20.91 B CUDA 99 2048 1 pp16384 14217.53 ± 88.59

So overall a gain of ~10% in PP.

I tried another simple but backwards incompatible change, changing the block_mxfp4 size from 17 bytes to 20 bytes to make it 4-byte aligned in 9099a1b175c804a092f6b5431d5e752386e47163, that stacked on top of this PR gives

Details

4090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp2048 11232.48 ± 20.12
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp4096 12063.57 ± 36.07
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp8192 12236.66 ± 24.62
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp16384 11528.88 ± 30.15

5090:

model size params backend ngl n_ubatch fa test t/s
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp2048 16994.97 ± 143.78
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp4096 18900.47 ± 6.48
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp8192 18885.38 ± 74.63
gpt-oss 20B Q8_0 12.93 GiB 20.91 B CUDA 99 2048 1 pp16384 14500.50 ± 85.92

Another ~5-10% gain for the 5090 and 1-3% gain for 4090. I uploaded another GGUF for this. This increases the size by ~15% from 12.1 GB to 13.9 GB.


Does it make sense for optimisation 1 to happen during model-load via some flag? (we could add something similar for QKV merging). For optimisation 2 the best would be a re-pack in the CUDA backend as discussed in #18427, but I'm finding it difficult to implement that.

AI disclosure - since I don't intend to merge this code as is, this code is mostly written by AI for quick experiments (guided by me)

@loci-review
Copy link

loci-review bot commented Jan 10, 2026

Explore the complete analysis inside the Version Insights

I've generated a comprehensive summary report for your project. The analysis shows significant performance concerns for pull request #881 in the llama.cpp repository:

Key Highlights:

Critical Issues:

  • Multiple STL vector operations showing 70%+ throughput degradation
  • Response times increased by up to 68.34% in some functions
  • Most severe impacts in std::vector::begin() and std::vector::back() operations

Some Improvements:

  • std::unique_ptr::operator= shows 97.86% throughput improvement
  • Minor improvements in tree operations

Recommendation: The report suggests careful review of PR #881 before merging due to the severe performance regressions in critical STL operations. Further investigation into memory allocation patterns and compiler optimization flags is recommended.

Would you like me to provide more details on any specific aspect of this performance analysis?

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 2d2b258 to 78dd122 Compare January 14, 2026 10:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 095e526 to db6cb7a Compare January 21, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants