UPSTREAM PR #18740: POC: group gate_exps and up_exps + fix mxfp4 alignment for PP boost by loci-dev · Pull Request #881 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-10T15:35:37Z

Instead of doing the gate and up expert MUL_MAT_ID separately, we group these two together (interestingly, the HF model also has them together but interleaved), so do 1 larger sparse MM (+ add) than 2. I uploaded a GGUF with this change for just gpt-oss-20b

Results:

Master

Details

4090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048	9858.26 ± 68.48
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp4096	10464.32 ± 44.26
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp8192	10591.82 ± 11.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp16384	10229.44 ± 16.46

5090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048	14544.29 ± 124.42
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp4096	15873.04 ± 81.31
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp8192	16310.87 ± 39.83
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	2048	1	pp16384	13459.03 ± 63.68

This PR

Details

4090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048	10908.66 ± 16.40
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp4096	11663.88 ± 46.38
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp8192	11815.96 ± 22.13
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp16384	11267.29 ± 4.64

5090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp2048	16002.91 ± 12.40
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp4096	17792.65 ± 47.00
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp8192	17774.20 ± 127.46
gpt-oss 20B Q8_0	11.27 GiB	20.91 B	CUDA	99	2048	1	pp16384	14217.53 ± 88.59

So overall a gain of ~10% in PP.

I tried another simple but backwards incompatible change, changing the block_mxfp4 size from 17 bytes to 20 bytes to make it 4-byte aligned in 9099a1b175c804a092f6b5431d5e752386e47163, that stacked on top of this PR gives

Details

4090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp2048	11232.48 ± 20.12
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp4096	12063.57 ± 36.07
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp8192	12236.66 ± 24.62
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp16384	11528.88 ± 30.15

5090:

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp2048	16994.97 ± 143.78
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp4096	18900.47 ± 6.48
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp8192	18885.38 ± 74.63
gpt-oss 20B Q8_0	12.93 GiB	20.91 B	CUDA	99	2048	1	pp16384	14500.50 ± 85.92

Another ~5-10% gain for the 5090 and 1-3% gain for 4090. I uploaded another GGUF for this. This increases the size by ~15% from 12.1 GB to 13.9 GB.

Does it make sense for optimisation 1 to happen during model-load via some flag? (we could add something similar for QKV merging). For optimisation 2 the best would be a re-pack in the CUDA backend as discussed in #18427, but I'm finding it difficult to implement that.

AI disclosure - since I don't intend to merge this code as is, this code is mostly written by AI for quick experiments (guided by me)

loci-review · 2026-01-10T16:30:35Z

Explore the complete analysis inside the Version Insights

I've generated a comprehensive summary report for your project. The analysis shows significant performance concerns for pull request #881 in the llama.cpp repository:

Key Highlights:

Critical Issues:

Multiple STL vector operations showing 70%+ throughput degradation
Response times increased by up to 68.34% in some functions
Most severe impacts in std::vector::begin() and std::vector::back() operations

Some Improvements:

std::unique_ptr::operator= shows 97.86% throughput improvement
Minor improvements in tree operations

Recommendation: The report suggests careful review of PR #881 before merging due to the severe performance regressions in critical STL operations. Further investigation into memory allocation patterns and compiler optimization flags is recommended.

Would you like me to provide more details on any specific aspect of this performance analysis?

POC: group gate_exps and up_exps for PP boost

ad966b8

loci-dev temporarily deployed to PROD__AL_DEMO January 10, 2026 15:35 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 4f10383 to 7515e5e Compare January 10, 2026 16:09

loci-dev force-pushed the main branch 26 times, most recently from 2d2b258 to 78dd122 Compare January 14, 2026 10:10

loci-dev force-pushed the main branch 30 times, most recently from 095e526 to db6cb7a Compare January 21, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18740: POC: group gate_exps and up_exps + fix mxfp4 alignment for PP boost#881

UPSTREAM PR #18740: POC: group gate_exps and up_exps + fix mxfp4 alignment for PP boost#881
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18740-branch_am17an-merge-qkv

loci-dev commented Jan 10, 2026

Uh oh!

loci-review bot commented Jan 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 10, 2026

Master

This PR

Uh oh!

loci-review bot commented Jan 10, 2026

Key Highlights:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants