Faster CPU prompt processing for Trellis quants and MoE models #488

ikawrakow · 2025-06-03T08:27:25Z

This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.

For a DeepSeek-Lite model where only the ffn_up and ffn_gate tensors are quantized with IQ2_KT I observe a ~35% improvement in PP performance compared to te main branch.

Iwan Kawrakow added 2 commits June 3, 2025 10:50

Also do the dequantize approach for mul_mat_id

feccbe0

Also do the dequantize approach for iqk_moe_fused_up_gate

62d5e53

ikawrakow merged commit 0b10f74 into main Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faster CPU prompt processing for Trellis quants and MoE models #488

Faster CPU prompt processing for Trellis quants and MoE models #488

Uh oh!

ikawrakow commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Faster CPU prompt processing for Trellis quants and MoE models #488

Faster CPU prompt processing for Trellis quants and MoE models #488

Uh oh!

Conversation

ikawrakow commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant