Metal: much faster MoE prompt processing #307
Merged
+2,118
−2,042
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The prompt processing (PP) performance on Metal for MoE models with many experts (such as DeepSeek) is pathetic. Here, and also in mainline before the very recent PR 12612. This mainline PR brings PP performance to a more acceptable level by effectively using GEMV for matrix multiplications involving MoE tensors.
This PR does much better than that. On my M2-Max (30-core GPU) PP performance for DeepSeek-Lite is now 1.75X faster than mainline (
build: a6f32f0b3 (5018)), and 5X compared to the main branch.Also, on mainline I observe a very peculiar performance behavior as a function of
u_batch:Interesting, right? For
u_batch = 512(where performance is maximized) the matrix multiplication is done using GEMV. Foru_batch = 128, 256, it is done using GEMM, but in an extremely inefficient way, where the inefficiency increases withu_batchsize, so performance degrades.Here is what we get with this PR:
The PR became much bigger than it should have been. But as TG performance is now slightly lower than mainline, and the only change that seemed promising to explain the difference was PR 9698, I decided to add that change. It made zero difference, but resulted in 2k lines of code moved around.