Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds FlasMLA on CUDA. It is enabled via
-mla 2 -fa.I observe a very strange slow down for TG that is caused by a very slow
ffn_gate_expsmatrix multiplication. As I was not able to resolve what causes this, for now TG will got via the regularmla = 2route, so TG performance remains the same as we had withmla = 2, fa = 0.Prompt processing speed is massively improved for long contexts, and is almost on par with standard FA. The following table shows a comparison between
mla = 2without FA and FlashMLA. Model isIQ4_NLquantized DeepSeek-Lite, GPU is RTX-4080.fmoeis on,u_batch = 2048The KV cache is the same size as
mla = 2without FA (i.e., the smallest possible). One no longer needs to worry about controlling the maximum compute buffer size via-amb.Caveats:
f16KV cache can be used for now. As explained in PR Faster FlashMLA prompt processing #246 we need to convert the KV cache tofp32to be able to do the required operations, and the CUDA back-end does not yet support this conversion for quantized data types.f32and other intermediate results. This is required on every GPU that performs attention computations. For DeepSeek-Lite and context length of 32k tokens the CUDA compute buffer is 1404 MiB. It shuldn't be much bigger for DeepSeekV3/R1.