Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For DeepSeek-V3/R1 this reduces KV cache size by ~2 GiB for a context of 65k tokens.
Using
one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR #260 for meaning and effect of the
-ambcommand line option.There is still an issue with one or more of the
GGML_OP_REPEAT, GGML_OP_CONCAT, GGML_OP_CPYoperations on CUDA, which are required to implement the entire attention computation using quantized tensors, so this PR takes the pragmatic approach of computing the attention operations withfp16on CUDA. The downside is thatfp16will be used also on the CPU if the code was built with CUDA enabled (and this is slower than usingQ8_0directly, wit the gap in performance increasing with context length).