Skip to content

Conversation

@ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Mar 18, 2025

For DeepSeek-V3/R1 this reduces KV cache size by ~2 GiB for a context of 65k tokens.

Using

-amb 512 -mla 2 -fa -ctk q8_0

one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR #260 for meaning and effect of the -amb command line option.

There is still an issue with one or more of the GGML_OP_REPEAT, GGML_OP_CONCAT, GGML_OP_CPY operations on CUDA, which are required to implement the entire attention computation using quantized tensors, so this PR takes the pragmatic approach of computing the attention operations with fp16 on CUDA. The downside is that fp16 will be used also on the CPU if the code was built with CUDA enabled (and this is slower than using Q8_0 directly, wit the gap in performance increasing with context length).

@ikawrakow ikawrakow merged commit 68a5b60 into main Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants