Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264

ikawrakow · 2025-03-18T10:06:39Z

For DeepSeek-V3/R1 this reduces KV cache size by ~2 GiB for a context of 65k tokens.

Using

-amb 512 -mla 2 -fa -ctk q8_0

one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR #260 for meaning and effect of the -amb command line option.

There is still an issue with one or more of the GGML_OP_REPEAT, GGML_OP_CONCAT, GGML_OP_CPY operations on CUDA, which are required to implement the entire attention computation using quantized tensors, so this PR takes the pragmatic approach of computing the attention operations with fp16 on CUDA. The downside is that fp16 will be used also on the CPU if the code was built with CUDA enabled (and this is slower than using Q8_0 directly, wit the gap in performance increasing with context length).

Make Q8_0 KV cache work with mla=2,fa on CUDA

a9440bd

ikawrakow merged commit 68a5b60 into main Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264

Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264

Uh oh!

ikawrakow commented Mar 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264

Make Q8_0 KV cache work with FlasMLA-2 on CUDA #264

Uh oh!

Conversation

ikawrakow commented Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ikawrakow commented Mar 18, 2025 •

edited

Loading