Support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation#23346
Conversation
* convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation
ggerganov
left a comment
There was a problem hiding this comment.
Add a TODO so I don't forget to do the refactor:
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 0b0a56ce9..649269af6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -93,6 +93,9 @@ public:
using slot_info_vec_t = std::vector<slot_info>;
+ // TODO: refactor the memory instances to not depend on `llama_model`
+ // instead pass all necessary info (e.g. hparams, dev layers, arch, etc.) directly
+ // likely through `struct llama_memory_params`
llama_kv_cache(
const llama_model & model,
const llama_hparams & hparams,| res->t_embd = cur; | ||
|
|
||
| // lm_head | ||
| cur = ggml_mul_mat(ctx0, model.output, cur); |
There was a problem hiding this comment.
Why not
build_lora_mm?
I guess nobody ever cared enough to add this to the DeepSeek code that I copied and modified in this PR, so it's kind of inherited.
Are there any standard conventions of which tensor matmuls should be LoRAble and which should be left alone?
|
@fairydreaming sorry for my ignorance, but does the flash model work with this same architecture? That requires way less VRAM and I can also test it out on my machine (I have 128GB vram) |
@am17an There is no DeepSeek V3.2 Flash model. I'm currently trying to get NVFP4 quant to work as @CISC suggested, but it's still almost 400GB. Edit: in case you meant DeepSeek V4 Flash then unfortunately the answer is no, it's something completely different from DeepSeek V3.2. |
|
@fairydreaming yes I mean the DSV4 flash model. I just read up on it and you're right it's completely different, but the lighting indexer work you're doing here will be useful there. I will try and work on the flash model in the meantime |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
@fairydreaming GitHub UI messed up EOL again, please normalize to |
30fdfe4 to
4643fda
Compare
Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
CISC
left a comment
There was a problem hiding this comment.
I still have the build_lora_mm question, but otherwise LGTM.
@CISC Yeah I noticed, force-pushed a fixed commit. |
|
@CISC By the way I managed to convert and run nvidia/DeepSeek-V3.2-NVFP4 with your NVFP4 changes and it seems to work fine. Needed only regenerating model.safetensors.index.json as currently it misses NVFP4 scale tensors. @am17an I thought about DSV4 too, but still don't have a clear vision of how to integrate it with llama.cpp memory subsystem without creating a bunch of new specialized classes. But it's definitely a good idea to keep common parts reusable in both. I suppose one obvious next step is to add separate lightning indexer GGML OP as it brings immense compute buffer size reductions. But since DS V3.2 is kind of obsolete now I can chill a bit and take it easy. Anyway, please keep me posted about any progress, wish you luck! |
Weird, but great to hear it works, do you have BW hw, and if so how does performance compare? |
@CISC I have Epyc 9374F with a single RTX PRO 6000 Max-Q (BLACKWELL_NATIVE_FP4 = 1), experts were in RAM. Some llama-bench experiments I did: Q8_0, --no-op-offload 0Q8_0, --no-op-offload 1NVFP4, --no-op-offload 0NVFP4, --no-op-offload 1From what I understand NVFP4 has horrible performance on the CPU and this slows everything down, I added some mul_mat backend op tests and they seem to confirm it: Q8_0:NVFP4so while Q8_0 on CPU works pretty fast, NVFP4 is like 8 times slower, basically unusable. |
Yeah, it's only useful if you can fit all the NVFP4 tensors on GPU. :( |
|
Also the current NVFP4 CPU path is the "generic" path, probably an AVX impl would bring it up to par with the rest of the quants |
|
This PR broke the CI for |
* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>
Warning: The DeepSeek V3.2 model conversion currently fails with transformers 5.x (required by requirements.txt after #21617 was merged). Downgrade transformers to 4.x (for example to 4.57.6) to convert the model.
Overview
This PR adds support for DeepseekV32ForCausalLM (DeepSeek V3.2 Exp, DeepSeek V3.2, DeepSeek V3.2 Speciale) models. It implements lightning indexer and DeepSeek Sparse Attention (DSA) in generic GGML without adding any new OPs.
This PR is a continuation of PR #21149 (now closed).
Additional information
Covered areas
Areas covered by this PR:
llama_kv_cacheconstructor to include explicithparamsargument,llama_kv_cache_dsaclass which aggregates two instances ofllama_kv_cache- one for caching MLA latent representations, second for caching lightning indexer keys,LLM_ARCH_DEEPSEEK32architecture (mostly a copy of existingLLM_ARCH_GLM_DSA),llama_model_deepseek32implementation (mostly copied fromllama_model_glm_dsaandllama_model_deepseek2)Testing
GGUFs for testing (Q8_0/Q4_K_M):
You need over 700GB (Q8_0) or over 400GB (Q4_K_M) of RAM/VRAM to run these models. Generic lightning indexer implementation uses very large compute buffers, so if you encounter out of memory errors reduce context and/or ubatch size.
There is also a tiny 16GB 4-layer DeepSeek V3.2 GGUF that does not produce coherent output but may be useful for testing the implementation.
Use
models/templates/deepseek-ai-DeepSeek-V3.2.jinjachat template when testing models.Perplexity
I measured perplexity (on wiki.test.raw with 4k chunk size so that indexer does some actual work) of:
Final estimate: PPL = 2.9115 +/- 0.0146Final estimate: PPL = 2.9126 +/- 0.01466Final estimate: PPL = 3.0727 +/- 0.01577Requirements