Support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation by fairydreaming · Pull Request #23346 · ggml-org/llama.cpp

fairydreaming · 2026-05-19T15:46:07Z

Warning: The DeepSeek V3.2 model conversion currently fails with transformers 5.x (required by requirements.txt after #21617 was merged). Downgrade transformers to 4.x (for example to 4.57.6) to convert the model.

Overview

This PR adds support for DeepseekV32ForCausalLM (DeepSeek V3.2 Exp, DeepSeek V3.2, DeepSeek V3.2 Speciale) models. It implements lightning indexer and DeepSeek Sparse Attention (DSA) in generic GGML without adding any new OPs.

This PR is a continuation of PR #21149 (now closed).

Additional information

Covered areas

Areas covered by this PR:

conversion: support for DeepseekV32ForCausalLM architecture,
ggml-cpu: support for f16 GGML_OP_FILL,
memory: refactored llama_kv_cache constructor to include explicit hparams argument,
memory: added llama_kv_cache_dsa class which aggregates two instances of llama_kv_cache - one for caching MLA latent representations, second for caching lightning indexer keys,
llama: added LLM_ARCH_DEEPSEEK32 architecture (mostly a copy of existing LLM_ARCH_GLM_DSA),
llama: implemented sparse attention by masking KQ mask elements corresponding to tokens not selected by the lightning indexer,
model: llama_model_deepseek32 implementation (mostly copied from llama_model_glm_dsa and llama_model_deepseek2)

Testing

GGUFs for testing (Q8_0/Q4_K_M):

You need over 700GB (Q8_0) or over 400GB (Q4_K_M) of RAM/VRAM to run these models. Generic lightning indexer implementation uses very large compute buffers, so if you encounter out of memory errors reduce context and/or ubatch size.

There is also a tiny 16GB 4-layer DeepSeek V3.2 GGUF that does not produce coherent output but may be useful for testing the implementation.

Use models/templates/deepseek-ai-DeepSeek-V3.2.jinja chat template when testing models.

Perplexity

I measured perplexity (on wiki.test.raw with 4k chunk size so that indexer does some actual work) of:

Q8_0 quant without lightning indexer (dense attention): Final estimate: PPL = 2.9115 +/- 0.0146
Q8_0 quant with lightning indexer (sparse attention): Final estimate: PPL = 2.9126 +/- 0.01466
NVFP4 quant with lightning indexer (sparse attention): Final estimate: PPL = 3.0727 +/- 0.01577

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

* convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation

…implementation

ggerganov

Add a TODO so I don't forget to do the refactor:

diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 0b0a56ce9..649269af6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -93,6 +93,9 @@ public:
 
     using slot_info_vec_t = std::vector<slot_info>;
 
+    // TODO: refactor the memory instances to not depend on `llama_model`
+    //       instead pass all necessary info (e.g. hparams, dev layers, arch, etc.) directly
+    //       likely through `struct llama_memory_params`
     llama_kv_cache(
             const llama_model & model,
             const llama_hparams & hparams,

CISC

Support NVFP4 model.

CISC · 2026-05-25T08:34:05Z

+    res->t_embd = cur;
+
+    // lm_head
+    cur = ggml_mul_mat(ctx0, model.output, cur);


Why not build_lora_mm?

Why not build_lora_mm?

I guess nobody ever cared enough to add this to the DeepSeek code that I copied and modified in this PR, so it's kind of inherited.

Are there any standard conventions of which tensor matmuls should be LoRAble and which should be left alone?

am17an · 2026-05-28T05:20:49Z

@fairydreaming sorry for my ignorance, but does the flash model work with this same architecture? That requires way less VRAM and I can also test it out on my machine (I have 128GB vram)

fairydreaming · 2026-05-28T07:24:01Z

@fairydreaming sorry for my ignorance, but does the flash model work with this same architecture? That requires way less VRAM and I can also test it out on my machine (I have 128GB vram)

@am17an There is no DeepSeek V3.2 Flash model. I'm currently trying to get NVFP4 quant to work as @CISC suggested, but it's still almost 400GB.

Edit: in case you meant DeepSeek V4 Flash then unfortunately the answer is no, it's something completely different from DeepSeek V3.2.

am17an · 2026-05-28T15:05:28Z

@fairydreaming yes I mean the DSV4 flash model. I just read up on it and you're right it's completely different, but the lighting indexer work you're doing here will be useful there. I will try and work on the flash model in the meantime

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC · 2026-05-28T15:09:31Z

@fairydreaming GitHub UI messed up EOL again, please normalize to \n:
https://github.com/ggml-org/llama.cpp/actions/runs/26582700461/job/78319887350?pr=23346

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

CISC

I still have the build_lora_mm question, but otherwise LGTM.

fairydreaming · 2026-05-28T15:22:53Z

@fairydreaming GitHub UI messed up EOL again, please normalize to \n: https://github.com/ggml-org/llama.cpp/actions/runs/26582700461/job/78319887350?pr=23346

@CISC Yeah I noticed, force-pushed a fixed commit.

fairydreaming · 2026-05-28T17:44:47Z

@CISC By the way I managed to convert and run nvidia/DeepSeek-V3.2-NVFP4 with your NVFP4 changes and it seems to work fine. Needed only regenerating model.safetensors.index.json as currently it misses NVFP4 scale tensors.

@am17an I thought about DSV4 too, but still don't have a clear vision of how to integrate it with llama.cpp memory subsystem without creating a bunch of new specialized classes. But it's definitely a good idea to keep common parts reusable in both. I suppose one obvious next step is to add separate lightning indexer GGML OP as it brings immense compute buffer size reductions. But since DS V3.2 is kind of obsolete now I can chill a bit and take it easy. Anyway, please keep me posted about any progress, wish you luck!

CISC · 2026-05-28T18:07:02Z

@CISC By the way I managed to convert and run nvidia/DeepSeek-V3.2-NVFP4 with your NVFP4 changes and it seems to work fine. Needed only regenerating model.safetensors.index.json as currently it misses NVFP4 scale tensors.

Weird, but great to hear it works, do you have BW hw, and if so how does performance compare?

fairydreaming · 2026-05-28T21:20:03Z

@CISC By the way I managed to convert and run nvidia/DeepSeek-V3.2-NVFP4 with your NVFP4 changes and it seems to work fine. Needed only regenerating model.safetensors.index.json as currently it misses NVFP4 scale tensors.

Weird, but great to hear it works, do you have BW hw, and if so how does performance compare?

@CISC I have Epyc 9374F with a single RTX PRO 6000 Max-Q (BLACKWELL_NATIVE_FP4 = 1), experts were in RAM. Some llama-bench experiments I did:

Q8_0, --no-op-offload 0

./bin/llama-bench -m ../models/DeepSeek-V3.2-Q8_0.gguf -ncmoe 999 -ngl 99 -fa 1 -ub 512 -p 512 -n 32 -r 1 --no-op-offload 0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -: | --------------: | -------------------: |
| deepseek32 685B.A37B Q8_0      | 678.56 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |           pp512 |         22.17 ± 0.00 |
| deepseek32 685B.A37B Q8_0      | 678.56 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |            tg32 |         10.91 ± 0.00 |

Q8_0, --no-op-offload 1

./bin/llama-bench -m ../models/DeepSeek-V3.2-Q8_0.gguf -ncmoe 999 -ngl 99 -fa 1 -ub 512 -p 512 -n 32 -r 1 --no-op-offload 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | fa | nopo |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -: | ---: | --------------: | -------------------: |
| deepseek32 685B.A37B Q8_0      | 678.56 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |    1 |           pp512 |         42.01 ± 0.00 |
| deepseek32 685B.A37B Q8_0      | 678.56 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |    1 |            tg32 |         10.97 ± 0.00 |

NVFP4, --no-op-offload 0

./bin/llama-bench -m ../models/DeepSeek-V3.2-NVFP4.gguf -ncmoe 999 -ngl 99 -fa 1 -ub 512 -p 512 -n 32 -r 1 --no-op-offload 0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -: | --------------: | -------------------: |
| deepseek32 685B.A37B NVFP4     | 386.79 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |           pp512 |         41.34 ± 0.00 |
| deepseek32 685B.A37B NVFP4     | 386.79 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |            tg32 |          1.82 ± 0.00 |

NVFP4, --no-op-offload 1

./bin/llama-bench -m ../models/DeepSeek-V3.2-NVFP4.gguf -ncmoe 999 -ngl 99 -fa 1 -ub 512 -p 512 -n 32 -r 1 --no-op-offload 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | fa | nopo |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -: | ---: | --------------: | -------------------: |
| deepseek32 685B.A37B NVFP4     | 386.79 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |    1 |           pp512 |          1.99 ± 0.00 |
| deepseek32 685B.A37B NVFP4     | 386.79 GiB |   685.36 B | CUDA       |  99 |        999 |  1 |    1 |            tg32 |          1.83 ± 0.00 |

build: 101bad432 (9403)

From what I understand NVFP4 has horrible performance on the CPU and this slows everything down, I added some mul_mat backend op tests and they seem to confirm it:

Q8_0:

./bin/test-backend-ops perf -o "MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 319488 runs -     3.18 us/run -   4.19 MFLOP/run -   1.32 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Device description: AMD EPYC 9374F 32-Core Processor
  Device memory: 1160411 MB (1160411 MB free)

  MUL_MAT(type_a=q8_0,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 150732 runs -     6.67 us/run -   4.19 MFLOP/run - 628.48 GFLOPS
  Backend CPU: OK
2/2 backends passed
OK

NVFP4

./bin/test-backend-ops perf -o "MUL_MAT(type_a=nvfp4,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
  MUL_MAT(type_a=nvfp4,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                294912 runs -     3.44 us/run -   4.19 MFLOP/run -   1.22 TFLOPS
  Backend CUDA0: OK
Backend 2/2: CPU
  Device description: AMD EPYC 9374F 32-Core Processor
  Device memory: 1160411 MB (1160411 MB free)

  MUL_MAT(type_a=nvfp4,type_b=f32,m=4096,n=1,k=512,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1):                 19080 runs -    55.78 us/run -   4.19 MFLOP/run -  75.20 GFLOPS
  Backend CPU: OK
2/2 backends passed
OK

so while Q8_0 on CPU works pretty fast, NVFP4 is like 8 times slower, basically unusable.

CISC · 2026-05-29T07:01:38Z

so while Q8_0 on CPU works pretty fast, NVFP4 is like 8 times slower, basically unusable.

Yeah, it's only useful if you can fit all the NVFP4 tensors on GPU. :(

am17an · 2026-05-29T07:03:03Z

Also the current NVFP4 CPU path is the "generic" path, probably an AVX impl would bring it up to par with the rest of the quants

JohannesGaessler · 2026-05-29T17:35:57Z

This PR broke the CI for test-llama-archs, see #23876 .

CISC · 2026-05-29T17:54:30Z

This PR broke the CI for test-llama-archs, see #23876 .

See #23864 :)

* origin/master: vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826) graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864) server: remove obsolete scripts (ggml-org#23870) ci : update macos release to use macos-26 runner (ggml-org#23878) download: add option to skip_download (ggml-org#23059) mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975) CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530) server: bump timeout to 3600s (ggml-org#23842) model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346) llama: use f16 mask for FA to save VRAM (ggml-org#23764) sync : ggml ggml : bump version to 0.13.1 (ggml/1523) ngram-mod : Add missing include (ggml-org#23857) llama: add llm_graph_input_mtp (ggml-org#23643) app : move licences to llama-app (ggml-org#23824) cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825) meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)

…se Attention (DSA) implementation (ggml-org#23346) * llama : support DeepSeek V3.2 model family (with DSA lightning indexer) * convert : handle DeepseekV32ForCausalLM architecture * ggml : support for f16 GGML_OP_FILL * memory : separate hparams argument in llama_kv_cache constructor * memory : add llama_kv_cache_dsa memory (KV cache + lightning indexer cache) * llama : support for LLM_ARCH_DEEPSEEK32 * model : llama_model_deepseek32 implementation * model : merge two scale operations into one in DSA lightning indexer implementation * chore : remove unused code * model : support NVFP4 in DeepSeek V3.2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * memory : refactoring TODO Co-authored-by: ggerganov <ggerganov@users.noreply.github.com> --------- Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

sszymczy added 3 commits May 19, 2026 07:22

model : merge two scale operations into one in DSA lightning indexer …

35b4d81

…implementation

chore : remove unused code

069a774

fairydreaming requested review from CISC, JohannesGaessler and ggerganov as code owners May 19, 2026 15:46

github-actions Bot added model Model specific testing Everything test related python python script changes ggml changes relating to the ggml tensor library for machine learning labels May 19, 2026

Merge remote-tracking branch 'upstream/master' into deepseek-v32-minimal

cb855a3

ggerganov self-assigned this May 23, 2026

vanmilleru mentioned this pull request May 24, 2026

model: add DeepSeek V4 architecture #22607

Closed

ggerganov approved these changes May 25, 2026

View reviewed changes

CISC reviewed May 25, 2026

View reviewed changes

fairydreaming mentioned this pull request May 28, 2026

requirements : update transformers to 5.5.0 #21617

Merged

model : support NVFP4 in DeepSeek V3.2

4643fda

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

fairydreaming force-pushed the deepseek-v32-minimal branch from 30fdfe4 to 4643fda Compare May 28, 2026 15:09

sszymczy and others added 2 commits May 28, 2026 17:12

memory : refactoring TODO

4ce1f30

Co-authored-by: ggerganov <ggerganov@users.noreply.github.com>

Merge remote-tracking branch 'upstream/master' into deepseek-v32-minimal

101bad4

CISC approved these changes May 28, 2026

View reviewed changes

CISC merged commit 1f0aa2a into ggml-org:master May 29, 2026
32 checks passed

CISC mentioned this pull request May 29, 2026

graph : ensure DS32 kq_mask_lid is F32 #23864

Merged

JohannesGaessler mentioned this pull request May 29, 2026

test-llama-archs: disable DS 3.2 [no release] #23876

Closed

Conversation

fairydreaming commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Covered areas

Testing

Perplexity

Requirements

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC May 25, 2026

Choose a reason for hiding this comment

Uh oh!

fairydreaming May 28, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 28, 2026

Uh oh!

fairydreaming commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented May 28, 2026

Uh oh!

CISC commented May 28, 2026

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented May 28, 2026

Uh oh!

fairydreaming commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented May 28, 2026

Uh oh!

fairydreaming commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Q8_0, --no-op-offload 0

Q8_0, --no-op-offload 1

NVFP4, --no-op-offload 0

NVFP4, --no-op-offload 1

Q8_0:

NVFP4

Uh oh!

CISC commented May 29, 2026

Uh oh!

am17an commented May 29, 2026

Uh oh!

Uh oh!

JohannesGaessler commented May 29, 2026

Uh oh!

CISC commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fairydreaming commented May 19, 2026 •

edited

Loading

fairydreaming commented May 28, 2026 •

edited

Loading

fairydreaming commented May 28, 2026 •

edited

Loading

fairydreaming commented May 28, 2026 •

edited

Loading