Support for DeepseekV32ForCausalLM with DeepSeek Sparse Attention (DSA) by fairydreaming · Pull Request #21149 · ggml-org/llama.cpp

fairydreaming · 2026-03-29T12:56:48Z

Overview

This PR adds support for DeepseekV32ForCausalLM (DeepSeek V3.2 Exp, DeepSeek V3.2, DeepSeek V3.2 Speciale) models. It contains implementation of the lightning indexer and DeepSeek Sparse Attention (DSA) - both implemented in the simplest possible way as a proof of concept. So far only CPU and CUDA backends are supported.

~~Due to the way it's currently implemented it doesn't improve long context performance yet, more work is needed for this.~~ Long context performance was improved by using sparse top_k indices in CUDA flash attention MMA kernel.

Some GGUFs for testing are available here (-light models), I uploaded Q8_0/Q4_K_M quants, so you need over 700GB/400GB of RAM/VRAM to run them.

I also created a 16GB baby DeepSeek V3.2 GGUF for VRAM-deprived people. It outputs incoherent gibberish, but should be useful for testing and optimizing this implementation even with limited resources.

I really could use some help with verifying the implementation correctness. If you have large GPU cluster and can run some benchmarks to compare results with official reported benchmark results for DeepSeek V3.2 models then go for it. More details in #21183.

Fixes #16331, #20363

Additional information

Decisions I made when implementing this:

new model arch DEEPSEEK32 was added (mostly a copy of existing GLM_DSA arch),
sparse attention was implemented by masking KQ mask entries corresponding to tokens that are not in the set of top-k tokens selected by the lightning indexer,
~~for this purpose I added new GGML op GGML_OP_SCATTER that works similar to torch scatter_ operation but is currently limited to setting tensor elements at specified indices to a given scalar value,~~ replaced by SET_ROWS with 1-element rows
Hadamard transform ~~was added as another new GGML op GGML_OP_HADAMARD with implementation borrowed from ik_llama.cpp (thx @ikawrakow),~~ implementation from llama : rotate activations for better quantization #21038 was used in lightning indexer
KV cache was implemented as a new llama_kv_cache_dsa class which aggregates the usual llama_kv_cache that caches MLA latent representations (same as before for DeepSeek V3) and another new llama_ik_cache class (basically a copy of llama_kv_cache stripped of code related to V vector) that caches lightning indexer keys, two instances of llama_kv_cache - one for caching MLA latent representations, second for caching lightning indexer keys
since there are no official jinja templates for V3.2 and V3.2 Speciale, I simply decided to ignore this problem for now. You have to explicitly set chat template for these models (using jinja template from V3.2 Exp with these models will allow you to chat but tool calls won't work correctly). PR chat: dedicated DeepSeek v3.2 parser + "official" template #21785 added DeepSeek V3.2 chat template that you can use with --chat-template-file models/templates/deepseek-ai-DeepSeek-V3.2.jinja

Requirements

Due to limitations of the current CUDA ggml_top_k() implementation NVIDIA CUDA CCCL library (version >3.2) and enabling GGML_CUDA_USE_CUB during CUDA backend compilation is needed, otherwise the CUDA implementation will crash for context sizes larger than (I think) 1024 tokens. I use it with CUDA 13.2 and CCCL 13.2.27.
Bug in ggml_top_k() is now fixed, fix is merged, so it should work even on 2.[89] CUDA without CCCL.

Also if you want to convert the model by yourself, set add_bos_token to true in tokenizer_config.json before the model conversion - this is needed for DeepSeek V3.2 and DeepSeek V3.2 Speciale. The conversion script has assert that checks this.

Next Steps

I'd like to confirm my architectural choices regarding the implementation,
If they are accepted I will clean up the code if needed, merge with the current master and it will be ready for code review,
If not then So Long, and Thanks for All the Fish. Just joking, we can talk about this.

I have read and agree with the contributing guidelines
AI usage disclosure: YES, AI was used as an assistant helping me find bugs in CUDA kernel implementations.

…e attention). Needs manual change of add_bos_token to true in tokenizer_config.json before conversion.

…I think it's best not to quantize them.

…DeepSeek V3.2.

… ik_llama.cpp)

…er implementation

…indexer implementation since the former fails for large tensors even when using CCCL.

…ion.

… of llama_kv_cache and new llama_ik_cache (lightning indexer key cache). model : used new llama_kv_cache_dsa instead of modified llama_kv_cache with indexer keys in DeepseekV32ForCausalLM model : removed non-MLA path in DeepseekV32ForCausalLM

…lar to torch scatter_ operation.

…e can get rid of ggml_cast() calls in sparse attention implementation

…rm implementations

…orCausalLM-based models.

…rscores

… Ampere

…lash attention mma kernel.

…it does not improve the performance

…() to restrict cases where fattn-mma-f16 top_k optimization may be used.

CISC · 2026-05-06T09:37:56Z

+        if name.startswith("language_model."):
+            name = name.replace("language_model.", "")
+
+        # rename e_score_correction_bias tensors
+        if name.endswith("e_score_correction_bias"):
+            name = name.replace("e_score_correction_bias", "e_score_correction.bias")
+
+        # skip Multi-Token Prediction (MTP) layers


Suggested change

if name.startswith("language_model."):

name = name.replace("language_model.", "")

# rename e_score_correction_bias tensors

if name.endswith("e_score_correction_bias"):

name = name.replace("e_score_correction_bias", "e_score_correction.bias")

# skip Multi-Token Prediction (MTP) layers

# skip Multi-Token Prediction (MTP) layers

No longer needed after #22597

So, skip the check below.

CISC · 2026-05-06T09:44:00Z

+        if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:
+            self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)


Suggested change

if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:

self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)

if not self.skip_mtp:

if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:

self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)

I don't think it's a good idea, DeepSeek V3.2 model C++ code uses hparams.nextn_predict_layers to calculate number of non-mtp layers. Many other models does that, they all have in convert script:

self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)

so total number of layers includes MTP layers regardless of skip_mtp value. Then in C++ code:

int effective_n_layers = hparams.n_layer - hparams.nextn_predict_layers;

or similar.

Why DeepSeek V3.2 code shall behave differently?

Because skip_mtp strips those layers?

Point is you are mis-reporting the number of layers included in the GGUF.

CISC · 2026-05-06T11:46:49Z

+        self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)
+        self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)


Suggested change

self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)

self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)

if not self.skip_mtp:

self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)

self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)

Sorry, I added this earlier, then second-guessed it because of the additional check you had, but I think that check should go instead.

whoisjeremylam · 2026-05-07T02:59:09Z

Forgive my ignorance. Does a model need to be re-quantized to use this PR? ie GLM 5.1, which makes use of DSA.

fairydreaming · 2026-05-07T09:30:25Z

Forgive my ignorance. Does a model need to be re-quantized to use this PR? ie GLM 5.1, which makes use of DSA.

@whoisjeremylam As far as I know existing GLM 5.0/5.1 GGUFs already contain weights for indexer tensors, so it's only a matter of using them in glm-dsa.cpp implementation. I think there will be no need to requantize GGUFs to run them with DSA.

However, lightning indexer output may be sensitive to quantization of indexer tensors and for this reason it may be necessary to change their quantization level in the future.

DeepSeek V3.2 / V4-Flash use a sparse-attention 'lightning indexer' that scores compressed K vectors against per-head Q vectors via a fused mul_mat -> relu -> weighted-sum-over-heads pipeline. The graph emitted by build_attn_v4 today materializes that sequence as four discrete ggml ops (mul_mat, relu, mul, sum_rows), which costs multiple kernel launches per layer per token at decode and an intermediate [n_comp, n_heads, n_batch] score tensor that scales linearly with both context length and ubatch. This commit imports the WMMA + vector CUDA kernel originally written by Stanislaw Szymczyk for ggml-org/llama.cpp PR ggml-org#21149 (V3.2 DSA), later kept available on cchuter/llama.cpp (feat/v4-port). It does not yet wire the op into src/models/deepseek4.cpp -- the V4 indexer has three distinct shape regimes (decode, collapsed-q prefill, per-query prefill) that each need their own reshape adapter -- so this commit only: * adds GGML_OP_LIGHTNING_INDEXER to the op enum and bumps GGML_OP_COUNT/static_asserts to 98 * adds the ggml_lightning_indexer constructor in ggml.c with shape and dtype guards matching fairydreaming's reference * adds the CUDA dispatcher case + supports() entry. The supports() check restricts to the V3.2/V4 indexer config (n_embd=128, n_heads=64) and to the K dtypes the kernel actually instantiates (F32/F16/BF16/Q4_0/Q4_1/Q5_0/Q5_1/Q8_0). Other shapes return false so the scheduler keeps those ops on a backend that can run them. * imports the kernel implementation (WMMA path on Ampere+ NVIDIA, vector path on everything else, including HIP/MUSA stubs). Build clean; existing smoke tests still pass since the op isn't called yet. Co-authored-by: Stanislaw Szymczyk <sszymczy@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ggerganov

Edit: I just realized that by forcing ncols1 to 1 there will be no problem with mixing Q vectors from different tokens in one FA kernel Q tile, so top_k optimization should work for prompt processing as well. Will test it soon!

Nice. I suppose this is a sort of a stopgap solution until we implement a more optimized large-batch top-k kernel? Or do you think it is already the right approach for the prefill?

Also, I guess it works quite well with small batches (BS <= 8), for example for parallel decoding?

ggerganov · 2026-05-09T05:22:58Z

+    GGML_API struct ggml_tensor * ggml_lightning_indexer(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * weights,
+        float                 scale_embd,
+        float                 scale_heads);
+


Can this OP be used in DeepSeek v4?

fairydreaming · 2026-05-09T07:15:13Z

Edit: I just realized that by forcing ncols1 to 1 there will be no problem with mixing Q vectors from different tokens in one FA kernel Q tile, so top_k optimization should work for prompt processing as well. Will test it soon!

Nice. I suppose this is a sort of a stopgap solution until we implement a more optimized large-batch top-k kernel? Or do you think it is already the right approach for the prefill?

Also, I guess it works quite well with small batches (BS <= 8), for example for parallel decoding?

@ggerganov I think it's more like a stopgap solution - best performing one currently achievable with minimal changes. I tried to optimize it further by copying topk tiles to shared memory, but it reduced the performance. I'm hardly a CUDA expert, so waiting for @JohannesGaessler opinion on this.

I see that in V4 there's even more complicated approach - a dense part of attention with SWA (128 tokens) and a sparse top-k (1024) based part that uses compressed KV cache. But maybe a sparse kernel still could be used by always including the dense part in top-k indices.

ggerganov · 2026-05-10T09:50:21Z

+
+                // store indexer keys to KV cache
+                const auto * mctx_lid = inp_attn_dsa->mctx->get_lid();
+                const auto & k_idxs_lid = inp_attn_dsa->get_k_idxs_lid();


Suggested change

const auto & k_idxs_lid = inp_attn_dsa->get_k_idxs_lid();

const auto * k_idxs_lid = inp_attn_dsa->get_k_idxs_lid();

ggerganov · 2026-05-10T09:56:29Z


+    GGML_API void ggml_flash_attn_ext_add_top_k(
+            struct ggml_tensor * a,
+            struct ggml_tensor * top_k);


Semantically, is it important here that these are "Top-K" indices? If I understand correctly, this is more generic than top-k - it's just a list of any indices.

If yes, I think the name should reflect that. Instead of top_k, consider using ggml_flash_attn_ext_add_idxs().

JohannesGaessler · 2026-05-12T14:57:31Z

I tried to optimize it further by copying topk tiles to shared memory, but it reduced the performance. I'm hardly a CUDA expert, so waiting for @JohannesGaessler opinion on this.

My opinion is that you should initially only add CPU support as is clearly laid out in the contributing guidelines and add CUDA support in a follow-up PR. It is way more work for me to review the changes if they're in this PR.

fairydreaming · 2026-05-12T16:49:00Z

I tried to optimize it further by copying topk tiles to shared memory, but it reduced the performance. I'm hardly a CUDA expert, so waiting for @JohannesGaessler opinion on this.

My opinion is that you should initially only add CPU support as is clearly laid out in the contributing guidelines and add CUDA support in a follow-up PR. It is way more work for me to review the changes if they're in this PR.

Oh, sorry. I don't want to be a burden. Closing then.

sszymczy added 26 commits March 12, 2026 13:15

model : Initial support for DeepseekV32ForCausalLM (for now with dens…

a337ebd

…e attention). Needs manual change of add_bos_token to true in tokenizer_config.json before conversion.

model : added indexer q and k calculation in DeepseekV32ForCausalLM.

e467684

ggml : add Hadamard transform GGML OP and implementation

723f0ce

kv-cache : add cache for indexer keys (temporary solution)

72b7214

convert : DSA indexer weights are bf16 in the original fp8 model, so …

961bc95

…I think it's best not to quantize them.

model : crude proof-of-concept implementation of the DSA indexer for …

9a63e7a

…DeepSeek V3.2.

ggml : add CUDA Hadamard transformation implementation (borrowed from…

3eb340e

… ik_llama.cpp)

ggml : add new GGML_OP_WHERE_ID (akin to torch where but using indices)

08dc7fd

model : used new GGML_OP_WHERE_ID op in DeepSeek V3.2 lightning index…

998f496

…er implementation

model : handle multiple streams in DeepSeek V3.2 lightning indexer

6c9d773

ggml : handle multiple streams in CUDA GGML_OP_WHERE_ID implementation

cb94b56

kv-cache : fix crashes for models without indexer

02c2159

model : replaced ggml_argsort_top_k with ggml_top_k in DeepSeek V3.2 …

e7aa89a

…indexer implementation since the former fails for large tensors even when using CCCL.

model : added comments in DeepSeek V3.2 lightning indexer implementat…

1874ac9

…ion.

ggml : replaced GGML_OP_WHERE_ID with GGML_OP_SCATTER that works simi…

9b0a4ee

…lar to torch scatter_ operation.

ggml : added inplace version of GGML_OP_SCATTER and tests for this OP

0ee5d80

gguf-py : removed obsolete KV_B tensor from DEEPSEEK32 arch

7f5578f

convert : make pyright happy

54945c7

ggml : added f16 version of GGML_OP_SCATTER

5677f08

ggml : added f16 version of GGML_OP_FILL

1c830a1

model : GGML_OP_SCATTER AND GGML_OP_FILL now work with f16 data, so w…

83a0313

…e can get rid of ggml_cast() calls in sparse attention implementation

ggml : fix bug in CUDA Hadamard transform implementation

6011bdd

ggml : simplified testing for nh being power of 2 in Hadamard transfo…

4aec6a8

…rm implementations

ggml : added test for GGML_OP_HADAMARD

a74d83a

convert : check if add_bos_token is true when converting DeepseekV32F…

5b9ce6c

…orCausalLM-based models.

fairydreaming requested review from a team, CISC and ggerganov as code owners March 29, 2026 12:56

fairydreaming marked this pull request as draft March 29, 2026 12:56

sszymczy added 2 commits May 4, 2026 10:30

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

d9e33a8

chore : rename lightning indexer kernel to use dashes instead of unde…

8e6256a

…rscores

ngxson mentioned this pull request May 4, 2026

model: add DeepSeek V4 architecture #22607

Closed

sszymczy added 8 commits May 4, 2026 16:20

ggml : reduce shared memory usage in lightning indexer vector kernel

ca6fc8e

ggml : empty lightning_indexer_kernel_wmma for compilation on archs <…

cdf27d2

… Ampere

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

5278aa1

chore : whitespace formatting

7825cca

ggml : set tile_top_k parameter to nullptr if use_top_k is false in f…

1d68f58

…lash attention mma kernel.

ggml : remove half type usage from lightning indexer vector kernel - …

61e62be

…it does not improve the performance

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

11ab41f

ggml : add constexpr ggml_cuda_flash_attn_ext_mma_f16_shall_use_top_k…

5d16ac1

…() to restrict cases where fattn-mma-f16 top_k optimization may be used.

fairydreaming marked this pull request as ready for review May 6, 2026 09:31

fairydreaming requested review from IMbackK and JohannesGaessler as code owners May 6, 2026 09:31

CISC reviewed May 6, 2026

View reviewed changes

sszymczy added 2 commits May 6, 2026 12:56

Merge remote-tracking branch 'upstream/master' into deepseek-dsa

4fd30b5

convert : no longer needed after ggml-org#22597

b257352

CISC reviewed May 6, 2026

View reviewed changes

convert : include MTP layer during conversion of DeepseekV32ForCausalLM.

91502cb

cchuter mentioned this pull request May 7, 2026

Model request: DeepSeek V4 Series #22319

Open

ggerganov reviewed May 9, 2026

View reviewed changes

ggerganov reviewed May 10, 2026

View reviewed changes

ggerganov self-assigned this May 10, 2026

fairydreaming closed this May 12, 2026

		if (num_nextn_predict_layers := self.hparams.get("num_nextn_predict_layers")) is not None:
		self.gguf_writer.add_nextn_predict_layers(num_nextn_predict_layers)

		self.block_count = self.hparams["num_hidden_layers"] + self.hparams.get("num_nextn_predict_layers", 0)
		self.tensor_map = gguf.get_tensor_name_map(self.model_arch, self.block_count)

	const auto & k_idxs_lid = inp_attn_dsa->get_k_idxs_lid();
	const auto * k_idxs_lid = inp_attn_dsa->get_k_idxs_lid();

Conversation

fairydreaming commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Next Steps

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whoisjeremylam commented May 7, 2026

Uh oh!

fairydreaming commented May 7, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fairydreaming commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

fairydreaming commented Mar 29, 2026 •

edited

Loading

fairydreaming commented May 9, 2026 •

edited

Loading

JohannesGaessler commented May 12, 2026 •

edited

Loading