Skip to content

graph : ensure DS32 kq_mask_lid is F32#23864

Merged
CISC merged 1 commit into
masterfrom
cisc/graph-ds32-lid-mask-fix
May 29, 2026
Merged

graph : ensure DS32 kq_mask_lid is F32#23864
CISC merged 1 commit into
masterfrom
cisc/graph-ds32-lid-mask-fix

Conversation

@CISC
Copy link
Copy Markdown
Member

@CISC CISC commented May 29, 2026

Overview

cont #23346
cont #23764

Additional information

Since build_attn_inp_kq_mask returns F16 mask when flash attention is enabled, pass a modified copy of cparams for kq_mask_lid.

// mask indexer scores
ggml_tensor * indexer_kq_mask = inp_attn_dsa->get_kq_mask_lid();
indexer_score = ggml_add(ctx0, indexer_score, indexer_kq_mask);
cb(indexer_score, "indexer_score", il);

This is a bit hacky, open for better solutions. cc/ @am17an

Requirements

@CISC CISC requested a review from ggerganov May 29, 2026 10:48
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 29, 2026

Does this mask need to be f32?

@CISC
Copy link
Copy Markdown
Member Author

CISC commented May 29, 2026

Does this mask need to be f32?

Either that or we have to cast indexer_score to F16.

@fairydreaming
Copy link
Copy Markdown
Collaborator

So... I checked how DeepSeek V3.2 works in master (a couple of hours too late) and ended up here. But this PR helps, ggml_cuda_op_add error is gone.

@CISC CISC merged commit 764f1e6 into master May 29, 2026
27 checks passed
@CISC CISC deleted the cisc/graph-ds32-lid-mask-fix branch May 29, 2026 17:55
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 29, 2026
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants