[DSV4] Attention accumulation in model dtype by kylesayrs · Pull Request #41533 · vllm-project/vllm

kylesayrs · 2026-05-03T04:16:00Z

Purpose

Rerequisite for [WIP] [DSV4] Quantization Support #41276
- NVFP4 kernels do not have a pathway to specify output dtype, they instead follow the dtype of the activations (typically bf16)
Minor throughput/ latency improvement for no accuracy loss (slight accuracy improvement)

Changes

Replace out_dtype argument of compressor and indexer operations with bfloat16

Testing

vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"`
lm_eval
   --model local-completions
   --model_args model=deepseek-ai/DeepSeek-V4Flash,base_url=http://localhost:8089/v1/completions,tokenized_requests=False,trust_remote_code=True
   --tasks longbench_summarization
   --output_path ./

Metric	`out_dtype=float32`	`out_dtype=bfloat16`
MMLU	0.677	0.680
GSM8K	0.938	0.945
LongBenchV2_Summarization	0.2007	0.2029
Tok/s (mmlu)	163.546	165.546

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request removes the explicit out_dtype=torch.float32 argument from torch.mm calls within the compressor_kv_score and indexer_compressor_kv_score functions in the DeepSeek V4 attention layer. I have no feedback to provide.

kylesayrs · 2026-05-07T22:39:03Z

Closing due to correctness concerns raised by @robertgshaw2-redhat. Closing in favor of passing out_dtype to linears.

kylesayrs added 2 commits May 2, 2026 20:27

swap

f74c861

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove dtype

a656119

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

claude Bot reviewed May 3, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

kylesayrs mentioned this pull request May 4, 2026

[WIP] [DSV4] Quantization Support #41276

Draft

kylesayrs closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSV4] Attention accumulation in model dtype#41533

[DSV4] Attention accumulation in model dtype#41533
kylesayrs wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/bf16-accumulate-dsv4-indexer

kylesayrs commented May 3, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

kylesayrs commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kylesayrs commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

kylesayrs commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kylesayrs commented May 3, 2026 •

edited

Loading