Skip to content

[DSV4] Attention accumulation in model dtype#41533

Closed
kylesayrs wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/bf16-accumulate-dsv4-indexer
Closed

[DSV4] Attention accumulation in model dtype#41533
kylesayrs wants to merge 2 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/bf16-accumulate-dsv4-indexer

Conversation

@kylesayrs
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs commented May 3, 2026

Purpose

  • Rerequisite for [WIP] [DSV4] Quantization Support #41276
    • NVFP4 kernels do not have a pathway to specify output dtype, they instead follow the dtype of the activations (typically bf16)
  • Minor throughput/ latency improvement for no accuracy loss (slight accuracy improvement)

Changes

  • Replace out_dtype argument of compressor and indexer operations with bfloat16

Testing

vllm serve deepseek-ai/DeepSeek-V4-Flash --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"`
lm_eval
   --model local-completions
   --model_args model=deepseek-ai/DeepSeek-V4Flash,base_url=http://localhost:8089/v1/completions,tokenized_requests=False,trust_remote_code=True
   --tasks longbench_summarization
   --output_path ./
Metric out_dtype=float32 out_dtype=bfloat16
MMLU 0.677 0.680
GSM8K 0.938 0.945
LongBenchV2_Summarization 0.2007 0.2029
Tok/s (mmlu) 163.546 165.546

kylesayrs added 2 commits May 2, 2026 20:27
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the explicit out_dtype=torch.float32 argument from torch.mm calls within the compressor_kv_score and indexer_compressor_kv_score functions in the DeepSeek V4 attention layer. I have no feedback to provide.

@kylesayrs
Copy link
Copy Markdown
Contributor Author

Closing due to correctness concerns raised by @robertgshaw2-redhat. Closing in favor of passing out_dtype to linears.

@kylesayrs kylesayrs closed this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant