[WIP] [DSV4] Quantization Support by kylesayrs · Pull Request #41276 · vllm-project/vllm

kylesayrs · 2026-04-29T19:14:09Z

DeepSeek-V4-Flash-NVFP4-FP8

Model Optimizations

This model was obtained by using the following branch with LLM Compressor: vllm-project/llm-compressor#2647

Deployment

vllm serve RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8 --tensor-parallel-size 4 --port 8089 --kv_cache_dtype="fp8"

Evaluation

python tests/evals/gsm8k/gsm8k_eval.py

Results:
Accuracy: 0.910
Invalid responses: 0.000
Total latency: 173.006 s
Questions per second: 7.624
Total output tokens: 116217
Output tokens per second: 671.752

For more details on how this model was created and run in LLM Compressor, please contact Kyle Sayers on the vLLM Slack: https://communityinviter.com/apps/vllm-dev/join-vllm-developers-slack

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the DeepseekV4 model implementation by adding a packed_modules_mapping for fused layers and implementing a safe initialization for scale_fmt that defaults to 'ue8m0' when the quantization configuration is missing or not a dictionary. I have no feedback to provide.

dsikka

this pathway is likely going to continue seeing multiple updates in the next few weeks. Would be good to add some form of smoke test

kylesayrs · 2026-05-04T19:42:30Z

FYI I'm seeing a slight accuracy loss with the model, I've ruled out output_dtype as the cause in #41533, which makes me suspect that the cause is the quantization of the indexer/compressor wkv weights. Currently working on updating the checkpoint to skip this quantization, will post accuracy evaluations.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs marked this pull request as draft April 29, 2026 19:14

claude Bot reviewed Apr 29, 2026

View reviewed changes

mergify Bot added the deepseek Related to DeepSeek models label Apr 29, 2026

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

dsikka reviewed Apr 30, 2026

View reviewed changes

kylesayrs closed this May 1, 2026

kylesayrs force-pushed the kylesayrs/deepseek-ct branch from 6a81e43 to 33f36d4 Compare May 1, 2026 17:31

kylesayrs reopened this May 1, 2026

kylesayrs force-pushed the kylesayrs/deepseek-ct branch from 1e6b8a1 to f910a73 Compare May 1, 2026 19:42

This was referenced May 2, 2026

[Bug]: compressed-tensors W4A16 MoE: weight_scale not sharded along K under tensor parallelism, kernel computes wrong group_size #41511

Open

[DSv4][Nvidia] SM12x DeepSeek V4 support #40991

Closed

kylesayrs mentioned this pull request May 3, 2026

[DSV4] Attention accumulation in model dtype #41533

Closed

kylesayrs changed the title ~~[WIP] [DSV4] Compressed Tensors Support~~ [WIP] [DSV4] Quantization Support May 3, 2026

pasta-paul mentioned this pull request May 5, 2026

[Bug]: workspace.py rejects post-lock growth in deepseek_v4_attention._forward_prefill (DSV4 #40991) — patch attached #41700

Closed

wuhuikx mentioned this pull request May 6, 2026

[Performance]: Deepseek-V4 Support and Optimization on ROCm Backend #41820

Open

21 tasks

kylesayrs force-pushed the kylesayrs/deepseek-ct branch from f910a73 to f5fc438 Compare May 7, 2026 22:29

kylesayrs added 2 commits May 7, 2026 18:53

support ct quantization

d09eeb4

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

revoke support for fused_wkv_wgate

322ca21

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/deepseek-ct branch from f5fc438 to 322ca21 Compare May 7, 2026 22:53

use config ignored_layers

22f6da8

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [DSV4] Quantization Support#41276

[WIP] [DSV4] Quantization Support#41276
kylesayrs wants to merge 3 commits intovllm-project:mainfrom
neuralmagic:kylesayrs/deepseek-ct

kylesayrs commented Apr 29, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

dsikka left a comment

Uh oh!

kylesayrs commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kylesayrs commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DeepSeek-V4-Flash-NVFP4-FP8

Model Optimizations

Deployment

Evaluation

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kylesayrs commented Apr 29, 2026 •

edited

Loading