[Feat] DeepSeek V4 Rebased by ivanium · Pull Request #40860 · vllm-project/vllm

ivanium · 2026-04-25T04:40:59Z

Purpose

Rebased version of #40760

Roadmap: #40902

Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

ywang96 · 2026-04-27T01:30:47Z

Merging this PR since the failed tests are not related.

Great work and much thanks to everyone who worked on DSv4 effort!

On SM 12.x (RTX 50-series, GB10/DGX Spark), Marlin and Marlin-MoE kernels are currently absent from the compiled `_C.so` / `_moe_C.so`. The driver JIT-promotes the `8.0+PTX` fallback to PTX-as-SM-12.x at first use, but the resulting cubin produces silently-wrong outputs on Marlin-MoE (observed: V4-Flash MoE forward emits gibberish tokens on a GB10 box, while the same model on Hopper emits coherent text). Note that PTX-JIT correctness is not guaranteed across major arch jumps; this is the expected failure mode of relying on `8.0+PTX` for sm_120/sm_121. `MARLIN_ARCHS`, `MARLIN_BF16_ARCHS`, and `MARLIN_MOE_ARCHS` in CMakeLists.txt do not list `12.0;12.1`, so the build omits native sm_120/sm_121 ELF entries from the kernel object. The neighbouring `MARLIN_FP8_ARCHS` and `MARLIN_MOE_FP8_ARCHS` already include `8.9;12.0;12.1`, so the precedent for SM 12.x in this file is set; this change extends the same pattern to the BF16/FP16 paths. Add `12.0;12.1` to the three arch lists. After rebuild on a GB10: `cuobjdump --list-elf _moe_C.abi3.so | grep sm_121` returns 22 native sm_121 ELF entries (was 0), and V4-Flash MoE forward output becomes coherent (verified haiku generation, 6.28 t/s steady on dual DGX Spark TP=2, max_tokens=80, single request). Refs vllm-project#40860 (V4 rebase touches the build matrix, no overlap with this arch-list change) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>

`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M dimension in single-request decode (M=1). GPUs without a pre-tuned JSON file for their (N, K, device) tuple pay this cost. Narrow the change: only specialize the M<=8 case (single-request decode and short MTP-style draft batches). Larger M keeps the previous default unchanged so non-decode paths and tuned configs are not perturbed. M <= 8 (CUDA) -> BLOCK_SIZE_M=16, num_stages=3 (new) M <= 8 (ROCm) -> BLOCK_SIZE_M=16, num_stages=2 (new) else -> BLOCK_SIZE_M=64, num_stages=2 (previous default) num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is borderline for 3-stage Triton pipelining at typical [128, 128] block sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure. Pre-tuned JSON configs are unaffected (they short-circuit before this branch). Workloads that already have a JSON for their (N, K, device) get the same kernel as before. Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash: median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with no other changes. Output remains coherent. The win is expected to generalize to other architectures lacking a pre-tuned JSON for the target (N, K) pair, but only the GB10 case is verified here; reviewers on Hopper/Ampere are welcome to confirm or push back. Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>

tonyliu312 · 2026-04-27T01:36:27Z

Congrats @ivanium @zyongye @ywang96 — landing V4 in main is a big step. Quick heads-up for the sm_120 / sm_121 crowd (DGX Spark / GB10 / RTX 50-series users) who will pull main and try to deploy:

To get a working V4 / V4-Flash / V4-Pro on sm_12x out of post-#40860 main, two small follow-ups are still needed (both rebased clean on top of this merge):

[Kernel] Marlin MoE: include SM 12.x in default arch list #40923 [Kernel] Marlin MoE: include SM 12.x in default arch list — adds 12.0;12.1 to MARLIN_ARCHS / MARLIN_BF16_ARCHS / MARLIN_MOE_ARCHS, mirroring the existing MARLIN_FP8_ARCHS = "8.9;12.0;12.1" precedent. Without it, the 8.0+PTX JIT fallback produces no native sm_12x cubin and gives silently-wrong outputs on Marlin-MoE (verified end-to-end: gibberish → coherent on dual DGX Spark TP=2; independently re-verified by @idonati on 8× DGX Spark TP=8 running V4-Pro at DeepSeek V4 support on SM12x with Triton sparse MLA fallback #40899).
[Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode #40925 [Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode — narrow specialisation of the w8a8_triton_block_scaled_mm fallback default (only when no tuned JSON matches): BLOCK_SIZE_M=16, num_stages=3 for M <= 8, larger M unchanged. ROCm gated to num_stages=2 per gemini-code-assist review. +23% on GB10 V4-Flash single-request decode, no regression possible for M > 8 or for hosts with a tuned JSON.

Both are review-clean since 04-26 16:16 UTC (gemini-code-assist closed all concerns), CI gated only on first-time-contributor ready label (cc CODEOWNERs: @LucasWilkinson @tlrmchlsmth for the CMakeLists change, @mgoin @tlrmchlsmth for the fp8_utils tune). Posting here primarily as a heads-up for sm_12x users grabbing main today, not as a label nudge.

Thanks again for the V4 work.

BowenBao · 2026-04-27T16:27:38Z

    return Mxfp4MoeBackend.NONE, None


+def select_mxfp4_moe_backend(


imo we shouldn't create separate select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend

what's the reason that these two can't be merged?

cc @mgoin , @robertgshaw2-redhat

claude Bot reviewed Apr 25, 2026

View reviewed changes

ywang96 merged commit 4d51588 into vllm-project:main Apr 27, 2026
197 of 202 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 27, 2026

github-project-automation Bot moved this from Ready to Done in gpt-oss Issues & Enhancements Apr 27, 2026

github-project-automation Bot moved this to Done in Tool Calling Apr 27, 2026

zyongye mentioned this pull request Apr 27, 2026

[New Model] Support DeepseekV4 #40760

Closed

MengqingCao mentioned this pull request Apr 27, 2026

[Usage]: deepseek-v4-flash当前function call是否已经适配，目前function call参数使用v3效果没有那么好 vllm-project/vllm-ascend#8713

Open

tonyliu312 mentioned this pull request Apr 27, 2026

Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 #40082

Open

noooop mentioned this pull request Apr 27, 2026

[Feature]: deepseek v4 support #40778

Closed

1 task

ZhanqiuHu mentioned this pull request Apr 27, 2026

[CI Flaky 2026-04-27] Fusion E2E TP2 Quick: Llama-4-Scout-17B OOM on H100 ZhanqiuHu/vllm-ci-watch#47

Open

AlpinDale mentioned this pull request Apr 27, 2026

feat: implement DeepSeek-V4 model aphrodite-engine/aphrodite-engine#1651

Merged

BowenBao reviewed Apr 27, 2026

View reviewed changes

This was referenced Apr 28, 2026

DeepSeek V4 + MegaMoE #40868

Closed

Integrate DeepGeMM MegaMoE #40843

Closed

Defilan mentioned this pull request Apr 28, 2026

test + integrate vLLM v0.20.0 (TurboQuant 2-bit KV, DeepSeek V4, FA4 default) defilantech/LLMKube#354

Closed

5 tasks

ProExpertProg mentioned this pull request Apr 28, 2026

[CI Failure]: Fusion E2E TP2 Quick (H100) #41156

Closed

3 tasks

Rohan138 mentioned this pull request Apr 28, 2026

[ROCm][Bugfix][GPTOSS]: fix input_ids and expert_map args for quark w4a8 gptoss #41165

Merged

4 tasks

pawel-olejniczak mentioned this pull request Apr 29, 2026

[FIX_FOR_VLLM_CUSTOM=5b39b268f506150dbab38f6f6c04b7c843e37c07] Fix upstream regressions: MoE refactor, DeepSeek V4 router, KV offload HMA vllm-project/vllm-gaudi#1403

Merged

This was referenced Apr 29, 2026

[Refactor] Merge select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend #41291

Open

[ROCm][Quantization][2/N] Refactor quark_moe w4a8 w/ oracle #39136

Merged

demian-overflow mentioned this pull request Apr 30, 2026

[Refactor] Extract shared helpers from MXFP4 MoE backend selectors #41317

Open

gcanlin mentioned this pull request May 2, 2026

[Misc][Main2Main] Upgrade vLLM to 0429(DSV4/v0.20.0) vllm-project/vllm-ascend#8856

Open

shen-shanshan mentioned this pull request May 6, 2026

[Misc][Main2Main] Upgrade vLLM to 0427 vllm-project/vllm-ascend#8899

Merged

Ph0rk0z mentioned this pull request May 7, 2026

Feature Request: Deepseek V4-Flash? Qwen sized deepseek... ikawrakow/ik_llama.cpp#1752

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] DeepSeek V4 Rebased #40860

[Feat] DeepSeek V4 Rebased #40860
ywang96 merged 25 commits intovllm-project:mainfrom
ivanium:feat/dsv4-support

ivanium commented Apr 25, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

ywang96 commented Apr 27, 2026

Uh oh!

Uh oh!

tonyliu312 commented Apr 27, 2026

Uh oh!

BowenBao Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

		return Mxfp4MoeBackend.NONE, None


		def select_mxfp4_moe_backend(

Uh oh!

Conversation

ivanium commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

ywang96 commented Apr 27, 2026

Uh oh!

Uh oh!

tonyliu312 commented Apr 27, 2026

Uh oh!

BowenBao Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

ivanium commented Apr 25, 2026 •

edited

Loading