Skip to content

[Feat] DeepSeek V4 Rebased #40860

Merged
ywang96 merged 25 commits intovllm-project:mainfrom
ivanium:feat/dsv4-support
Apr 27, 2026
Merged

[Feat] DeepSeek V4 Rebased #40860
ywang96 merged 25 commits intovllm-project:mainfrom
ivanium:feat/dsv4-support

Conversation

@ivanium
Copy link
Copy Markdown
Contributor

@ivanium ivanium commented Apr 25, 2026

Purpose

Rebased version of #40760

Roadmap: #40902

Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Apr 27, 2026

Merging this PR since the failed tests are not related.

Great work and much thanks to everyone who worked on DSv4 effort!

@ywang96 ywang96 merged commit 4d51588 into vllm-project:main Apr 27, 2026
197 of 202 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 27, 2026
tonyliu312 pushed a commit to tonyliu312/vllm that referenced this pull request Apr 27, 2026
On SM 12.x (RTX 50-series, GB10/DGX Spark), Marlin and Marlin-MoE kernels
are currently absent from the compiled `_C.so` / `_moe_C.so`. The driver
JIT-promotes the `8.0+PTX` fallback to PTX-as-SM-12.x at first use, but
the resulting cubin produces silently-wrong outputs on Marlin-MoE
(observed: V4-Flash MoE forward emits gibberish tokens on a GB10 box,
while the same model on Hopper emits coherent text). Note that PTX-JIT
correctness is not guaranteed across major arch jumps; this is the
expected failure mode of relying on `8.0+PTX` for sm_120/sm_121.

`MARLIN_ARCHS`, `MARLIN_BF16_ARCHS`, and `MARLIN_MOE_ARCHS` in
CMakeLists.txt do not list `12.0;12.1`, so the build omits native
sm_120/sm_121 ELF entries from the kernel object. The neighbouring
`MARLIN_FP8_ARCHS` and `MARLIN_MOE_FP8_ARCHS` already include
`8.9;12.0;12.1`, so the precedent for SM 12.x in this file is set;
this change extends the same pattern to the BF16/FP16 paths.

Add `12.0;12.1` to the three arch lists. After rebuild on a GB10:
`cuobjdump --list-elf _moe_C.abi3.so | grep sm_121` returns 22 native
sm_121 ELF entries (was 0), and V4-Flash MoE forward output becomes
coherent (verified haiku generation, 6.28 t/s steady on dual DGX Spark
TP=2, max_tokens=80, single request).

Refs vllm-project#40860 (V4 rebase touches the build matrix, no overlap with this
arch-list change)

Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
tonyliu312 pushed a commit to tonyliu312/vllm that referenced this pull request Apr 27, 2026
`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config
when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the
GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M
dimension in single-request decode (M=1). GPUs without a pre-tuned JSON
file for their (N, K, device) tuple pay this cost.

Narrow the change: only specialize the M<=8 case (single-request decode
and short MTP-style draft batches). Larger M keeps the previous default
unchanged so non-decode paths and tuned configs are not perturbed.

  M <= 8 (CUDA)   -> BLOCK_SIZE_M=16, num_stages=3   (new)
  M <= 8 (ROCm)   -> BLOCK_SIZE_M=16, num_stages=2   (new)
  else            -> BLOCK_SIZE_M=64, num_stages=2   (previous default)

num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is
borderline for 3-stage Triton pipelining at typical [128, 128] block
sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the
BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure.

Pre-tuned JSON configs are unaffected (they short-circuit before this
branch). Workloads that already have a JSON for their (N, K, device)
get the same kernel as before.

Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash:
median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with
no other changes. Output remains coherent. The win is expected to
generalize to other architectures lacking a pre-tuned JSON for the
target (N, K) pair, but only the GB10 case is verified here; reviewers
on Hopper/Ampere are welcome to confirm or push back.

Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal)

Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
@tonyliu312
Copy link
Copy Markdown

Congrats @ivanium @zyongye @ywang96 — landing V4 in main is a big step. Quick heads-up for the sm_120 / sm_121 crowd (DGX Spark / GB10 / RTX 50-series users) who will pull main and try to deploy:

To get a working V4 / V4-Flash / V4-Pro on sm_12x out of post-#40860 main, two small follow-ups are still needed (both rebased clean on top of this merge):

  • [Kernel] Marlin MoE: include SM 12.x in default arch list #40923 [Kernel] Marlin MoE: include SM 12.x in default arch list — adds 12.0;12.1 to MARLIN_ARCHS / MARLIN_BF16_ARCHS / MARLIN_MOE_ARCHS, mirroring the existing MARLIN_FP8_ARCHS = "8.9;12.0;12.1" precedent. Without it, the 8.0+PTX JIT fallback produces no native sm_12x cubin and gives silently-wrong outputs on Marlin-MoE (verified end-to-end: gibberish → coherent on dual DGX Spark TP=2; independently re-verified by @idonati on 8× DGX Spark TP=8 running V4-Pro at DeepSeek V4 support on SM12x with Triton sparse MLA fallback #40899).
  • [Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode #40925 [Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode — narrow specialisation of the w8a8_triton_block_scaled_mm fallback default (only when no tuned JSON matches): BLOCK_SIZE_M=16, num_stages=3 for M <= 8, larger M unchanged. ROCm gated to num_stages=2 per gemini-code-assist review. +23% on GB10 V4-Flash single-request decode, no regression possible for M > 8 or for hosts with a tuned JSON.

Both are review-clean since 04-26 16:16 UTC (gemini-code-assist closed all concerns), CI gated only on first-time-contributor ready label (cc CODEOWNERs: @LucasWilkinson @tlrmchlsmth for the CMakeLists change, @mgoin @tlrmchlsmth for the fp8_utils tune). Posting here primarily as a heads-up for sm_12x users grabbing main today, not as a label nudge.

Thanks again for the V4 work.

return Mxfp4MoeBackend.NONE, None


def select_mxfp4_moe_backend(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo we shouldn't create separate select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend

what's the reason that these two can't be merged?

cc @mgoin , @robertgshaw2-redhat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector new-model Requests to new models nvidia ready-run-all-tests Trigger CI with all tests for wide-ranging PRs speculative-decoding tool-calling v1 verified Run pre-commit for new contributors without triggering other tests

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.