[Feat] DeepSeek V4 Rebased #40860
Conversation
|
Merging this PR since the failed tests are not related. Great work and much thanks to everyone who worked on DSv4 effort! |
On SM 12.x (RTX 50-series, GB10/DGX Spark), Marlin and Marlin-MoE kernels are currently absent from the compiled `_C.so` / `_moe_C.so`. The driver JIT-promotes the `8.0+PTX` fallback to PTX-as-SM-12.x at first use, but the resulting cubin produces silently-wrong outputs on Marlin-MoE (observed: V4-Flash MoE forward emits gibberish tokens on a GB10 box, while the same model on Hopper emits coherent text). Note that PTX-JIT correctness is not guaranteed across major arch jumps; this is the expected failure mode of relying on `8.0+PTX` for sm_120/sm_121. `MARLIN_ARCHS`, `MARLIN_BF16_ARCHS`, and `MARLIN_MOE_ARCHS` in CMakeLists.txt do not list `12.0;12.1`, so the build omits native sm_120/sm_121 ELF entries from the kernel object. The neighbouring `MARLIN_FP8_ARCHS` and `MARLIN_MOE_FP8_ARCHS` already include `8.9;12.0;12.1`, so the precedent for SM 12.x in this file is set; this change extends the same pattern to the BF16/FP16 paths. Add `12.0;12.1` to the three arch lists. After rebuild on a GB10: `cuobjdump --list-elf _moe_C.abi3.so | grep sm_121` returns 22 native sm_121 ELF entries (was 0), and V4-Flash MoE forward output becomes coherent (verified haiku generation, 6.28 t/s steady on dual DGX Spark TP=2, max_tokens=80, single request). Refs vllm-project#40860 (V4 rebase touches the build matrix, no overlap with this arch-list change) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M dimension in single-request decode (M=1). GPUs without a pre-tuned JSON file for their (N, K, device) tuple pay this cost. Narrow the change: only specialize the M<=8 case (single-request decode and short MTP-style draft batches). Larger M keeps the previous default unchanged so non-decode paths and tuned configs are not perturbed. M <= 8 (CUDA) -> BLOCK_SIZE_M=16, num_stages=3 (new) M <= 8 (ROCm) -> BLOCK_SIZE_M=16, num_stages=2 (new) else -> BLOCK_SIZE_M=64, num_stages=2 (previous default) num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is borderline for 3-stage Triton pipelining at typical [128, 128] block sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure. Pre-tuned JSON configs are unaffected (they short-circuit before this branch). Workloads that already have a JSON for their (N, K, device) get the same kernel as before. Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash: median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with no other changes. Output remains coherent. The win is expected to generalize to other architectures lacking a pre-tuned JSON for the target (N, K) pair, but only the GB10 case is verified here; reviewers on Hopper/Ampere are welcome to confirm or push back. Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
|
Congrats @ivanium @zyongye @ywang96 — landing V4 in main is a big step. Quick heads-up for the sm_120 / sm_121 crowd (DGX Spark / GB10 / RTX 50-series users) who will pull main and try to deploy: To get a working V4 / V4-Flash / V4-Pro on sm_12x out of post-#40860 main, two small follow-ups are still needed (both rebased clean on top of this merge):
Both are review-clean since 04-26 16:16 UTC (gemini-code-assist closed all concerns), CI gated only on first-time-contributor Thanks again for the V4 work. |
| return Mxfp4MoeBackend.NONE, None | ||
|
|
||
|
|
||
| def select_mxfp4_moe_backend( |
There was a problem hiding this comment.
imo we shouldn't create separate select_gpt_oss_mxfp4_moe_backend and select_mxfp4_moe_backend
what's the reason that these two can't be merged?
Purpose
Rebased version of #40760
Roadmap: #40902
Co-authored by: Bugen Zhao, Giancarlo Delfin, Jie Li, Kaichao You, Roy Wang, Woosuk Kwon, Yifan Qiao, Yongye Zhu, Zhewen Li, Zijing Liu, Zixi Qi
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.