[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834
[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes#41834jasl wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
@zyongye |
There was a problem hiding this comment.
Code Review
This pull request implements support for DeepSeek V4 on SM12x (Blackwell) architectures by providing Triton-based fallbacks for DeepGEMM-dependent operations. Key enhancements include the introduction of specialized Triton kernels for sparse MLA, FP8 einsum, and MQA logits, as well as memory optimizations in the sparse attention indexer to compute top-k indices without materializing full logits. Additionally, the PR updates the model loader to support weight name filtering for skipping MTP weights and handles Blackwell-specific FP8 quantization scales. I have no feedback to provide.
💡 Codex Reviewvllm/vllm/model_executor/layers/sparse_attn_indexer.py Lines 86 to 89 in 9596dbf This helper now disables the DeepGEMM requirement for every SM120 run, but the FP4 indexer cache path still depends on DeepGEMM kernels ( vllm/vllm/model_executor/model_loader/default_loader.py Lines 236 to 240 in 9596dbf The new pre-load ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
042e366 to
df2e6f8
Compare
df2e6f8 to
5a774bb
Compare
|
Do you have a performance result somewhere compared to the last PR? |
This is the minimum support PR (as the reviewer requested), so it only enables support with two essential patches to prevent the vLLM crash or device hang. I'm porting all the following changes to a new preview branch, which should be close to production-ready. |
5a774bb to
4128d11
Compare
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Fix the SM12x fp8 einsum custom-op registration import, skip unused DeepSeek V4 MTP checkpoint tensors before safetensors materialization, and release MXFP4 setup temporaries after kernel setup. Signed-off-by: jasl <jasl9187@hotmail.com>
Protect hybrid-aligned DeepSeek V4 MLA prompt cache blocks so they survive decode and unrelated cache churn. Release those protected references under admission pressure and before prefix-cache reset so they do not starve the block pool. Add regression coverage for reuse after decode pressure, admission under protected refs, and reset cleanup. Signed-off-by: jasl <jasl9187@hotmail.com>
Forward model skip_weight_name_before_load filters into the fastsafetensors iterator and skip filtered keys before materializing tensors. This keeps DeepSeek V4 non-MTP loads from reading MTP-only weights when users select --load-format fastsafetensors. Keep the regression coverage at behavior level by checking the DefaultModelLoader path and pruning private implementation-field assertions from the adjacent DeepSeek V4 prefix-cache tests. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: jasl <jasl9187@hotmail.com>
4128d11 to
1942bad
Compare
|
May I ask which is the final branch? How do we set it up and use it? Has the crash issue been resolved? |
This PR's performance isn't great, because this is the minimum support PR (as the reviewer requested), but the key reliability patches have been included. I don't see any crash on my 2 * GB10 and 2 * RTX Pro 6000. It needs the following PRs to restore performance. I'm preparing the preview branch now. To avoid the mass I made earlier, I'll test it locally first, especially for GB10. |
Do you have any recommended high-performance branches? Is there a deployment tutorial? I am using 4* RTX Pro 6000, and I want to try setting them up and using them. |
|
Hi, I opened a PR adding SM120 support for DeepGEMM: deepseek-ai/DeepGEMM#324 — might be helpful here |
|
@jasl I have a suggestion, after understanding how everything is. First, we can try our best to merge this (#38476). This has only DeepSeek v3.2 and GLM-5.x in scope for now, and DeepSeek v4 is not within scope yet. However, DeepSeek v4 still needs Sparse MLA. Then, the PR can be substantially shortened without the TRITON_SPARSE_MLA components. Does this sound good? |
https://github.com/jasl/vllm/tree/ds4-sm120-preview |
I tested the branch a few days ago; however, it's not the fast path for SM12x, and its correctness isn't as good as my implementation. |
Yes, I understand totally. So the PRs are both required for different purposes. Thank you for the response. |
|
Note that a breakthrough regarding TTFT was recently found in this Docker image. |
Purpose
This PR adds support for DeepSeek V4 Flash for SM12x (DGX Spark and RTX Pro 6000)
Note 1: This supersedes #40991 with a smaller branch and a cleaner file layout.
Note 2: SM12x hardware is hard to infer DeepSeek V4 Pro, so I only test and focus on DeepSeek V4 Flash
Note 3: GB10 requires a patch, or otherwise, loading the DeepSeek model will cause the device to hang and require a hard reboot.
Note 4: Another essential patch is about usability as well, without it, a 2 * GB10 or 2 * RTX Pro 6000 configuration will crash on agentic use cases (such as OpenClaw, Open Code) in just one turn. The issue was reported by a tester in jasl#2
Test Plan
Local/static checks:
Serve test, SM120 / 2x RTX PRO 6000, TP=2:
Full GSM8K accuracy:
Test Result
Static checks:
GB10 / DGX Spark startup smoke:
SM120 serve / max-context:
Long-context reliability smoke, SM120 / TP=2:
Full GSM8K, SM120 / TP=2 / EP enabled / FlashInfer autotune disabled:
Both runs completed with return code 0 and no server-side errors.