[DSv4][Nvidia] SM12x DeepSeek V4 support#40991
[DSv4][Nvidia] SM12x DeepSeek V4 support#40991jasl wants to merge 7 commits intovllm-project:mainfrom
Conversation
|
@WoosukKwon |
There was a problem hiding this comment.
Code Review
This pull request introduces support for DeepSeek V4 models, including updates to DeepGEMM integration, new FP8 einsum kernels for SM12x, and infrastructure for sparse MLA attention. However, there are two critical issues: the removal of the optional dependency check for tilelang in vllm/model_executor/layers/mhc.py will break installations on non-CUDA platforms, and the replacement of DeepseekV4MLP with DeepseekV2MLP for shared experts removes necessary swiglu_limit clamping, which is vital for numerical stability in FP8 inference.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4e2adf8a9f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
The PR is ready to review. |
|
Thanks to your development, the length of the context currently supported locally has increased significantly, and the speed of decode has increased a lot. It's amazing! |
|
@jasl My understanding is that your current approach supports SM120 through a combination of DeepGEMM and Triton. I wonder whether a pure Triton implementation, without depending on DeepGEMM at all, would be cleaner and perhaps worth considering as an alternative. I’d be interested to hear your thoughts. |
I don't have a preference. |
|
Reproduced on 2x RTX PRO 6000 Server Edition @ 300W TDP + PCIe 4.0: 88/91/70/87/84% of your reference numbers across the five workloads, roughly fits the PCIe 4.0 speeds (I assume you run PCIe 5.0). Boots fine at 250K context with max-num-seqs=16, gpu-memory-utilization=0.97. Thank you for your work! |
KV-cache reuse breaks under concurrent multi-session load (sparse-MLA + SWA page sharing)
TL;DRTwo parallel chat sessions at ~50 K context each behave as if the KV cache is wiped between turns — every request re-prefills from scratch. Single-session multi-turn is fine.
Smoking gunA's parallel turn 1 = 297 ms (cache hit — A's blocks from the solo phase were still resident). A's parallel turn 2 = 30 311 ms (full re-prefill). Between A.1 and A.2, B's cold prefill ran. B's prefill destroyed A's blocks despite ~88 % free pool. Tested
My (AI-assisted) reading of the patchDS-V4-Flash has
The hypothesis I landed on: upstream This fits the index-level hit rate staying healthy on solo (66 %) but collapsing on concurrent (12–17 %) — the index thinks it has A's data, the memory says otherwise. I'm reasonably confident about the shape of the bug, much less so about the exact location. Could very well be wrong. Reproducer~250-line standalone Python script ( Setup
|
|
I think I have found the root cause, a stupid fault I missed an argument... |
|
Update to ds4-sm120-full and I'll test |
|
Now testing the DeepSeek-V4-Flash-W4A16-FP8 build (W4A16 GPTQ + FP8_BLOCK attention, calibrated against jasl's branch + #41276) on dual DGX Spark TP=2 — will post full Note: this build uses Marlin INT4 kernels for the routed experts, not the SM12x sparse-MLA / FP4 path where the @wuwenthink garbled-output issue currently lives — so if it produces clean output on chat-smoke coding ( Anyone with 2× RTX PRO 6000 (SM 12.0) want to also run this on the same harness while we wait for jasl's fix? |
The author said he might have found a problem and would test it after he fixed the update |
|
I took the down the latest version of these and now DS4-Flash works with opencode, speed is about 40-50 tok/s with dual RTX Pro 6000. Great progress! Using For Opencode I needed to add these start parameters: For some reason I need to make this patch to avoid "unsupported architecture" error: // # DeepGEMM is required for the paged MQA logits on CUDA devices `` Any ideas why this is needed? |
|
@jasl Is the latest ds4-sm120-full branch going to be tested with the export VLLM_DEEPSEEK_V4_USE_DEEPGEMM_SM12X_KERNELS=1 command? Author: Can you provide a version of your export environment parameters and VLLL loading parameters that can effectively improve the speed of prefill and decode? |
I'm not focusing on performance right now, since you reported the correctness and low-quality generation issues. GPT summary:
I'm fixing the KV cache issue and building a new SM120 baseline to start the performance tuning. |
|
Pre-warm sparse-MLA TileLang kernels at boot to avoid first-request JIT spikes Running Excerpt from engine log on a fresh boot under live mixed traffic: The compiled artifacts persist to Suggested: pre-warm sparse-MLA TileLang kernels at engine init, ideally driven by a list analogous to Is there already a knob for this I missed? Happy to test a patch on the same hardware. |
Thank you! I'll check it. |
|
@wuwenthink another analysis from Opus 4.7 Max 总评:在同一档生成质量上
分项观察
真正出现的瑕疵(实地核对过)
这三例都是在 zh 长上下文 + temperature=1.0 下的零星采样偏移,9 个 seed 里只命中 1 个;同任务族其他 seed(包括同一 think-max 模式)干净;en 侧在我抽查的范围内未观察到此类瑕疵。 是否在可接受范围内是。理由:
结论:把 |
|
re: 77bbc16 — Validated for production tool-use and chat at TP=2 on DGX Spark with --max-model-len 16384 at ~14–17 tok/s decode, pending full harness confirmation. Long-context behavior beyond 16K and 24-hour stability soak still pending. |
|
Reporting a workspace-allocator bug we hit deploying this PR's W4A16 quant on dual DGX Spark TP=2: The comment at Full report + patch + Spark TP=2 validation results (gsm8k 95.37%, HumanEval pass@1 80.49%) in #41700. |
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Protect hybrid-aligned DeepSeek V4 MLA prompt cache blocks so they survive decode and unrelated long-session cache churn. Keep common-prefix accounting aware of the extra protection reference and cover compressor-state SlidingWindowMLA groups in a regression test. Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
|
Confirming Cherry-picked the commit on top of |
|
@pasta-paul @moshemalawach @v1b3coder |
MTP indexer performance fix for SM12x@jasl We investigated the 1M-context MTP decode slowdown on SM120 (RTX PRO 6000 ×4) and found two issues in the paged-MQA-logits Triton kernels. ProblemWith Root cause #1: MTP excluded from row-wise kernel
- if next_n == 1 and head_dim % 64 == 0 and num_heads % 4 == 0:
+ if head_dim % 64 == 0 and num_heads % 4 == 0:Root cause #2: Grid sized to max_model_lenEven with the row-wise kernel, Fix: early-exit in + if token_start + pid_n * BLOCK_N >= context_len:
+ tl.store(
+ logits_ptr + row * stride_lm + offs_local_n * stride_ln,
+ tl.full((BLOCK_N,), float("-inf"), dtype=tl.float32),
+ mask=valid_row & valid_n,
+ )
+ return
context_mask = valid_n & (offs_n < context_len)The grid stays at Results (SM120, TP=4, DeepSeek-V4-Flash, MTP N=3)
Full commit: aabbccddwasd@b1b1b532c Both changes are correctness-preserving: the paper defines the indexer scan as Tested with Investigation by DeepSeek-V4-Pro under Claude Code |
Different failure mode under sustained streaming load — silent wedge, KV blocks not freed at request teardownReporting a wedge we hit after Setup
SignatureHealthy operation at high prefix-hit rate, then in a single 10 s sampling window the engine transitions to a non-recovering wedge. Two captures, both showing the same shape: After this point, new The other capture (different image build, same hardware/config) hit the wedge at Why I think this is distinct from the eviction race
This shape is consistent with a refcount leak on streaming-request teardown — possibly the new What I can share
Not posting the full Docker logs publicly (they contain client IPs and internal endpoint paths), but happy to forward via DM or any channel that works for @jasl. Speculation on next stepThe new commits on |
Co-authored-by: OpenAI Codex <codex@openai.com>
|
@moshemalawach @aabbccddwasd |
The PR combines #40929, now it's DeepGEMM free, thanks to @bbbearxyz !
UPDATE: To better aligh with Deepseek official API and the B200 code path, I made a harness to help to measure correctness, performance, and quality https://github.com/jasl/vllm-ds4-sm120-harness
And I will put the latest report for people to review
Summary
This PR enables DeepSeek V4 Flash to serve on NVIDIA SM12x GPUs, tested on a
2x RTX PRO 6000 Blackwell Workstation Edition host.
The important change from the earlier prototype is that this PR no longer pins
or rewrites the DeepGEMM dependency. The branch keeps vLLM's upstream DeepGEMM
installer and CMake metadata intact, and implements the required SM12x runtime
fallbacks in vLLM:
fp8_ds_mlasparse MLA cache handling.Motivation
DeepSeek V4 currently relies on kernels that are available on Hopper and
datacenter Blackwell paths, but not on SM120 / SM121 workstation and consumer
Blackwell GPUs. In particular, SM12x cannot directly reuse SM90 WGMMA kernels
or SM100 tcgen05 kernels.
This PR adds correctness-first portable kernels for the missing SM12x pieces,
then optimizes the hot sparse MLA paths enough for real serving. The result is
a reviewable vLLM-side compatibility layer that does not require maintainers to
accept a temporary DeepGEMM fork pin.
Scope
Included:
fp8_ds_mlapacked cache decode for SWA and compressed sparse candidates.model path.
Not included:
broad for this PR.
Runtime controls
The SM12x sparse MLA path registers its environment variables in
vllm.envs,so users should not see unknown-variable warnings for these knobs.
VLLM_TRITON_MLA_SPARSE1forces the Triton sparse MLA path,0disables it. When unset, vLLM enables it on SM12x where FlashMLA sparse is unavailable.VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE512VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE256VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE1,2, and4; benchmarks used4.VLLM_TRITON_MLA_SPARSE_MATMUL_DECODEVLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH1forces allow,0disables.Operational warning: do not set
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truewith the TP=2 CUDA graphconfiguration used below. In local testing it made custom all-reduce fail
during CUDA graph address registration. Leaving it unset avoids that failure.
Branches
Formal PR branch:
Preview / evaluation branch with extra community performance work and MTP fixes:
The preview branch is not intended as the review target. It exists so users can
try the broader optimization stack while this PR stays focused.
Test environment
Hardware:
Software:
Benchmark environment:
Validation
Formal PR branch checks:
Result:
Compile check:
Targeted tests:
Result:
Diff hygiene:
Result: clean.
Preview branch focused checks:
Result:
Serving command
Formal PR branch, no MTP:
Preview branch, MTP:
Benchmark command
The short-context benchmark uses
128 -> 512; the long-context benchmark uses8192 -> 512. Each row uses 48 prompts andtemperature=0.Formal PR branch benchmark
Branch:
Server memory setting:
MTP is not included in this branch. Starting the formal branch with
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'failsbecause the MTP fix stack is intentionally kept separate.
Result directory:
Preview branch benchmark
Branch:
Server memory setting:
This branch includes the separate MTP fixes and community performance patches.
It is for evaluation only, not the formal PR review target.
Startup notes:
Result directory:
Review notes
Changes made before this update:
clearly separated from serving kernels.
Known follow-ups
ds4-sm120-fullcan continue to carry community performance patches forpublic evaluation.
indexer, MoE, collectives, sampling, and sparse MLA rather than broadening
this PR.