DeepSeek V4 support on SM12x with Triton sparse MLA fallback#40899
DeepSeek V4 support on SM12x with Triton sparse MLA fallback#40899jasl wants to merge 70 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Yongye Zhu <yongye@inferact.ai> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Simon Mo <simon@inferact.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Giancarlo Delfin <gdelfin@inferact.ai> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Roy Wang <yasong.wang@inferact.ai> Co-authored-by: Woosuk Kwon <woosuk@inferact.ai> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Zhewen Li <jerven.vllm@gmail.com> Co-authored-by: Zijing Liu <liuzijing2014@gmail.com> Co-authored-by: khluu <khluu000@gmail.com> Co-authored-by: qizixi <zixi@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Add an experimental SM120 DeepSeek V4 path that keeps the existing FlashMLA sparse metadata, KV-cache, and top-k/global-slot plumbing, but uses a correctness-first reference attention implementation instead of calling FlashMLA kernels that are unavailable on SM120. The prototype is gated behind VLLM_SM120_REFERENCE_DEEPSEEK_V4_ATTENTION and leaves the SM90/SM100 FlashMLA path unchanged. It also adds diagnostic dumping via VLLM_SM120_DUMP_DEEPSEEK_V4_ATTENTION so shape and metadata issues can be captured without changing normal execution. Implemented pieces: - sink-aware reference sparse attention with online softmax state - SWA-only decode reference path - compressed decode reference paths for C4A and C128A - chunked C128A top-k processing to avoid materializing full 8192-slot KV tensors per token batch - prefill reference path over the existing gathered KV workspace and combined sparse/SWA indices - fp8_ds_mla global-slot dequantization helper for arbitrary physical KV cache slots - SM120 tile-scheduler bypass when the reference path is enabled - torch.compile defunctionalization for DeepSeek V4 FP8/CUTLASS custom ops - E8M0 scale upcast before CUTLASS scaled-mm calls Validation on the SM120 host: - python -m py_compile over the modified Python modules - git diff --check - GPU smoke tests for single-chunk, multi-chunk, and prefill reference attention all reported max_abs=0.0 against PyTorch golden references - vllm serve DeepSeek-V4-Flash with --max-model-len 262144 started successfully and returned HTTP 200 for a one-token /v1/completions request The default 1M context remains outside this prototype's current memory budget: after the reference path and CUDA graph capture, vLLM's KV-cache admission reports insufficient available KV memory for 1,048,576 tokens.
DeepGEMM now has an experimental SM120/SM121 compatibility path in the local dependency branch, but vLLM's vendored DeepGEMM CMake integration was still filtering CUDA architectures down to SM90/SM100. On an SM120 host this made the _deep_gemm_C extension fall through the unsupported-architecture branch even when CUDA 13 and a compatible DeepGEMM source checkout were available. Add 12.0f to the DeepGEMM supported architecture list for CUDA 13.0+. vLLM's cuda_archs_loose_intersection helper maps this to the requested SM12x target (for example CUDA_ARCH_LIST=120a resolves to DeepGEMM CUDA architectures: 12.0a), matching the SM120/SM121 compatibility model used by the prototype. Validation on the SM120 host: - git diff --check - DEEPGEMM_SRC_DIR=~/tmp/DeepGEMM CCACHE_NOHASHDIR=true MAX_JOBS=64 pip install --verbose --no-build-isolation -e . with CUDA_ARCH_LIST=120a and TORCH_CUDA_ARCH_LIST=12.0a - build log reported: DeepGEMM CUDA architectures: 12.0a
Refactor the experimental SM120 DeepSeek V4 reference attention path so the attention subsets compute no-sink normalized outputs plus log-sum-exp values, then apply the learnable attention sink exactly once in a small merge step. This makes the prototype line up with the intended optimization boundary: SWA and compressed top-k attention can be replaced independently by Triton or CUDA kernels that return (out_no_sink, lse_no_sink), while the shared sink-aware merge remains small and easy to verify. The decode path now computes compressed and SWA subsets separately and merges both LSEs with the sink denominator. SWA-only decode and prefill still use the same math, but go through the no-sink finalize plus merge helper so all reference paths share one sink application point. Validation on the SM120 host: - python -m py_compile vllm/model_executor/layers/deepseek_v4_attention.py - git diff --check - GPU smoke tests reported single_chunk_max_abs=0.0, multi_chunk_max_abs=0.0, and prefill_max_abs=0.0 against PyTorch golden references - vllm serve DeepSeek-V4-Flash with --max-model-len 262144 started successfully and returned HTTP 200 for a one-token /v1/completions request
DeepSeek V4 chat prompts can enter the sparse attention indexer prefill path before the SM120 sparse-attention reference path runs. That path calls DeepGEMM fp8_fp4_mqa_logits, which still rejects SM12x in attention.hpp and kills the engine with Unsupported architecture. Add an SM120-only FP8 torch reference implementation for non-paged MQA logits. It dequantizes FP8 K rows with the existing per-row scale, accumulates relu(q @ k.T) weighted across heads in small head chunks, and preserves the existing clean_logits mask behavior. SM90/SM100 and FP4 continue to use the DeepGEMM implementation. Add an SM120 CUDA regression test comparing the fallback against an explicit PyTorch reference so the chat/prefill indexer path remains covered.
The SM120 DeepSeek V4 prototype reads several experimental controls directly from the environment. vLLM warns about VLLM_* variables that are not present in envs.environment_variables, so startup printed unknown-variable warnings even though the controls worked. Register the SM120 reference attention flags, dump path, and chunk-size controls in envs.py so startup validation recognizes them without changing the existing os.getenv-based behavior.
The SM120 vLLM prototype needs the DeepGEMM branch that contains the experimental SM120 reference fallbacks. The CUDA-13 CMake architecture patch only allows DeepGEMM to be built for SM12x; if DEEPGEMM_SRC_DIR is not set, the default upstream DeepGEMM tag still lacks those fallback kernels. Point the prototype vendored DeepGEMM fetch to jasl/DeepGEMM at the SM120 fallback commit so rebuilds without DEEPGEMM_SRC_DIR use the same dependency that was validated on the DGX Spark host.
Add a minimal pipeline-parallel path for DeepSeek V4 so the model can run with TP=1 and PP=2 on the SM120 prototype branch. The causal LM wrapper now advertises SupportsPP, creates the LM head only on the last pipeline rank, and exposes the model intermediate tensor factory.\n\nSplit the core model forward by pipeline rank: the first rank embeds tokens and expands them into the HyperConnection stream, intermediate ranks receive flattened HC hidden states, non-last ranks return IntermediateTensors, and the last rank applies the HC head and final RMSNorm. The model loader now skips parameters that belong to PP-missing layers, including stacked attention/MLP weights, per-expert MoE weights, attn sinks, embeddings, and final norm parameters.\n\nKeep this intentionally scoped to the tested prototype configuration: TP=1, PP=2, no speculative/MTP pipeline support. The MTP hidden-state buffer is only allocated and populated on the final rank, so get_mtp_target_hidden_states returns None away from the last stage.\n\nVerification:\n- supports_pp(DeepseekV4ForCausalLM) changed from False before the patch to True after the patch.\n- python -m py_compile vllm/model_executor/models/deepseek_v4.py tests/models/test_deepseek_v4_pp.py\n- manual invocation of tests/models/test_deepseek_v4_pp.py assertion passed in the venv.\n- Started DeepSeek-V4-Flash with --tensor-parallel-size 1 --pipeline-parallel-size 2 on port 8001; PP ranks initialized as PP0/PP1, layers split as [22,21], checkpoint loaded, CUDA graphs captured, and /v1/chat/completions returned HTTP 200 with normal assistant text.\n\nNote: python -m pytest tests/models/test_deepseek_v4_pp.py -q was not runnable because the remote venv has no pytest module.
Introduce VLLM_TRITON_MLA_SPARSE as the generic control for the correctness-first sparse MLA fallback, with dump and chunk-size knobs under the same namespace. The old VLLM_SM120_* names remain registered and are still honored as aliases so existing prototype scripts keep working without unknown-env warnings. When the new control is unset, SM12x devices now select the reference sparse MLA path automatically. This keeps SM90/SM100 on the FlashMLA sparse path while avoiding the unavailable FlashMLA tile-scheduler path on SM120/SM121. The fallback can still be force-disabled with VLLM_TRITON_MLA_SPARSE=0 or force-enabled for debugging with VLLM_TRITON_MLA_SPARSE=1. The attention helper names and diagnostics now describe the implementation as a sparse MLA reference fallback instead of a DeepSeek-V4-specific SM120 switch, which is a step toward a reviewable TRITON_MLA_SPARSE backend shape. Signed-off-by: jasl <jasl9187@hotmail.com>
Honor VLLM_TRITON_MLA_SPARSE_DUMP explicitly when it is set, even if the legacy VLLM_SM120_DUMP_DEEPSEEK_V4_ATTENTION alias is also present. This keeps the new generic sparse MLA namespace authoritative while preserving the old alias as a fallback when the new variable is unset. Signed-off-by: jasl <jasl9187@hotmail.com>
Keep the vLLM CMake change scoped to the CUDA 13 SM12x architecture entry. The prototype can still build against the SM120 DeepGEMM branch by passing DEEPGEMM_SRC_DIR, but the default FetchContent source now stays on the upstream DeepGEMM repository and tag so this patch is reviewable as a simple build-compatibility change. Signed-off-by: jasl <jasl9187@hotmail.com>
Restore the default FetchContent source to the SM120 DeepGEMM prototype branch so users can build this vLLM branch directly without setting DEEPGEMM_SRC_DIR. The separate CUDA 13 SM12x architecture entry remains in place for local-source and vendored builds. Signed-off-by: jasl <jasl9187@hotmail.com>
Cover the correctness-first reference path used by the SM12x sparse MLA fallback. The tests assert that the no-sink subset attention returns log-sum-exp state, that the sink contributes only to the denominator, and that merging SWA and compressed subsets by LSE is equivalent to attention over the concatenated sparse candidates. Also add a CUDA regression for the V4 fp8_ds_mla global-slot dequant path. It builds the 584-byte block-packed cache layout directly, checks UE8M0 scale decoding, verifies invalid slots are zeroed, and exercises both 2D and 3D slot-id inputs. Signed-off-by: jasl <jasl9187@hotmail.com>
Move the PyTorch reference sparse MLA math out of DeepSeek V4 attention into vllm.v1.attention.backends.mla.sparse_mla_reference. The model path now calls shared helpers for no-sink accumulation, LSE finalization, and sink-aware subset merging instead of carrying those methods on DeepseekV4MLAAttention. Update the sparse MLA correctness tests to target the shared helper module directly. This keeps the reference contract reusable for a future TRITON_MLA_SPARSE backend while preserving the current SM12x DeepSeek V4 fallback behavior. Signed-off-by: jasl <jasl9187@hotmail.com>
Extract the DeepSeek V4 prefill reference fallback into sparse_mla_reference.reference_sparse_mla_prefill. The model layer now passes the prepared combined sparse indices, lens, sink, scale, and chunk-size knobs into a shared helper instead of carrying the prefill accumulation loop inline. Add a direct prefill correctness test that compares the shared helper against a dense golden formulation across multiple query and top-k chunk sizes, including duplicate indices, invalid -1 entries, and an all-invalid row. Signed-off-by: jasl <jasl9187@hotmail.com>
Move VLLM_TRITON_MLA_SPARSE and legacy SM120 alias parsing into a shared sparse_mla_env module. DeepSeek V4 attention and the SWA metadata builder now use the same helpers for reference fallback enablement, diagnostic dumps, and chunk-size knobs. This preserves the current behavior where unset VLLM_TRITON_MLA_SPARSE auto-enables the fallback on SM12x, while an explicit new value overrides the legacy aliases. Signed-off-by: jasl <jasl9187@hotmail.com>
Allocate the intermediate wo_a einsum output through the V1 workspace manager instead of torch.empty. This removes one hot-path dynamic allocation after sparse MLA attention has completed while leaving the attention output buffer as a regular tensor, since that buffer must survive nested workspace use inside the attention implementation. Signed-off-by: jasl <jasl9187@hotmail.com>
Cover the shared sparse MLA environment helper semantics, including new-name precedence over the legacy SM120 aliases, legacy chunk-size fallbacks, invalid chunk-size defaults, and diagnostic dump path precedence. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a small Triton kernel for merging the compressed and SWA sparse MLA subset outputs with the DeepSeek V4 attention sink. The kernel implements the stable LSE merge formula and is used by the SM12x compressed decode reference path after the two subset attentions have produced out/lse pairs. Keep the PyTorch merge helper as the correctness oracle and add a CUDA regression that compares the Triton merge against that reference, including -inf subset LSE entries. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a portable Triton online-softmax accumulator for gathered sparse MLA subsets. The kernel updates per-token/head max, denominator, and accumulator state across chunks, and a finish kernel emits the subset-normalized output and LSE. Route the DeepSeek V4 compressed decode reference path through the Triton accumulator for both compressed top-k and SWA subsets while keeping the existing fp8_ds_mla dequantize-by-slot kernels and sink-aware LSE merge boundary intact. Extend sparse MLA reference tests with CUDA parity coverage for chunked slot-id accumulation, the slot-id-free SWA path, and 512-dim heads. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a Triton accumulator that reads DeepSeek V4 fp8_ds_mla packed KV-cache entries directly from global slot ids, dequantizes the 448 FP8 NoPE dimensions plus 64 BF16 RoPE dimensions, and updates the same online softmax state used by the portable sparse MLA fallback. Use the fused packed-cache accumulator for the compressed top-k decode subset so the SM12x fallback no longer materializes a BF16 compressed_kv scratch buffer before attention. The SWA subset still uses the gathered BF16 accumulator and the final sink-aware LSE merge remains unchanged. Cover the new path with CUDA parity tests against the PyTorch reference, including chunked slot-id accumulation, invalid slots, lens truncation, and the 584-byte fp8_ds_mla layout. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a Triton sparse MLA accumulator that reads DeepSeek V4 fp8_ds_mla packed KV-cache entries through seq_lens and block_table instead of pre-gathering the sliding-window subset into a BF16 scratch buffer. Use the paged accumulator for the SWA subset inside the compressed decode fallback, keeping the compressed top-k accumulator and sink-aware LSE merge unchanged. Cover the new path with a CUDA parity test using non-trivial block-table mappings plus a 128-token, 512-dim SWA smoke run against the PyTorch reference. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a Triton single-subset sink merge for sparse MLA outputs and use it with the paged fp8_ds_mla accumulator to remove the BF16 gather/reference path from SWA-only decode. The SWA-only fallback now reads the packed paged cache directly, finishes the no-sink attention state, then applies the sink denominator in Triton. Tests cover the single-subset merge and paged SWA attention with sink against the PyTorch reference. Signed-off-by: jasl <jasl9187@hotmail.com>
Add a BF16 indexed sparse MLA accumulator that reads kv_flat by combined_indices and updates online softmax state without materializing gathered_kv. Route the DeepSeek V4 sparse MLA prefill fallback through query/top-k chunked Triton accumulation, finish, and single-subset sink merge. The PyTorch reference helper remains as a test oracle. Tests cover invalid indices, duplicates, all-invalid rows, query chunking, and top-k chunking against the reference prefill path. Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
Signed-off-by: jasl <jasl9187@hotmail.com>
|
@tonyliu312 Please check my latest commit, and feel free to cherry-pick.
I nearly doubled the performance. |
Use active prefill sequence and gather lengths to size the DeepSeek V4 sparse MLA staging workspace instead of reserving against max_model_len and max_num_batched_tokens. This keeps the gathered KV row stride bounded by the current request batch, which matters for long-prompt agent workloads and especially large max-model-len configurations. Also route sparse indexer prefill logits sizing through a shared helper. SM12x now defaults to a 256 MiB logits cap when VLLM_SPARSE_INDEXER_MAX_LOGITS_MB is unset, while explicit env overrides keep the previous behavior. The profiling dummy allocation uses the same helper as runtime chunking so the memory profile reflects the configured cap. Validation: python -m pytest -q tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py; python -m pytest -q tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py tests/v1/attention/test_sparse_mla_backends.py -k 'prefill_workspace_bounds or sparse_indexer_max_logits_bytes or split_indexer_prefill_chunks'; python -m ruff check vllm/model_executor/layers/deepseek_v4_attention.py vllm/v1/attention/backends/mla/sparse_swa.py vllm/v1/attention/backends/mla/indexer.py vllm/model_executor/layers/sparse_attn_indexer.py tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py tests/v1/attention/test_sparse_mla_backends.py; long-prompt smoke 15k input x10 on TP=2/EP/eager succeeded.
|
@BehindTheCartan UPDATE: I'm working on improving stability for long prefill, will update tomorrow |
Route multi-token speculative decode through global-slot sparse MLA accumulation for both SWA-only and compressed sparse paths while keeping single-token decode on the existing paged fast paths. Disable sparse MLA CUDA graph capture by default when speculative decoding is configured, preserving the explicit env override. Add regression coverage for MTP-shaped sparse MLA decode metadata and cudagraph policy.
| list(APPEND DEEPGEMM_SUPPORT_ARCHS "10.0a") | ||
| endif() | ||
| if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0) | ||
| list(APPEND DEEPGEMM_SUPPORT_ARCHS "12.0f") |
There was a problem hiding this comment.
Do we need also 12.1a here for DGX Spark?
There was a problem hiding this comment.
12.0f means for all 12.x family
There was a problem hiding this comment.
I see in log
[gpu_model_runner.py:4884] Model loading took 39.77 GiB memory and 522.056257 seconds
Running NVCC command: cd /root/.cache/vllm/deep_gemm/tmp && /usr/local/cuda/bin/nvcc /root/.cache/vllm/deep_gemm/tmp/203-af1d4b4f-d7902775-636c4fd8/kernel.cu -cubin -o /root/.cache/vllm/deep_gemm/tmp/203-af1d4b4f-d7902775-636c4fd8/kernel.cubin -std=c++20 --diag-suppress=39,161,174,177,186,940 --ptxas-options=--register-usage-level=10 -I/usr/local/lib/python3.12/dist-packages/vllm/third_party/deep_gemm/include -gencode=arch=compute_120f,code=sm_120f --compiler-options=-fPIC,-O3,-fconcepts,-Wno-deprecated-declarations,-Wno-abi -O3 --expt-relaxed-constexpr --expt-extended-lambda
Was thinking, is it problem on my side with env, not sure, but would like to see there 121f :) okay, will continue with testing...
DeepSeek-V4-Pro working on 8× DGX Spark (sm_121, TP=8) — recipe + suggested upstream nudgeHi @jasl @tonyliu312 — first, thanks for the SM12x V4 work in #40899 and the Stack
TL;DRV4-Pro now fires up + serves coherently on the 8-Spark cluster with three changes:
Verified output (coherent,
What we observed (and why the third flag was the breaker)After applying #40923 + the rebuild, V4-Flash worked immediately but V4-Pro hung in MoE weight materialization for 5/8→3/8 workers (depending on run). The hung workers were ALL stuck in The slow workers weren't doing anything special — Setting Suggested upstream nudgeThe Suggested change: invert the condition — when checkpoint > 90% RAM, enable prefetch by default (regardless of FS), with the comment "user can This would have saved us about 5 hours of diagnostic cycles. Happy to file as a separate vllm issue if useful. Numbers (V4-Pro on 8× DGX Spark, TP=8, max-model-len=1024, gpu-memory-utilization=0.88)
vllm serve invocation that worksvllm serve deepseek-ai/DeepSeek-V4-Pro \
--trust-remote-code --kv-cache-dtype fp8 --block-size 256 \
--tensor-parallel-size 8 --pipeline-parallel-size 1 \
--max-model-len 1024 --gpu-memory-utilization 0.88 \
--safetensors-load-strategy prefetch \
--tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 \
--enable-auto-tool-choice --reasoning-parser deepseek_v4 \
--host 0.0.0.0 --port 5001 \
--distributed-executor-backend ray --enforce-eagerContainer env (key flags carried over from V4-Flash recipe): Attachments
ThanksThe combination of #40899 (SM12x V4 support), #40923 (Marlin sm_12x cubins), — @idonati |
|
Supporting evidence (inlined for searchability): Coherent V4-Pro samples (full, captured by diagnose-pro-marlin.sh after FIRED-UP)Per-node memory snapshot during V4-Pro inference (TP=8, max-model-len=1024, gpu-mem=0.88)Engine log around fired-up (last 80 lines) |
|
@idonati this is fantastic data, thanks for the detailed write-up. A few quick reactions: On #40923 reproduction. Independent confirmation from an 8-node TP=8 setup — using the exact same patch — is exactly the kind of validation that makes a small CMakeLists change credible. Two completely separate hardware/network configurations (dual-Spark TP=2 RoCE bonded + 8-node TP=8 RoCE multi-rail) hitting the same broken-PTX-JIT failure mode and being unblocked by the same arch-list addition is, IMO, the strongest signal we can give reviewers that the precedent already set by On the V4-Flash vs V4-Pro divergence (256 vs 384 experts). Your point that V4-Flash was getting "lucky on the broken PTX path" while V4-Pro consistently hit the dead On the prefetch nudge. +1 — the current heuristic optimises for the wrong axis (FS type). Inverting the condition (when checkpoint > 90% RAM, enable prefetch by default with On our V4-Flash baseline. For reference, dual DGX Spark (TP=2, sm_121) running V4-Flash with Thanks again — this is the kind of report that moves a PR forward. |
|
This pull request has merge conflicts that must be resolved before it can be |
`w8a8_triton_block_scaled_mm` falls back to a hardcoded default config when no pre-tuned `configs/N=*,K=*,device_name=*.json` file matches the GPU. The default uses `BLOCK_SIZE_M=64`, which wastes 98% of the M dimension in single-request decode (M=1). GPUs without a pre-tuned JSON file for their (N, K, device) tuple pay this cost. Narrow the change: only specialize the M<=8 case (single-request decode and short MTP-style draft batches). Larger M keeps the previous default unchanged so non-decode paths and tuned configs are not perturbed. M <= 8 (CUDA) -> BLOCK_SIZE_M=16, num_stages=3 (new) M <= 8 (ROCm) -> BLOCK_SIZE_M=16, num_stages=2 (new) else -> BLOCK_SIZE_M=64, num_stages=2 (previous default) num_stages=3 is gated to non-ROCm because MI300/MI250X LDS (64 KB) is borderline for 3-stage Triton pipelining at typical [128, 128] block sizes; on ROCm we keep num_stages=2 so the M<=8 branch still gets the BLOCK_SIZE_M=16 wave-quantisation win without LDS pressure. Pre-tuned JSON configs are unaffected (they short-circuit before this branch). Workloads that already have a JSON for their (N, K, device) get the same kernel as before. Verified on dual DGX Spark (GB10, sm_121, TP=2) running V4-Flash: median single-request decode goes from 5.45 t/s to 6.73 t/s (+23%) with no other changes. Output remains coherent. The win is expected to generalize to other architectures lacking a pre-tuned JSON for the target (N, K) pair, but only the GB10 case is verified here; reviewers on Hopper/Ampere are welcome to confirm or push back. Refs vllm-project#40860 (V4 rebase), vllm-project#40899 (jasl SM12x scope is orthogonal) Signed-off-by: Tony Liu <tonyliu0512@gmail.com>
youkaichao
left a comment
There was a problem hiding this comment.
for such a big change, we will not be able to review and accept it. you can keep it in a fork. thanks for your interest.
|
@jasl Can you actually rebase with the current main branch, so that we can see the diffs more clearly? |
|
@WoosukKwon I'm re-organizing a new PR based on the latest main |
|
@tonyliu312 Thanks again for the detailed reply. I ran your three suggestions on the 8× DGX Spark cluster and have results to share. 1.
|
| Task | Metric | Score | Notes |
|---|---|---|---|
| GSM8K | flexible-extract | 96.0% ± 2.8% | 5-shot, no chat template applied |
| GSM8K | strict-match | 94.0% ± 3.4% | 5-shot, no chat template applied |
| MMLU | aggregate (acc) | 30.7% ± 0.84% | 0-shot, no chat template, gpt2 tokenizer used for length budgeting only — see caveat below |
| MMLU | stem | 37.4% | |
| MMLU | humanities | 28.3% | |
| MMLU | social sciences | 25.7% | |
| MMLU | other | 28.0% |
MMLU caveat: aggregate is suppressed by 0-shot + no chat template + tokenizer fallback (lm-eval can't load the deepseek_v4 tokenizer config in transformers 5.6 because of a model_type=deepseek_v4 registration gap; I used the gpt2 tokenizer for context-length budgeting since tokenized_requests=False sends raw text to vLLM and the server tokenizes correctly). The score should not be read as the model's true MMLU number — it confirms the serving stack handled 11,400 sequential loglikelihood requests over 80 minutes without crashing or returning empty content. The stem-vs-humanities gradient looks directionally plausible.
GSM8K is the meaningful number — 96.0% is in line with reasoning-focused SOTA and confirms inference quality is intact end-to-end on the 8-Spark TP=8 InstantTensor stack.
5. The compiled-DAG hang — characterized precisely
This is what I think you were really pointing at, and I can now reproduce it deterministically.
I first ran GSM8K with num_concurrent=8. At request ~24, vLLM threw:
ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected
to take a long time, increase RAY_CGRAPH_get_timeout which is currently 300 seconds.
Otherwise, this may indicate that the execution is hanging.
Engine died, all subsequent requests returned HTTP 500. Full Ray actor cancellation + compiled DAG teardown in the logs.
Re-running the same task with num_concurrent=4 against the same engine: clean run, 50/50 success, the 96% / 94% scores above. Per-request latency 5–10s steady state, individual ranks never block the channel longer than 300s.
So under concurrent generative load with longer chain-of-thought outputs and 8 parallel client streams against an 8-rank TP=8 cluster, the compiled DAG channel timeout (default 300s) gets hit before the slowest rank completes its allocation. Two practical workarounds:
- Client-side: cap
num_concurrent≤ 4 in lm-evaluation-harness or any benchmark client. - Server-side:
export RAY_CGRAPH_get_timeout=900(or higher) at launch — Ray-level env var, propagates fine through the vLLM → Ray executor stack.
I've gone with the server-side env in our recipe so production isn't sensitive to client tuning. Single-stream production traffic at TP=8 has never reproduced this regardless of duration (the original 30-min soak ran clean — it never had >1 in-flight request).
If a master toggle to fall back to the legacy non-compiled Ray executor is meant to still exist, I haven't found the right name in vllm/envs.py — happy to re-test if you point me at the current flag.
6. Separate finding worth a follow-up
--safetensors-load-strategy prefetch is load-bearing on EXT4 for V4-Pro on this cluster. Without it, post-shard-load weight materialization random-reads from NVMe per-tensor and 3 of 8 workers straggle past 60 minutes (effective hang). With prefetch, V4-Pro fires up in ~12 min. Want me to file a separate issue with the worker-stall trace so the heuristic can decide more aggressively when to default-on prefetch (probably "if model size > 500 GB AND filesystem != tmpfs")? Happy to tag you on it.
Cluster details for reference: 8× NVIDIA DGX Spark (GB10 / sm_121, 128 GiB unified memory, ARM64), dual-rail 200G RoCE multi-switch fabric (4× MikroTik CRS804-4DDQ uplinks). Image built on nvcr.io/nvidia/pytorch:25.11-py3 + your sm_12x Marlin patch (PR #40923) + TORCH_CUDA_ARCH_LIST="12.0;12.1" + the InstantTensor loader.
Re-test on
|
|
I created a new PR #40991 against the latest main. @BehindTheCartan |

This PR is based on #40760
Companion with deepseek-ai/DeepGEMM#318
Tested on 2 x RTX Pro 6000 (SM120)
Updated: 4.27
Summary
This PR is a runnable prototype for DeepSeek V4 Flash on NVIDIA SM12x GPUs, tested on RTX PRO 6000 Blackwell / SM120.
The main goal is to unblock correctness and end-to-end serving on workstation/consumer Blackwell GPUs where the existing FlashMLA / DeepGEMM SM90-SM100 paths are not available.
This stack includes:
fp8_ds_mlacache support for sparse MLA.jasl/DeepGEMM@7a7a41a1bac7dacabe74057e7600e59f98f85bce.Why
DeepSeek V4 currently depends on kernels that are available on datacenter Hopper/Blackwell paths but not on SM120/SM121 GPUs.
In particular, SM12x cannot directly reuse the existing SM90 WGMMA or SM100 tcgen05-based implementations. This PR adds a portable fallback path so DeepSeek V4 can run on SM12x first, with performance optimization left as incremental follow-up work.
Scope
Implemented / included:
fp8_ds_mlapacked cache handling.Not intended as final form:
Kernel capability matrix
This stack keeps the existing optimized SM90/SM100 paths intact and only adds
SM12x fallback coverage for the DeepSeek V4 blockers. The last column is listed
to make the dependency boundary explicit: this PR is tested with the pinned
DeepGEMM fork, not as a fully DeepGEMM-free stack.
fp8_ds_mla, SWA + compressed sparse candidates, and sink-aware denominator merge. Enabled automatically on SM12x;VLLM_TRITON_MLA_SPARSE=1can force it.fp8_gemm_ntfp8_m_grouped_gemm_nt/ FP8 MoEfp8_einsumfp8_einsumbhr,hdr->bhdwith recipe(1,128,128)on SM12x; other recipes still use the DeepGEMM wrapper.fp8_fp4_mqa_logitsfp8_fp4_paged_mqa_logitsVLLM_DEEP_GEMM_SM120_PAGED_MQA_TILED=1forwardsDG_SM120_PAGED_MQA_TILED=1to use the tiled implementation when present.tf32_hc_prenorm_gemmValidation
Test machine:
Static / unit checks
DeepGEMM pin checks:
Runtime environment knobs
Build / CUDA environment used for local editable installs and CUDA extension builds:
Sparse MLA / DeepGEMM runtime knobs:
VLLM_TRITON_MLA_SPARSE1to force-enable it, or0to disable it.VLLM_DEEP_GEMM_SM120_PAGED_MQA_TILED10to force the simpler compatibility fallback.VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH10to disable compile/cudagraph capture for debugging.VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE512VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE256VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE1,2, and4; unset uses1for single-token decode,2for small decode batches, and4for larger decode batches.VLLM_TRITON_MLA_SPARSE_DUMP0VLLM_TRITON_MLA_SPARSE_DUMP_PATH/tmp/deepseek_v4_triton_mla_sparse_dump.jsonlVLLM_TRITON_MLA_SPARSE_DUMP=1.Unsupported follow-up knobs checked
Two potentially useful runtime knobs were checked separately and are not currently usable on this SM120 path.
use_fp4_indexer_cachemust be passed through the JSON attention config, not as a dotted argparse option:--attention-config '{"use_fp4_indexer_cache":true}'The dotted form is rejected by argparse:
--attention_config.use_fp4_indexer_cache=True # error: unrecognized argumentsEven with the correct JSON form, startup fails on RTX PRO 6000 Blackwell / SM120 because the current vLLM indexer metadata builder gates this feature to datacenter Blackwell SM10x:
MTP speculative decoding must also use the dashed JSON CLI form:
--speculative-config '{"method":"mtp","num_speculative_tokens":2}'The model-side MTP patch is present in this branch: the target model exposes the
pre-
hc_headresidual buffer, and startup loads the MTP draft model and sharesthe target embedding /
lm_head/topk_indices_bufferweights. The SM12xsparse MLA decode fallback now also accepts MTP-shaped
q_len > 1decode byusing explicit global-slot sparse indices for the SWA subset instead of the
single-token paged SWA window path.
This is intentionally still experimental. It is useful as a correctness bridge
and as an optimization target for other developers, but current throughput is
below the non-MTP path.
Serving smoke
Short-context serving works with:
VLLM_TRITON_MLA_SPARSE=1 \ vllm serve deepseek-ai/DeepSeek-V4-Flash \ --trust-remote-code \ --kv-cache-dtype fp8 \ --block-size 256 \ --max-model-len 8192 \ --gpu-memory-utilization 0.985 \ --tensor-parallel-size 2 \ --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice \ --reasoning-parser deepseek_v41M context initialization also works with:
Relevant startup log:
Serving benchmark
Common benchmark shape:
Peak VRAM is the maximum
memory.usedobserved by a 1-secondnvidia-smisampler during each benchmark run.TP=2, PP=1
Compared with the previous PR table, the same TP=2/PP=1 c=32 benchmark moved
from 626.69 to 781.47 output tok/s, and c=1 moved from 45.14 to 97.61 output
tok/s.
TP=1, PP=2
fp8_fp4_mqa_logitstorch reference path, then EngineCore waited on a dead workerPP=2/TP=1 is stable through
max_concurrency=16in this short-context benchmark,but
max_concurrency=32still overcommits memory. The c=32 row is included as afailure case only and should not be interpreted as usable throughput.
MTP speculative decoding
At
gpu_memory_utilization=0.985, startup reaches readiness but the firstrequest OOMs inside the DeepSeek compressor Triton path. Lowering to
0.92leaves enough transient headroom for a short serving smoke and the benchmark
below.
A comparable c=1 MTP run was stopped after 5/32 requests because it was taking
roughly 40 seconds per request. No official throughput row is reported for that
partial run.
Known limitations
--gpu-memory-utilization; tested working at0.985, while0.99is rejected by vLLM startup memory guard on this machine.0.985supports approximately one full 1M-token request, not multiple full-context concurrent requests.q_len > 1decode are included, but MTP speculative serving remains experimental. It currently needs extra memory headroom and regresses throughput versus the non-MTP path.--attention-config '{"use_fp4_indexer_cache":true}'is not supported on SM120; current vLLM gates it to datacenter Blackwell SM10x.