perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696
Open
AdamPlatin123 wants to merge 6 commits into
Open
perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696AdamPlatin123 wants to merge 6 commits into
AdamPlatin123 wants to merge 6 commits into
Conversation
Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark) support for DeepSeek-V4 on SGLang, rebased onto main branch. Key changes: - Triton MXFP4 MoE kernel for SM120 (no MARLIN/tcgen05 on desktop Blackwell) - Triton FlashMLA sparse decode kernel for SM120 - MQA wq-precompute with vectorized batch for CUDA graph compatibility - DeepGEMM/PDL guards for SM120 (no TMEM/tcgen05) - NSA backend SM120 dispatch (tilelang default, skip DeepGEMM metadata) - FlashMLA SM120 adapter for deepseek_v4_backend - 3 CUDA-graph-breaking paths fixed (MoE .unique/.item, NSA/Compressed MQA) Results (8x RTX PRO 6000, TP=8): - Decode: 10.26 tok/s BS=1 with CUDA graph (2.4x vs without) - GSM8K 5-shot: 98.0% accuracy (200 questions) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address all reviewer feedback from PR sgl-project#24692: - Use is_sm120_supported() helper instead of raw sm_version checks - Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE with `not is_sm120_supported()` in deepseek_v4.py - Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py - Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe Fix FlashMLA Triton kernel garbled output on latest sglang:dev image: - Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8. The Triton kernel's as_strided() preserved the input dtype, so tl.load interpreted FP8 bit patterns as raw integers, corrupting attention scores. - Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton. Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130): - GSM8K 5-shot 200q: 99.0% - Decode BS=1: 11.40 tok/s, TPOT 87.7ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts: # python/sglang/srt/layers/attention/dsv4/indexer.py # python/sglang/srt/layers/quantization/mxfp4_marlin_moe.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 23 autotune configuration JSON files from the SM120 fork branch, eliminating all "Performance might be sub-optimal" warnings on RTX PRO 6000 Blackwell. These configs provide pre-tuned tile sizes for the Triton FP8 block-scaled GEMM kernel across all DeepSeek-V4 layer dimensions (N×K = 1280×5120 through 7168×18432). Verified: EAGLE 1/1/2 decode avg 44.4→45.0 tok/s, peak 52.4 tok/s on 2× RTX PRO 6000 (TP=2, ctx=8192).
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SM120 (RTX PRO 6000 Blackwell, CC 12.0) optimization layer on top of PR #24692 (AliceChenyy SM120 support). This PR contributes 23 pre-tuned FP8 W8A8 Block GEMM autotune configuration files for the RTX PRO 6000, eliminating all
"Performance might be sub-optimal"warnings and improving decode throughput.Depends on: #24692 (must be merged first)
Changes
New files (23 autotune configs):
configs/N=*,K=*,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Workstation_Edition,dtype=fp8_w8a8,block_shape=[128, 128].jsonBenchmark Results
Hardware: 2× RTX PRO 6000 Blackwell (96 GB each), TP=2
Model: DeepSeek-V4-Flash (FP8, 158B params, unchanged open-source weights)
Container: CUDA 13, PyTorch 2.11+cu130, sglang-kernel 0.4.2.post1
Decode Speed — EAGLE Parameter Sweep (18 tests × 6 personas)
Per-Persona Speed (EAGLE 1/1/2 + autotune)
Comparison with PR #24692 on same hardware
Startup Command
python3 -m sglang.launch_server \ --model-path /models/DeepSeek-V4-Flash \ --tp 2 --trust-remote-code \ --host 0.0.0.0 --port 30000 \ --context-length 8192 \ --mem-fraction-static 0.95 \ --fp8-gemm-backend triton \ --disable-custom-all-reduce \ --cuda-graph-bs 1 2 \ --speculative-algorithm EAGLE \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 2Key Findings
Model Info
Test plan
CI States
Latest PR Test (Base): ❌ Missing
run-cilabel -- add it to run CI tests.Latest PR Test (Extra): ❌ Blocked --
run-ciis required first.