perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000 by AdamPlatin123 · Pull Request #25696 · sgl-project/sglang

AdamPlatin123 · 2026-05-18T23:22:20Z

Summary

Adds SM120 (RTX PRO 6000 Blackwell, CC 12.0) optimization layer on top of PR #24692 (AliceChenyy SM120 support). This PR contributes 23 pre-tuned FP8 W8A8 Block GEMM autotune configuration files for the RTX PRO 6000, eliminating all "Performance might be sub-optimal" warnings and improving decode throughput.

Depends on: #24692 (must be merged first)

Changes

New files (23 autotune configs):

23 × configs/N=*,K=*,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Workstation_Edition,dtype=fp8_w8a8,block_shape=[128, 128].json
Covers all DeepSeek-V4-Flash FP8 GEMM dimensions: N×K from 512×7168 to 32768×512

Benchmark Results

Hardware: 2× RTX PRO 6000 Blackwell (96 GB each), TP=2
Model: DeepSeek-V4-Flash (FP8, 158B params, unchanged open-source weights)
Container: CUDA 13, PyTorch 2.11+cu130, sglang-kernel 0.4.2.post1

Decode Speed — EAGLE Parameter Sweep (18 tests × 6 personas)

Configuration	Avg tok/s	Median tok/s	Range	Success
No EAGLE (baseline)	33.6	34.4	24.7 – 34.7	17/18
EAGLE 3/1/4 (official)	36.7	36.5	34.6 – 39.5	18/18
EAGLE 2/1/3	42.3	42.5	34.0 – 53.0	17/18
EAGLE 1/1/2 (no autotune)	44.4	44.8	31.2 – 50.6	17/18
EAGLE 1/1/2 + FP8 autotune	45.0	46.0	26.8 – 52.4	17/18

Per-Persona Speed (EAGLE 1/1/2 + autotune)

Persona	No EAGLE	+ EAGLE + autotune	Speedup
Casual chat	33.4	39.0	+17%
Programmer	34.3	49.2	+43%
Writer	31.1	38.9	+25%
Researcher	34.4	46.3	+35%
Enterprise	34.1	49.1	+44%
Multi-turn	34.5	47.8	+39%

Comparison with PR #24692 on same hardware

Setup	Decode	Notes
PR #24692, 8× RTX PRO 6000, TP=8	11.4 tok/s	Author's benchmark
PR #24692, 2× RTX PRO 6000, TP=2	~2.3 tok/s	Author's comment, no EAGLE
This PR, 2× RTX PRO 6000, TP=2	45.0 avg / 52.4 peak	EAGLE 1/1/2 + autotune

Startup Command

python3 -m sglang.launch_server \
    --model-path /models/DeepSeek-V4-Flash \
    --tp 2 --trust-remote-code \
    --host 0.0.0.0 --port 30000 \
    --context-length 8192 \
    --mem-fraction-static 0.95 \
    --fp8-gemm-backend triton \
    --disable-custom-all-reduce \
    --cuda-graph-bs 1 2 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2

Key Findings

EAGLE 1/1/2 is optimal for SM120 TP=2 — fewer draft tokens yield higher acceptance rate, lower overhead than 3/1/4 config
FP8 autotune configs eliminate all "sub-optimal" warnings — pre-tuned tile sizes for every DSV4 layer dimension
Context length 8192 is the hard maximum for 2× RTX PRO 6000 with this model (KV pool + activations ~185 GB)
Model weights are untouched — all optimizations are in SGLang runtime (Triton kernels, EAGLE spec dec, autotune configs)

Model Info

Field	Value
Model	DeepSeek-V4-Flash (open-source, unmodified)
Parameters	~158B
Quantization	FP8 (e4m3, dynamic, block_size [128, 128])
Layers	43
Experts	256 routed + 1 shared, 6 active per token
MTP layers	1 (used by EAGLE speculative decoding)

Test plan

18-test multi-persona benchmark (6 personas × 3 prompts each)
EAGLE parameter sweep (1/1/2, 2/1/3, 3/1/4)
FP8 autotune A/B test (with vs without configs)
Context capacity test (4096, 8192, 8704+ OOM boundary)
Long prompt test (7000+ input tokens with expandable_segments)
Output quality verification (all 18 tests produce correct, coherent output)
CI (no SM120 runner available — tested on local 2× RTX PRO 6000)

CI States

Latest PR Test (Base): ❌ Missing run-ci label -- add it to run CI tests.
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark) support for DeepSeek-V4 on SGLang, rebased onto main branch. Key changes: - Triton MXFP4 MoE kernel for SM120 (no MARLIN/tcgen05 on desktop Blackwell) - Triton FlashMLA sparse decode kernel for SM120 - MQA wq-precompute with vectorized batch for CUDA graph compatibility - DeepGEMM/PDL guards for SM120 (no TMEM/tcgen05) - NSA backend SM120 dispatch (tilelang default, skip DeepGEMM metadata) - FlashMLA SM120 adapter for deepseek_v4_backend - 3 CUDA-graph-breaking paths fixed (MoE .unique/.item, NSA/Compressed MQA) Results (8x RTX PRO 6000, TP=8): - Decode: 10.26 tok/s BS=1 with CUDA graph (2.4x vs without) - GSM8K 5-shot: 98.0% accuracy (200 questions) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address all reviewer feedback from PR sgl-project#24692: - Use is_sm120_supported() helper instead of raw sm_version checks - Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE with `not is_sm120_supported()` in deepseek_v4.py - Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py - Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe Fix FlashMLA Triton kernel garbled output on latest sglang:dev image: - Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8. The Triton kernel's as_strided() preserved the input dtype, so tl.load interpreted FP8 bit patterns as raw integers, corrupting attention scores. - Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton. Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130): - GSM8K 5-shot 200q: 99.0% - Decode BS=1: 11.40 tok/s, TPOT 87.7ms Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

# Conflicts: # python/sglang/srt/layers/attention/dsv4/indexer.py # python/sglang/srt/layers/quantization/mxfp4_marlin_moe.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Port 23 autotune configuration JSON files from the SM120 fork branch, eliminating all "Performance might be sub-optimal" warnings on RTX PRO 6000 Blackwell. These configs provide pre-tuned tile sizes for the Triton FP8 block-scaled GEMM kernel across all DeepSeek-V4 layer dimensions (N×K = 1280×5120 through 7168×18432). Verified: EAGLE 1/1/2 decode avg 44.4→45.0 tok/s, peak 52.4 tok/s on 2× RTX PRO 6000 (TP=2, ctx=8192).

gemini-code-assist · 2026-05-18T23:22:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

AliceChenyy and others added 6 commits May 12, 2026 06:02

Merge remote-tracking branch 'origin/main' into sm120-dsv4-rebase

0b95556

# Conflicts: # python/sglang/srt/layers/attention/dsv4/indexer.py # python/sglang/srt/layers/quantization/mxfp4_marlin_moe.py

style: fix pre-commit lint issues (isort, ruff, black)

beac7a0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: wrap pytest.main in sys.exit for CI exit code propagation

385a8d4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AdamPlatin123 requested review from 1am9trash, AniZpZ, BBuf, DarkSharpness, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, celve, ch-wan, hebiao064, hlu1, hubertlu-tw, ispobock, kkHuang-amd and merrymercy as code owners May 18, 2026 23:22

AdamPlatin123 requested review from HydraQYH and yuan-luo as code owners May 18, 2026 23:22

github-actions Bot added deepseek jit-kernel labels May 18, 2026

AliceChenyy mentioned this pull request May 19, 2026

feat: SM120 (Blackwell Desktop) support for DeepSeek-V4 inference #24692

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696

perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696
AdamPlatin123 wants to merge 6 commits into
sgl-project:mainfrom
AdamPlatin123:feature/sm120-eagle-optimize

AdamPlatin123 commented May 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdamPlatin123 commented May 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Benchmark Results

Decode Speed — EAGLE Parameter Sweep (18 tests × 6 personas)

Per-Persona Speed (EAGLE 1/1/2 + autotune)

Comparison with PR #24692 on same hardware

Startup Command

Key Findings

Model Info

Test plan

CI States

Uh oh!

gemini-code-assist Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdamPlatin123 commented May 18, 2026 •

edited by github-actions Bot

Loading