Skip to content

perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696

Open
AdamPlatin123 wants to merge 6 commits into
sgl-project:mainfrom
AdamPlatin123:feature/sm120-eagle-optimize
Open

perf(sm120): add FP8 W8A8 Block GEMM autotune configs + EAGLE benchmark for RTX PRO 6000#25696
AdamPlatin123 wants to merge 6 commits into
sgl-project:mainfrom
AdamPlatin123:feature/sm120-eagle-optimize

Conversation

@AdamPlatin123
Copy link
Copy Markdown

@AdamPlatin123 AdamPlatin123 commented May 18, 2026

Summary

Adds SM120 (RTX PRO 6000 Blackwell, CC 12.0) optimization layer on top of PR #24692 (AliceChenyy SM120 support). This PR contributes 23 pre-tuned FP8 W8A8 Block GEMM autotune configuration files for the RTX PRO 6000, eliminating all "Performance might be sub-optimal" warnings and improving decode throughput.

Depends on: #24692 (must be merged first)

Changes

New files (23 autotune configs):

  • 23 × configs/N=*,K=*,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Workstation_Edition,dtype=fp8_w8a8,block_shape=[128, 128].json
  • Covers all DeepSeek-V4-Flash FP8 GEMM dimensions: N×K from 512×7168 to 32768×512

Benchmark Results

Hardware: 2× RTX PRO 6000 Blackwell (96 GB each), TP=2
Model: DeepSeek-V4-Flash (FP8, 158B params, unchanged open-source weights)
Container: CUDA 13, PyTorch 2.11+cu130, sglang-kernel 0.4.2.post1

Decode Speed — EAGLE Parameter Sweep (18 tests × 6 personas)

Configuration Avg tok/s Median tok/s Range Success
No EAGLE (baseline) 33.6 34.4 24.7 – 34.7 17/18
EAGLE 3/1/4 (official) 36.7 36.5 34.6 – 39.5 18/18
EAGLE 2/1/3 42.3 42.5 34.0 – 53.0 17/18
EAGLE 1/1/2 (no autotune) 44.4 44.8 31.2 – 50.6 17/18
EAGLE 1/1/2 + FP8 autotune 45.0 46.0 26.8 – 52.4 17/18

Per-Persona Speed (EAGLE 1/1/2 + autotune)

Persona No EAGLE + EAGLE + autotune Speedup
Casual chat 33.4 39.0 +17%
Programmer 34.3 49.2 +43%
Writer 31.1 38.9 +25%
Researcher 34.4 46.3 +35%
Enterprise 34.1 49.1 +44%
Multi-turn 34.5 47.8 +39%

Comparison with PR #24692 on same hardware

Setup Decode Notes
PR #24692, 8× RTX PRO 6000, TP=8 11.4 tok/s Author's benchmark
PR #24692, 2× RTX PRO 6000, TP=2 ~2.3 tok/s Author's comment, no EAGLE
This PR, 2× RTX PRO 6000, TP=2 45.0 avg / 52.4 peak EAGLE 1/1/2 + autotune

Startup Command

python3 -m sglang.launch_server \
    --model-path /models/DeepSeek-V4-Flash \
    --tp 2 --trust-remote-code \
    --host 0.0.0.0 --port 30000 \
    --context-length 8192 \
    --mem-fraction-static 0.95 \
    --fp8-gemm-backend triton \
    --disable-custom-all-reduce \
    --cuda-graph-bs 1 2 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2

Key Findings

  • EAGLE 1/1/2 is optimal for SM120 TP=2 — fewer draft tokens yield higher acceptance rate, lower overhead than 3/1/4 config
  • FP8 autotune configs eliminate all "sub-optimal" warnings — pre-tuned tile sizes for every DSV4 layer dimension
  • Context length 8192 is the hard maximum for 2× RTX PRO 6000 with this model (KV pool + activations ~185 GB)
  • Model weights are untouched — all optimizations are in SGLang runtime (Triton kernels, EAGLE spec dec, autotune configs)

Model Info

Field Value
Model DeepSeek-V4-Flash (open-source, unmodified)
Parameters ~158B
Quantization FP8 (e4m3, dynamic, block_size [128, 128])
Layers 43
Experts 256 routed + 1 shared, 6 active per token
MTP layers 1 (used by EAGLE speculative decoding)

Test plan

  • 18-test multi-persona benchmark (6 personas × 3 prompts each)
  • EAGLE parameter sweep (1/1/2, 2/1/3, 3/1/4)
  • FP8 autotune A/B test (with vs without configs)
  • Context capacity test (4096, 8192, 8704+ OOM boundary)
  • Long prompt test (7000+ input tokens with expandable_segments)
  • Output quality verification (all 18 tests produce correct, coherent output)
  • CI (no SM120 runner available — tested on local 2× RTX PRO 6000)

CI States

Latest PR Test (Base): ❌ Missing run-ci label -- add it to run CI tests.
Latest PR Test (Extra): ❌ Blocked -- run-ci is required first.

AliceChenyy and others added 6 commits May 12, 2026 06:02
Adds full SM120 (RTX PRO 6000 / RTX 5090 / DGX Spark) support for
DeepSeek-V4 on SGLang, rebased onto main branch.

Key changes:
- Triton MXFP4 MoE kernel for SM120 (no MARLIN/tcgen05 on desktop Blackwell)
- Triton FlashMLA sparse decode kernel for SM120
- MQA wq-precompute with vectorized batch for CUDA graph compatibility
- DeepGEMM/PDL guards for SM120 (no TMEM/tcgen05)
- NSA backend SM120 dispatch (tilelang default, skip DeepGEMM metadata)
- FlashMLA SM120 adapter for deepseek_v4_backend
- 3 CUDA-graph-breaking paths fixed (MoE .unique/.item, NSA/Compressed MQA)

Results (8x RTX PRO 6000, TP=8):
- Decode: 10.26 tok/s BS=1 with CUDA graph (2.4x vs without)
- GSM8K 5-shot: 98.0% accuracy (200 questions)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address all reviewer feedback from PR sgl-project#24692:
- Use is_sm120_supported() helper instead of raw sm_version checks
- Guard SGLANG_OPT_DEEPGEMM_HC_PRENORM and SGLANG_OPT_USE_TILELANG_MHC_PRE
  with `not is_sm120_supported()` in deepseek_v4.py
- Auto-select marlin MoE backend on SM120 in deepseek_v4_hook.py
- Minor cleanups in indexer, metadata, nsa_backend, mxfp4_marlin_moe

Fix FlashMLA Triton kernel garbled output on latest sglang:dev image:
- Root cause: upstream changed KV cache dtype from float8_e4m3fn to uint8.
  The Triton kernel's as_strided() preserved the input dtype, so tl.load
  interpreted FP8 bit patterns as raw integers, corrupting attention scores.
- Fix: explicitly view through uint8 → float8_e4m3fn before passing to Triton.

Verified on sglang:dev-cu13 (sgl-kernel 0.4.2.post1, PyTorch 2.11+cu130):
- GSM8K 5-shot 200q: 99.0%
- Decode BS=1: 11.40 tok/s, TPOT 87.7ms

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Conflicts:
#	python/sglang/srt/layers/attention/dsv4/indexer.py
#	python/sglang/srt/layers/quantization/mxfp4_marlin_moe.py
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 23 autotune configuration JSON files from the SM120 fork branch,
eliminating all "Performance might be sub-optimal" warnings on RTX PRO
6000 Blackwell. These configs provide pre-tuned tile sizes for the
Triton FP8 block-scaled GEMM kernel across all DeepSeek-V4 layer
dimensions (N×K = 1280×5120 through 7168×18432).

Verified: EAGLE 1/1/2 decode avg 44.4→45.0 tok/s, peak 52.4 tok/s
on 2× RTX PRO 6000 (TP=2, ctx=8192).
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants