Use Cute-DSL NVFP4 quantization kernels by b8zhong · Pull Request #23745 · sgl-project/sglang

b8zhong · 2026-04-26T03:47:29Z

Motivation

This can beat the original SGLang CUDA based kernel in all scenarios, after perf optimizations in flashinfer-ai/flashinfer#2904. Note, this is with backend=cute-dsl, TRT-LLM quantize_with_block_size will still be much slower. We don't use this option.

Modifications

Add the backend on SM100, and set it as the default.

Accuracy Tests

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-V3.2-NVFP4 \
  --tp 4 \
  --ep 4 \
  --quantization modelopt_fp4 \
  --tool-call-parser deepseekv32 \
  --reasoning-parser deepseek-v3 \
  --port 30020 \
  --cuda-graph-bs 1 2 4 8 16 32 64 128 160 192 224 256 288 320 352 384 416 448 480 512 \
  --max-running-requests 512

python3 -m sglang.test.run_eval --port 30020 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3

➜  sglang git:(brayden/fp4-quant-cute-dsl) ✗ CUDA_VISIBLE_DEVICES=7 pytest test/registered/quant/test_nvfp4_gemm.py
================================================================================ test session starts ================================================================================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0
rootdir: /sgl-workspace/sglang/test
configfile: pytest.ini
plugins: anyio-4.13.0, typeguard-4.5.1
collected 5 items

test/registered/quant/test_nvfp4_gemm.py .....                                                                                                                                [100%]

================================================================================= warnings summary ==================================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 5 passed, 3 warnings in 324.12s (0:05:24) =====================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Speed Tests and Profiling

BS = 128 decode

Before:

After (notice, it will also PDL correctly)

gemini-code-assist · 2026-04-26T03:47:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

nvpohanh · 2026-04-28T07:11:13Z

Note, this is with backend=cute-dsl, TRT-LLM quantize_with_block_size will still be much slower. We don't use this option.

@b8zhong Could you help us to file a FlashInfer Github issue for this issue? I think we should also improve the perf of FlashInfer's NVFP4 quant kernel. Thanks! cc @bkryu

Fridge003 · 2026-05-09T09:19:33Z

@@ -72,13 +76,71 @@

 fp4_quantize = None


A better place to put fp4_quantize would be python/sglang/srt/layers/quantization/fp4_utils.py

Good point. Just moved

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>

Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254 (0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the bump did NOT fix the Mistral NVFP4 test - confirmed by running the test locally at the post-revert SHA 0fde615 and getting the identical "RuntimeError: shape '[1]' is invalid for input of size 128" traceback under flashinfer 0.6.8.post1. Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4 quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize in fp4_utils.py and unconditionally selects backend="cute-dsl" on SM100/B200. The cute-dsl kernel does `global_scale.float().reshape(1)`, which crashes when the MoE apply_weights call site passes the per-expert `layer.w13_input_scale_quant` (shape [num_experts] = 128). The same kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the flashinfer version is irrelevant to this failure path. Evidence: - CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11, SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608 (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a lands in that window. - Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1) still fails on partition 3 - the Mistral NVFP4 row is absent from the per-partition metrics artifact for runs 608/609/610/613. - Local experiment matrix: A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951 B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]' C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]' E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949 One-line fix on post-revert main: sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \ python/sglang/srt/layers/quantization/fp4_utils.py Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to collapse layer.w13_input_scale_quant to scalar or per-token before calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 26, 2026 03:47

github-actions Bot added quant LLM Quantization blackwell SM100/SM120 labels Apr 26, 2026

b8zhong added the run-ci label Apr 26, 2026

b8zhong force-pushed the brayden/fp4-quant-cute-dsl branch from 91c67d6 to 1ff9ee2 Compare April 26, 2026 03:48

b8zhong mentioned this pull request Apr 26, 2026

Enable flashinfer::trtllm_allreduce_fusion with PDL #23765

Merged

nvpohanh approved these changes Apr 28, 2026

View reviewed changes

Fridge003 reviewed May 9, 2026

View reviewed changes

more

dce712c

b8zhong force-pushed the brayden/fp4-quant-cute-dsl branch from 5f246e1 to dce712c Compare May 9, 2026 12:39

Fridge003 approved these changes May 11, 2026

View reviewed changes

Fridge003 merged commit 1d80a1a into main May 11, 2026
165 of 182 checks passed

Fridge003 deleted the brayden/fp4-quant-cute-dsl branch May 11, 2026 07:40

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

Use Cute-DSL NVFP4 quantization kernels (sgl-project#23745)

0c3ee93

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>

xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026

Use Cute-DSL NVFP4 quantization kernels (sgl-project#23745)

9cb1010

Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cute-DSL NVFP4 quantization kernels#23745

Use Cute-DSL NVFP4 quantization kernels#23745
Fridge003 merged 1 commit into
mainfrom
brayden/fp4-quant-cute-dsl

b8zhong commented Apr 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 26, 2026

Uh oh!

nvpohanh commented Apr 28, 2026

Uh oh!

Fridge003 May 9, 2026

Uh oh!

b8zhong May 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

b8zhong commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Uh oh!

gemini-code-assist Bot commented Apr 26, 2026

Uh oh!

nvpohanh commented Apr 28, 2026

Uh oh!

Fridge003 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

b8zhong May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

b8zhong commented Apr 26, 2026 •

edited

Loading