Skip to content

Use Cute-DSL NVFP4 quantization kernels#23745

Merged
Fridge003 merged 1 commit into
mainfrom
brayden/fp4-quant-cute-dsl
May 11, 2026
Merged

Use Cute-DSL NVFP4 quantization kernels#23745
Fridge003 merged 1 commit into
mainfrom
brayden/fp4-quant-cute-dsl

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Apr 26, 2026

Motivation

image

This can beat the original SGLang CUDA based kernel in all scenarios, after perf optimizations in flashinfer-ai/flashinfer#2904. Note, this is with backend=cute-dsl, TRT-LLM quantize_with_block_size will still be much slower. We don't use this option.

Modifications

Add the backend on SM100, and set it as the default.

Accuracy Tests

python3 -m sglang.launch_server \
  --model-path nvidia/DeepSeek-V3.2-NVFP4 \
  --tp 4 \
  --ep 4 \
  --quantization modelopt_fp4 \
  --tool-call-parser deepseekv32 \
  --reasoning-parser deepseek-v3 \
  --port 30020 \
  --cuda-graph-bs 1 2 4 8 16 32 64 128 160 192 224 256 288 320 352 384 416 448 480 512 \
  --max-running-requests 512
python3 -m sglang.test.run_eval --port 30020 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
Screenshot 2026-04-25 at 11 42 32 PM
➜  sglang git:(brayden/fp4-quant-cute-dsl) ✗ CUDA_VISIBLE_DEVICES=7 pytest test/registered/quant/test_nvfp4_gemm.py
================================================================================ test session starts ================================================================================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0
rootdir: /sgl-workspace/sglang/test
configfile: pytest.ini
plugins: anyio-4.13.0, typeguard-4.5.1
collected 5 items

test/registered/quant/test_nvfp4_gemm.py .....                                                                                                                                [100%]

================================================================================= warnings summary ==================================================================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1434: PytestConfigWarning: Unknown config option: asyncio_mode

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 5 passed, 3 warnings in 324.12s (0:05:24) =====================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

Speed Tests and Profiling

BS = 128 decode

Before:
Screenshot 2026-04-25 at 11 44 32 PM

After (notice, it will also PDL correctly)
Screenshot 2026-04-25 at 11 44 45 PM

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added quant LLM Quantization blackwell SM100/SM120 labels Apr 26, 2026
@b8zhong b8zhong added the run-ci label Apr 26, 2026
@b8zhong b8zhong force-pushed the brayden/fp4-quant-cute-dsl branch from 91c67d6 to 1ff9ee2 Compare April 26, 2026 03:48
@nvpohanh
Copy link
Copy Markdown
Collaborator

Note, this is with backend=cute-dsl, TRT-LLM quantize_with_block_size will still be much slower. We don't use this option.

@b8zhong Could you help us to file a FlashInfer Github issue for this issue? I think we should also improve the perf of FlashInfer's NVFP4 quant kernel. Thanks! cc @bkryu

@@ -72,13 +76,71 @@

fp4_quantize = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better place to put fp4_quantize would be python/sglang/srt/layers/quantization/fp4_utils.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Just moved

Comment thread python/sglang/srt/layers/quantization/modelopt_quant.py
Comment thread python/sglang/srt/layers/quantization/modelopt_quant.py Outdated
@b8zhong b8zhong force-pushed the brayden/fp4-quant-cute-dsl branch from 5f246e1 to dce712c Compare May 9, 2026 12:39
@Fridge003 Fridge003 merged commit 1d80a1a into main May 11, 2026
165 of 182 checks passed
@Fridge003 Fridge003 deleted the brayden/fp4-quant-cute-dsl branch May 11, 2026 07:40
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
xjpang pushed a commit to xjpang/sglang that referenced this pull request May 13, 2026
Co-authored-by: b8zhong <b8zhong@users.noreply.github.com>
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254
(0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the
bump did NOT fix the Mistral NVFP4 test - confirmed by running the
test locally at the post-revert SHA 0fde615 and getting the identical
"RuntimeError: shape '[1]' is invalid for input of size 128" traceback
under flashinfer 0.6.8.post1.

Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4
quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize
in fp4_utils.py and unconditionally selects backend="cute-dsl" on
SM100/B200. The cute-dsl kernel does
`global_scale.float().reshape(1)`, which crashes when the MoE
apply_weights call site passes the per-expert
`layer.w13_input_scale_quant` (shape [num_experts] = 128). The same
kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the
flashinfer version is irrelevant to this failure path.

Evidence:
- CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11,
  SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608
  (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a
  lands in that window.
- Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1)
  still fails on partition 3 - the Mistral NVFP4 row is absent from the
  per-partition metrics artifact for runs 608/609/610/613.
- Local experiment matrix:
  A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951
  B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]'
  C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check
  D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]'
  E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949

One-line fix on post-revert main:
  sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \
    python/sglang/srt/layers/quantization/fp4_utils.py

Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to
collapse layer.w13_input_scale_quant to scalar or per-token before
calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Jiminator added a commit to Jiminator/sglang that referenced this pull request May 15, 2026
Earlier draft (b0591ab) blamed the flashinfer dependency bump d5f3254
(0.6.8.post1 -> 0.6.11). That was wrong, and PR sgl-project#25310 reverting the
bump did NOT fix the Mistral NVFP4 test - confirmed by running the
test locally at the post-revert SHA 0fde615 and getting the identical
"RuntimeError: shape '[1]' is invalid for input of size 128" traceback
under flashinfer 0.6.8.post1.

Corrected root cause: 1d80a1a (PR sgl-project#23745, "Use Cute-DSL NVFP4
quantization kernels", merged 2026-05-10) wraps flashinfer.fp4_quantize
in fp4_utils.py and unconditionally selects backend="cute-dsl" on
SM100/B200. The cute-dsl kernel does
`global_scale.float().reshape(1)`, which crashes when the MoE
apply_weights call site passes the per-expert
`layer.w13_input_scale_quant` (shape [num_experts] = 128). The same
kernel exists in both flashinfer 0.6.8.post1 and 0.6.11+, so the
flashinfer version is irrelevant to this failure path.

Evidence:
- CI metrics show Mistral NVFP4 was last green in run 607 (2026-05-11,
  SHA aa7a9af, flashinfer 0.6.8.post1) and first red in run 608
  (2026-05-12, SHA 74d70af, still flashinfer 0.6.8.post1). 1d80a1a
  lands in that window.
- Run 613 (2026-05-15, post-revert SHA 0fde615, flashinfer 0.6.8.post1)
  still fails on partition 3 - the Mistral NVFP4 row is absent from the
  per-partition metrics artifact for runs 608/609/610/613.
- Local experiment matrix:
  A) 13afe8a (pre-1d80a1a) + flashinfer 0.6.8.post1: PASS gsm8k 0.951
  B) 34c0029 (CI failure SHA) + cute-dsl backend + flashinfer 0.6.11.post1: FAIL reshape '[1]'
  C) 34c0029 + cuda backend patch + flashinfer 0.6.11.post1: FAIL globalScale strict check
  D) 0fde615 (post-revert) + cute-dsl backend + flashinfer 0.6.8.post1: FAIL reshape '[1]'
  E) 0fde615 + cuda backend one-line patch + flashinfer 0.6.8.post1: PASS gsm8k 0.949

One-line fix on post-revert main:
  sed -i 's|"cute-dsl" if is_sm100_supported() else "cuda"|"cuda"|' \
    python/sglang/srt/layers/quantization/fp4_utils.py

Verified passes gsm8k 0.949 in 23m 43s. The proper long-term fix is to
collapse layer.w13_input_scale_quant to scalar or per-token before
calling fp4_quantize in compressed_tensors_w4a4_nvfp4_moe.py:315.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120 quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants