Skip to content

[CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests#21625

Merged
Fridge003 merged 3 commits intosgl-project:mainfrom
zianglih:v067-ci
Mar 30, 2026
Merged

[CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests#21625
Fridge003 merged 3 commits intosgl-project:mainfrom
zianglih:v067-ci

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented Mar 29, 2026

Motivation

@HumansAnd

Use offline mxfp8 checkpoint for CI stability.

MXFP8 Gemm CI is unstable after FlashInfer v0.6.7 update:

pip uninstall -y flashinfer-jit-cache flashinfer-python flashinfer-cubin
pip install flashinfer-python==0.6.7 flashinfer-cubin==0.6.7
pip install flashinfer-jit-cache==0.6.7 --index-url https://flashinfer.ai/whl/cu129

python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k MXFP8
========================================== short test summary info ==========================================
FAILED test/registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmTriton::test_gsm8k - AssertionError: np.float64(0.7619408642911296) not greater than or equal to 0.8
FAILED test/registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k - AssertionError: np.float64(0.7619408642911296) not greater than or equal to 0.8
2 failed, 4 deselected, 5 warnings in 121.20s (0:02:01)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
# If reverting to v0.6.6:
pip uninstall -y flashinfer-jit-cache flashinfer-python flashinfer-cubin
pip install flashinfer-python==0.6.6 flashinfer-cubin==0.6.6
pip install flashinfer-jit-cache==0.6.6 --index-url https://flashinfer.ai/whl/cu129

python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k MXFP8
============================================= warnings summary ==============================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmTriton::test_gsm8k
registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k
  /sgl-workspace/sglang/python/sglang/test/few_shot_gsm8k.py:54: DeprecationWarning: Including the scheme in --host ('http://127.0.0.1') is deprecated. Pass just the hostname (e.g. '127.0.0.1') instead.
    set_default_backend(RuntimeEndpoint(normalize_base_url(args.host, args.port)))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2 passed, 4 deselected, 5 warnings in 138.05s (0:02:18)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

After investigating, the root cause is due to the instability of the online quantization code path itself, not flashinfer v0.6.7:

# v0.6.6 online quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model Qwen/Qwen3-4B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.716
Invalid: 0.000
Latency: 10.441 s
Output throughput: 27218.943 token/s

# v0.6.6 offline quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model zianglih/Qwen3-4B-Instruct-2507-MXFP8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.840
Invalid: 0.000
Latency: 8.577 s
Output throughput: 25232.410 token/s

# v0.6.7 online quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model Qwen/Qwen3-4B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.796
Invalid: 0.000
Latency: 8.984 s
Output throughput: 27016.746 token/s

# v0.6.7 offline quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model zianglih/Qwen3-4B-Instruct-2507-MXFP8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.841
Invalid: 0.000
Latency: 8.538 s
Output throughput: 25126.517 token/s

Note, TestMXFP8GemmTriton is temporarily disabled until a fix in #19835 is merged long PCG capture time is fixed.

Modifications

Accuracy Tests

root@B200-124:/sgl-workspace/sglang# python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k TestMXFP8
...
Accuracy: 0.850
Invalid: 0.000
Latency: 14.914 s
Output throughput: 15622.660 token/s
{'accuracy': np.float64(0.8498862774829417), 'invalid': np.float64(0.0), 'latency': 14.914425703696907, 'output_throughput': 15622.65987501245}
.
================== warnings summary ==================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k
  /sgl-workspace/sglang/python/sglang/test/few_shot_gsm8k.py:54: DeprecationWarning: Including the scheme in --host ('http://127.0.0.1') is deprecated. Pass just the hostname (e.g. '127.0.0.1') instead.
    set_default_backend(RuntimeEndpoint(normalize_base_url(args.host, args.port)))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
1 passed, 1 skipped, 4 deselected, 4 warnings in 64.06s (0:01:04)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
root@B200-124:/sgl-workspace/sglang# 

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the test suite to use a specific MXFP8 model path and simplifies the test setup by removing the redundant quantization flag. It also temporarily disables the Triton-based MXFP8 GEMM test. A review comment identifies a likely typo in the pull request number referenced in the skip message, which should be corrected to ensure the reason for skipping the test is properly documented.



@unittest.skip(
"Temporarily disabled until https://github.com/sgl-project/sglang/pull/19835 is merged"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to the pull request seems to be broken as PR number 19835 does not exist. This appears to be a typo. Please correct the link to ensure the reason for skipping this test is clear.

@zianglih zianglih marked this pull request as draft March 29, 2026 06:39
@zianglih zianglih marked this pull request as ready for review March 29, 2026 06:39
@Fridge003
Copy link
Copy Markdown
Collaborator

@zianglih #19835 has been merged

@zianglih
Copy link
Copy Markdown
Contributor Author

Hi @Fridge003 ,

TestMXFP8GemmTriton works after #19835 but currently compiling PCG takes 5-7mins, so I disable it again.

Accuracy: 0.853
Invalid: 0.000
Latency: 22.824 s
Output throughput: 10023.130 token/s
{'accuracy': np.float64(0.8529188779378317), 'invalid': np.float64(0.0), 'latency': 22.823908662889153, 'output_throughput': 10023.129840681795}
[2026-03-29 09:10:37] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192, 8704, 9216, 9728, 10240, 10752, 11264, 11776, 12288, 12800, 13312, 13824, 14336, 14848, 15360, 15872, 16384]
Compiling num tokens (num_tokens=16384):   0%|                                                          | 0/74 [00:00<?, ?it/s][2026-03-29 09:10:42] Compiling a graph for dynamic shape takes 0.42 s
Compiling num tokens (num_tokens=15872):   1%|▋                                                 | 1/74 [00:05<06:57,  5.71s/it]

CC @wolfcomos

@wolfcomos
Copy link
Copy Markdown
Contributor

Hi @Fridge003 ,

TestMXFP8GemmTriton works after #19835 but currently compiling PCG takes 5-7mins, so I disable it again.

Accuracy: 0.853
Invalid: 0.000
Latency: 22.824 s
Output throughput: 10023.130 token/s
{'accuracy': np.float64(0.8529188779378317), 'invalid': np.float64(0.0), 'latency': 22.823908662889153, 'output_throughput': 10023.129840681795}
[2026-03-29 09:10:37] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192, 8704, 9216, 9728, 10240, 10752, 11264, 11776, 12288, 12800, 13312, 13824, 14336, 14848, 15360, 15872, 16384]
Compiling num tokens (num_tokens=16384):   0%|                                                          | 0/74 [00:00<?, ?it/s][2026-03-29 09:10:42] Compiling a graph for dynamic shape takes 0.42 s
Compiling num tokens (num_tokens=15872):   1%|▋                                                 | 1/74 [00:05<06:57,  5.71s/it]

CC @wolfcomos

Thanks! I'm now working on the PR to improve the cuda graph capturing time.

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

@Fridge003 Fridge003 merged commit 1a4b383 into sgl-project:main Mar 30, 2026
57 of 63 checks passed
@zianglih zianglih deleted the v067-ci branch March 30, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants