[CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests by zianglih · Pull Request #21625 · sgl-project/sglang

zianglih · 2026-03-29T06:30:06Z

Motivation

Use offline mxfp8 checkpoint for CI stability.

MXFP8 Gemm CI is unstable after FlashInfer v0.6.7 update:

pip uninstall -y flashinfer-jit-cache flashinfer-python flashinfer-cubin
pip install flashinfer-python==0.6.7 flashinfer-cubin==0.6.7
pip install flashinfer-jit-cache==0.6.7 --index-url https://flashinfer.ai/whl/cu129

python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k MXFP8
========================================== short test summary info ==========================================
FAILED test/registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmTriton::test_gsm8k - AssertionError: np.float64(0.7619408642911296) not greater than or equal to 0.8
FAILED test/registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k - AssertionError: np.float64(0.7619408642911296) not greater than or equal to 0.8
2 failed, 4 deselected, 5 warnings in 121.20s (0:02:01)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
# If reverting to v0.6.6:
pip uninstall -y flashinfer-jit-cache flashinfer-python flashinfer-cubin
pip install flashinfer-python==0.6.6 flashinfer-cubin==0.6.6
pip install flashinfer-jit-cache==0.6.6 --index-url https://flashinfer.ai/whl/cu129

python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k MXFP8
============================================= warnings summary ==============================================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmTriton::test_gsm8k
registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k
  /sgl-workspace/sglang/python/sglang/test/few_shot_gsm8k.py:54: DeprecationWarning: Including the scheme in --host ('http://127.0.0.1') is deprecated. Pass just the hostname (e.g. '127.0.0.1') instead.
    set_default_backend(RuntimeEndpoint(normalize_base_url(args.host, args.port)))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2 passed, 4 deselected, 5 warnings in 138.05s (0:02:18)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

After investigating, the root cause is due to the instability of the online quantization code path itself, not flashinfer v0.6.7:

# v0.6.6 online quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model Qwen/Qwen3-4B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.716
Invalid: 0.000
Latency: 10.441 s
Output throughput: 27218.943 token/s

# v0.6.6 offline quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model zianglih/Qwen3-4B-Instruct-2507-MXFP8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.840
Invalid: 0.000
Latency: 8.577 s
Output throughput: 25232.410 token/s

# v0.6.7 online quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model Qwen/Qwen3-4B-Instruct-2507 --quantization mxfp8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.796
Invalid: 0.000
Latency: 8.984 s
Output throughput: 27016.746 token/s

# v0.6.7 offline quantization
python3 -m sglang.launch_server --kv-cache-dtype bf16 --model zianglih/Qwen3-4B-Instruct-2507-MXFP8 --fp8-gemm-backend flashinfer_trtllm
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.841
Invalid: 0.000
Latency: 8.538 s
Output throughput: 25126.517 token/s

Note, TestMXFP8GemmTriton is temporarily disabled until ~~a fix in #19835 is merged~~ long PCG capture time is fixed.

Modifications

Accuracy Tests

root@B200-124:/sgl-workspace/sglang# python3 -m pytest -s -q test/registered/quant/test_fp8_blockwise_gemm.py -k TestMXFP8
...
Accuracy: 0.850
Invalid: 0.000
Latency: 14.914 s
Output throughput: 15622.660 token/s
{'accuracy': np.float64(0.8498862774829417), 'invalid': np.float64(0.0), 'latency': 14.914425703696907, 'output_throughput': 15622.65987501245}
.
================== warnings summary ==================
../../usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428
  /usr/local/lib/python3.12/dist-packages/_pytest/config/__init__.py:1428: PytestConfigWarning: Unknown config option: asyncio_mode
  
    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

registered/quant/test_fp8_blockwise_gemm.py::TestMXFP8GemmFlashinferTrtllm::test_gsm8k
  /sgl-workspace/sglang/python/sglang/test/few_shot_gsm8k.py:54: DeprecationWarning: Including the scheme in --host ('http://127.0.0.1') is deprecated. Pass just the hostname (e.g. '127.0.0.1') instead.
    set_default_backend(RuntimeEndpoint(normalize_base_url(args.host, args.port)))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
1 passed, 1 skipped, 4 deselected, 4 warnings in 64.06s (0:01:04)
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute
root@B200-124:/sgl-workspace/sglang#

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request updates the test suite to use a specific MXFP8 model path and simplifies the test setup by removing the redundant quantization flag. It also temporarily disables the Triton-based MXFP8 GEMM test. A review comment identifies a likely typo in the pull request number referenced in the skip message, which should be corrected to ensure the reason for skipping the test is properly documented.

gemini-code-assist · 2026-03-29T06:30:45Z

test/registered/quant/test_fp8_blockwise_gemm.py



+@unittest.skip(
+    "Temporarily disabled until https://github.com/sgl-project/sglang/pull/19835 is merged"


The link to the pull request seems to be broken as PR number 19835 does not exist. This appears to be a typo. Please correct the link to ensure the reason for skipping this test is clear.

Fridge003 · 2026-03-29T08:59:44Z

@zianglih #19835 has been merged

zianglih · 2026-03-29T09:20:27Z

Hi @Fridge003 ,

TestMXFP8GemmTriton works after #19835 but currently compiling PCG takes 5-7mins, so I disable it again.

Accuracy: 0.853
Invalid: 0.000
Latency: 22.824 s
Output throughput: 10023.130 token/s
{'accuracy': np.float64(0.8529188779378317), 'invalid': np.float64(0.0), 'latency': 22.823908662889153, 'output_throughput': 10023.129840681795}

[2026-03-29 09:10:37] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192, 8704, 9216, 9728, 10240, 10752, 11264, 11776, 12288, 12800, 13312, 13824, 14336, 14848, 15360, 15872, 16384]
Compiling num tokens (num_tokens=16384):   0%|                                                          | 0/74 [00:00<?, ?it/s][2026-03-29 09:10:42] Compiling a graph for dynamic shape takes 0.42 s
Compiling num tokens (num_tokens=15872):   1%|▋                                                 | 1/74 [00:05<06:57,  5.71s/it]

CC @wolfcomos

wolfcomos · 2026-03-30T02:42:22Z

Hi @Fridge003 ,

TestMXFP8GemmTriton works after #19835 but currently compiling PCG takes 5-7mins, so I disable it again.

Accuracy: 0.853
Invalid: 0.000
Latency: 22.824 s
Output throughput: 10023.130 token/s
{'accuracy': np.float64(0.8529188779378317), 'invalid': np.float64(0.0), 'latency': 22.823908662889153, 'output_throughput': 10023.129840681795}

[2026-03-29 09:10:37] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192, 8704, 9216, 9728, 10240, 10752, 11264, 11776, 12288, 12800, 13312, 13824, 14336, 14848, 15360, 15872, 16384]
Compiling num tokens (num_tokens=16384):   0%|                                                          | 0/74 [00:00<?, ?it/s][2026-03-29 09:10:42] Compiling a graph for dynamic shape takes 0.42 s
Compiling num tokens (num_tokens=15872):   1%|▋                                                 | 1/74 [00:05<06:57,  5.71s/it]

CC @wolfcomos

Thanks! I'm now working on the PR to improve the cuda graph capturing time.

Fridge003 · 2026-03-30T03:38:21Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-03-30T03:38:47Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

gemini-code-assist bot reviewed Mar 29, 2026

View reviewed changes

zianglih marked this pull request as draft March 29, 2026 06:39

zianglih marked this pull request as ready for review March 29, 2026 06:39

zianglih mentioned this pull request Mar 29, 2026

chore: bump flashinfer version to 0.6.7 #21422

Merged

zianglih added 2 commits March 29, 2026 02:04

Fix test

f7b2d07

Drop disable Triton test

14b6d46

ziang-and force-pushed the v067-ci branch from 615ad9d to 14b6d46 Compare March 29, 2026 09:04

Disable Triton test again due to long PCG capture

872114d

Fridge003 merged commit 1a4b383 into sgl-project:main Mar 30, 2026
57 of 63 checks passed

zianglih deleted the v067-ci branch March 30, 2026 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests#21625

[CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests#21625
Fridge003 merged 3 commits intosgl-project:mainfrom
zianglih:v067-ci

zianglih commented Mar 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 29, 2026

Uh oh!

Fridge003 commented Mar 29, 2026

Uh oh!

zianglih commented Mar 29, 2026

Uh oh!

wolfcomos commented Mar 30, 2026

Uh oh!

Fridge003 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@unittest.skip(
		"Temporarily disabled until https://github.com/sgl-project/sglang/pull/19835 is merged"

Conversation

zianglih commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Mar 29, 2026

Uh oh!

zianglih commented Mar 29, 2026

Uh oh!

wolfcomos commented Mar 30, 2026

Uh oh!

Fridge003 commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zianglih commented Mar 29, 2026 •

edited

Loading