Skip to content

[WIP]enable mxfp8 on nvidia sm120#19112

Merged
Kangyan-Zhou merged 7 commits intosgl-project:mainfrom
wolfcomos:mxfp8_sm120
Mar 2, 2026
Merged

[WIP]enable mxfp8 on nvidia sm120#19112
Kangyan-Zhou merged 7 commits intosgl-project:mainfrom
wolfcomos:mxfp8_sm120

Conversation

@wolfcomos
Copy link
Copy Markdown
Contributor

@wolfcomos wolfcomos commented Feb 21, 2026

Motivation

Following up to #17449 , sm120 indeed supports mxfp8 forward pass according to discussion from NVIDIA/TransformerEngine#2668.

Modifications

enabled sm120 mxfp8_block_scaled_matmul_triton kernel: added selection flag and set default sm120 machine stage size to 1 due to the shared memory constraint. Updated MoE assertions to include sm120 support

Accuracy Tests

Testing machine: RTX5070 TI

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1000 --parallel 10 --platinum
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct --port 30000

Baseline bf16:
Accuracy: 0.359
Invalid: 0.003
Latency: 57.425 s
Output throughput: 2188.128 token/s
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-0.5B-Instruct --quantization mxfp8 --port 30000

bf16 with online mxfp8 quantization
Accuracy: 0.348
Invalid: 0.003
Latency: 307.610 s
Output throughput: 414.304 token/s

unit test based on test_block_fp8.py

python3 -m pytest python/sglang/test/test_block_fp8.py::TestMXFP8DenseLinear -v


==================================== test session starts ====================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /opt/venv-sglang-dev/bin/python3
cachedir: .pytest_cache
rootdir: /sglang-dev/sglang/python
configfile: pyproject.toml
plugins: anyio-4.12.1
collected 1 item                                                                            

python/sglang/test/test_block_fp8.py::TestMXFP8DenseLinear::test_mxfp8_dense_linear 
python/sglang/test/test_block_fp8.py::TestMXFP8DenseLinear::test_mxfp8_dense_linear PASSED [100%]

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@wolfcomos wolfcomos changed the title enable mxfp8 on nvidia sm120 [WIP]enable mxfp8 on nvidia sm120 Feb 21, 2026
@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Mar 2, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Mar 2, 2026
@Kangyan-Zhou Kangyan-Zhou merged commit e5edf22 into sgl-project:main Mar 2, 2026
135 of 160 checks passed
Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
Co-authored-by: Your Name <you@example.com>
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Co-authored-by: Your Name <you@example.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: Your Name <you@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants