[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints by zianglih · Pull Request #18742 · sgl-project/sglang

zianglih · 2026-02-12T22:29:22Z

Motivation

@HumansAnd

As noted in https://arxiv.org/abs/2509.25149:

Keep a few sensitive linear layers in higher precision

can improve training convergence.

radixark/miles#614 adds --first-last-layers-bf16 for mxfp8 RL training.

This PR enables reliable serving of FP8 checkpoints with intentional per-layer BF16 retention

Modifications

Normalize modules_to_not_convert to include both model. and non‑model. prefixes so mixed‑precision BF16 tail layers (including MoE experts) are reliably skipped.
Pass packed_modules_mapping through Fp8Config at construction time to keep fused‑name skip logic consistent.
Fix is_layer_skipped for MoE experts by preserving coarse prefix matches while still honoring expert‑specific entries; remove the redundant MoE skip check in Fp8Config.

Accuracy Tests

Download checkpoint and convert it to mixed-precision checkpoint using radixark/miles@83ba755

hf download Qwen/Qwen3-30B-A3B-Instruct-2507 --local-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507
python /root/miles/tools/convert_hf_to_mxfp8.py --model-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507 --save-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8 --num-layers-at-end-in-bf16 8

Serving the mixed-precision PTQ checkpoint:

python -m sglang.launch_server --kv-cache-dtype bf16 --model /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8 --fp8-gemm-backend triton --moe-runner-backend cutlass &
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.964
Invalid: 0.000
Latency: 16.168 s
Output throughput: 10505.626 token/s

Weight update also works:

curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "/data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.964
Invalid: 0.000
Latency: 15.775 s
Output throughput: 10767.157 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-12T22:29:39Z

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of FP8 checkpoint serving by enabling mixed precision capabilities across different layers. It introduces a mechanism to selectively skip FP8 quantization for FusedMoE layers, allowing them to operate in unquantized precision (e.g., BF16) when needed. Additionally, the changes refine how layers are designated to be ignored from quantization, providing more robust prefix matching for module names and incorporating a new mapping for packed modules to ensure accurate quantization control.

Highlights

Mixed Precision for FusedMoE Layers: Introduced the ability to serve FP8 checkpoints with mixed precision, specifically allowing FusedMoE layers to fall back to unquantized (BF16) precision when specified.
Enhanced Ignored Layers Configuration: Improved the ignored_layers mechanism to robustly handle layer names, supporting both "model." prefixed and non-prefixed variants for more flexible exclusion from FP8 quantization.
Packed Modules Mapping Support: Added a packed_modules_mapping parameter to Fp8Config to better manage and identify modules that might be grouped or packed, influencing quantization decisions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/layers/quantization/fp8.py
- Imported UnquantizedFusedMoEMethod to support unquantized FusedMoE layers.
- Added packed_modules_mapping as a new configuration parameter to Fp8Config and its __init__ and from_config methods.
- Modified the ignored_layers processing logic to normalize layer names, ensuring both "model." prefixed and non-prefixed versions are considered for robust matching.
- Updated the get_quant_method function to utilize packed_modules_mapping in is_layer_skipped and to return UnquantizedFusedMoEMethod for FusedMoE layers that are explicitly skipped.

Activity

No specific activity (comments, reviews, progress) was found in the provided pull request description.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances FP8 quantization by allowing mixed precision across layers. This is achieved by introducing a mechanism to skip quantization for specified layers, including LinearBase and FusedMoE layers. The changes include updating Fp8Config to handle ignored_layers and packed_modules_mapping more robustly. My review focuses on improving the implementation details for clarity and efficiency.

python/sglang/srt/layers/quantization/fp8.py

gemini-code-assist · 2026-02-12T22:31:44Z

python/sglang/srt/layers/quantization/fp8.py

+            should_skip = is_layer_skipped(
+                prefix, self.ignored_layers, fused_mapping=self.packed_modules_mapping
+            ) or any(ignored in prefix for ignored in self.ignored_layers)


The logic to determine if a FusedMoE layer should be skipped appears to be redundant. The is_layer_skipped function is called, and its result is then OR-ed with any(ignored in prefix for ignored in self.ignored_layers). This second check is already part of is_layer_skipped's logic for most cases.

This addition seems to be a workaround for a potential bug in is_layer_skipped's special handling for MoE layers. While this fix is effective, it makes the code harder to understand. A more direct fix within is_layer_skipped would be ideal. If that's out of scope for this PR, consider adding a comment to explain why this apparent redundancy is necessary.

Done by c2ce10c

This reverts commit a66907f.

This reverts commit b9eed87.

HandH1998 · 2026-02-26T12:31:17Z

/tag-and-rerun-ci

zianglih · 2026-02-27T05:04:22Z

Some tests failed due to The action 'Install dependencies' has timed out after 20 minutes.

zianglih · 2026-02-28T21:58:21Z

@HandH1998 NVIDIA CI is mostly green, only 1 irrelevant flaky test.

HandH1998 · 2026-03-01T11:38:36Z

Ok, we will merge it soon.

…l-project#18742)

zianglih requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners February 12, 2026 22:29

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

zianglih marked this pull request as draft February 12, 2026 22:38

ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026

Bring in changes from sgl-project#18742

b9eed87

ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026

Bring in changes from sgl-project#18742

a66907f

ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026

Revert "Bring in changes from sgl-project#18742"

293d8d8

This reverts commit a66907f.

ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026

Revert "Bring in changes from sgl-project#18742"

7070955

This reverts commit b9eed87.

zianglih force-pushed the mxfp8-last-n branch from d573d7e to 65ca08f Compare February 16, 2026 22:26

This was referenced Feb 17, 2026

Add --first-last-layers-bf16 radixark/miles#614

Open

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

ziang-and force-pushed the mxfp8-last-n branch from 65ca08f to 799956a Compare February 22, 2026 04:51

zianglih marked this pull request as ready for review February 22, 2026 05:18

ispobock assigned AniZpZ and HandH1998 Feb 22, 2026

zianglih added 7 commits February 25, 2026 23:32

Initial attempt

5ef4e8b

Try fix

8184c0f

Simplify prefix processing fix

fcca4a7

Clean up and refactor

3ea6e6a

Preserve coarse prefix matches when is_layer_skipped handles MoE experts

e1efddf

Minor refactor

d9f2f47

Minor clean up

604e664

zianglih force-pushed the mxfp8-last-n branch from 1a43f25 to 604e664 Compare February 26, 2026 07:32

zianglih requested a review from HaiShaw as a code owner February 26, 2026 07:32

github-actions bot added the run-ci label Feb 26, 2026

Merge branch 'main' into mxfp8-last-n

df96b93

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026

Cherry-pick PR sgl-project#18742 (mixed FP8/BF16 serving)

c469bec

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026

Cherry-pick PR sgl-project#18742 (mixed FP8/BF16 serving)

93d4d73

Merge branch 'main' into mxfp8-last-n

23920af

zianglih changed the title ~~Allow serving FP8 checkpoints with mixed precision across layers~~ Support per-layer mixed FP8/BF16 serving for FP8 checkpoints Feb 28, 2026

zianglih changed the title ~~Support per-layer mixed FP8/BF16 serving for FP8 checkpoints~~ [RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints Feb 28, 2026

ispobock merged commit 0e86977 into sgl-project:main Mar 1, 2026
91 of 103 checks passed

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints (sg…

a682858

…l-project#18742)

zianglih mentioned this pull request Mar 10, 2026

[FlashInfer v0.6.6][RL] Support fp8-last-n-bf16 RL for flashinfer_trtllm_routed moe backend #20214

Merged

5 tasks

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints (sg…

8567943

…l-project#18742)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742
ispobock merged 9 commits intosgl-project:mainfrom
zianglih:mxfp8-last-n

zianglih commented Feb 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

zianglih Feb 22, 2026

Uh oh!

HandH1998 commented Feb 26, 2026 •

edited by ispobock

Loading

Uh oh!

zianglih commented Feb 27, 2026

Uh oh!

zianglih commented Feb 28, 2026

Uh oh!

HandH1998 commented Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zianglih commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

zianglih Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

HandH1998 commented Feb 26, 2026 • edited by ispobock Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented Feb 27, 2026

Uh oh!

zianglih commented Feb 28, 2026

Uh oh!

HandH1998 commented Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zianglih commented Feb 12, 2026 •

edited

Loading

HandH1998 commented Feb 26, 2026 •

edited by ispobock

Loading