Skip to content

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742

Merged
ispobock merged 9 commits intosgl-project:mainfrom
zianglih:mxfp8-last-n
Mar 1, 2026
Merged

[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742
ispobock merged 9 commits intosgl-project:mainfrom
zianglih:mxfp8-last-n

Conversation

@zianglih
Copy link
Contributor

@zianglih zianglih commented Feb 12, 2026

Motivation

@HumansAnd

As noted in https://arxiv.org/abs/2509.25149:

Keep a few sensitive linear layers in higher precision

can improve training convergence.

radixark/miles#614 adds --first-last-layers-bf16 for mxfp8 RL training.

This PR enables reliable serving of FP8 checkpoints with intentional per-layer BF16 retention

Modifications

  • Normalize modules_to_not_convert to include both model. and non‑model. prefixes so mixed‑precision BF16 tail layers (including MoE experts) are reliably skipped.
  • Pass packed_modules_mapping through Fp8Config at construction time to keep fused‑name skip logic consistent.
  • Fix is_layer_skipped for MoE experts by preserving coarse prefix matches while still honoring expert‑specific entries; remove the redundant MoE skip check in Fp8Config.

Accuracy Tests

Download checkpoint and convert it to mixed-precision checkpoint using radixark/miles@83ba755

hf download Qwen/Qwen3-30B-A3B-Instruct-2507 --local-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507
python /root/miles/tools/convert_hf_to_mxfp8.py --model-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507 --save-dir /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8 --num-layers-at-end-in-bf16 8

Serving the mixed-precision PTQ checkpoint:

python -m sglang.launch_server --kv-cache-dtype bf16 --model /data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8 --fp8-gemm-backend triton --moe-runner-backend cutlass &
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.964
Invalid: 0.000
Latency: 16.168 s
Output throughput: 10505.626 token/s

Weight update also works:

curl -sS http://localhost:30000/update_weights_from_disk \
  -H 'Content-Type: application/json' \
  -d '{
    "model_path": "/data/home/ziangli/models/Qwen3-30B-A3B-Instruct-2507-MXFP8-last-8",
    "flush_cache": true,
    "abort_all_requests": false
  }'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1209 --parallel 1209 --platinum
Accuracy: 0.964
Invalid: 0.000
Latency: 15.775 s
Output throughput: 10767.157 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of FP8 checkpoint serving by enabling mixed precision capabilities across different layers. It introduces a mechanism to selectively skip FP8 quantization for FusedMoE layers, allowing them to operate in unquantized precision (e.g., BF16) when needed. Additionally, the changes refine how layers are designated to be ignored from quantization, providing more robust prefix matching for module names and incorporating a new mapping for packed modules to ensure accurate quantization control.

Highlights

  • Mixed Precision for FusedMoE Layers: Introduced the ability to serve FP8 checkpoints with mixed precision, specifically allowing FusedMoE layers to fall back to unquantized (BF16) precision when specified.
  • Enhanced Ignored Layers Configuration: Improved the ignored_layers mechanism to robustly handle layer names, supporting both "model." prefixed and non-prefixed variants for more flexible exclusion from FP8 quantization.
  • Packed Modules Mapping Support: Added a packed_modules_mapping parameter to Fp8Config to better manage and identify modules that might be grouped or packed, influencing quantization decisions.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/layers/quantization/fp8.py
    • Imported UnquantizedFusedMoEMethod to support unquantized FusedMoE layers.
    • Added packed_modules_mapping as a new configuration parameter to Fp8Config and its __init__ and from_config methods.
    • Modified the ignored_layers processing logic to normalize layer names, ensuring both "model." prefixed and non-prefixed versions are considered for robust matching.
    • Updated the get_quant_method function to utilize packed_modules_mapping in is_layer_skipped and to return UnquantizedFusedMoEMethod for FusedMoE layers that are explicitly skipped.
Activity
  • No specific activity (comments, reviews, progress) was found in the provided pull request description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances FP8 quantization by allowing mixed precision across layers. This is achieved by introducing a mechanism to skip quantization for specified layers, including LinearBase and FusedMoE layers. The changes include updating Fp8Config to handle ignored_layers and packed_modules_mapping more robustly. My review focuses on improving the implementation details for clarity and efficiency.

Comment on lines +220 to +222
should_skip = is_layer_skipped(
prefix, self.ignored_layers, fused_mapping=self.packed_modules_mapping
) or any(ignored in prefix for ignored in self.ignored_layers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine if a FusedMoE layer should be skipped appears to be redundant. The is_layer_skipped function is called, and its result is then OR-ed with any(ignored in prefix for ignored in self.ignored_layers). This second check is already part of is_layer_skipped's logic for most cases.

This addition seems to be a workaround for a potential bug in is_layer_skipped's special handling for MoE layers. While this fix is effective, it makes the code harder to understand. A more direct fix within is_layer_skipped would be ideal. If that's out of scope for this PR, consider adding a comment to explain why this apparent redundancy is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by c2ce10c

@zianglih zianglih marked this pull request as draft February 12, 2026 22:38
ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026
ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026
ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026
ziang-and pushed a commit to zianglih/sglang that referenced this pull request Feb 12, 2026
@zianglih zianglih marked this pull request as ready for review February 22, 2026 05:18
@HandH1998
Copy link
Collaborator

HandH1998 commented Feb 26, 2026

/tag-and-rerun-ci

@zianglih
Copy link
Contributor Author

Some tests failed due to The action 'Install dependencies' has timed out after 20 minutes.

zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026
zianglih pushed a commit to zianglih/sglang that referenced this pull request Feb 28, 2026
@zianglih zianglih changed the title Allow serving FP8 checkpoints with mixed precision across layers Support per-layer mixed FP8/BF16 serving for FP8 checkpoints Feb 28, 2026
@zianglih
Copy link
Contributor Author

@HandH1998 NVIDIA CI is mostly green, only 1 irrelevant flaky test.

@zianglih zianglih changed the title Support per-layer mixed FP8/BF16 serving for FP8 checkpoints [RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints Feb 28, 2026
@HandH1998
Copy link
Collaborator

Ok, we will merge it soon.

@ispobock ispobock merged commit 0e86977 into sgl-project:main Mar 1, 2026
91 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants