[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742
[RL] Support per-layer mixed FP8/BF16 serving for FP8 checkpoints#18742ispobock merged 9 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @zianglih, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility of FP8 checkpoint serving by enabling mixed precision capabilities across different layers. It introduces a mechanism to selectively skip FP8 quantization for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enhances FP8 quantization by allowing mixed precision across layers. This is achieved by introducing a mechanism to skip quantization for specified layers, including LinearBase and FusedMoE layers. The changes include updating Fp8Config to handle ignored_layers and packed_modules_mapping more robustly. My review focuses on improving the implementation details for clarity and efficiency.
| should_skip = is_layer_skipped( | ||
| prefix, self.ignored_layers, fused_mapping=self.packed_modules_mapping | ||
| ) or any(ignored in prefix for ignored in self.ignored_layers) |
There was a problem hiding this comment.
The logic to determine if a FusedMoE layer should be skipped appears to be redundant. The is_layer_skipped function is called, and its result is then OR-ed with any(ignored in prefix for ignored in self.ignored_layers). This second check is already part of is_layer_skipped's logic for most cases.
This addition seems to be a workaround for a potential bug in is_layer_skipped's special handling for MoE layers. While this fix is effective, it makes the code harder to understand. A more direct fix within is_layer_skipped would be ideal. If that's out of scope for this PR, consider adding a comment to explain why this apparent redundancy is necessary.
This reverts commit a66907f.
This reverts commit b9eed87.
d573d7e to
65ca08f
Compare
65ca08f to
799956a
Compare
1a43f25 to
604e664
Compare
|
/tag-and-rerun-ci |
|
Some tests failed due to |
|
@HandH1998 NVIDIA CI is mostly green, only 1 irrelevant flaky test. |
|
Ok, we will merge it soon. |
Motivation
@HumansAnd
As noted in https://arxiv.org/abs/2509.25149:
can improve training convergence.
radixark/miles#614 adds
--first-last-layers-bf16for mxfp8 RL training.This PR enables reliable serving of FP8 checkpoints with intentional per-layer BF16 retention
Modifications
modules_to_not_convertto include bothmodel.and non‑model.prefixes so mixed‑precision BF16 tail layers (including MoE experts) are reliably skipped.packed_modules_mappingthroughFp8Configat construction time to keep fused‑name skip logic consistent.is_layer_skippedfor MoE experts by preserving coarse prefix matches while still honoring expert‑specific entries; remove the redundant MoE skip check inFp8Config.Accuracy Tests
Download checkpoint and convert it to mixed-precision checkpoint using radixark/miles@83ba755
Serving the mixed-precision PTQ checkpoint:
Weight update also works:
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci