[Quantization] Support FP8 MoE bias for models like GPT-OSS#34906
[Quantization] Support FP8 MoE bias for models like GPT-OSS#34906simon-mo merged 4 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for FP8 MoE bias in models like GPT-OSS-120B. The changes include registering bias parameters in Fp8MoEMethod.create_weights, passing them to the fused_experts kernel in apply, and adding a safety guard for the FusedMoEModularKernel.
While the logic is sound, there are a few critical issues regarding the data types used for bias parameters and potential typos in attribute names that could lead to runtime errors or precision loss. Specifically, biases should typically remain in the model's original high precision (e.g., BF16) rather than being quantized to FP8, and there are references to self.moe which should likely be self.moe_config.
| set_weight_attrs(w2_weight, extra_weight_attrs) | ||
|
|
||
| # BIASES (for models like GPT-OSS that have biased MoE) | ||
| if self.moe.has_bias: |
There was a problem hiding this comment.
Can you double check this?
There was a problem hiding this comment.
This is false positive. self.moe is inherited from the parent class FusedMoEMethodBase
class FusedMoEMethodBase(QuantizeMethodBase):
def __init__(self, moe: FusedMoEConfig):
super().__init__()
self.moe: FusedMoEConfig = moe
self.moe_quant_config: FusedMoEQuantConfig | None = None
self.moe_mk: mk.FusedMoEModularKernel | None = None
There was a problem hiding this comment.
3f49b35 to
be3c045
Compare
|
Hi @jasperjiaguo, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
5469510 to
7963ae7
Compare
GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias). When serving the BF16 model with `--quantization fp8`, Fp8MoEMethod and Fp8OnlineMoEMethod don't register bias parameters, causing weight loading failures. This adds bias support to both FP8 MoE method classes: - Register w13_bias/w2_bias in Fp8MoEMethod.create_weights() when moe.has_bias is set - Inject biases into quant_config via get_fused_moe_quant_config() - Register biases in Fp8OnlineMoEMethod.create_weights() using the original (unpatched) weight_loader Tested on 4xH200 with GPT-OSS-120B BF16 + vllm 0.15.1: - vllm serve --quantization fp8 loads and serves successfully - TRITON Fp8 MoE backend selected correctly - GSM8K accuracy: 0.834 (FP8) vs 0.848 (BF16) - 1.5x throughput improvement with FP8 Companion PR: sgl-project/sglang#18988 Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7963ae7 to
29b8f6a
Compare
|
@jasperjiaguo could you share the exact command to reproduce the test plan in this PR? |
…ject#34906) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ject#34906) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ject#34906) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ject#34906) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
…ject#34906) Signed-off-by: jasperjiaguo <jasperg662@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
GPT-OSS-120B has biased MoE layers (
gate_up_proj_bias,down_proj_bias). When serving the BF16 model with--quantization fp8,Fp8MoEMethoddoesn't register bias parameters, causing weight loading failures.This PR adds bias support to
Fp8MoEMethod:w13_bias/w2_biasincreate_weights()whenFusedMoEConfig.has_biasis setfused_experts()inapply()FusedMoEModularKernel+ bias combination (consistent withUnquantizedFusedMoEMethod)Test plan
Tested on 4×H200 with GPT-OSS-120B BF16 model:
--quantization fp8vllm serve --quantization fp8loads successfully with biased MoEfused_moe_kernelconfirmed via nsys profilingRelated
🤖 Generated with Claude Code