Skip to content

[Quantization] Support FP8 MoE bias for models like GPT-OSS#34906

Merged
simon-mo merged 4 commits intovllm-project:mainfrom
jasperjiaguo:gpt-oss-fp8-moe-bias
Feb 24, 2026
Merged

[Quantization] Support FP8 MoE bias for models like GPT-OSS#34906
simon-mo merged 4 commits intovllm-project:mainfrom
jasperjiaguo:gpt-oss-fp8-moe-bias

Conversation

@jasperjiaguo
Copy link
Copy Markdown
Contributor

Summary

GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias). When serving the BF16 model with --quantization fp8, Fp8MoEMethod doesn't register bias parameters, causing weight loading failures.

This PR adds bias support to Fp8MoEMethod:

  • Register w13_bias / w2_bias in create_weights() when FusedMoEConfig.has_bias is set
  • Pass biases through to fused_experts() in apply()
  • Guard against unsupported FusedMoEModularKernel + bias combination (consistent with UnquantizedFusedMoEMethod)

Test plan

Tested on 4×H200 with GPT-OSS-120B BF16 model:

Metric BF16 (no quant) BF16 + --quantization fp8
GSM8K Accuracy 0.848 0.834
Output throughput ~9,600 tok/s ~14,000 tok/s
  • vllm serve --quantization fp8 loads successfully with biased MoE
  • FP8 fused_moe_kernel confirmed via nsys profiling

Related

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jasperjiaguo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for FP8 MoE bias in models like GPT-OSS-120B. The changes include registering bias parameters in Fp8MoEMethod.create_weights, passing them to the fused_experts kernel in apply, and adding a safety guard for the FusedMoEModularKernel.

While the logic is sound, there are a few critical issues regarding the data types used for bias parameters and potential typos in attribute names that could lead to runtime errors or precision loss. Specifically, biases should typically remain in the model's original high precision (e.g., BF16) rather than being quantized to FP8, and there are references to self.moe which should likely be self.moe_config.

set_weight_attrs(w2_weight, extra_weight_attrs)

# BIASES (for models like GPT-OSS that have biased MoE)
if self.moe.has_bias:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The attribute self.moe is not defined in Fp8MoEMethod. Based on the class initialization and vLLM conventions, this should be self.moe_config.

Suggested change
if self.moe.has_bias:
if self.moe_config.has_bias:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double check this?

Copy link
Copy Markdown
Contributor Author

@jasperjiaguo jasperjiaguo Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is false positive. self.moe is inherited from the parent class FusedMoEMethodBase


class FusedMoEMethodBase(QuantizeMethodBase):
    def __init__(self, moe: FusedMoEConfig):
        super().__init__()
        self.moe: FusedMoEConfig = moe
        self.moe_quant_config: FusedMoEQuantConfig | None = None
        self.moe_mk: mk.FusedMoEModularKernel | None = None

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasperjiaguo jasperjiaguo force-pushed the gpt-oss-fp8-moe-bias branch 2 times, most recently from 3f49b35 to be3c045 Compare February 19, 2026 19:40
@mergify mergify bot removed the needs-rebase label Feb 19, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 19, 2026

Hi @jasperjiaguo, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@jasperjiaguo jasperjiaguo force-pushed the gpt-oss-fp8-moe-bias branch 2 times, most recently from 5469510 to 7963ae7 Compare February 19, 2026 22:46
GPT-OSS-120B has biased MoE layers (gate_up_proj_bias, down_proj_bias).
When serving the BF16 model with `--quantization fp8`, Fp8MoEMethod and
Fp8OnlineMoEMethod don't register bias parameters, causing weight
loading failures.

This adds bias support to both FP8 MoE method classes:
- Register w13_bias/w2_bias in Fp8MoEMethod.create_weights() when
  moe.has_bias is set
- Inject biases into quant_config via get_fused_moe_quant_config()
- Register biases in Fp8OnlineMoEMethod.create_weights() using the
  original (unpatched) weight_loader

Tested on 4xH200 with GPT-OSS-120B BF16 + vllm 0.15.1:
- vllm serve --quantization fp8 loads and serves successfully
- TRITON Fp8 MoE backend selected correctly
- GSM8K accuracy: 0.834 (FP8) vs 0.848 (BF16)
- 1.5x throughput improvement with FP8

Companion PR: sgl-project/sglang#18988

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 22, 2026
@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 22, 2026
@simon-mo simon-mo merged commit ec85340 into vllm-project:main Feb 24, 2026
59 of 62 checks passed
@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Feb 24, 2026

@jasperjiaguo could you share the exact command to reproduce the test plan in this PR?

tom-zju pushed a commit to tom-zju/vllm that referenced this pull request Feb 26, 2026
…ject#34906)

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…ject#34906)

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…ject#34906)

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…ject#34906)

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…ject#34906)

Signed-off-by: jasperjiaguo <jasperg662@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants