Add model gpt-oss by hlu1 · Pull Request #8822 · sgl-project/sglang

hlu1 · 2025-08-05T17:01:18Z

Motivation

Authors @xutizhou, @liz-badada, @zhuofan1123, @linhu-nv from NVIDIA Solution Architect Team

Many thanks to @qsang-nv, @PerkzZheng, @yzh119 for helping integrate attention sink with FlashInfer attention kernels, @dongfengy for helping integrate the Triton MoE kernel.

SGLang now support GPT-OSS. Here is a simple guide provides step-by-step instructions for deploying GPT-OSS with SGLang or running benchmark tests. (Note: Current implementation focuses on functional completeness rather than performance optimization. Community contributions for performance improvements are welcome. Some workarounds exist in the current codebase and will be addressed in future updates.)

Main Modifications

Add support to model GPT-OSS
Model weight loading adaptation
Integrated Triton MoE support (data types: especially MXFP4_W4A16, MXFP4_W4A8)
Added multiple attention backends to support SWA with sinks (Flashinfer attention backend; Torch native attention backend)
Added accuracy tests (GSM8K, MMLU, etc.) and reasoning mode as well as system_messages support.
Args: --attention-backend [flashinfer, torch_native_sink], --enable-fp8-act for MoE MXFP4 W4A8
Others

Prerequisites

Hopper / Blackwell GPUs
Docker on Hopper: lmsysorg/sglang:latest, on Blackwell: lmsysorg/sglang:v0.4.8-cu128-b200
Transformers: newly released version
Flashinfer: newly released version
Triton 3.4.0.
Sgl-kernel 0.3.1
Torch 2.8

cd path-to/sglang
git submodule init && git submodule update

Model Weights

Get model weights from HuggingFace Model Card

Command Example

To serve GPT-OSS model, refer to the following command:

Launch Server

python3 -m sglang.launch_server \
    --model-path <path-to-gpt-oss> \
    --tp-size 4 \
    --enable-fp8-act \
    --cuda-graph-bs 256 \
    --max-running-requests 256 \
    --attention-backend flashinfer

Bench Serving

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --num-prompts 6144 \
    --random-input-len 1000 \
    --random-output-len 2000 \
    --random-range-ratio 1

Note:

To enable MXFP4 W4A8 triton MoE, try --enable-fp8-act.

Bench offline

To bench GPT-OSS model offline, refer to the following command:

python3 -m sglang.bench_offline_throughput \
        --model-path <path-to-gpt-oss>  \
        --tp-size 4 \
        --num-prompts 6144 \
        --enable-fp8-act \
        --cuda-graph-bs 256 \
        --dataset-name random \
        --random-input 1000 \
        --random-output 2000\
        --random-range-ratio 1 \
        --max-running-requests 256 \
        --attention-backend [flashinfer]

Simple Test

python3 test/srt/models/test_gpt_oss_models.py

Note

If meet with some issues on Blackwell, refer #7227

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…ding issue & add swiglu activation

…pport

fix the problem that mlp weight is not really loaded, and mlp bias support See merge request jinyanc/sglang_fork!2

…ry and key tensors

…er in OpenAIMoeConfig

This update introduces a new optional parameter 'bias' to the FusedMoE implementation and its associated quantization methods. When enabled, bias parameters are created for both gate_up_proj and down_proj layers. If disabled, these parameters are registered as None. This change enhances flexibility in model configuration.

…n OpenAIMoeSparseMoeBlock

1. fix mismatch including truncatation, max_positions, etc. 2. simiply fallback to native apply rope impl as sgl-kernel version does not support oai version.

… feat/orangina

align with reference code See merge request jinyanc/sglang_fork!3

…salLM. The update ensures that parameters with ".mlp.router." in their names are correctly mapped to ".mlp.gate." and loaded appropriately, enhancing model compatibility.

…eters in OpenAIMoeForCausalLM. The change specifies that OpenAIMoe uses 'router' to refer to 'gate', improving code readability and understanding.

fix gate not load bug See merge request jinyanc/sglang_fork!4

Edenzzzz · 2025-08-05T17:10:44Z

python/sglang/srt/layers/attention/flashinfer_backend.py

Edenzzzz · 2025-08-05T17:11:39Z

python/sglang/srt/layers/attention/flashinfer_backend.py

seems not needed, if not >= 0 flashinfer will not use sliding window

merrymercy · 2025-08-05T18:14:17Z

Summary:

We will be based on Add initial support for gpt-oss #8824 going forward
merge flashinfer_backend support to this PR Add initial support for gpt-oss #8824
try vllm mxfp4 moe kernel instead of the one in this PR

chriswritescode-dev · 2025-08-05T21:44:50Z

So this will not work on Ada series GPUs?

hlu1 · 2025-08-05T21:46:39Z

So this will not work on Ada series GPUs?

We never tested on Ada. Feel free to try it.

chriswritescode-dev · 2025-08-05T22:07:24Z

So this will not work on Ada series GPUs?

We never tested on Ada. Feel free to try it.
sglang/python/sglang/srt/model_loader/loader.py", line 140, in _get_quantization_config
raise ValueError(
ValueError: The quantization method mxfp4 is not supported for the current GPU. Minimum capability: 90. Current capability: 89.

Nope I guess its due to mxfp4. Any plans for this ?

zhyncs · 2025-08-05T22:09:06Z

@chriswritescode-dev please refer to #8833

zhyncs · 2025-08-06T09:34:12Z

Thank you for your contributions. Since we have already received support in pull requests #8824 and #8843, we will now close this PR.

yiakwy-xpu-ml-framework-team · 2025-08-06T16:01:32Z

python/sglang/srt/layers/quantization/mxfp4_moe.py

+    return output_q, scale
+
+def swizzle(x: torch.Tensor):
+    assert len(x.shape) == 3


hi @hlu1 great work!

May I know the reason to swizzle a tensor outside of main gpu kernel ? Swizzle is realted to shared memory layout, and it has nothing to do with its gpu main memory layout. I am really interested why this helps ?

Swizzling is very important for the kernel performance for block scale formats on Hopper and Blackwell. It's required to get optimal global memory access.

Jinyan Chen and others added 30 commits June 25, 2025 04:27

init orangina

0123240

trivial modification

1b6d426

debugging weight loading

b28605c

fix o_proj linear error

966bab4

fix config compatiblity & RMSNorm failed with no kernel is available

8c545d9

rename

918e3c0

add sliding window attn every other layer & fix attn sinks weight loa…

0bba216

…ding issue & add swiglu activation

update

87d81a6

use silu as a WA

81bb3bf

moe bias support (not finished)

8febdd6

fix moe intermediate size configuration to match intermediate size

606961e

fix the problem that mlp weight is not really loaded, and mlp bias su…

d1b159a

…pport

Merge branch 'linhu/dev' into 'feat/orangina'

fab80db

fix the problem that mlp weight is not really loaded, and mlp bias support See merge request jinyanc/sglang_fork!2

add structure

884666a

to display

fc37426

rm model sstructure

2355609

Update OpenAIMoeAttention to set sinks parameter dtype to bfloat16

c49f8ee

Refactor OpenAIMoeAttention by removing RMSNorm normalization for que…

c426b23

…ry and key tensors

Add a TODO comment to ensure correct sliding window size for flashinf…

4eb4680

…er in OpenAIMoeConfig

mark attn o_proj all reduce

f4ab9cb

Add TODO comment to indicate future replacement of gate with router i…

7958fa3

…n OpenAIMoeSparseMoeBlock

fix rope yarn mismatch issue

5bb606f

1. fix mismatch including truncatation, max_positions, etc. 2. simiply fallback to native apply rope impl as sgl-kernel version does not support oai version.

Merge remote-tracking branch 'refs/remotes/origin/feat/orangina' into…

76c206d

… feat/orangina

Merge branch 'feat/orangina' into 'feat/orangina'

c9fdd37

align with reference code See merge request jinyanc/sglang_fork!3

Fix naming mismatch for gate and router parameters in OpenAIMoeForCau…

51f9a92

…salLM. The update ensures that parameters with ".mlp.router." in their names are correctly mapped to ".mlp.gate." and loaded appropriately, enhancing model compatibility.

Update comment to clarify naming convention for gate and router param…

4abddbc

…eters in OpenAIMoeForCausalLM. The change specifies that OpenAIMoe uses 'router' to refer to 'gate', improving code readability and understanding.

Merge branch 'feat/orangina' into 'feat/orangina'

9b9320d

fix gate not load bug See merge request jinyanc/sglang_fork!4

Bug fix: FusedMoE expert weight loader can not load weights

314bf1a

bf16 fusedmoe integration

3f0bbd4

zhyncs added the high priority label Aug 5, 2025

Swipe4057 mentioned this pull request Aug 5, 2025

[Model] Openai GPT OSS model #8821

Closed

2 tasks

Edenzzzz reviewed Aug 5, 2025

View reviewed changes

python/sglang/srt/layers/attention/flashinfer_backend.py Outdated

Copy link
Copy Markdown

Contributor

Edenzzzz Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Edenzzzz reviewed Aug 5, 2025

View reviewed changes

yyihuang mentioned this pull request Aug 5, 2025

feat: openai oss attention sink support with trtllm-gen backend #8825

Closed

6 tasks

zhyncs removed the high priority label Aug 5, 2025

hlu1 force-pushed the feat/gpt-oss branch from 41e46f8 to cbcf761 Compare August 6, 2025 05:30

hlu1 requested review from ByronHsu, CatherineSue, FlamingoPg, Fridge003, HandH1998, JustinTong0323, ShangmingCai, fzyzcjy, kssteven418, mickqian, rkooo567, slin1237 and yizhang2077 as code owners August 6, 2025 05:30

hlu1 force-pushed the feat/gpt-oss branch from cbcf761 to 6640f5c Compare August 6, 2025 05:34

zhyncs closed this Aug 6, 2025

yiakwy-xpu-ml-framework-team reviewed Aug 6, 2025

View reviewed changes

hlu1 deleted the feat/gpt-oss branch November 14, 2025 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model gpt-oss#8822

Add model gpt-oss#8822
hlu1 wants to merge 138 commits intosgl-project:mainfrom
hlu1:feat/gpt-oss

hlu1 commented Aug 5, 2025

Uh oh!

Edenzzzz Aug 5, 2025

Uh oh!

Edenzzzz Aug 5, 2025

Uh oh!

merrymercy commented Aug 5, 2025

Uh oh!

chriswritescode-dev commented Aug 5, 2025

Uh oh!

hlu1 commented Aug 5, 2025

Uh oh!

chriswritescode-dev commented Aug 5, 2025

Uh oh!

zhyncs commented Aug 5, 2025

Uh oh!

zhyncs commented Aug 6, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Aug 6, 2025

Uh oh!

hlu1 Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

hlu1 commented Aug 5, 2025

Motivation

Main Modifications

Prerequisites

Model Weights

Command Example

Bench offline

Simple Test

Note

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

Edenzzzz Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Aug 5, 2025

Uh oh!

chriswritescode-dev commented Aug 5, 2025

Uh oh!

hlu1 commented Aug 5, 2025

Uh oh!

chriswritescode-dev commented Aug 5, 2025

Uh oh!

zhyncs commented Aug 5, 2025

Uh oh!

zhyncs commented Aug 6, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

hlu1 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants