Skip to content

Add model gpt-oss#8822

Closed
hlu1 wants to merge 138 commits intosgl-project:mainfrom
hlu1:feat/gpt-oss
Closed

Add model gpt-oss#8822
hlu1 wants to merge 138 commits intosgl-project:mainfrom
hlu1:feat/gpt-oss

Conversation

@hlu1
Copy link
Copy Markdown
Collaborator

@hlu1 hlu1 commented Aug 5, 2025

Motivation

Authors @xutizhou, @liz-badada, @zhuofan1123, @linhu-nv from NVIDIA Solution Architect Team

Many thanks to @qsang-nv, @PerkzZheng, @yzh119 for helping integrate attention sink with FlashInfer attention kernels, @dongfengy for helping integrate the Triton MoE kernel.

SGLang now support GPT-OSS. Here is a simple guide provides step-by-step instructions for deploying GPT-OSS with SGLang or running benchmark tests. (Note: Current implementation focuses on functional completeness rather than performance optimization. Community contributions for performance improvements are welcome. Some workarounds exist in the current codebase and will be addressed in future updates.)

Main Modifications

  • Add support to model GPT-OSS
  • Model weight loading adaptation
  • Integrated Triton MoE support (data types: especially MXFP4_W4A16, MXFP4_W4A8)
  • Added multiple attention backends to support SWA with sinks (Flashinfer attention backend; Torch native attention backend)
  • Added accuracy tests (GSM8K, MMLU, etc.) and reasoning mode as well as system_messages support.
  • Args: --attention-backend [flashinfer, torch_native_sink], --enable-fp8-act for MoE MXFP4 W4A8
  • Others

Prerequisites

  • Hopper / Blackwell GPUs
  • Docker on Hopper: lmsysorg/sglang:latest, on Blackwell: lmsysorg/sglang:v0.4.8-cu128-b200
  • Transformers: newly released version
  • Flashinfer: newly released version
  • Triton 3.4.0.
  • Sgl-kernel 0.3.1
  • Torch 2.8
cd path-to/sglang
git submodule init && git submodule update

Model Weights

Get model weights from HuggingFace Model Card

Command Example

To serve GPT-OSS model, refer to the following command:

Launch Server

python3 -m sglang.launch_server \
    --model-path <path-to-gpt-oss> \
    --tp-size 4 \
    --enable-fp8-act \
    --cuda-graph-bs 256 \
    --max-running-requests 256 \
    --attention-backend flashinfer

Bench Serving

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-name random \
    --num-prompts 6144 \
    --random-input-len 1000 \
    --random-output-len 2000 \
    --random-range-ratio 1

Note:

  • To enable MXFP4 W4A8 triton MoE, try --enable-fp8-act.

Bench offline

To bench GPT-OSS model offline, refer to the following command:

python3 -m sglang.bench_offline_throughput \
        --model-path <path-to-gpt-oss>  \
        --tp-size 4 \
        --num-prompts 6144 \
        --enable-fp8-act \
        --cuda-graph-bs 256 \
        --dataset-name random \
        --random-input 1000 \
        --random-output 2000\
        --random-range-ratio 1 \
        --max-running-requests 256 \
        --attention-backend [flashinfer]

Simple Test

python3 test/srt/models/test_gpt_oss_models.py

Note

If meet with some issues on Blackwell, refer #7227

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Jinyan Chen and others added 30 commits June 25, 2025 04:27
fix the problem that mlp weight is not really loaded, and mlp bias support

See merge request jinyanc/sglang_fork!2
This update introduces a new optional parameter 'bias' to the FusedMoE implementation and its associated quantization methods. When enabled, bias parameters are created for both gate_up_proj and down_proj layers. If disabled, these parameters are registered as None. This change enhances flexibility in model configuration.
1. fix mismatch including truncatation, max_positions, etc.
2. simiply fallback to native apply rope impl as sgl-kernel version does not support oai version.
align with reference code

See merge request jinyanc/sglang_fork!3
…salLM. The update ensures that parameters with ".mlp.router." in their names are correctly mapped to ".mlp.gate." and loaded appropriately, enhancing model compatibility.
…eters in OpenAIMoeForCausalLM. The change specifies that OpenAIMoe uses 'router' to refer to 'gate', improving code readability and understanding.
fix gate not load bug

See merge request jinyanc/sglang_fork!4
@Swipe4057 Swipe4057 mentioned this pull request Aug 5, 2025
2 tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems not needed, if not >= 0 flashinfer will not use sliding window

@merrymercy
Copy link
Copy Markdown
Contributor

Summary:

@chriswritescode-dev
Copy link
Copy Markdown

So this will not work on Ada series GPUs?

@hlu1
Copy link
Copy Markdown
Collaborator Author

hlu1 commented Aug 5, 2025

So this will not work on Ada series GPUs?

We never tested on Ada. Feel free to try it.

@chriswritescode-dev
Copy link
Copy Markdown

So this will not work on Ada series GPUs?

We never tested on Ada. Feel free to try it.
sglang/python/sglang/srt/model_loader/loader.py", line 140, in _get_quantization_config
raise ValueError(
ValueError: The quantization method mxfp4 is not supported for the current GPU. Minimum capability: 90. Current capability: 89.

Nope I guess its due to mxfp4. Any plans for this ?

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 5, 2025

@chriswritescode-dev please refer to #8833

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 6, 2025

Thank you for your contributions. Since we have already received support in pull requests #8824 and #8843, we will now close this PR.

@zhyncs zhyncs closed this Aug 6, 2025
return output_q, scale

def swizzle(x: torch.Tensor):
assert len(x.shape) == 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @hlu1 great work!

May I know the reason to swizzle a tensor outside of main gpu kernel ? Swizzle is realted to shared memory layout, and it has nothing to do with its gpu main memory layout. I am really interested why this helps ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swizzling is very important for the kernel performance for block scale formats on Hopper and Blackwell. It's required to get optimal global memory access.

@hlu1 hlu1 deleted the feat/gpt-oss branch November 14, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.