[NPU]Support GPT-OSS for NPU by Todobe · Pull Request #14197 · sgl-project/sglang

Todobe · 2025-12-01T07:24:05Z

Co-author: @McZyWu

Motivation

Adapting GPT-OSS model for NPU.

Modifications

Operators capable of handling sinks and sliding windows have been added to the Ascend backend for attention.
The swiglu function in torch_npu that affects precision has been replaced.
Other incompatible method calls have been adapted for NPU.

Accuracy Tests

We tested the GPT-oss-120b-bf16 model on the GSM8K dataset with 256 questions with max_new_tokens=512. The accuracy was 84.4%.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-01T07:24:23Z

Summary of Changes

Hello @Todobe, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang framework by enabling the GPT-OSS model to operate effectively on NPU hardware. The changes involve fundamental adjustments to the attention mechanism and activation functions within the Ascend backend, ensuring both compatibility and improved performance. The overall goal is to broaden the hardware support for the GPT-OSS model, allowing it to leverage the specialized capabilities of NPUs.

Highlights

NPU Adaptation: The GPT-OSS model has been adapted to run on NPU (Neural Processing Unit) hardware, specifically targeting the Ascend backend.
Attention Mechanism Enhancements: New operators capable of handling 'sinks' and 'sliding windows' have been integrated into the Ascend backend for the attention mechanism, improving its functionality on NPU.
Precision Fix for SiGLU: The default swiglu function in torch_npu was replaced with a custom implementation to resolve precision issues affecting the GPT-OSS model on NPU.
NPU Compatibility: Various other incompatible method calls across the codebase have been identified and adapted to ensure proper functioning on NPU.
Accuracy Improvement: Accuracy tests on the GPT-oss-120b-bf16 model on the GSM8K dataset showed 67.2% accuracy on NPU, which is an improvement over the 62.5% observed on GPU for the same test.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adapts the GPT-OSS model for NPU by adding operators for sinks and sliding windows, replacing a problematic swiglu function, and other NPU-specific adaptations. The changes are generally well-structured. I've identified a critical bug in the activation function logic for NPU MoE layers, which could affect models other than GPT-OSS. I've also suggested a refactoring to reduce code duplication in the attention backend. Addressing these points will improve the correctness and maintainability of the code.

python/sglang/srt/layers/quantization/unquant.py

python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

python/sglang/srt/models/gpt_oss.py

iforgetmyname · 2026-01-16T07:31:26Z

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

revert this file

iforgetmyname · 2026-01-16T07:31:33Z

python/sglang/srt/layers/moe/moe_runner/base.py

revert this file

iforgetmyname · 2026-01-16T07:32:56Z

python/sglang/srt/layers/quantization/unquant.py

+        if self.moe_runner_config.custom_act_fn is not None:
+            hidden_states = self.moe_runner_config.custom_act_fn(layer, hidden_states)
+        elif self.moe_runner_config.activation == "silu":


add an option swiglu_oai for activation here

iforgetmyname · 2026-01-16T07:34:04Z

python/sglang/srt/models/gpt_oss.py

+        custom_act_fn = None
+        if _is_npu:
+            if self.layer_id == 0:
+                logger.warning(
+                    "Warning: GPT-OSS use custom activate function on FusedMoE when using ascend backend."
+                )
+            custom_act_fn = swiglu_oai


revert this part of changes, instead, modify modelconfig when init model

iforgetmyname · 2026-01-16T07:34:15Z

python/sglang/srt/models/gpt_oss.py

            gemm1_clamp_limit=self.gemm1_clamp_limit,
            with_bias=True,
            prefix=add_prefix("experts", prefix),
+            custom_act_fn=custom_act_fn,


iforgetmyname · 2026-01-16T07:34:36Z

python/sglang/srt/models/gpt_oss.py

-            ),
-        )
+        extra_args = {}
+        if _is_cuda:


Suggested change

if _is_cuda:

if not _is_npu:

python/sglang/srt/models/gpt_oss.py

iforgetmyname · 2026-01-16T15:11:48Z

/tag-and-rerun-ci

…pt-oss

* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256) * [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225) Signed-off-by: Lancer <maruixiang6688@gmail.com> * Add runner utilization report workflow (sgl-project#17234) * cli: support sglang version (sgl-project#17250) * Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261) * [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Tiny] Improve docs (sgl-project#17264) * [diffusion] fix: set guidance_scale default to None (sgl-project#17182) * Tiny fix comment typo (sgl-project#17287) * [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974) * Add kl test for swa radix cache (sgl-project#17281) * fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * Move radix cache related tests (sgl-project#17295) * [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534) Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> * [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212) Co-authored-by: Shangming Cai <csmthu@gmail.com> * [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296) * [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649) * [NPU]Support GPT-OSS for NPU (sgl-project#14197) * [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631) Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> * [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235) Co-authored-by: root <root@ubuntu-nvidia.localdomain> * Update CODEOWNERS for multimodal_gen (sgl-project#17308) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] overlap LoRA weight loading with compute (sgl-project#15512) * [PD] Optimize MHA models pp util calculation logic (sgl-project#17306) * [Minor] Correct sglang version when installing from source (sgl-project#17315) * Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347) * [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961) * Update code sync scripts (sgl-project#17319) * [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * support new qwen3_coder_detector (sgl-project#16744) Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> * Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325) * KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412) * [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241) Co-authored-by: Minglei Zhu <zminglei@linkedin.com> * [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245) * fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238) * Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332) * Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177) Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> * Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158) * [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309) * [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix v32 continue_final_message not work (sgl-project#16567) * Evict swa kv cache during decoding (sgl-project#17220) * [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142) Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> * [AMD CI] Migrate and Add More Testcases (sgl-project#17116) Co-authored-by: yctseng0211 <yctseng@amd.com> * [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345) * Restore deepseek_v2.py to main's code, except the utils * Ran `pre-commit` --------- Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Hudson Xing <1277646412@qq.com> Co-authored-by: Lancer <402430575@qq.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu> Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Jerry Ji <jerryjilol@gmail.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: Jinyan Chen <jinyanc@nvidia.com> Co-authored-by: Koushik Dutta <koush@koushikdutta.com> Co-authored-by: root <root@ubuntu-nvidia.localdomain> Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com> Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Minglei Zhu <zminglei@linkedin.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: zhangheng <hzh0425@apache.org> Co-authored-by: yizhang2077 <1109276519@qq.com> Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com> Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

ck-intel · 2026-01-21T17:45:46Z

python/sglang/srt/models/gpt_oss.py

        self.vocab_size = config.vocab_size
        self.pp_group = get_pp_group()

+        if is_npu:


This should be _is_npu, otherwise this condition will always be True and will hinder running on other backends.
@Todobe can you fix it?

Todobe added 2 commits November 27, 2025 16:19

adapt gpt-oss-120b

1d307a1

move sinks_attention to sgl-kernel-npu

fde8ed1

Todobe requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, iforgetmyname, ispobock, merrymercy and ping1jing2 as code owners December 1, 2025 07:24

github-actions bot added quant LLM Quantization npu labels Dec 1, 2025

Todobe changed the title ~~Gpt oss~~ Adapt GPT-OSS for NPU Dec 1, 2025

Todobe changed the title ~~Adapt GPT-OSS for NPU~~ Support GPT-OSS for NPU Dec 1, 2025

Merge branch 'main' into gpt-oss

621f137

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

python/sglang/srt/layers/quantization/unquant.py Show resolved Hide resolved

python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py Show resolved Hide resolved

fix silu

9212884

ping1jing2 self-assigned this Dec 2, 2025

ping1jing2 reviewed Dec 2, 2025

View reviewed changes

python/sglang/srt/models/gpt_oss.py Outdated Show resolved Hide resolved

ping1jing2 changed the title ~~Support GPT-OSS for NPU~~ [Ascend]Support GPT-OSS for NPU Dec 2, 2025

Todobe and others added 5 commits December 4, 2025 10:51

move swiglu_oai to sgl-kernel-npu

d215df7

fix swiglu_oai

19a130e

Merge branch 'main' into gpt-oss

7bdb2c2

Merge remote-tracking branch 'shequ/main' into gpt-oss

d6a45b9

Merge branch 'main' into gpt-oss

f148f9f

Todobe and others added 3 commits December 8, 2025 14:08

Merge branch 'main' into gpt-oss

a87d4e4

Merge branch 'sgl-project:main' into gpt-oss

240e0ec

gptoss adpat prefix cache

17a0309

iforgetmyname reviewed Jan 16, 2026

View reviewed changes

modify swiglu activation

3a8e0a1

iforgetmyname approved these changes Jan 16, 2026

View reviewed changes

github-actions bot added the run-ci label Jan 16, 2026

iforgetmyname and others added 5 commits January 17, 2026 15:52

Merge branch 'main' into gpt-oss

703dd42

fix bias

d46f1f5

Merge branch 'gpt-oss' of https://github.com/Todobe/sgl-sglang into g…

23af5c4

…pt-oss

rm extra file

30a7bd2

Merge branch 'main' into gpt-oss

8e9f5fe

iforgetmyname changed the title ~~[Ascend]Support GPT-OSS for NPU~~ [NPU]Support GPT-OSS for NPU Jan 18, 2026

iforgetmyname merged commit 733de6b into sgl-project:main Jan 18, 2026
375 of 401 checks passed

ck-intel reviewed Jan 21, 2026

View reviewed changes

ck-intel mentioned this pull request Jan 22, 2026

[NPU] [Bug Fix] Fix typo in npu device check in gpt_oss.py #17553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU]Support GPT-OSS for NPU#14197

[NPU]Support GPT-OSS for NPU#14197
iforgetmyname merged 18 commits intosgl-project:mainfrom
Todobe:gpt-oss

Todobe commented Dec 1, 2025 •

edited by iforgetmyname

Loading

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

iforgetmyname Jan 16, 2026

Uh oh!

Uh oh!

iforgetmyname commented Jan 16, 2026

Uh oh!

Uh oh!

ck-intel Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Todobe commented Dec 1, 2025 • edited by iforgetmyname Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

iforgetmyname Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iforgetmyname commented Jan 16, 2026

Uh oh!

Uh oh!

ck-intel Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Todobe commented Dec 1, 2025 •

edited by iforgetmyname

Loading