Support triton_kernels for GPT-OSS on SM120#19718
Support triton_kernels for GPT-OSS on SM120#19718Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
triton_kernels for GPT-OSS on SM120#19718Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Did someone test his after rebase with an RTX pro 6000 @b8zhong ? |
|
@amittell I used 5090. It's the same SM capability so it's fine |
@b8zhong would appreciate being added as a co-author via Co-authored-by: in the commit message, not just the PR description. Or I can reopen my original PR that I authored. |
|
Sure no problem. I'll make sure to include it. Could you send your email and (ideally, the co-authored by string)? Thanks~ |
@b8zhong Co-authored-by: amittell 1388680+amittell@users.noreply.github.com |
ba22475 to
ef78f22
Compare
|
@b8zhong Still no co-author attribution....? Can we just merge my original PR? I'll handle the rebase. |
|
It's in the auto squash message FYI. It'll appear when it's merged |
|
@amittell added your name in the force merge message, please check, thanks for your contributions! |
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
|
@b8zhong looks like this doesn't work on sglang serve --model-path openai/gpt-oss-120b --reasoning-parser gpt-oss --tool-call-parser gpt-oss
...
[2026-03-06 08:43:46] Using KV cache dtype: torch.bfloat16
[2026-03-06 08:43:46] Use sliding window memory pool. full_layer_tokens=318715, swa_layer_tokens=254972
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 254972, K size: 4.38 GB, V size: 4.38 GB
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 318715, K size: 5.47 GB, V size: 5.47 GB
[2026-03-06 08:43:46] SWAKVPool mem usage: 19.70 GB, swa size: 254972, full size: 318715
[2026-03-06 08:43:46] Memory pool end. avail mem=12.33 GB
[2026-03-06 08:43:46] Capture cuda graph begin. This can take up to several minutes. avail mem=12.25 GB
[2026-03-06 08:43:46] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=11.50 GB): 100%|██████████| 36/36 [01:09<00:00, 1.93s/it]
[2026-03-06 08:44:56] Capture cuda graph end. Time elapsed: 69.83 s. mem usage=0.75 GB. avail mem=11.49 GB.
[2026-03-06 08:44:56] Capture piecewise CUDA graph begin. avail mem=11.49 GB
[2026-03-06 08:44:56] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192]
[2026-03-06 08:45:00] install_torch_compiled
Compiling num tokens (num_tokens=8192): 0%| | 0/58 [00:00<?, ?it/s]/root/.local/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
torch._dynamo.utils.warn_once(msg)
[2026-03-06 08:45:08] Initializing SGLangBackend
[2026-03-06 08:45:08] SGLangBackend __call__
[2026-03-06 08:45:10] Compiling a graph for dynamic shape takes 0.64 s
[2026-03-06 08:45:10] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1772786710.2688763.py
Compiling num tokens (num_tokens=8192): 0%| | 0/58 [00:12<?, ?it/s]
[2026-03-06 08:45:13] Piecewise CUDA Graph failed with error:
Piecewise CUDA Graph is enabled by default as an experimental feature.
To work around this error, add --disable-piecewise-cuda-graph to your launch command.
Please report this issue at https://github.com/sgl-project/sglang/issues/new/choose
[2026-03-06 08:45:13] Scheduler hit an exception: Traceback (most recent call last):
File "/sglang/python/sglang/srt/managers/scheduler.py", line 3237, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sglang/python/sglang/srt/managers/scheduler.py", line 365, in __init__
self.init_model_worker()
File "/sglang/python/sglang/srt/managers/scheduler.py", line 561, in init_model_worker
self.init_tp_model_worker()
File "/sglang/python/sglang/srt/managers/scheduler.py", line 519, in init_tp_model_worker
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/managers/tp_worker.py", line 258, in __init__
self._init_model_runner()
File "/sglang/python/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 416, in __init__
self.initialize(min_per_gpu_memory)
File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 640, in initialize
self.init_piecewise_cuda_graphs()
File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 2247, in init_piecewise_cuda_graphs
self.piecewise_cuda_graph_runner = PiecewiseCudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 306, in __init__
self.warmup_compile(num_tokens=num_tokens)
File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 403, in warmup_compile
_ = self.model_runner.model.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/models/gpt_oss.py", line 635, in forward
hidden_states = self.model(
^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/compilation/compile.py", line 194, in trampoline
_ensure_compiled(self, *args, **kwargs)
File "/sglang/python/sglang/srt/compilation/compile.py", line 185, in _ensure_compiled
compiled_callable(*args, **kwargs)
File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/models/gpt_oss.py", line 541, in forward
def forward(
File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 414, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
raise e
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<eval_with_key>.74", line 269, in forward
submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_, l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_); getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_ = l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py", line 111, in __call__
return self.compiled_graph_for_general_shape(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
raise e
File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<eval_with_key>.3", line 9, in forward
moe_impl = torch.ops.sglang.moe_impl(0, linear); linear = None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/models/gpt_oss.py", line 206, in moe_impl
final_hidden_states = moe_fusion.experts(hidden_states, topk_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 961, in forward
return self.forward_impl(hidden_states, topk_output)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 990, in forward_impl
combine_input = self.run_moe_core(
^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1011, in run_moe_core
return self.quant_method.apply(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 902, in apply
return self.runner.run(dispatch_output, quant_info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/moe_runner/runner.py", line 96, in run
runner_output = self.runner_core.run(runner_input, quant_info, running_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/moe_runner/triton_kernels.py", line 115, in run
output = triton_kernel_fused_experts_with_bias(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/triton_kernels_moe.py", line 306, in triton_kernel_fused_experts_with_bias
matmul_ogs(
File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 370, in matmul_ogs
opt_flags = make_opt_flags(out_dtype, x.dtype, w.dtype, precision_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 302, in make_opt_flags
return make_default_opt_flags_nvidia(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 218, in make_default_opt_flags_nvidia
assert num_stages >= 1
^^^^^^^^^^^^^^^
AssertionError |
|
@Kangyan-Zhou Thanks, but the co-author line is malformed -- missing angle brackets around the email, so GitHub doesn't actually link it to my profile: Should be: Could you fix that? Separately -- I saw @mmangkad's report about the RTX PRO 6000 Server Edition failing with |
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
|
hi @amittell @Wangzheee @magicYang1573 @b8zhong is that triton kernel faster than marvel kernel? |
|
@geraldstanje1 There is no "marvel kernel" in SGLang -- I'm not sure what you're referring to. The available MoE backends for MXFP4 on SM120 are: I tested all three viable backends on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM120, 98GB) with GPT-OSS-120B using triton_kernel (this PR) -- 5 runs per context length, 200 output tokens
triton (standard) -- OOMcutlass -- OOMSame error as triton. Both Key findingOn a single RTX PRO 6000 (98GB), For reference, here's how it compares with vLLM on the same hardware (200 token output):
vLLM is slightly faster at short context but doesn't support context lengths beyond ~48K on this hardware. SGLang with triton_kernel scales to 131K. BTW @Wangzheee @magicYang1573 @b8zhong I never got that co-author attribution for my work, the mail was missing <> on the email so GitHub ignored it. Any chance you can rectify please? |


Tested on 2 x 5090:
This PR is written by: @amittell #16975, I just rebased changes and tested code.
Requires:
pip install triton_kernels --no-depsLooks alright. Around 260 TPS