Skip to content

Support triton_kernels for GPT-OSS on SM120#19718

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
bzhng-development:brayden/sm120-triton-kernel
Mar 3, 2026
Merged

Support triton_kernels for GPT-OSS on SM120#19718
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
bzhng-development:brayden/sm120-triton-kernel

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Mar 2, 2026

Tested on 2 x 5090:

This PR is written by: @amittell #16975, I just rebased changes and tested code.

Requires:
pip install triton_kernels --no-deps

python -m sglang.launch_server \
  --model openai/gpt-oss-20b \
  --reasoning-parser gpt-oss \
  --tool-call-parser gpt-oss \
  --tp 2

Looks alright. Around 260 TPS

python3 -m sglang.test.send_one --stream --max-new-tokens 2048 --prompt "Fully explain a linear layer from scratch."


A linear layer, also known as a fully connected layer or dense layer, is a fundamental building block in neural networks. It is responsible for transforming input data into a new representation by applying a linear transformation followed by a non-linear activation function. Let's break down the components and operations involved in a linear layer:

1. Input: The linear layer receives an input vector or matrix, denoted as X. The input can be a single data point or a batch of data points.

2. Weights: The linear layer has a set of learnable parameters called weights, denoted as W. The weights are typically represented as a matrix of shape (input_dim, output_dim), where input_dim is the dimensionality of the input and output_dim is the desired dimensionality of the output.

3. Bias: In addition to weights, the linear layer also has a bias term, denoted as b. The bias is a learnable parameter that allows the layer to shift the output independently of the input. The bias is typically represented as a vector of shape (output_dim,).

4. Linear Transformation: The linear layer applies a linear transformation to the input by performing a matrix multiplication between the input X and the weight matrix W. This operation can be expressed as:

   Y = X * W

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong b8zhong marked this pull request as ready for review March 2, 2026 22:24
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 2, 2026

/tag-and-rerun-ci

@amittell
Copy link
Copy Markdown

amittell commented Mar 2, 2026

Did someone test his after rebase with an RTX pro 6000 @b8zhong ?

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 3, 2026

@amittell I used 5090. It's the same SM capability so it's fine

@amittell
Copy link
Copy Markdown

amittell commented Mar 3, 2026

@amittell I used 5090. It's the same SM capability so it's fine

@b8zhong would appreciate being added as a co-author via Co-authored-by: in the commit message, not just the PR description. Or I can reopen my original PR that I authored.

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 3, 2026

Sure no problem. I'll make sure to include it. Could you send your email and (ideally, the co-authored by string)? Thanks~

@amittell
Copy link
Copy Markdown

amittell commented Mar 3, 2026

Sure no problem. I'll make sure to include it. Could you send your email and (ideally, the co-authored by string)? Thanks~

@b8zhong Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

@b8zhong b8zhong enabled auto-merge (squash) March 3, 2026 01:00
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 3, 2026

Screenshot 2026-03-02 at 8 00 21 PM

@amittell
Copy link
Copy Markdown

amittell commented Mar 3, 2026

Screenshot 2026-03-02 at 8 00 21 PM

Co-authored-by: amittell <1388680+amittell@users.noreply.github.com>

Without the < > git won't parse it correctly it won't show up in the contributor graph. Luckily looks like you didn't commit yet.

@b8zhong b8zhong disabled auto-merge March 3, 2026 01:29
@b8zhong b8zhong enabled auto-merge (squash) March 3, 2026 01:29
@b8zhong b8zhong force-pushed the brayden/sm120-triton-kernel branch from ba22475 to ef78f22 Compare March 3, 2026 15:11
@amittell
Copy link
Copy Markdown

amittell commented Mar 3, 2026

@b8zhong Still no co-author attribution....? Can we just merge my original PR? I'll handle the rebase.

@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 3, 2026

It's in the auto squash message FYI. It'll appear when it's merged

@Kangyan-Zhou Kangyan-Zhou disabled auto-merge March 3, 2026 22:13
@Kangyan-Zhou Kangyan-Zhou enabled auto-merge (squash) March 3, 2026 22:13
@Kangyan-Zhou Kangyan-Zhou disabled auto-merge March 3, 2026 22:13
@Kangyan-Zhou Kangyan-Zhou merged commit 9305f0e into sgl-project:main Mar 3, 2026
94 of 103 checks passed
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@amittell added your name in the force merge message, please check, thanks for your contributions!

Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
qeternity pushed a commit to qeternity/sglang that referenced this pull request Mar 6, 2026
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
@mmangkad
Copy link
Copy Markdown
Contributor

mmangkad commented Mar 6, 2026

@b8zhong looks like this doesn't work on NVIDIA RTX PRO 6000 Blackwell Server Edition. I'm getting error below

sglang serve --model-path openai/gpt-oss-120b --reasoning-parser gpt-oss --tool-call-parser gpt-oss
...
[2026-03-06 08:43:46] Using KV cache dtype: torch.bfloat16
[2026-03-06 08:43:46] Use sliding window memory pool. full_layer_tokens=318715, swa_layer_tokens=254972
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 254972, K size: 4.38 GB, V size: 4.38 GB
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 318715, K size: 5.47 GB, V size: 5.47 GB
[2026-03-06 08:43:46] SWAKVPool mem usage: 19.70 GB, swa size: 254972, full size: 318715
[2026-03-06 08:43:46] Memory pool end. avail mem=12.33 GB
[2026-03-06 08:43:46] Capture cuda graph begin. This can take up to several minutes. avail mem=12.25 GB
[2026-03-06 08:43:46] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=11.50 GB): 100%|██████████| 36/36 [01:09<00:00,  1.93s/it]
[2026-03-06 08:44:56] Capture cuda graph end. Time elapsed: 69.83 s. mem usage=0.75 GB. avail mem=11.49 GB.
[2026-03-06 08:44:56] Capture piecewise CUDA graph begin. avail mem=11.49 GB
[2026-03-06 08:44:56] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192]
[2026-03-06 08:45:00] install_torch_compiled
Compiling num tokens (num_tokens=8192):   0%|          | 0/58 [00:00<?, ?it/s]/root/.local/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
[2026-03-06 08:45:08] Initializing SGLangBackend
[2026-03-06 08:45:08] SGLangBackend __call__
[2026-03-06 08:45:10] Compiling a graph for dynamic shape takes 0.64 s
[2026-03-06 08:45:10] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1772786710.2688763.py
Compiling num tokens (num_tokens=8192):   0%|          | 0/58 [00:12<?, ?it/s]
[2026-03-06 08:45:13] Piecewise CUDA Graph failed with error: 
Piecewise CUDA Graph is enabled by default as an experimental feature.
To work around this error, add --disable-piecewise-cuda-graph to your launch command.
Please report this issue at https://github.com/sgl-project/sglang/issues/new/choose
[2026-03-06 08:45:13] Scheduler hit an exception: Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 3237, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 365, in __init__
    self.init_model_worker()
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 561, in init_model_worker
    self.init_tp_model_worker()
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 519, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/tp_worker.py", line 258, in __init__
    self._init_model_runner()
  File "/sglang/python/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 416, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 640, in initialize
    self.init_piecewise_cuda_graphs()
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 2247, in init_piecewise_cuda_graphs
    self.piecewise_cuda_graph_runner = PiecewiseCudaGraphRunner(self)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 306, in __init__
    self.warmup_compile(num_tokens=num_tokens)
  File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 403, in warmup_compile
    _ = self.model_runner.model.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 635, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/compilation/compile.py", line 194, in trampoline
    _ensure_compiled(self, *args, **kwargs)
  File "/sglang/python/sglang/srt/compilation/compile.py", line 185, in _ensure_compiled
    compiled_callable(*args, **kwargs)
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 541, in forward
    def forward(
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 414, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
    raise e
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.74", line 269, in forward
    submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_, l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_);  getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_ = l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py", line 111, in __call__
    return self.compiled_graph_for_general_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
    raise e
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.3", line 9, in forward
    moe_impl = torch.ops.sglang.moe_impl(0, linear);  linear = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 206, in moe_impl
    final_hidden_states = moe_fusion.experts(hidden_states, topk_output)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 961, in forward
    return self.forward_impl(hidden_states, topk_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 990, in forward_impl
    combine_input = self.run_moe_core(
                    ^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1011, in run_moe_core
    return self.quant_method.apply(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 902, in apply
    return self.runner.run(dispatch_output, quant_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/moe_runner/runner.py", line 96, in run
    runner_output = self.runner_core.run(runner_input, quant_info, running_state)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/moe_runner/triton_kernels.py", line 115, in run
    output = triton_kernel_fused_experts_with_bias(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/triton_kernels_moe.py", line 306, in triton_kernel_fused_experts_with_bias
    matmul_ogs(
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 370, in matmul_ogs
    opt_flags = make_opt_flags(out_dtype, x.dtype, w.dtype, precision_config,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 302, in make_opt_flags
    return make_default_opt_flags_nvidia(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 218, in make_default_opt_flags_nvidia
    assert num_stages >= 1
           ^^^^^^^^^^^^^^^
AssertionError

@amittell
Copy link
Copy Markdown

amittell commented Mar 6, 2026

@Kangyan-Zhou Thanks, but the co-author line is malformed -- missing angle brackets around the email, so GitHub doesn't actually link it to my profile:

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

Should be:

Co-authored-by: amittell <1388680+amittell@users.noreply.github.com>

Could you fix that?

Separately -- I saw @mmangkad's report about the RTX PRO 6000 Server Edition failing with assert num_stages >= 1, and the fix in #20040. Tested the fix on an RTX PRO 6000 Blackwell Workstation Edition (same SM120, 101KB shared memory per block) and can confirm it works. Server starts cleanly and inference runs fine.

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: amittell 1388680+amittell@users.noreply.github.com
@geraldstanje1
Copy link
Copy Markdown

geraldstanje1 commented Mar 22, 2026

hi @amittell @Wangzheee @magicYang1573 @b8zhong is that triton kernel faster than marvel kernel?

@amittell
Copy link
Copy Markdown

amittell commented Mar 28, 2026

@geraldstanje1 There is no "marvel kernel" in SGLang -- I'm not sure what you're referring to. The available MoE backends for MXFP4 on SM120 are: triton_kernel (from the triton_kernels package, what this PR adds), triton (standard SGLang triton), cutlass, and flashinfer_* variants.

I tested all three viable backends on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM120, 98GB) with GPT-OSS-120B using lmsysorg/sglang:dev-cu13:

triton_kernel (this PR) -- 5 runs per context length, 200 output tokens

Context tok/s std
4K 142.6 3.9
8K 129.9 0.1
16K 121.3 0.0
32K 103.4 0.0
64K 78.2 0.0

triton (standard) -- OOM

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.96 GiB.

cutlass -- OOM

Same error as triton. Both triton and cutlass backends attempt to dequantize MXFP4 weights to a larger format, requiring ~4GB extra VRAM that isn't available with the 120B model loaded.

Key finding

On a single RTX PRO 6000 (98GB), triton_kernel is the only MoE backend that can actually run GPT-OSS-120B because it handles FP4 natively without dequantization. The other backends OOM during weight loading.

For reference, here's how it compares with vLLM on the same hardware (200 token output):

Context SGLang triton_kernel vLLM baseline
4K 142.6 tok/s 155.9 tok/s
8K 129.9 tok/s 155.1 tok/s
16K 121.3 tok/s 136.5 tok/s
32K 103.4 tok/s 109.5 tok/s
64K 78.2 tok/s --

vLLM is slightly faster at short context but doesn't support context lengths beyond ~48K on this hardware. SGLang with triton_kernel scales to 131K.

BTW @Wangzheee @magicYang1573 @b8zhong I never got that co-author attribution for my work, the mail was missing <> on the email so GitHub ignored it. Any chance you can rectify please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants