Support `triton_kernels` for GPT-OSS on SM120 by b8zhong · Pull Request #19718 · sgl-project/sglang

b8zhong · 2026-03-02T22:24:37Z

Tested on 2 x 5090:

This PR is written by: @amittell #16975, I just rebased changes and tested code.

Requires:
pip install triton_kernels --no-deps

python -m sglang.launch_server \
  --model openai/gpt-oss-20b \
  --reasoning-parser gpt-oss \
  --tool-call-parser gpt-oss \
  --tp 2

Looks alright. Around 260 TPS

python3 -m sglang.test.send_one --stream --max-new-tokens 2048 --prompt "Fully explain a linear layer from scratch."


A linear layer, also known as a fully connected layer or dense layer, is a fundamental building block in neural networks. It is responsible for transforming input data into a new representation by applying a linear transformation followed by a non-linear activation function. Let's break down the components and operations involved in a linear layer:

1. Input: The linear layer receives an input vector or matrix, denoted as X. The input can be a single data point or a batch of data points.

2. Weights: The linear layer has a set of learnable parameters called weights, denoted as W. The weights are typically represented as a matrix of shape (input_dim, output_dim), where input_dim is the dimensionality of the input and output_dim is the desired dimensionality of the output.

3. Bias: In addition to weights, the linear layer also has a bias term, denoted as b. The bias is a learnable parameter that allows the layer to shift the output independently of the input. The bias is typically represented as a vector of shape (output_dim,).

4. Linear Transformation: The linear layer applies a linear transformation to the input by performing a matrix multiplication between the input X and the weight matrix W. This operation can be expressed as:

   Y = X * W

gemini-code-assist · 2026-03-02T22:24:40Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-02T22:24:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

b8zhong · 2026-03-02T22:24:55Z

/tag-and-rerun-ci

amittell · 2026-03-02T23:08:59Z

Did someone test his after rebase with an RTX pro 6000 @b8zhong ?

b8zhong · 2026-03-03T00:19:36Z

@amittell I used 5090. It's the same SM capability so it's fine

amittell · 2026-03-03T00:29:04Z

@amittell I used 5090. It's the same SM capability so it's fine

@b8zhong would appreciate being added as a co-author via Co-authored-by: in the commit message, not just the PR description. Or I can reopen my original PR that I authored.

b8zhong · 2026-03-03T00:39:03Z

Sure no problem. I'll make sure to include it. Could you send your email and (ideally, the co-authored by string)? Thanks~

amittell · 2026-03-03T00:48:57Z

Sure no problem. I'll make sure to include it. Could you send your email and (ideally, the co-authored by string)? Thanks~

@b8zhong Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

b8zhong · 2026-03-03T01:00:30Z

amittell · 2026-03-03T01:11:34Z

Co-authored-by: amittell <1388680+amittell@users.noreply.github.com>

Without the < > git won't parse it correctly it won't show up in the contributor graph. Luckily looks like you didn't commit yet.

amittell · 2026-03-03T16:03:46Z

@b8zhong Still no co-author attribution....? Can we just merge my original PR? I'll handle the rebase.

b8zhong · 2026-03-03T18:42:20Z

It's in the auto squash message FYI. It'll appear when it's merged

Kangyan-Zhou · 2026-03-03T22:14:25Z

@amittell added your name in the force merge message, please check, thanks for your contributions!

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

mmangkad · 2026-03-06T09:13:13Z

@b8zhong looks like this doesn't work on NVIDIA RTX PRO 6000 Blackwell Server Edition. I'm getting error below

sglang serve --model-path openai/gpt-oss-120b --reasoning-parser gpt-oss --tool-call-parser gpt-oss
...
[2026-03-06 08:43:46] Using KV cache dtype: torch.bfloat16
[2026-03-06 08:43:46] Use sliding window memory pool. full_layer_tokens=318715, swa_layer_tokens=254972
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 254972, K size: 4.38 GB, V size: 4.38 GB
[2026-03-06 08:43:46] KV Cache is allocated. #tokens: 318715, K size: 5.47 GB, V size: 5.47 GB
[2026-03-06 08:43:46] SWAKVPool mem usage: 19.70 GB, swa size: 254972, full size: 318715
[2026-03-06 08:43:46] Memory pool end. avail mem=12.33 GB
[2026-03-06 08:43:46] Capture cuda graph begin. This can take up to several minutes. avail mem=12.25 GB
[2026-03-06 08:43:46] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=1 avail_mem=11.50 GB): 100%|██████████| 36/36 [01:09<00:00,  1.93s/it]
[2026-03-06 08:44:56] Capture cuda graph end. Time elapsed: 69.83 s. mem usage=0.75 GB. avail mem=11.49 GB.
[2026-03-06 08:44:56] Capture piecewise CUDA graph begin. avail mem=11.49 GB
[2026-03-06 08:44:56] Capture cuda graph num tokens [4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192]
[2026-03-06 08:45:00] install_torch_compiled
Compiling num tokens (num_tokens=8192):   0%|          | 0/58 [00:00<?, ?it/s]/root/.local/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
[2026-03-06 08:45:08] Initializing SGLangBackend
[2026-03-06 08:45:08] SGLangBackend __call__
[2026-03-06 08:45:10] Compiling a graph for dynamic shape takes 0.64 s
[2026-03-06 08:45:10] Computation graph saved to /root/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1772786710.2688763.py
Compiling num tokens (num_tokens=8192):   0%|          | 0/58 [00:12<?, ?it/s]
[2026-03-06 08:45:13] Piecewise CUDA Graph failed with error: 
Piecewise CUDA Graph is enabled by default as an experimental feature.
To work around this error, add --disable-piecewise-cuda-graph to your launch command.
Please report this issue at https://github.com/sgl-project/sglang/issues/new/choose
[2026-03-06 08:45:13] Scheduler hit an exception: Traceback (most recent call last):
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 3237, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 365, in __init__
    self.init_model_worker()
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 561, in init_model_worker
    self.init_tp_model_worker()
  File "/sglang/python/sglang/srt/managers/scheduler.py", line 519, in init_tp_model_worker
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/managers/tp_worker.py", line 258, in __init__
    self._init_model_runner()
  File "/sglang/python/sglang/srt/managers/tp_worker.py", line 341, in _init_model_runner
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 416, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 640, in initialize
    self.init_piecewise_cuda_graphs()
  File "/sglang/python/sglang/srt/model_executor/model_runner.py", line 2247, in init_piecewise_cuda_graphs
    self.piecewise_cuda_graph_runner = PiecewiseCudaGraphRunner(self)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 306, in __init__
    self.warmup_compile(num_tokens=num_tokens)
  File "/sglang/python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 403, in warmup_compile
    _ = self.model_runner.model.forward(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 635, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/compilation/compile.py", line 194, in trampoline
    _ensure_compiled(self, *args, **kwargs)
  File "/sglang/python/sglang/srt/compilation/compile.py", line 185, in _ensure_compiled
    compiled_callable(*args, **kwargs)
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 832, in compile_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 541, in forward
    def forward(
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 414, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
    raise e
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.74", line 269, in forward
    submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_, l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_);  getitem_3 = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_self_attn_modules_o_proj_parameters_bias_ = l_self_modules_layers_modules_0_layer_communicator_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_layer_communicator_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/compilation/cuda_piecewise_backend.py", line 111, in __call__
    return self.compiled_graph_for_general_shape(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 837, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 413, in __call__
    raise e
  File "/root/.local/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.3", line 9, in forward
    moe_impl = torch.ops.sglang.moe_impl(0, linear);  linear = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/_ops.py", line 1255, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/models/gpt_oss.py", line 206, in moe_impl
    final_hidden_states = moe_fusion.experts(hidden_states, topk_output)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 961, in forward
    return self.forward_impl(hidden_states, topk_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 990, in forward_impl
    combine_input = self.run_moe_core(
                    ^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1011, in run_moe_core
    return self.quant_method.apply(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/quantization/mxfp4.py", line 902, in apply
    return self.runner.run(dispatch_output, quant_info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/moe_runner/runner.py", line 96, in run
    runner_output = self.runner_core.run(runner_input, quant_info, running_state)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/moe_runner/triton_kernels.py", line 115, in run
    output = triton_kernel_fused_experts_with_bias(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sglang/python/sglang/srt/layers/moe/fused_moe_triton/triton_kernels_moe.py", line 306, in triton_kernel_fused_experts_with_bias
    matmul_ogs(
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs.py", line 370, in matmul_ogs
    opt_flags = make_opt_flags(out_dtype, x.dtype, w.dtype, precision_config,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 302, in make_opt_flags
    return make_default_opt_flags_nvidia(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/lib/python3.12/site-packages/triton_kernels/matmul_ogs_details/opt_flags.py", line 218, in make_default_opt_flags_nvidia
    assert num_stages >= 1
           ^^^^^^^^^^^^^^^
AssertionError

amittell · 2026-03-06T22:31:10Z

@Kangyan-Zhou Thanks, but the co-author line is malformed -- missing angle brackets around the email, so GitHub doesn't actually link it to my profile:

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

Should be:

Co-authored-by: amittell <1388680+amittell@users.noreply.github.com>

Could you fix that?

Separately -- I saw @mmangkad's report about the RTX PRO 6000 Server Edition failing with assert num_stages >= 1, and the fix in #20040. Tested the fix on an RTX PRO 6000 Blackwell Workstation Edition (same SM120, 101KB shared memory per block) and can confirm it works. Server starts cleanly and inference runs fine.

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

geraldstanje1 · 2026-03-22T17:50:07Z

hi @amittell @Wangzheee @magicYang1573 @b8zhong is that triton kernel faster than marvel kernel?

amittell · 2026-03-28T02:27:07Z

@geraldstanje1 There is no "marvel kernel" in SGLang -- I'm not sure what you're referring to. The available MoE backends for MXFP4 on SM120 are: triton_kernel (from the triton_kernels package, what this PR adds), triton (standard SGLang triton), cutlass, and flashinfer_* variants.

I tested all three viable backends on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition (SM120, 98GB) with GPT-OSS-120B using lmsysorg/sglang:dev-cu13:

triton_kernel (this PR) -- 5 runs per context length, 200 output tokens

Context	tok/s	std
4K	142.6	3.9
8K	129.9	0.1
16K	121.3	0.0
32K	103.4	0.0
64K	78.2	0.0

triton (standard) -- OOM

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.96 GiB.

cutlass -- OOM

Same error as triton. Both triton and cutlass backends attempt to dequantize MXFP4 weights to a larger format, requiring ~4GB extra VRAM that isn't available with the 120B model loaded.

Key finding

On a single RTX PRO 6000 (98GB), triton_kernel is the only MoE backend that can actually run GPT-OSS-120B because it handles FP4 natively without dequantization. The other backends OOM during weight loading.

For reference, here's how it compares with vLLM on the same hardware (200 token output):

Context	SGLang triton_kernel	vLLM baseline
4K	142.6 tok/s	155.9 tok/s
8K	129.9 tok/s	155.1 tok/s
16K	121.3 tok/s	136.5 tok/s
32K	103.4 tok/s	109.5 tok/s
64K	78.2 tok/s	--

vLLM is slightly faster at short context but doesn't support context lengths beyond ~48K on this hardware. SGLang with triton_kernel scales to 131K.

BTW @Wangzheee @magicYang1573 @b8zhong I never got that co-author attribution for my work, the mail was missing <> on the email so GitHub ignored it. Any chance you can rectify please?

b8zhong marked this pull request as ready for review March 2, 2026 22:24

b8zhong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, HaiShaw and ch-wan as code owners March 2, 2026 22:24

github-actions bot added the run-ci label Mar 2, 2026

This was referenced Mar 2, 2026

Add SM120 (Blackwell desktop) MXFP4 support #16975

Closed

SM120 Performance Optimization Plan #19637

Open

b8zhong enabled auto-merge (squash) March 3, 2026 01:00

b8zhong disabled auto-merge March 3, 2026 01:29

b8zhong enabled auto-merge (squash) March 3, 2026 01:29

more

ef78f22

b8zhong force-pushed the brayden/sm120-triton-kernel branch from ba22475 to ef78f22 Compare March 3, 2026 15:11

Kangyan-Zhou disabled auto-merge March 3, 2026 22:13

Kangyan-Zhou enabled auto-merge (squash) March 3, 2026 22:13

Kangyan-Zhou disabled auto-merge March 3, 2026 22:13

Kangyan-Zhou merged commit 9305f0e into sgl-project:main Mar 3, 2026
94 of 103 checks passed

Kangyan-Zhou pushed a commit to Kangyan-Zhou/sglang that referenced this pull request Mar 4, 2026

Support triton_kernels for GPT-OSS on SM120 (sgl-project#19718)

9846215

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

qeternity pushed a commit to qeternity/sglang that referenced this pull request Mar 6, 2026

Support triton_kernels for GPT-OSS on SM120 (sgl-project#19718)

ac2906b

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

Support triton_kernels for GPT-OSS on SM120 (sgl-project#19718)

42c5122

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

Support triton_kernels for GPT-OSS on SM120 (sgl-project#19718)

7f4456a

Co-authored-by: amittell 1388680+amittell@users.noreply.github.com

Conversation

b8zhong commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

b8zhong commented Mar 2, 2026

Uh oh!

amittell commented Mar 2, 2026

Uh oh!

b8zhong commented Mar 3, 2026

Uh oh!

amittell commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

b8zhong commented Mar 3, 2026

Uh oh!

amittell commented Mar 3, 2026

Uh oh!

b8zhong commented Mar 3, 2026

Uh oh!

amittell commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amittell commented Mar 3, 2026

Uh oh!

b8zhong commented Mar 3, 2026

Uh oh!

Uh oh!

Kangyan-Zhou commented Mar 3, 2026

Uh oh!

mmangkad commented Mar 6, 2026

Uh oh!

amittell commented Mar 6, 2026

Uh oh!

geraldstanje1 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amittell commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

triton_kernel (this PR) -- 5 runs per context length, 200 output tokens

triton (standard) -- OOM

cutlass -- OOM

Key finding

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

b8zhong commented Mar 2, 2026 •

edited

Loading

amittell commented Mar 3, 2026 •

edited

Loading

amittell commented Mar 3, 2026 •

edited

Loading

geraldstanje1 commented Mar 22, 2026 •

edited

Loading

amittell commented Mar 28, 2026 •

edited

Loading