[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy by will-deines · Pull Request #38054 · vllm-project/vllm

will-deines · 2026-03-25T02:18:17Z

Purpose

This draft follows up the GPT-OSS L40S attention-policy work with a narrow
Marlin MoE runtime policy for the GPT-OSS 20B shape on SM89 / L40S.

The change keeps the existing generic Marlin block_size_m heuristic as the
default, but adds a model- and device-specific override for the observed
GPT-OSS MXFP4 MoE shape on L40S:

choose block_size_m=64 for tiny-M decode-like calls
choose block_size_m=32 for larger-M prefill-like calls

The policy is intentionally narrow:

DeviceCapability(8, 9) only
GPT-OSS 20B MoE shape only (hidden_size=2880, num_experts=32, top_k=4)
MXFP4 MoE path only

Everything else keeps the existing generic Marlin auto policy unchanged.

Why This Is Not Duplicating Existing Open PRs

This is not a duplicate of [Attention][GPT-OSS] Prefer Triton on SM8x and narrow SM89 sink-prefill tuning #37949. That PR is the GPT-OSS L40S attention
follow-up and only changes CUDA attention backend selection plus Triton
unified-attention tile policy. This PR changes Marlin MoE block_size_m
selection.
This is not a duplicate of [Kernel] Add MXFP4 W4A4 CUTLASS MoE kernel for SM100 #37463. That PR adds an MXFP4 CUTLASS MoE kernel
for SM100. This PR does not add a new MoE kernel; it only adjusts Marlin
runtime policy for SM89 / L40S.
This is not a duplicate of fix marlin fp4 kernel N-dimension alignment #37296 or [Bugfix] Pad Marlin FP8 MoE weight dims to tile alignment under TP > 1 #36807. Those PRs address Marlin kernel
correctness / alignment bugs. This PR does not fix a correctness bug; it
adds a narrow performance policy for GPT-OSS on L40S.
This is not a duplicate of [ROCm][Quantization] Enable moe_wna16 on ROCm via Triton fallback #35596 or [NVFP4] Support NVFP4 MOE models on AMD Instinct, Nvidia Ampere, Hopper through NVFP4 MOE emulation #35737. Those PRs are ROCm / AMD MoE
enablement work, while this PR is CUDA SM89-specific.

Motivation

Deployed L40S benchmark experiments on GPT-OSS 20B showed that the current
generic Marlin block_size_m heuristic is not the best fit for the GPT-OSS
MoE shape on L40S.

In local deploy sweeps on Modal L40S endpoints:

block_size_m=48 consistently regressed and was rejected
block_size_m=32 was strong for long-prefill control cases
block_size_m=64 was best on the decode-heavy case

This draft encodes that result as a narrow runtime selector instead of
requiring separate deployment variants.

Test Plan

source ~/.venvs/modal-test/bin/activate
pytest tests/kernels/moe/test_marlin_block_size_policy.py \
  tests/v1/attention/test_cuda_attention_backend_policy.py \
  tests/kernels/attention/test_triton_unified_attention_tile_policy.py -q

UV_CACHE_DIR=/tmp/uv-cache UV_TOOL_DIR=/tmp/uv-tools \
  uvx pre-commit run --files \
  vllm/model_executor/layers/fused_moe/fused_marlin_moe.py \
  tests/kernels/moe/test_marlin_block_size_policy.py

Test Result

pytest ... -q -> 17 passed
file-scoped pre-commit -> passed

Manual validation on deployed Modal L40S GPT-OSS 20B endpoints:

block_size_m=48 regressed both the decode-heavy case and the long-prefill
control versus baseline
block_size_m=32 improved the decode-heavy case modestly and preserved a
strong long-prefill control result
block_size_m=64 improved the decode-heavy case more than 32 while
remaining strongly better than baseline on the long-prefill control

Representative deployed results versus the same baseline:

decode-heavy case, concurrency 8
- b32: -9.89% median per-request total
- b64: -13.91% median per-request total
long-prefill control, concurrency 1
- b32: -40.08% median per-request total
- b64: -45.04% median per-request total

This draft uses 64 for tiny-M decode-like calls and 32 for larger-M
prefill-like calls to reflect that deployed sweep.

AI Assistance

AI assistance was used to help implement the selector, write the focused
tests, and analyze the L40S benchmark results. I reviewed every changed line
and ran the commands above.

gemini-code-assist

Code Review

This pull request introduces model-aware attention backend and Marlin MoE block size policies. It adds logic to prioritize attention backends (Triton, FlashAttention, FlashInfer) for GPT-OSS models with attention sinks based on CUDA device capability (SM8x, SM9x, SM100+). Additionally, it implements a new policy for Marlin MoE block size selection, optimizing for GPT-OSS models on SM89, and introduces a mechanism to use smaller Triton unified attention sink tiles on SM8x GPUs with less shared memory. The review comments suggest improving readability and maintainability by defining magic numbers as named constants in both the Marlin MoE and Triton attention tile size selection logic.

gemini-code-assist · 2026-03-25T02:19:55Z

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py

+GPT_OSS_SM89_MOE_BLOCK_SIZE_M_SMALL_M = 64
+GPT_OSS_SM89_MOE_BLOCK_SIZE_M_LARGE_M = 32
+GPT_OSS_SM89_MOE_SMALL_M_THRESHOLD = 128
+
+
+def _use_gpt_oss_sm89_marlin_block_size_policy(
+    *,
+    num_experts: int,
+    topk: int,
+    hidden_size: int,
+    quant_type: ScalarType,
+    device_capability: DeviceCapability | None,
+) -> bool:
+    return (
+        device_capability == DeviceCapability(8, 9)
+        and num_experts == 32
+        and topk == 4
+        and hidden_size == 2880
+        and quant_type == scalar_types.float4_e2m1f
+    )


For better readability and maintainability, especially for such a narrow, hardware-specific performance policy, it's good practice to define the magic numbers for the GPT-OSS 20B MoE shape as constants. This makes the code easier to understand and modify in the future.

GPT_OSS_SM89_MOE_BLOCK_SIZE_M_SMALL_M = 64 GPT_OSS_SM89_MOE_BLOCK_SIZE_M_LARGE_M = 32 GPT_OSS_SM89_MOE_SMALL_M_THRESHOLD = 128 # GPT-OSS 20B MoE shape constants GPT_OSS_20B_MOE_NUM_EXPERTS = 32 GPT_OSS_20B_MOE_TOP_K = 4 GPT_OSS_20B_MOE_HIDDEN_SIZE = 2880 def _use_gpt_oss_sm89_marlin_block_size_policy( *, num_experts: int, topk: int, hidden_size: int, quant_type: ScalarType, device_capability: DeviceCapability | None, ) -> bool: return ( device_capability == DeviceCapability(8, 9) and num_experts == GPT_OSS_20B_MOE_NUM_EXPERTS and topk == GPT_OSS_20B_MOE_TOP_K and hidden_size == GPT_OSS_20B_MOE_HIDDEN_SIZE and quant_type == scalar_types.float4_e2m1f )

gemini-code-assist · 2026-03-25T02:19:55Z

vllm/v1/attention/ops/triton_unified_attention.py

+def _use_small_sm8x_sink_tiles(
+    device_capability: DeviceCapability | None,
+    has_sinks: bool,
+) -> bool:
+    """Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs.
+
+    SM86/SM89 parts have materially less shared memory per SM than SM80,
+    so the sink-capable unified Triton path benefits from a smaller tile.
+    """
+    return (
+        has_sinks
+        and device_capability is not None
+        and device_capability.major == 8
+        and device_capability.minor in (6, 9)
+    )


To improve readability and maintainability, it's better to define the magic numbers for device minor versions as a named constant. This makes the code's intent clearer and easier to update if more device types are added in the future.

Suggested change

def _use_small_sm8x_sink_tiles(

device_capability: DeviceCapability | None,

has_sinks: bool,

) -> bool:

"""Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs.

SM86/SM89 parts have materially less shared memory per SM than SM80,

so the sink-capable unified Triton path benefits from a smaller tile.

"""

return (

has_sinks

and device_capability is not None

and device_capability.major == 8

and device_capability.minor in (6, 9)

)

_SM8X_DEVICES_WITH_LESS_SHARED_MEM = (6, 9)

def _use_small_sm8x_sink_tiles(

device_capability: DeviceCapability | None,

has_sinks: bool,

) -> bool:

"""Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs.

SM86/SM89 parts have materially less shared memory per SM than SM80,

so the sink-capable unified Triton path benefits from a smaller tile.

"""

return (

has_sinks

and device_capability is not None

and device_capability.major == 8

and device_capability.minor in _SM8X_DEVICES_WITH_LESS_SHARED_MEM

)

Co-authored-by: OpenAI Codex <noreply@openai.com> Signed-off-by: Will Deines <will@garr.io> (cherry picked from commit b43bcfd) Signed-off-by: Will Deines <will@garr.io>

mergify bot added gpt-oss Related to GPT-OSS models nvidia v1 labels Mar 25, 2026

github-project-automation bot added this to NVIDIA and gpt-oss Issues & Enhancements Mar 25, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Mar 25, 2026

gemini-code-assist bot reviewed Mar 25, 2026

View reviewed changes

Add GPT-OSS L40S Marlin MoE block-size policy

d761c0b

Co-authored-by: OpenAI Codex <noreply@openai.com> Signed-off-by: Will Deines <will@garr.io> (cherry picked from commit b43bcfd) Signed-off-by: Will Deines <will@garr.io>

will-deines force-pushed the feature/gpt-oss-l40s-moe-policy branch from b43bcfd to d761c0b Compare March 26, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy#38054

[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy#38054
will-deines wants to merge 1 commit intovllm-project:mainfrom
will-deines:feature/gpt-oss-l40s-moe-policy

will-deines commented Mar 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

gemini-code-assist bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

will-deines commented Mar 25, 2026

Purpose

Why This Is Not Duplicating Existing Open PRs

Motivation

Test Plan

Test Result

AI Assistance

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants