[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy#38054
[MoE][GPT-OSS] Add L40S/SM89 Marlin block-size policy#38054will-deines wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces model-aware attention backend and Marlin MoE block size policies. It adds logic to prioritize attention backends (Triton, FlashAttention, FlashInfer) for GPT-OSS models with attention sinks based on CUDA device capability (SM8x, SM9x, SM100+). Additionally, it implements a new policy for Marlin MoE block size selection, optimizing for GPT-OSS models on SM89, and introduces a mechanism to use smaller Triton unified attention sink tiles on SM8x GPUs with less shared memory. The review comments suggest improving readability and maintainability by defining magic numbers as named constants in both the Marlin MoE and Triton attention tile size selection logic.
| GPT_OSS_SM89_MOE_BLOCK_SIZE_M_SMALL_M = 64 | ||
| GPT_OSS_SM89_MOE_BLOCK_SIZE_M_LARGE_M = 32 | ||
| GPT_OSS_SM89_MOE_SMALL_M_THRESHOLD = 128 | ||
|
|
||
|
|
||
| def _use_gpt_oss_sm89_marlin_block_size_policy( | ||
| *, | ||
| num_experts: int, | ||
| topk: int, | ||
| hidden_size: int, | ||
| quant_type: ScalarType, | ||
| device_capability: DeviceCapability | None, | ||
| ) -> bool: | ||
| return ( | ||
| device_capability == DeviceCapability(8, 9) | ||
| and num_experts == 32 | ||
| and topk == 4 | ||
| and hidden_size == 2880 | ||
| and quant_type == scalar_types.float4_e2m1f | ||
| ) |
There was a problem hiding this comment.
For better readability and maintainability, especially for such a narrow, hardware-specific performance policy, it's good practice to define the magic numbers for the GPT-OSS 20B MoE shape as constants. This makes the code easier to understand and modify in the future.
GPT_OSS_SM89_MOE_BLOCK_SIZE_M_SMALL_M = 64
GPT_OSS_SM89_MOE_BLOCK_SIZE_M_LARGE_M = 32
GPT_OSS_SM89_MOE_SMALL_M_THRESHOLD = 128
# GPT-OSS 20B MoE shape constants
GPT_OSS_20B_MOE_NUM_EXPERTS = 32
GPT_OSS_20B_MOE_TOP_K = 4
GPT_OSS_20B_MOE_HIDDEN_SIZE = 2880
def _use_gpt_oss_sm89_marlin_block_size_policy(
*,
num_experts: int,
topk: int,
hidden_size: int,
quant_type: ScalarType,
device_capability: DeviceCapability | None,
) -> bool:
return (
device_capability == DeviceCapability(8, 9)
and num_experts == GPT_OSS_20B_MOE_NUM_EXPERTS
and topk == GPT_OSS_20B_MOE_TOP_K
and hidden_size == GPT_OSS_20B_MOE_HIDDEN_SIZE
and quant_type == scalar_types.float4_e2m1f
)| def _use_small_sm8x_sink_tiles( | ||
| device_capability: DeviceCapability | None, | ||
| has_sinks: bool, | ||
| ) -> bool: | ||
| """Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs. | ||
|
|
||
| SM86/SM89 parts have materially less shared memory per SM than SM80, | ||
| so the sink-capable unified Triton path benefits from a smaller tile. | ||
| """ | ||
| return ( | ||
| has_sinks | ||
| and device_capability is not None | ||
| and device_capability.major == 8 | ||
| and device_capability.minor in (6, 9) | ||
| ) |
There was a problem hiding this comment.
To improve readability and maintainability, it's better to define the magic numbers for device minor versions as a named constant. This makes the code's intent clearer and easier to update if more device types are added in the future.
| def _use_small_sm8x_sink_tiles( | |
| device_capability: DeviceCapability | None, | |
| has_sinks: bool, | |
| ) -> bool: | |
| """Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs. | |
| SM86/SM89 parts have materially less shared memory per SM than SM80, | |
| so the sink-capable unified Triton path benefits from a smaller tile. | |
| """ | |
| return ( | |
| has_sinks | |
| and device_capability is not None | |
| and device_capability.major == 8 | |
| and device_capability.minor in (6, 9) | |
| ) | |
| _SM8X_DEVICES_WITH_LESS_SHARED_MEM = (6, 9) | |
| def _use_small_sm8x_sink_tiles( | |
| device_capability: DeviceCapability | None, | |
| has_sinks: bool, | |
| ) -> bool: | |
| """Prefer smaller sink tiles on Ada/GA10x-class SM8x GPUs. | |
| SM86/SM89 parts have materially less shared memory per SM than SM80, | |
| so the sink-capable unified Triton path benefits from a smaller tile. | |
| """ | |
| return ( | |
| has_sinks | |
| and device_capability is not None | |
| and device_capability.major == 8 | |
| and device_capability.minor in _SM8X_DEVICES_WITH_LESS_SHARED_MEM | |
| ) |
Co-authored-by: OpenAI Codex <noreply@openai.com> Signed-off-by: Will Deines <will@garr.io> (cherry picked from commit b43bcfd) Signed-off-by: Will Deines <will@garr.io>
b43bcfd to
d761c0b
Compare
Purpose
This draft follows up the GPT-OSS L40S attention-policy work with a narrow
Marlin MoE runtime policy for the GPT-OSS 20B shape on SM89 / L40S.
The change keeps the existing generic Marlin
block_size_mheuristic as thedefault, but adds a model- and device-specific override for the observed
GPT-OSS MXFP4 MoE shape on L40S:
block_size_m=64for tiny-Mdecode-like callsblock_size_m=32for larger-Mprefill-like callsThe policy is intentionally narrow:
DeviceCapability(8, 9)onlyhidden_size=2880,num_experts=32,top_k=4)Everything else keeps the existing generic Marlin auto policy unchanged.
Why This Is Not Duplicating Existing Open PRs
follow-up and only changes CUDA attention backend selection plus Triton
unified-attention tile policy. This PR changes Marlin MoE
block_size_mselection.
for SM100. This PR does not add a new MoE kernel; it only adjusts Marlin
runtime policy for SM89 / L40S.
correctness / alignment bugs. This PR does not fix a correctness bug; it
adds a narrow performance policy for GPT-OSS on L40S.
enablement work, while this PR is CUDA SM89-specific.
Motivation
Deployed L40S benchmark experiments on GPT-OSS 20B showed that the current
generic Marlin
block_size_mheuristic is not the best fit for the GPT-OSSMoE shape on L40S.
In local deploy sweeps on Modal L40S endpoints:
block_size_m=48consistently regressed and was rejectedblock_size_m=32was strong for long-prefill control casesblock_size_m=64was best on the decode-heavy caseThis draft encodes that result as a narrow runtime selector instead of
requiring separate deployment variants.
Test Plan
Test Result
pytest ... -q->17 passedpre-commit-> passedManual validation on deployed Modal L40S GPT-OSS 20B endpoints:
block_size_m=48regressed both the decode-heavy case and the long-prefillcontrol versus baseline
block_size_m=32improved the decode-heavy case modestly and preserved astrong long-prefill control result
block_size_m=64improved the decode-heavy case more than32whileremaining strongly better than baseline on the long-prefill control
Representative deployed results versus the same baseline:
8b32:-9.89%median per-request totalb64:-13.91%median per-request total1b32:-40.08%median per-request totalb64:-45.04%median per-request totalThis draft uses
64for tiny-Mdecode-like calls and32for larger-Mprefill-like calls to reflect that deployed sweep.
AI Assistance
AI assistance was used to help implement the selector, write the focused
tests, and analyze the L40S benchmark results. I reviewed every changed line
and ran the commands above.