Fix SM120/SM121 (consumer Blackwell) codegen: no tensor memory pipeline#9852
Closed
mihai-chiorean wants to merge 2 commits intotriton-lang:mainfrom
Closed
Fix SM120/SM121 (consumer Blackwell) codegen: no tensor memory pipeline#9852mihai-chiorean wants to merge 2 commits intotriton-lang:mainfrom
mihai-chiorean wants to merge 2 commits intotriton-lang:mainfrom
Conversation
Consumer Blackwell GPUs (SM120/SM121 — DGX Spark, RTX 5090) lack tensor memory (tcgen05) hardware present in datacenter Blackwell (SM100/SM103). The compiler was routing SM120/SM121 through the datacenter Blackwell pipeline, generating tensor memory instructions that cause illegal instruction crashes at runtime. Changes: - Add _has_tensor_memory() helper: returns True only for SM100/SM103 (arch family 10), False for SM120/SM121 (arch family 12) - Route SM120/SM121 to the Hopper-like pipeline (MMAv2, no tmem) instead of the datacenter Blackwell pipeline - Fix .target regex in make_ptx to handle optional "a" suffix - SM120/SM121 get no "a" suffix in arch string since they lack the accelerator features it implies Tested on DGX Spark GB10 (SM121, aarch64, CUDA 13.1): - Simple vector add kernel: PASS - FP16 matmul with tl.dot: PASS (max error: 0.000004) - Qwen3-Next fused_qkvzba_split_reshape_cat_kernel: PASS Fixes: triton-lang#9181, triton-lang#8539, triton-lang#8335 Related: triton-lang#9734 (reverted — addressed suffix but not pipeline routing) Signed-off-by: Mihai Chiorean <mihai.v.chiorean@gmail.com> Signed-off-by: Mihai Chiorean <mihai-chiorean@users.noreply.github.com>
ThomasRaoux
requested changes
Mar 25, 2026
Collaborator
ThomasRaoux
left a comment
There was a problem hiding this comment.
as far as I know people have been using Triton for sm_120 successfully so not sure what is the problem. The patch seems to be the same as what was reverted
Comment on lines
+111
to
+118
| # SM120/SM121 (consumer Blackwell) lack tensor memory features | ||
| # that the "a" suffix enables. Only give "a" to SM >= 90 that | ||
| # actually have the corresponding accelerator features. | ||
| arch_family = capability // 10 | ||
| if capability >= 90 and arch_family != 12: | ||
| suffix = "a" | ||
| else: | ||
| suffix = "" |
Collaborator
There was a problem hiding this comment.
we revert the patch because this was incorrect so not sure why you are adding it back
| passes.ttgpuir.add_schedule_loops(pm) | ||
| passes.ttgpuir.add_pipeline(pm, opt.num_stages, dump_enabled) | ||
| elif capability // 10 >= 10: | ||
| elif _has_tensor_memory(capability): |
Collaborator
There was a problem hiding this comment.
I don't think this is right, going through that path for sm_120 shoul be fine
Author
|
Thomas, you're right on both points. I ran systematic tests on an actual SM121 (NVIDIA GB10 / DGX Spark) with Triton 3.5.1 and CUDA 13.1, and I can confirm:
My original diagnosis was incorrect. Closing this PR. Thanks for the quick review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consumer Blackwell GPUs (SM120/SM121 — DGX Spark, RTX 5090) lack tensor memory (tcgen05) hardware present in datacenter Blackwell (SM100/SM103). The compiler routes SM120/SM121 through the datacenter Blackwell pipeline, generating tensor memory instructions that cause
illegal instructioncrashes at runtime.Previous fix attempt (#9734) was reverted (#9755) because it incorrectly claimed
sm_120aisn't a valid arch. It IS valid — the real issue is the pipeline routing, not the suffix. This PR addresses the actual root cause.Changes
_has_tensor_memory()helper — returnsTrueonly for SM100/SM103 (arch family 10). SM120/SM121 (arch family 12) returnFalse.add_hoist_tmem_alloc,add_promote_lhs_to_tmem)..targetregex — handles optionalasuffix (\.target sm_\d+a?).asuffix since they lack the accelerator features it implies.Test Results (DGX Spark GB10, SM121, aarch64, CUDA 13.1)
tl.dotfused_qkvzba_split_reshape_cat_kernelSimple Triton kernels already worked on SM121 (they don't trigger tensor memory codegen). Complex kernels using
tl.dotor 2D tensor operations crashed because the datacenter pipeline generated tensor memory instructions.Why the Previous Fix Was Wrong
PR #9734 removed the
asuffix for SM120, claimingsm_120aisn't valid. NVIDIA engineers objected —sm_120aIS a valid arch string (CUTLASS uses it). The PR was reverted.This PR takes a different approach: the
asuffix is secondary. The primary fix is routing SM120/SM121 away from the tensor memory pipeline. Even if we keptsm_120a, the pipeline routing fix alone would prevent the illegal instruction crash.Impact
Unblocks ALL Triton-dependent models on DGX Spark (GB10) and RTX 5090 (SM120), including:
nvidia/Qwen3-Next-80B-A3B-Thinking-NVFP4tl.dotor tensor operationstorch.compileon SM121Fixes: #9181, #8539, #8335
Related: #9734 (reverted), PyTorch #176426, vLLM #31128