[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10) by ianbarber · Pull Request #10010 · triton-lang/triton

ianbarber · 2026-04-12T23:43:44Z

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3).

New contributor declaration

[ x] I am not making a trivial change, such as fixing a typo in a comment.
[ x] I have written a PR description following these
rules.
[ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- [ x ] This PR does not need a test because current test paths cover the flow, though there are no GB10s in CI to verify AFAIK it does work for me.
Select one of the following.
- [x ] I have not added any lit tests.

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks. Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4. Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3).

…riton-lang#10010) SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks. Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4. Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3). # New contributor declaration - [ x] I am not making a trivial change, such as fixing a typo in a comment. - [ x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ x ] This PR does not need a test because current test paths cover the flow, though there are no GB10s in CI to verify AFAIK it does work for me. - Select one of the following. - [x ] I have not added any `lit` tests.

ianbarber requested a review from ptillet as a code owner April 12, 2026 23:43

masahi reviewed Apr 13, 2026

View reviewed changes

Comment thread lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp Outdated

ianbarber force-pushed the sm121-dot-scaled branch from 8caf39d to f40d381 Compare April 13, 2026 01:24

ThomasRaoux merged commit b3e3693 into triton-lang:main Apr 13, 2026
16 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10)#10010

[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10)#10010
ThomasRaoux merged 1 commit into
triton-lang:mainfrom
ianbarber:sm121-dot-scaled

ianbarber commented Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ianbarber commented Apr 12, 2026

New contributor declaration

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants