Skip to content

[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10)#10010

Merged
ThomasRaoux merged 1 commit into
triton-lang:mainfrom
ianbarber:sm121-dot-scaled
Apr 13, 2026
Merged

[SM121] Enable native block-scaled dot_scaled for DGX Spark (GB10)#10010
ThomasRaoux merged 1 commit into
triton-lang:mainfrom
ianbarber:sm121-dot-scaled

Conversation

@ianbarber
Copy link
Copy Markdown
Contributor

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled instructions as SM120 (RTX 5090) but was excluded from the native lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4 (scale_vec::4X, ue4m3).

New contributor declaration

  • [ x] I am not making a trivial change, such as fixing a typo in a comment.

  • [ x] I have written a PR description following these
    rules.

  • [ x] I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • [ x ] This PR does not need a test because current test paths cover the flow, though there are no GB10s in CI to verify AFAIK it does work for me.
  • Select one of the following.

    • [x ] I have not added any lit tests.

@ianbarber ianbarber requested a review from ptillet as a code owner April 12, 2026 23:43
Comment thread lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp Outdated
SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled
instructions as SM120 (RTX 5090) but was excluded from the native
lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to
DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270
TFLOPS with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4
(scale_vec::4X, ue4m3).
@ThomasRaoux ThomasRaoux merged commit b3e3693 into triton-lang:main Apr 13, 2026
16 of 18 checks passed
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
…riton-lang#10010)

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled
instructions as SM120 (RTX 5090) but was excluded from the native
lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to
DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS
with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4
(scale_vec::4X, ue4m3).


# New contributor declaration
- [ x] I am not making a trivial change, such as fixing a typo in a
comment.

- [ x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [ x ] This PR does not need a test because current test paths cover
the flow, though there are no GB10s in CI to verify AFAIK it does work
for me.

- Select one of the following.
  - [x ] I have not added any `lit` tests.
raymondtay pushed a commit to raymondtay/triton that referenced this pull request Apr 18, 2026
…riton-lang#10010)

SM121 (GB10 DGX Spark) supports the same mma.sync block-scaled
instructions as SM120 (RTX 5090) but was excluded from the native
lowering path by exact compute capability checks.

Without this fix, dot_scaled on SM121 falls through to
DecomposeScaledBlocked which upcasts to bf16 — ~10 TFLOPS vs ~270 TFLOPS
with native mma.sync block-scaled FP4.

Tested on GB10 with both MXFP4 (scale_vec::2X, ue8m0) and NVFP4
(scale_vec::4X, ue4m3).


# New contributor declaration
- [ x] I am not making a trivial change, such as fixing a typo in a
comment.

- [ x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [ x ] This PR does not need a test because current test paths cover
the flow, though there are no GB10s in CI to verify AFAIK it does work
for me.

- Select one of the following.
  - [x ] I have not added any `lit` tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants