# [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+ by mwichro · Pull Request #10313 · triton-lang/triton

mwichro · 2026-05-14T12:15:58Z

Summary

Triton's FP64 dot lowering previously emitted only m8n8k4.f64, leaving ~35% of the H100 perfomance on the table. This PR adds the m16n8k4.f64 MMA shape for sm_90+. m8n8k4.f64 is kept as sm_80 fallback.

Performance

M=N=K	cuBLAS (TFLOPS)	Triton before	Triton after	Gap after
1024	47.59	~28	45.62	4.1 %
2048	55.99	~31	53.43	4.6 %
4096	53.89	~31	49.18	8.7 %

Performance measured with script provided in #10060
The autotune settings need to be updated: num_wrap=2

`m16n8k8` and `m16n8k16`

I tried fully closing the gap by implementing m16n8k8 and m16n8k16 support, but there was no performance gain. So I am not including those changes, but the code should be extendible to add support for those:

pickFp64MmaK is the single point to extend with operand-K-aware dispatch (and an env-var override).
instrShape already carries K explicitly, so adding K=8/16 shapes is additive on the encoding side.
getMmaTypeDot dispatches on instrShape[M]; One will branch Further on instrShape.back().
callMmaAmpereFp64M16K4 is a single-K helper; a TODO at its definition lists the regs/thread shape for k=8 (A=4, B=2) and k=16 (A=8, B=4) so the generalization is a localized refactor.

Remaining gap to cuBLAS

I've spent some time trying to close the gap, but it turns out is needs a deeper change.

The B operand for m16n8k{8,16} cannot be vectorized with the current shared-encoding choice (N-contiguous shared layout vs K-adjacent register fragment), so the BLOCK_K ≥ 32 autotune configs regress and the autotuner sticks with BLOCK_K = 16 (= m16n8k4).

Declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
I have added tests.
- /test for lit tests
- /unittest for C++ tests
- /python/test for end-to-end tests
The lit tests I have added follow these best practices,

Jokeren

Originally, I was very supportive of the FP64 path, but now the update seems much larger than I expected. Adding too many if/else branches would make the code harder to maintain, especially when different architectures choose different instruction shapes and FP64 is relatively low priority. Can you share with us your plan for the FP64 functionality and what else still needs to be pushed upstream? We may want to reach an agreement on what essential features should be there before working on more PRs

cc @lezcano

mwichro · 2026-05-14T17:47:08Z

The point of this PR is to gain as much performance as possible while keeping the changes reasonably small. At the same time, I wanted to make it easy to extend the implementation to larger tiles. From adding support for more instructions, I was not able to get any performance gains.

Fully closing the gap would probably require deeper changes, so I opened this PR as at least a reasonable checkpoint. I am not able to run the profiler on the university H100/H200 servers; I can't even confirm that my reasoning about the performance gap is valid.

what else still needs to be pushed upstream?

That is a good question. For me, those changes seem to be enough; the gap to cuBLAS is reasonably small, but on the other hand, it is still 5%, so there is definitely something to gain (and this PR feels like unfinished work for me).

So, let me return the question: Where is the sweet spot concerning FP64 performance?

As for my applications, tile Mx8x4 is perfect.

Jokeren · 2026-05-14T18:42:45Z

That is a good question. For me, those changes seem to be enough; the gap to cuBLAS is reasonably small, but on the other hand, it is still 5%, so there is definitely something to gain (and this PR feels like unfinished work for me).

I believe support FP64 is mainly for the community benefit. I think getting 100% or higher perf would be beneficial but I haven't looked into the details.

If you are interested in further exploring what steps have to be done to get there with a brief report, this would be better rather than submitting separated PRs. We can leave this PR open for now without merging and come back later if you figure out that we don't have a good way to achieve 100% perf without significant updates.

mwichro · 2026-05-14T22:58:28Z

Sounds like a plan. I already investigated why there is a gap before I opened this PR, that is why I decided to post it before going further. Would you mind providing some comments on what I think is causing the gap?

My reasoning

By looking at PTX, with some help from Claude: B operand at kWidth ≥ 2 is loaded as two scalar ld.shared.b64 per thread (instead of one ld.shared.v2.b64). The two K-adjacent B elements per thread sit at non-contiguous addresses in shared memory because B's shared encoding has order=[1,0] (N-contiguous), inherited from global memory. PTX uses two different base registers for the two loads.

This means just adding ld.shared.v2.b64 is not possible: the elements being loaded aren't adjacent. B's shared encoding has to be changed with order=[0,1] (K-contiguous) when kWidth≥2 for f64.

Also ldmatrix.f64 has no f64 variant.

PTX dump

cuBLAS

ld.shared.v2.b64 {%rd74, %rd76}, [%r170];          // A load 1
ld.shared.v2.b64 {%rd75, %rd77}, [%r170+2048];     // A load 2
... 14 more A v2 loads ...
ld.shared.b64    %rd79, [%r175+49152];             // B load 1 (scalar)
ld.shared.b64    %rd78, [%r174+49152];             // B load 2 (scalar)
... 14 more B scalar loads ...
mma.sync.aligned.m16n8k8.row.col.f64.f64.f64.f64
  { %rd169-%rd172 },                                // C
  { %rd74, %rd75, %rd76, %rd77 },                   // A (4 regs, v2 path)
  { %rd78, %rd79 },                                 // B (2 regs, scalar path)
  { %rd169-%rd172 };

TTGIR

#shared  = swizzled_shared<vec=4, perPhase=1, maxPhase=4, order=[1, 0]>
#shared1 = swizzled_shared<vec=8, perPhase=1, maxPhase=2, order=[1, 0]>

%a = ttg.local_alloc : memdesc<3x64x32xf64, #shared,  #smem, mutable>
%b = ttg.local_alloc : memdesc<3x32x32xf64, #shared1, #smem, mutable>

order=[1, 0] means axis-1 is innermost. For A (M×K) axis-1 is K → A is K-contiguous, good.
For B (K×N) axis-1 is N → B is N-contiguous, which is the wrong axis for the m16n8k8 B-operand fragment

Jokeren · 2026-05-15T13:34:40Z

Thanks for the preliminary investigation. What I want actually is comparing triton FP64 and cublas FP64's NCU profiling results and assembly code. I don't think the analysis from Claude is useful without checking what cublas is doing.

Add support m16n8k4 path for FP64 (sm90+).

4d1f0dd

mwichro requested review from lezcano and ptillet as code owners May 14, 2026 12:15

mwichro changed the title ~~# [BACKEND] FP64: Add support for m16n8k4 path for sm90+~~ # [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+ May 14, 2026

Jokeren reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+#10313

# [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+#10313
mwichro wants to merge 1 commit into
triton-lang:mainfrom
mwichro:h100_fp64_optimizing_PR

mwichro commented May 14, 2026

Uh oh!

Jokeren left a comment •

edited

Loading

Uh oh!

mwichro commented May 14, 2026

Uh oh!

Jokeren commented May 14, 2026

Uh oh!

mwichro commented May 14, 2026

Uh oh!

Jokeren commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwichro commented May 14, 2026

Summary

Performance

m16n8k8 and m16n8k16

Remaining gap to cuBLAS

Declaration

Uh oh!

Jokeren left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwichro commented May 14, 2026

Uh oh!

Jokeren commented May 14, 2026

Uh oh!

mwichro commented May 14, 2026

My reasoning

PTX dump

cuBLAS

TTGIR

Uh oh!

Jokeren commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`m16n8k8` and `m16n8k16`

Jokeren left a comment •

edited

Loading