Skip to content

# [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+#10313

Open
mwichro wants to merge 1 commit into
triton-lang:mainfrom
mwichro:h100_fp64_optimizing_PR
Open

# [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+#10313
mwichro wants to merge 1 commit into
triton-lang:mainfrom
mwichro:h100_fp64_optimizing_PR

Conversation

@mwichro
Copy link
Copy Markdown
Contributor

@mwichro mwichro commented May 14, 2026

Summary

Triton's FP64 dot lowering previously emitted only m8n8k4.f64, leaving ~35% of the H100 perfomance on the table. This PR adds the m16n8k4.f64 MMA shape for sm_90+. m8n8k4.f64 is kept as sm_80 fallback.

Performance

M=N=K cuBLAS (TFLOPS) Triton before Triton after Gap after
1024 47.59 ~28 45.62 4.1 %
2048 55.99 ~31 53.43 4.6 %
4096 53.89 ~31 49.18 8.7 %

Performance measured with script provided in #10060
The autotune settings need to be updated: num_wrap=2

m16n8k8 and m16n8k16

I tried fully closing the gap by implementing m16n8k8 and m16n8k16 support, but there was no performance gain. So I am not including those changes, but the code should be extendible to add support for those:

  • pickFp64MmaK is the single point to extend with operand-K-aware dispatch (and an env-var override).
  • instrShape already carries K explicitly, so adding K=8/16 shapes is additive on the encoding side.
  • getMmaTypeDot dispatches on instrShape[M]; One will branch Further on instrShape.back().
  • callMmaAmpereFp64M16K4 is a single-K helper; a TODO at its definition lists the regs/thread shape for k=8 (A=4, B=2) and k=16 (A=8, B=4) so the generalization is a localized refactor.

Remaining gap to cuBLAS

I've spent some time trying to close the gap, but it turns out is needs a deeper change.

The B operand for m16n8k{8,16} cannot be vectorized with the current shared-encoding choice (N-contiguous shared layout vs K-adjacent register fragment), so the BLOCK_K ≥ 32 autotune configs regress and the autotuner sticks with BLOCK_K = 16 (= m16n8k4).

Declaration

  • I am not making a trivial change, such as fixing a typo in a comment.
  • I have written a PR description following these
    rules.
  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.
  • I have added tests.
    • /test for lit tests
    • /unittest for C++ tests
    • /python/test for end-to-end tests
  • The lit tests I have added follow these best practices,

@mwichro mwichro requested review from lezcano and ptillet as code owners May 14, 2026 12:15
@mwichro mwichro changed the title # [BACKEND] FP64: Add support for m16n8k4 path for sm90+ # [BACKEND] FP64 on Hopper: add support for m16n8k4 path for sm90+ May 14, 2026
Copy link
Copy Markdown
Contributor

@Jokeren Jokeren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I was very supportive of the FP64 path, but now the update seems much larger than I expected. Adding too many if/else branches would make the code harder to maintain, especially when different architectures choose different instruction shapes and FP64 is relatively low priority. Can you share with us your plan for the FP64 functionality and what else still needs to be pushed upstream? We may want to reach an agreement on what essential features should be there before working on more PRs

cc @lezcano

@mwichro
Copy link
Copy Markdown
Contributor Author

mwichro commented May 14, 2026

The point of this PR is to gain as much performance as possible while keeping the changes reasonably small. At the same time, I wanted to make it easy to extend the implementation to larger tiles. From adding support for more instructions, I was not able to get any performance gains.

Fully closing the gap would probably require deeper changes, so I opened this PR as at least a reasonable checkpoint. I am not able to run the profiler on the university H100/H200 servers; I can't even confirm that my reasoning about the performance gap is valid.

what else still needs to be pushed upstream?

That is a good question. For me, those changes seem to be enough; the gap to cuBLAS is reasonably small, but on the other hand, it is still 5%, so there is definitely something to gain (and this PR feels like unfinished work for me).

So, let me return the question: Where is the sweet spot concerning FP64 performance?

As for my applications, tile Mx8x4 is perfect.

@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 14, 2026

That is a good question. For me, those changes seem to be enough; the gap to cuBLAS is reasonably small, but on the other hand, it is still 5%, so there is definitely something to gain (and this PR feels like unfinished work for me).

I believe support FP64 is mainly for the community benefit. I think getting 100% or higher perf would be beneficial but I haven't looked into the details.

If you are interested in further exploring what steps have to be done to get there with a brief report, this would be better rather than submitting separated PRs. We can leave this PR open for now without merging and come back later if you figure out that we don't have a good way to achieve 100% perf without significant updates.

@mwichro
Copy link
Copy Markdown
Contributor Author

mwichro commented May 14, 2026

Sounds like a plan. I already investigated why there is a gap before I opened this PR, that is why I decided to post it before going further. Would you mind providing some comments on what I think is causing the gap?

My reasoning

By looking at PTX, with some help from Claude: B operand at kWidth ≥ 2 is loaded as two scalar ld.shared.b64 per thread (instead of one ld.shared.v2.b64). The two K-adjacent B elements per thread sit at non-contiguous addresses in shared memory because B's shared encoding has order=[1,0] (N-contiguous), inherited from global memory. PTX uses two different base registers for the two loads.

This means just adding ld.shared.v2.b64 is not possible: the elements being loaded aren't adjacent. B's shared encoding has to be changed with order=[0,1] (K-contiguous) when kWidth≥2 for f64.

Also ldmatrix.f64 has no f64 variant.

PTX dump

cuBLAS

ld.shared.v2.b64 {%rd74, %rd76}, [%r170];          // A load 1
ld.shared.v2.b64 {%rd75, %rd77}, [%r170+2048];     // A load 2
... 14 more A v2 loads ...
ld.shared.b64    %rd79, [%r175+49152];             // B load 1 (scalar)
ld.shared.b64    %rd78, [%r174+49152];             // B load 2 (scalar)
... 14 more B scalar loads ...
mma.sync.aligned.m16n8k8.row.col.f64.f64.f64.f64
  { %rd169-%rd172 },                                // C
  { %rd74, %rd75, %rd76, %rd77 },                   // A (4 regs, v2 path)
  { %rd78, %rd79 },                                 // B (2 regs, scalar path)
  { %rd169-%rd172 };

TTGIR

#shared  = swizzled_shared<vec=4, perPhase=1, maxPhase=4, order=[1, 0]>
#shared1 = swizzled_shared<vec=8, perPhase=1, maxPhase=2, order=[1, 0]>

%a = ttg.local_alloc : memdesc<3x64x32xf64, #shared,  #smem, mutable>
%b = ttg.local_alloc : memdesc<3x32x32xf64, #shared1, #smem, mutable>
  • order=[1, 0] means axis-1 is innermost. For A (M×K) axis-1 is K → A is K-contiguous, good.
  • For B (K×N) axis-1 is N → B is N-contiguous, which is the wrong axis for the m16n8k8 B-operand fragment

@Jokeren
Copy link
Copy Markdown
Contributor

Jokeren commented May 15, 2026

Thanks for the preliminary investigation. What I want actually is comparing triton FP64 and cublas FP64's NCU profiling results and assembly code. I don't think the analysis from Claude is useful without checking what cublas is doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants