Skip to content

Conversation

@chsigg
Copy link
Collaborator

@chsigg chsigg commented Aug 23, 2024

Hopper has very low throughput of conversion instructions that cause this operations to quickly become an ALU bottleneck. Restating it in terms of bitwise ops and SIMD bf16 instructions increases the throughput significantly and translates to meaningful speedups (e.g. 10% end-to-end on one matmul I was looking at).

@chsigg chsigg requested a review from ptillet as a code owner August 23, 2024 11:07
Hopper has very low throughput of conversion instructions that cause this
operations to quickly become an ALU bottleneck. Restating it in terms of
bitwise ops and SIMD bf16 instructions increases the throughput significantly
and translates to meaningful speedups (e.g. 10% end-to-end on one matmul I was
looking at).
Copy link
Collaborator

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chsigg chsigg merged commit 241e89c into triton-lang:main Aug 28, 2024
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…s8->bf16 conversions (triton-lang#4563)

Hopper has very low throughput of conversion instructions that cause
this operations to quickly become an ALU bottleneck. Restating it in
terms of bitwise ops and SIMD bf16 instructions increases the throughput
significantly and translates to meaningful speedups (e.g. 10% end-to-end
on one matmul I was looking at).

Co-authored-by: Adam Paszke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants