-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transposed read/writes #176
Comments
This seems like a bug. Triton is supposed to order threads so that it stays coalesced for both reads and writes, and automatically figure out that it should use shared memory. It used to work properly. Thanks for reporting, I'll look into it. |
Hey! Sorry for the delay. I've just merged a bunch of fixes into |
…l_benchmark_rows_softmax Print all softmax tutorial benchmark rows.
Hi,
Is there a way to perform efficient transposed read/writes in Triton? As an exercise I wanted to write a transpose kernel in Triton, and taking inspiration from the matmul example in the docs I wrote the following kernel which naively transposes the input by swapping the strides,
However this performs suboptimally (likely because the store instruction becomes uncoalesced), unless the destination is already lazily transposed, as the write then becomes coalesced and approaches the speed of a copy.
It would be very useful to have a generic way to express such transposed operations in a way that avoids uncoalesced reads/writes to dram (similar to how in CUDA one can use shared memory to efficiently transpose elements). If there is already a way to achieve this in Triton I would appreciate some hints, if not perhaps a tl.transpose function could be introduced?
Best regards,
Kenny
The text was updated successfully, but these errors were encountered: