[BACKEND] Reinterpreted memory should represent the same amount of memory#10243
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e9c3b61f8d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
addressed |
…mory To do this we compute look at the offset size while looking at the pseduoinverse of the layout to handle subviews properly. We extend the pseudoinvert to take a shape to handle subviews properly (whose layout shape is the allocshape, which may be different to the tensor shape).
|
Changed the PR to disallow reinterpret of subslices which heavily simplfies the implementation. Can you please review @Mogball @ThomasRaoux @Jokeren |
|
@lezcano this PR disallow reinterpret smem with "index" virtual dim to another smem without "index" dim (works on triton 3.7, llvm error on triton 3.6), is it possible to support it? otherwise we need to use terrible hack to implement "buffered processing", e.g. flush profile events when smem buffer is full, or do bitonic sort on whole smem buffer when full.
|
|
Can you share a minimised repro of the issue? |
|
@lezcano Here is a script that works on triton 3.7.0, error on triton main, llvm error on triton 3.6.0: import triton.language as tl
from triton.experimental.gluon import language as gl
from triton.experimental import gluon
import torch
@gluon.jit
def smem_flush_kernel(out_ptr, ):
num_smem_ev: tl.constexpr = 32
layout: gl.constexpr = gl.BlockedLayout([1, 2], [32, 1], [1, 1], [1, 0])
smem = gl.allocate_shared_memory(gl.uint32, [num_smem_ev, 2], gl.SwizzledSharedLayout(1, 1, 1, [0]))
for j in range(32):
value = gl.full([2], j, dtype=gl.uint32, layout=gl.SliceLayout(0, layout))
smem.index(j).store(value)
smem_rep = smem._reinterpret(gl.uint32, smem.shape, layout=gl.SwizzledSharedLayout(1, 1, 1, [1, 0]))
smem_val = smem_rep.load(layout)
offs_m = gl.arange(0, smem.shape[0], layout=gl.AutoLayout())
offs_n = gl.arange(0, smem.shape[1], layout=gl.AutoLayout())
gl.store(out_ptr + offs_m[:, None] * 2 + offs_n[None, :], smem_val)
def main():
out = torch.zeros((32, 2), dtype=torch.uint32, device="cuda")
smem_flush_kernel[(1,)](out, num_warps=1)
print(out)
if __name__ == "__main__":
main() |
|
Ah, right. Yeah, that we can support. Will add support for that on Monday |
|
fixed in #10286 |
We also check that when reinterpreting a pipelining buffer, the intial dimensions are the same. Addresses #10243 (comment)
We also disallow performing reinterpret layouts of subslices as they'd be rather cursed when the subslice is not contiguous.
Instead we ask the user to reinterpret the base layout instead.
We also improve the API for
_reinterpretin gluon by allowing to pass in just the attributes you want to change.