Skip to content

[BACKEND] Reinterpreted memory should represent the same amount of memory#10243

Merged
lezcano merged 1 commit into
mainfrom
reinterpret
May 7, 2026
Merged

[BACKEND] Reinterpreted memory should represent the same amount of memory#10243
lezcano merged 1 commit into
mainfrom
reinterpret

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented May 6, 2026

We also disallow performing reinterpret layouts of subslices as they'd be rather cursed when the subslice is not contiguous.
Instead we ask the user to reinterpret the base layout instead.

We also improve the API for _reinterpret in gluon by allowing to pass in just the attributes you want to change.

@lezcano lezcano requested review from Jokeren and Mogball May 6, 2026 12:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9c3b61f8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread lib/Tools/LinearLayout.cpp Outdated
Comment thread include/triton/Tools/LinearLayout.h Outdated
Comment thread lib/Tools/LinearLayout.cpp Outdated
Comment thread lib/Dialect/TritonGPU/IR/Ops.cpp
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented May 6, 2026

addressed

…mory

To do this we compute look at the offset size while looking at the
pseduoinverse of the layout to handle subviews properly. We extend
the pseudoinvert to take a shape to handle subviews properly (whose
layout shape is the allocshape, which may be different to the tensor
shape).
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented May 6, 2026

Changed the PR to disallow reinterpret of subslices which heavily simplfies the implementation. Can you please review @Mogball @ThomasRaoux @Jokeren

Copy link
Copy Markdown
Contributor

@peterbell10 peterbell10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an unrelated change was committed?

Comment thread python/tutorials/gluon/07-persistence.py
Comment thread python/examples/gluon/01-attention-forward.py
@lezcano lezcano requested a review from peterbell10 May 7, 2026 08:53
@lezcano lezcano merged commit 40e899b into main May 7, 2026
16 of 18 checks passed
@lezcano lezcano deleted the reinterpret branch May 7, 2026 09:58
@FindDefinition
Copy link
Copy Markdown

@lezcano this PR disallow reinterpret smem with "index" virtual dim to another smem without "index" dim (works on triton 3.7, llvm error on triton 3.6), is it possible to support it? otherwise we need to use terrible hack to implement "buffered processing", e.g. flush profile events when smem buffer is full, or do bitonic sort on whole smem buffer when full.

  • bitonic sort: we use global memory instead of smem as buffer.
  • custom profile events (not proton): we use inline ptx to generate illegal non-uniform scalar, then use it to index smem.

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented May 9, 2026

Can you share a minimised repro of the issue?

@FindDefinition
Copy link
Copy Markdown

@lezcano Here is a script that works on triton 3.7.0, error on triton main, llvm error on triton 3.6.0:

import triton.language as tl
from triton.experimental.gluon import language as gl
from triton.experimental import gluon
import torch 
@gluon.jit
def smem_flush_kernel(out_ptr, ):
    num_smem_ev: tl.constexpr = 32
    layout: gl.constexpr = gl.BlockedLayout([1, 2], [32, 1], [1, 1], [1, 0])
    smem = gl.allocate_shared_memory(gl.uint32, [num_smem_ev, 2], gl.SwizzledSharedLayout(1, 1, 1, [0]))
    for j in range(32):
        value = gl.full([2], j, dtype=gl.uint32, layout=gl.SliceLayout(0, layout))
        smem.index(j).store(value)

    smem_rep = smem._reinterpret(gl.uint32, smem.shape, layout=gl.SwizzledSharedLayout(1, 1, 1, [1, 0]))
    smem_val = smem_rep.load(layout)
    offs_m = gl.arange(0, smem.shape[0], layout=gl.AutoLayout())
    offs_n = gl.arange(0, smem.shape[1], layout=gl.AutoLayout())
    gl.store(out_ptr + offs_m[:, None] * 2 + offs_n[None, :], smem_val)

def main():
    out = torch.zeros((32, 2), dtype=torch.uint32, device="cuda")
    smem_flush_kernel[(1,)](out, num_warps=1)
    print(out)

if __name__ == "__main__":
    main()

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented May 9, 2026

Ah, right. Yeah, that we can support. Will add support for that on Monday

@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented May 11, 2026

fixed in #10286

lezcano added a commit that referenced this pull request May 13, 2026
We also check that when reinterpreting a pipelining buffer, the intial
dimensions are the same.

Addresses
#10243 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants