Skip to content

[BACKEND][EZ] Tighten TMA multicta verifier#9941

Merged
lezcano merged 6 commits into
mainfrom
multicast_2cta
Apr 10, 2026
Merged

[BACKEND][EZ] Tighten TMA multicta verifier#9941
lezcano merged 6 commits into
mainfrom
multicast_2cta

Conversation

@lezcano
Copy link
Copy Markdown
Contributor

@lezcano lezcano commented Apr 7, 2026

As per the PTX docs, TMAs have a very specific behaviour when executed
in a 2CTA kernel:

.cta_group::1 : The mbarrier signal is also multicasted to the same offset as mbar in the shared memory of the destination CTA.
.cta_group::2 : The mbarrier signal is multicasted either to all the odd numbered CTAs or the even numbered CTAs within the corresponding CTA-Pair. For each destination CTA specified in the ctaMask, the mbarrier signal is sent either to the destination CTA or its peer-CTA based on CTAs %cluster_ctarank parity of shared memory where the mbarrier object mbar resides.

As such, we require these CTA layouts in TMA barriers.

@lezcano lezcano requested a review from ptillet as a code owner April 7, 2026 14:31
@lezcano lezcano changed the title [BACKEND] Tighten TMA multicta layouts [BACKEND][EZ] Tighten TMA multicta verifier Apr 7, 2026
@lezcano lezcano enabled auto-merge (squash) April 7, 2026 15:48
@lezcano
Copy link
Copy Markdown
Contributor Author

lezcano commented Apr 8, 2026

needs #9958. I'll merge it after that one is merged

lezcano and others added 5 commits April 9, 2026 19:37
As per the PTX docs, TMAs have a very specific behaviour when executed
in a 2CTA kernel:

>.cta_group::1 : The mbarrier signal is also multicasted to the same offset as mbar in the shared memory of the destination CTA.
.cta_group::2 : The mbarrier signal is multicasted either to all the odd numbered CTAs or the even numbered CTAs within the corresponding CTA-Pair. For each destination CTA specified in the ctaMask, the mbarrier signal is sent either to the destination CTA or its peer-CTA based on CTAs %cluster_ctarank parity of shared memory where the mbarrier object mbar resides.

As such, we require these CTA layouts in TMA barriers.
Co-authored-by: Codex <noreply@openai.com>
@lezcano lezcano merged commit 76fbef9 into main Apr 10, 2026
17 of 18 checks passed
@lezcano lezcano deleted the multicast_2cta branch April 10, 2026 08:34
plognjen pushed a commit to plognjen/triton that referenced this pull request Apr 14, 2026
As per the PTX docs, TMAs have a very specific behaviour when executed
in a 2CTA kernel:

>.cta_group::1 : The mbarrier signal is also multicasted to the same
offset as mbar in the shared memory of the destination CTA.
.cta_group::2 : The mbarrier signal is multicasted either to all the odd
numbered CTAs or the even numbered CTAs within the corresponding
CTA-Pair. For each destination CTA specified in the ctaMask, the
mbarrier signal is sent either to the destination CTA or its peer-CTA
based on CTAs %cluster_ctarank parity of shared memory where the
mbarrier object mbar resides.

As such, we require these CTA layouts in TMA barriers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants