[AMDGPU][SIInsertWaitcnts] Fix iota_range assertion when OtherMarks is empty in mergeAsyncMarks()#193499
Conversation
|
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
|
@llvm/pr-subscribers-backend-amdgpu Author: xiaohuguo2023 ChangesWaitcntBrackets::mergeAsyncMarks() asserts when merging async wait-count Problem:
Changes:
Full diff: https://github.com/llvm/llvm-project/pull/193499.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index abd03d5f9bf73..120faaa82a315 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -3024,6 +3024,8 @@ bool WaitcntBrackets::mergeAsyncMarks(ArrayRef<MergeInfo> MergeInfos,
unsigned OtherSize = OtherMarks.size();
unsigned OurSize = AsyncMarks.size();
unsigned MergeCount = std::min(OtherSize, OurSize);
+ if (MergeCount == 0)
+ return StrictDom;
for (auto Idx : seq_inclusive<unsigned>(1, MergeCount)) {
for (auto T : inst_counter_types(Context->MaxCounter)) {
StrictDom |= mergeScore(MergeInfos[T], AsyncMarks[OurSize - Idx][T],
diff --git a/llvm/test/CodeGen/AMDGPU/asyncmark-merge-empty-other.mir b/llvm/test/CodeGen/AMDGPU/asyncmark-merge-empty-other.mir
new file mode 100644
index 0000000000000..ec9c33f33a1f8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/asyncmark-merge-empty-other.mir
@@ -0,0 +1,75 @@
+# RUN: llc -mtriple=amdgcn -mcpu=gfx950 -verify-machineinstrs -run-pass=si-insert-waitcnts -o - %s | FileCheck %s
+
+# Regression test for mergeAsyncMarks() asserting when OtherMarks is empty.
+#
+# At a CFG join point where one predecessor has an ASYNCMARK (non-empty
+# AsyncMarks) and the other does not (empty OtherMarks), MergeCount becomes
+# min(0, N) = 0. Before the fix, seq_inclusive<unsigned>(1, 0) asserted
+# Begin <= End. After the fix the function returns early: the other predecessor
+# contributed no async marks so our marks are unchanged and no stricter waits
+# are needed.
+#
+# Key check: no S_WAITCNT is inserted at the join block (bb.3). The else
+# branch had no async operations so merging its empty marks into the then
+# branch's non-empty marks requires no additional waits.
+
+--- |
+ define void @asyncmark_merge_empty_other(i32 %cond) {
+ entry:
+ br i1 true, label %then, label %else
+ then:
+ br label %join
+ else:
+ br label %join
+ join:
+ ret void
+ }
+...
+
+---
+name: asyncmark_merge_empty_other
+tracksRegLiveness: true
+machineFunctionInfo:
+ occupancy: 8
+body: |
+ ; entry — conditional branch to then (async) or else (no async)
+ bb.0.entry:
+ successors: %bb.1(0x40000000), %bb.2(0x40000000)
+ liveins: $sgpr0, $sgpr1
+
+ ; CHECK-LABEL: name: asyncmark_merge_empty_other
+ ; CHECK: bb.0.entry:
+ ; CHECK: S_CMP_LG_U32 $sgpr0, $sgpr1, implicit-def $scc
+ ; CHECK-NEXT: S_CBRANCH_SCC1 %bb.2, implicit killed $scc
+ S_CMP_LG_U32 $sgpr0, $sgpr1, implicit-def $scc
+ S_CBRANCH_SCC1 %bb.2, implicit killed $scc
+
+ ; then branch — ASYNCMARK records non-empty AsyncMarks
+ bb.1.then:
+ successors: %bb.3(0x80000000)
+
+ ; CHECK: bb.1.then:
+ ; CHECK: ASYNCMARK
+ ; CHECK-NEXT: S_BRANCH %bb.3
+ ASYNCMARK
+ S_BRANCH %bb.3
+
+ ; else branch — no async operations, OtherMarks is empty at join
+ bb.2.else:
+ successors: %bb.3(0x80000000)
+
+ ; CHECK: bb.2.else:
+ ; CHECK: S_BRANCH %bb.3
+ S_BRANCH %bb.3
+
+ ; join — mergeAsyncMarks() sees non-empty AsyncMarks (bb.1) and empty
+ ; OtherMarks (bb.2). Before the fix: assertion. After the fix: early
+ ; return, no waitcnt inserted.
+ bb.3.join:
+ ; CHECK: bb.3.join:
+ ; CHECK-NOT: S_WAITCNT
+ ; CHECK: S_ENDPGM 0
+ ; Verify no waitcnt is inserted before S_ENDPGM — the else predecessor
+ ; had no async marks so no stricter waits are needed at the join.
+ S_ENDPGM 0
+...
|
There was a problem hiding this comment.
-NOT checks are fragile, use update_mir_test_checks
There was a problem hiding this comment.
| # RUN: llc -mtriple=amdgcn -mcpu=gfx950 -run-pass=si-insert-waitcnts -o - %s | FileCheck %s |
There was a problem hiding this comment.
wouldn't it make sense to place this at the top of the function, where there already is the AsyncMarks.empty() && OtherMarks.empty() check?
There was a problem hiding this comment.
Good call — I've moved the guard to the top of the function and changed && to || to cover the one-side-empty case, which also lets us drop the mid-function MergeCount == 0 check. Thanks for the suggestion!
There was a problem hiding this comment.
This isn't entirely accurate. No wait is generated because there is no wait.asyncmark instruction. The should have a wait.asyncmark in the join block, and it should emit an S_WAIT wait without crashing. For completeness, the test should also exercise the opposite pattern where the single asyncmark is in the else successor instead of the then successor.
There was a problem hiding this comment.
ah, good catch, I've rewritten it with two functions (asyncmark_in_then / asyncmark_in_else) that use a real BUFFER_LOAD_DWORD_OFFSET + ASYNCMARK on the async path and a consumer V_ADD_U32_e32 in the join block, so the generated S_WAITCNT can now be verified.
There was a problem hiding this comment.
The ordinary BUFFER_LOAD_DWORD_OFFSET does not use ASYNCMARK at all. On gfx900, you need BUFFER_LOAD_DWORD_LDS_OFFEN with the async bit set in its aux/cpol argument. Or on gfx1250, you need GLOBAL_LOAD_ASYNC_TO_LDS. Both are not strictly necessary, but I would suggest using only gfx1250.
The test is still missing a WAIT_ASYNCMARK in the join block. Right now, the wait is being generated because the buffer load is in fact not async at all.
There was a problem hiding this comment.
@ssahasra
Many Thanks for the detailed feedback. You're right that BUFFER_LOAD_DWORD_OFFSET is synchronous and the S_WAITCNT in the join was just ordinary vmcnt tracking, not async mark machinery.
-
I've replaced it with BUFFER_LOAD_DWORD_LDS_IDXEN (IsAsync=1) and added WAIT_ASYNCMARK 0 to the join block.
-
For the target: gfx1250 isn't currently feasible ?
- ASYNCMARK hits assert(ST.getGeneration() < AMDGPUSubtarget::GFX12) (SIInsertWaitcnts.cpp:3273) and
- WAIT_ASYNCMARK hits reportFatalUsageError("WAIT_ASYNCMARK is not ready for GFX12 yet") (line 2096).
- I've used gfx950 instead since that's the hardware where the original bug was found.
There was a problem hiding this comment.
For the target: gfx1250 isn't currently feasible ?
Is it possible that you are using a stale copy? Support for GFX1250 was added in #185813.
I've used gfx950 instead since that's the hardware where the original bug was found.
GFX9 uses VM_CNT to track async buffer loads. GFX1250 supports new intrinsics like load.async.to.lds, and these are tracked using ASYNC_CNT instead.
There was a problem hiding this comment.
@ssahasra You were right — my copy was stale. After rebasing onto upstream main I updated the test to gfx1250 with GLOBAL_LOAD_ASYNC_TO_LDS_B32 and WAIT_ASYNCMARK 0 in the join block. On the else-pattern the pass correctly expands WAIT_ASYNCMARK 0 to S_WAIT_ASYNCCNT 0.
There was a problem hiding this comment.
Done, thanks!
In mergeAsyncMarks(), when one branch of a conditional (e.g. EVEN_K=False K-loop boundary) has no async memory ops, OtherMarks can be empty while AsyncMarks is not. The existing early-exit only handles both-empty, so MergeCount = min(0, N) = 0 reaches seq_inclusive<unsigned>(1, 0) which asserts Begin <= End. Fix: return early when MergeCount == 0 — nothing to merge. Reproducer: compile a StreamK GEMM kernel on gfx950 (MI350X) with BLOCK_SIZE_K=256, K=2880 (not divisible → EVEN_K=False), num_stages=2, waves_per_eu=4, matrix_instr_nonkdim=32. The assert fires deterministically even when compiled alone (not a race). Fixes: iota_range<unsigned>(1, 0) → Assertion Begin <= End failed
- Regenerate CHECK lines using update_mir_test_checks.py - Remove -verify-machineinstrs from RUN line
Change `&&` to `||` in the early-exit check so the function returns immediately when either AsyncMarks or OtherMarks is empty, rather than falling through to the resize/pad logic and hitting a seq_inclusive assert (Begin > End) when MergeCount == 0. Removes the now-unreachable mid-function MergeCount == 0 guard.
Test two CFG patterns where one predecessor carries an ASYNCMARK from a
BUFFER_LOAD and the other is a sync path (empty OtherMarks):
asyncmark_in_then: ASYNCMARK in then-successor, else is sync
asyncmark_in_else: ASYNCMARK in else-successor, then is sync
Before the fix, mergeAsyncMarks() asserted Begin <= End via
seq_inclusive<unsigned>(1, 0). After the fix the function returns
early when either side is empty and preserves the pending async marks,
so S_WAITCNT is correctly emitted before the consumer in the join block
….join) as suggested in review
- Replaced BUFFER_LOAD_DWORD_OFFSET with BUFFER_LOAD_DWORD_LDS_IDXEN + IsAsync=1 — a real async LDS DMA that isAsync() recognises, recording AsyncScore[LOAD_CNT] before ASYNCMARK pushes it onto AsyncMarks - Added $m0 = S_MOV_B32 0 in bb.0 (required for LDS DMA) - Added WAIT_ASYNCMARK 0 to the join block so the async mark machinery is actually consumed - Changed join block consumer to DS_READ_B32_gfx9 (reads from LDS where the async DMA wrote) - Regenerated CHECK lines
Update MIR test to use gfx1250 async LDS DMA Switch from gfx950 + BUFFER_LOAD_DWORD_LDS_IDXEN to gfx1250 + GLOBAL_LOAD_ASYNC_TO_LDS_B32 as suggested in review. On GFX1250 the async load is tracked via ASYNC_CNT rather than VM_CNT, which properly exercises the ASYNCMARK/WAIT_ASYNCMARK machinery. The join block carries WAIT_ASYNCMARK 0; in asyncmark_in_else the pass expands it to S_WAIT_ASYNCCNT 0 to drain the pending async count. Rebased onto upstream main to pick up GFX1250 ASYNCMARK support added in llvm#185813. Regenerate CHECK lines with update_mir_test_checks.py.
4957222 to
5d9b8e4
Compare
| successors: %bb.3(0x80000000) | ||
| liveins: $vgpr0_vgpr1, $vgpr2 | ||
|
|
||
| GLOBAL_LOAD_ASYNC_TO_LDS_B32 $vgpr2, $vgpr0_vgpr1, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`) |
There was a problem hiding this comment.
| GLOBAL_LOAD_ASYNC_TO_LDS_B32 $vgpr2, $vgpr0_vgpr1, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`) | |
| GLOBAL_LOAD_ASYNC_TO_LDS_B32 $vgpr2, $vgpr0_vgpr1, 0, 0, implicit-def $asynccnt, implicit $exec, implicit $asynccnt :: (load (s32), addrspace 1), (store (s32), addrspace 3) |
There was a problem hiding this comment.
sorry, missed it. done now
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
To consolidate the one-side-empty guard at the top of the function,
the early-exit was changed from && to ||. However, after rebasing onto
upstream/main, the || condition broke asyncmark-max-pregfx12.ll: the
padding logic now sits between the early-exit and the seq_inclusive loop,
so returning early when AsyncMarks.empty() && OtherMarks non-empty skips
the padding and silently drops OtherMarks.
The correct fix keeps && at the top (handles the both-empty case, leaves
padding intact for the AsyncMarks-empty case) and adds:
if (MergeCount == 0)
return StrictDom;
after the padding to handle the OtherMarks-empty case without touching
the padding logic or triggering seq_inclusive<unsigned>(1, 0).
Update asyncmark-merge-empty-other.mir: with OtherMarks no longer
dropped, asyncmark_in_else correctly emits S_WAIT_ASYNCCNT 0 at the
join block.
this should be fixed now |
| @@ -3067,6 +3067,11 @@ bool WaitcntBrackets::mergeAsyncMarks(ArrayRef<MergeInfo> MergeInfos, | |||
| unsigned OtherSize = OtherMarks.size(); | |||
| unsigned OurSize = AsyncMarks.size(); | |||
| unsigned MergeCount = std::min(OtherSize, OurSize); | |||
| // OtherMarks is empty → OtherSize == 0 → MergeCount == 0. | |||
There was a problem hiding this comment.
→ is not ASICII. I'm not sure what our policy is, but I'd use -> instead.
| @@ -0,0 +1,139 @@ | |||
| # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 6 | |||
| # RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass=si-insert-waitcnts -o - %s | FileCheck %s | |||
There was a problem hiding this comment.
| # RUN: llc -mtriple=amdgcn -mcpu=gfx1250 -run-pass=si-insert-waitcnts -o - %s | FileCheck %s | |
| # RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1250 -run-pass=si-insert-waitcnts < %s | FileCheck %s |
There was a problem hiding this comment.
This should not use < %s, that will think the input is IR
There was a problem hiding this comment.
< %s fails for MIR files (llc can't detect format from stdin), so kept -o - %s but applied the triple fix amdgcn → amdgcn-amd-amdhsa
|
@xiaohuguo2023 Congratulations on having your first Pull Request (PR) merged into the LLVM Project! Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR. Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues. How to do this, and the rest of the post-merge process, is covered in detail here. If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again. If you don't get any reports, no action is required from you. Your changes are working as expected, well done! |
…s empty in mergeAsyncMarks() (llvm#193499) WaitcntBrackets::mergeAsyncMarks() asserts when merging async wait-count state at a CFG join point where one predecessor has pending async memory operations and the other does not. Problem: - The existing early-exit only handles the both-empty case - When OtherMarks is empty but AsyncMarks is not, MergeCount = min(0, N) = 0 - seq_inclusive<unsigned>(1, 0) fires: "Assertion Begin <= End failed" Changes: - Add early return when MergeCount == 0 (OtherMarks is empty) - When the other predecessor contributed no async marks, our marks are unchanged and no stricter waits are needed - Add regression lit test: asyncmark-merge-empty-other.mir
…s empty in mergeAsyncMarks() (llvm#193499) WaitcntBrackets::mergeAsyncMarks() asserts when merging async wait-count state at a CFG join point where one predecessor has pending async memory operations and the other does not. Problem: - The existing early-exit only handles the both-empty case - When OtherMarks is empty but AsyncMarks is not, MergeCount = min(0, N) = 0 - seq_inclusive<unsigned>(1, 0) fires: "Assertion Begin <= End failed" Changes: - Add early return when MergeCount == 0 (OtherMarks is empty) - When the other predecessor contributed no async marks, our marks are unchanged and no stricter waits are needed - Add regression lit test: asyncmark-merge-empty-other.mir
This reverts commit 36394d4. We saw some compile failure cases with disabled vector combine. But should be resolved and should re-land this commit once with llvm/llvm-project#193499 or commit `81d618b6bc1e71cda79fe7bf9cbab63933dd5975` gets picked up in the next LLVM bump.
WaitcntBrackets::mergeAsyncMarks() asserts when merging async wait-count
state at a CFG join point where one predecessor has pending async memory
operations and the other does not.
Problem:
Changes:
unchanged and no stricter waits are needed