[AMDGPU] Make WaitcntBrackets::simplifyWaitcnt const again by jayfoad · Pull Request #173390 · llvm/llvm-project

jayfoad · 2025-12-23T15:55:04Z

The original design was:

WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the
current state of WaitcntBrackets, removing unnecesary waits.
WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on
Wait, updating the state by applying the specified waits.

This was changed by #164357 which started calling applyWaitcnt from
simplifyWaitcnt.

This patch restores the original design without any significant
functional changes. There is some code duplication because both
simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts
with other counters like LOADCNT and KMCNT.

The original design was: - WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the current state of WaitcntBrackets, removing unnecesary waits. - WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on Wait, updating the state by applying the specified waits. This was changed by llvm#164357 which started calling applyWaitcnt from simplifyWaitcnt. This patch restores the original design without any significant functional changes. There is some code duplication because both simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts with other counters like LOADCNT and KMCNT.

llvmbot · 2025-12-23T15:55:34Z

@llvm/pr-subscribers-backend-amdgpu

Author: Jay Foad (jayfoad)

Changes

The original design was:

WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the
current state of WaitcntBrackets, removing unnecesary waits.
WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on
Wait, updating the state by applying the specified waits.

This was changed by #164357 which started calling applyWaitcnt from
simplifyWaitcnt.

This patch restores the original design without any significant
functional changes. There is some code duplication because both
simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts
with other counters like LOADCNT and KMCNT.

Full diff: https://github.com/llvm/llvm-project/pull/173390.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+45-39)
(modified) llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll (+1)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index e21583ae0876f..e927141ae0fe6 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -656,11 +656,11 @@ class WaitcntBrackets {
   bool merge(const WaitcntBrackets &Other);
 
   bool counterOutOfOrder(InstCounterType T) const;
-  void simplifyWaitcnt(AMDGPU::Waitcnt &Wait);
+  void simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const;
   void simplifyWaitcnt(InstCounterType T, unsigned &Count) const;
-  bool hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait);
-  bool canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait);
-  void simplifyXcnt(AMDGPU::Waitcnt &CheckWait, AMDGPU::Waitcnt &UpdateWait);
+  bool hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait) const;
+  bool canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) const;
+  void simplifyXcnt(AMDGPU::Waitcnt &Wait) const;
 
   void determineWaitForPhysReg(InstCounterType T, MCPhysReg Reg,
                                AMDGPU::Waitcnt &Wait) const;
@@ -1212,7 +1212,7 @@ void WaitcntBrackets::print(raw_ostream &OS) const {
 
 /// Simplify the waitcnt, in the sense of removing redundant counts, and return
 /// whether a waitcnt instruction is needed at all.
-void WaitcntBrackets::simplifyWaitcnt(AMDGPU::Waitcnt &Wait) {
+void WaitcntBrackets::simplifyWaitcnt(AMDGPU::Waitcnt &Wait) const {
   simplifyWaitcnt(LOAD_CNT, Wait.LoadCnt);
   simplifyWaitcnt(EXP_CNT, Wait.ExpCnt);
   simplifyWaitcnt(DS_CNT, Wait.DsCnt);
@@ -1220,7 +1220,7 @@ void WaitcntBrackets::simplifyWaitcnt(AMDGPU::Waitcnt &Wait) {
   simplifyWaitcnt(SAMPLE_CNT, Wait.SampleCnt);
   simplifyWaitcnt(BVH_CNT, Wait.BvhCnt);
   simplifyWaitcnt(KM_CNT, Wait.KmCnt);
-  simplifyXcnt(Wait, Wait);
+  simplifyXcnt(Wait);
 }
 
 void WaitcntBrackets::simplifyWaitcnt(InstCounterType T,
@@ -1332,16 +1332,32 @@ void WaitcntBrackets::applyWaitcnt(InstCounterType T, unsigned Count) {
     setScoreLB(T, UB);
     PendingEvents &= ~Context->WaitEventMaskForInst[T];
   }
+
+  if (T == KM_CNT && Count == 0 && hasPendingEvent(SMEM_GROUP)) {
+    if (!hasMixedPendingEvents(X_CNT))
+      applyWaitcnt(X_CNT, 0);
+    else
+      PendingEvents &= ~(1 << SMEM_GROUP);
+  }
+  if (T == LOAD_CNT && hasPendingEvent(VMEM_GROUP) &&
+      !hasPendingEvent(STORE_CNT)) {
+    if (!hasMixedPendingEvents(X_CNT))
+      applyWaitcnt(X_CNT, Count);
+    else if (Count == 0)
+      PendingEvents &= ~(1 << VMEM_GROUP);
+  }
 }
 
-bool WaitcntBrackets::hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait) {
+bool WaitcntBrackets::hasRedundantXCntWithKmCnt(
+    const AMDGPU::Waitcnt &Wait) const {
   // Wait on XCNT is redundant if we are already waiting for a load to complete.
   // SMEM can return out of order, so only omit XCNT wait if we are waiting till
   // zero.
   return Wait.KmCnt == 0 && hasPendingEvent(SMEM_GROUP);
 }
 
-bool WaitcntBrackets::canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) {
+bool WaitcntBrackets::canOptimizeXCntWithLoadCnt(
+    const AMDGPU::Waitcnt &Wait) const {
   // If we have pending store we cannot optimize XCnt because we do not wait for
   // stores. VMEM loads retun in order, so if we only have loads XCnt is
   // decremented to the same number as LOADCnt.
@@ -1349,27 +1365,17 @@ bool WaitcntBrackets::canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) {
          !hasPendingEvent(STORE_CNT);
 }
 
-void WaitcntBrackets::simplifyXcnt(AMDGPU::Waitcnt &CheckWait,
-                                   AMDGPU::Waitcnt &UpdateWait) {
+void WaitcntBrackets::simplifyXcnt(AMDGPU::Waitcnt &Wait) const {
   // Try to simplify xcnt further by checking for joint kmcnt and loadcnt
   // optimizations. On entry to a block with multiple predescessors, there may
   // be pending SMEM and VMEM events active at the same time.
   // In such cases, only clear one active event at a time.
   // TODO: Revisit xcnt optimizations for gfx1250.
-  if (hasRedundantXCntWithKmCnt(CheckWait)) {
-    if (!hasMixedPendingEvents(X_CNT)) {
-      applyWaitcnt(X_CNT, 0);
-    } else {
-      PendingEvents &= ~(1 << SMEM_GROUP);
-    }
-  } else if (canOptimizeXCntWithLoadCnt(CheckWait)) {
-    if (!hasMixedPendingEvents(X_CNT)) {
-      applyWaitcnt(X_CNT, std::min(CheckWait.XCnt, CheckWait.LoadCnt));
-    } else if (CheckWait.LoadCnt == 0) {
-      PendingEvents &= ~(1 << VMEM_GROUP);
-    }
-  }
-  simplifyWaitcnt(X_CNT, UpdateWait.XCnt);
+  if (hasRedundantXCntWithKmCnt(Wait))
+    Wait.XCnt = ~0u;
+  if (canOptimizeXCntWithLoadCnt(Wait) && Wait.XCnt >= Wait.LoadCnt)
+    Wait.XCnt = ~0u;
+  simplifyWaitcnt(X_CNT, Wait.XCnt);
 }
 
 // Where there are multiple types of event in the bracket of a counter,
@@ -1656,6 +1662,9 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
       dbgs() << *It;
   });
 
+  // Accumulate waits that should not be simplified.
+  AMDGPU::Waitcnt RequiredWait;
+
   for (auto &II :
        make_early_inc_range(make_range(OldWaitcntInstr.getIterator(), It))) {
     LLVM_DEBUG(dbgs() << "pre-existing iter: " << II);
@@ -1682,16 +1691,18 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
           TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
       AMDGPU::Waitcnt OldWait = AMDGPU::decodeLoadcntDscnt(IV, OldEnc);
       if (TrySimplify)
-        ScoreBrackets.simplifyWaitcnt(OldWait);
-      Wait = Wait.combined(OldWait);
+        Wait = Wait.combined(OldWait);
+      else
+        RequiredWait = RequiredWait.combined(OldWait);
       UpdatableInstr = &CombinedLoadDsCntInstr;
     } else if (Opcode == AMDGPU::S_WAIT_STORECNT_DSCNT) {
       unsigned OldEnc =
           TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
       AMDGPU::Waitcnt OldWait = AMDGPU::decodeStorecntDscnt(IV, OldEnc);
       if (TrySimplify)
-        ScoreBrackets.simplifyWaitcnt(OldWait);
-      Wait = Wait.combined(OldWait);
+        Wait = Wait.combined(OldWait);
+      else
+        RequiredWait = RequiredWait.combined(OldWait);
       UpdatableInstr = &CombinedStoreDsCntInstr;
     } else if (Opcode == AMDGPU::S_WAITCNT_lds_direct) {
       // Architectures higher than GFX10 do not have direct loads to
@@ -1704,8 +1715,9 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
       unsigned OldCnt =
           TII->getNamedOperand(II, AMDGPU::OpName::simm16)->getImm();
       if (TrySimplify)
-        ScoreBrackets.simplifyWaitcnt(CT.value(), OldCnt);
-      addWait(Wait, CT.value(), OldCnt);
+        addWait(Wait, CT.value(), OldCnt);
+      else
+        addWait(RequiredWait, CT.value(), OldCnt);
       UpdatableInstr = &WaitInstrs[CT.value()];
     }
 
@@ -1718,8 +1730,9 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
     }
   }
 
-  // Save the pre combine waitcnt in order to make xcnt checks.
-  AMDGPU::Waitcnt PreCombine = Wait;
+  ScoreBrackets.simplifyWaitcnt(Wait);
+  Wait = Wait.combined(RequiredWait);
+
   if (CombinedLoadDsCntInstr) {
     // Only keep an S_WAIT_LOADCNT_DSCNT if both counters actually need
     // to be waited for. Otherwise, let the instruction be deleted so
@@ -1810,13 +1823,6 @@ bool WaitcntGeneratorGFX12Plus::applyPreexistingWaitcnt(
   }
 
   for (auto CT : inst_counter_types(NUM_EXTENDED_INST_CNTS)) {
-    if ((CT == KM_CNT && ScoreBrackets.hasRedundantXCntWithKmCnt(PreCombine)) ||
-        (CT == LOAD_CNT &&
-         ScoreBrackets.canOptimizeXCntWithLoadCnt(PreCombine))) {
-      // Xcnt may need to be updated depending on a pre-existing KM/LOAD_CNT
-      // due to taking the backedge of a block.
-      ScoreBrackets.simplifyXcnt(PreCombine, Wait);
-    }
     if (!WaitInstrs[CT])
       continue;
 
diff --git a/llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll b/llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll
index 29aad9c5f9dc1..58a679fa8bf9d 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll
@@ -28,6 +28,7 @@ define amdgpu_kernel void @test_move_load_address_to_vgpr(ptr addrspace(1) nocap
 ; GCN-NEXT:    s_wait_dscnt 0x0
 ; GCN-NEXT:    flat_load_b32 v3, v[0:1] scope:SCOPE_SYS
 ; GCN-NEXT:    s_wait_loadcnt 0x0
+; GCN-NEXT:    s_wait_xcnt 0x0
 ; GCN-NEXT:    v_add_nc_u64_e32 v[0:1], 4, v[0:1]
 ; GCN-NEXT:    v_add_co_u32 v2, s0, v2, 1
 ; GCN-NEXT:    s_and_b32 vcc_lo, exec_lo, s0

jayfoad · 2025-12-23T15:56:29Z

@RyanRio

jayfoad · 2025-12-23T16:00:01Z

llvm/test/CodeGen/AMDGPU/flat-load-saddr-to-vaddr.ll

 ; GCN-NEXT:    s_wait_dscnt 0x0
 ; GCN-NEXT:    flat_load_b32 v3, v[0:1] scope:SCOPE_SYS
 ; GCN-NEXT:    s_wait_loadcnt 0x0
+; GCN-NEXT:    s_wait_xcnt 0x0


This is unintended but I'd like to fix it separately. The first time we process this block we eagerly promote S_WAIT_LOADCNT_soft 0 to S_WAIT_LOADCNT 0. The second time it looks like a required (non-soft) wait, which inhibits the optimization that simplifies away the wait for xcnt.

I think for this reason and others we really need to delay the promotion of soft to non-soft waits until after the iteration has reached a fixed point.

I see what you're saying in terms of how this came about, but that was kind of the point of the prior implementation - at any point if a new wait is going to be inserted it should have full knowledge of the state of other waits that have already been applied (as in, an implicit wait for something or a literal wait). Maybe part of the issue is that applyPreexistingWait kind of has a dual meaning where it accumulates pre-existing wait instrs and it also handles soft waits

I certainly agree that applyPreexistingWait is trying to do too much.

RyanRio

I think CheckWait and Updatewait made sense since we're trying to consider many different waits at once and optimize them in different ways. Which can affect how further optimizations happen.

…Waitcnt.

jayfoad · 2026-01-02T15:57:02Z

I think CheckWait and Updatewait made sense since we're trying to consider many different waits at once and optimize them in different ways. Which can affect how further optimizations happen.

Fair enough. For this patch I have restored the separate CheckWait and Updatewait parameters in simplifyXcnt, and added them to simplifyWaitcnt since I want all callers to use that as the entry point.

jayfoad · 2026-01-02T15:57:52Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+  bool hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait) const;
+  bool canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) const;


These two helpers now have only a single use each, and could be inlined.

nhaehnle

Just skimmed it but seems reasonable to me. One comment inline.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

The original design was: - WaitcntBrackets::simplifyWaitcnt(Wait) updates Wait based on the current state of WaitcntBrackets, removing unnecesary waits. - WaitcntBrackets::applyWaitcnt(Wait) updates WaitBrackets based on Wait, updating the state by applying the specified waits. This was changed by llvm#164357 which started calling applyWaitcnt from simplifyWaitcnt. This patch restores the original design without any significant functional changes. There is some code duplication because both simplifyWaitcnt and applyWaitcnt need to understand how XCNT interacts with other counters like LOADCNT and KMCNT.

llvmbot added the backend:AMDGPU label Dec 23, 2025

jayfoad requested review from Pierre-vh, easyonaadit, nhaehnle and shiltian December 23, 2025 15:55

jayfoad commented Dec 23, 2025

View reviewed changes

jayfoad requested a review from hidekisaito December 23, 2025 16:00

RyanRio reviewed Dec 23, 2025

View reviewed changes

Restore two argument form of simplifyXcnt. Add corresponding simplify…

2b7be03

…Waitcnt.

jayfoad commented Jan 2, 2026

View reviewed changes

nhaehnle approved these changes Jan 2, 2026

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Show resolved Hide resolved

Update comment

9708ada

jayfoad merged commit 0494abb into llvm:main Jan 5, 2026
9 of 10 checks passed

jayfoad deleted the simplifywaitcnt-const branch January 5, 2026 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Make WaitcntBrackets::simplifyWaitcnt const again#173390

[AMDGPU] Make WaitcntBrackets::simplifyWaitcnt const again#173390
jayfoad merged 3 commits intollvm:mainfrom
jayfoad:simplifywaitcnt-const

jayfoad commented Dec 23, 2025

Uh oh!

llvmbot commented Dec 23, 2025

Uh oh!

jayfoad commented Dec 23, 2025

Uh oh!

jayfoad Dec 23, 2025

Uh oh!

RyanRio Dec 23, 2025

Uh oh!

jayfoad Jan 2, 2026

Uh oh!

RyanRio left a comment

Uh oh!

jayfoad commented Jan 2, 2026

Uh oh!

jayfoad Jan 2, 2026

Uh oh!

jayfoad Jan 6, 2026

Uh oh!

nhaehnle left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		bool hasRedundantXCntWithKmCnt(const AMDGPU::Waitcnt &Wait) const;
		bool canOptimizeXCntWithLoadCnt(const AMDGPU::Waitcnt &Wait) const;

Conversation

jayfoad commented Dec 23, 2025

Uh oh!

llvmbot commented Dec 23, 2025

Uh oh!

jayfoad commented Dec 23, 2025

Uh oh!

jayfoad Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

RyanRio Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

jayfoad Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

RyanRio left a comment

Choose a reason for hiding this comment

Uh oh!

jayfoad commented Jan 2, 2026

Uh oh!

jayfoad Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

jayfoad Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

nhaehnle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants