[AMDGPU] IGLP: Fix static variables by ro-i · Pull Request #137549 · llvm/llvm-project

ro-i · 2025-04-27T21:09:26Z

Replace global / class-level static variables with instance members and guarantee thread-safety.

@arsenm as discussed

I tried to keep the changes as non-invasive as possible and without functional change.

llvmbot · 2025-04-27T21:10:00Z

@llvm/pr-subscribers-backend-amdgpu

Author: Robert Imschweiler (ro-i)

Changes

Replace global / class-level static variables with instance members and guarantee thread-safety.

@arsenm as discussed

Some notable implications of these changes:

MFMASmallGemmSingleWaveOpt::applyIGLPStrategy(): no caching anymore between scheduling phases
MFMAExpInterleaveOpt::analyzeDAG(): needs to be able to run on post-ra DAG as well, which especially affects the dependency handling

I tried to keep the changes as non-invasive as possible and without functional change.

Full diff: https://github.com/llvm/llvm-project/pull/137549.diff

4 Files Affected:

(modified) llvm/include/llvm/CodeGen/ScheduleDAG.h (+5-2)
(modified) llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h (+4-2)
(modified) llvm/lib/CodeGen/ScheduleDAG.cpp (+8-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+89-102)

diff --git a/llvm/include/llvm/CodeGen/ScheduleDAG.h b/llvm/include/llvm/CodeGen/ScheduleDAG.h
index 1c8d92d149adc..f57bc55bf3131 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAG.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAG.h
@@ -734,7 +734,8 @@ class TargetRegisterInfo;
     /// Makes a DFS traversal and mark all nodes affected by the edge insertion.
     /// These nodes will later get new topological indexes by means of the Shift
     /// method.
-    void DFS(const SUnit *SU, int UpperBound, bool& HasLoop);
+    void DFS(const SUnit *SU, int UpperBound, bool &HasLoop,
+             std::optional<SDep::Kind> OnlyDepKind = std::nullopt);
 
     /// Reassigns topological indexes for the nodes in the DAG to
     /// preserve the topological ordering.
@@ -767,7 +768,9 @@ class TargetRegisterInfo;
                                  bool &Success);
 
     /// Checks if \p SU is reachable from \p TargetSU.
-    bool IsReachable(const SUnit *SU, const SUnit *TargetSU);
+    /// If OnlyDepKind is given, consider only dependencies of this kind.
+    bool IsReachable(const SUnit *SU, const SUnit *TargetSU,
+                     std::optional<SDep::Kind> OnlyDepKind = std::nullopt);
 
     /// Returns true if addPred(TargetSU, SU) creates a cycle.
     bool WillCreateCycle(SUnit *TargetSU, SUnit *SU);
diff --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
index e79b03c57a1e8..f1d8852c75005 100644
--- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
+++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
@@ -287,8 +287,10 @@ namespace llvm {
     }
 
     /// IsReachable - Checks if SU is reachable from TargetSU.
-    bool IsReachable(SUnit *SU, SUnit *TargetSU) {
-      return Topo.IsReachable(SU, TargetSU);
+    /// If OnlyDepKind is given, only dependencies of this kind are considered.
+    bool IsReachable(SUnit *SU, SUnit *TargetSU,
+                     std::optional<SDep::Kind> OnlyDepKind = std::nullopt) {
+      return Topo.IsReachable(SU, TargetSU, OnlyDepKind);
     }
 
     /// Whether regions with a single MI should be scheduled.
diff --git a/llvm/lib/CodeGen/ScheduleDAG.cpp b/llvm/lib/CodeGen/ScheduleDAG.cpp
index 26857edd871e2..dec508f2f763d 100644
--- a/llvm/lib/CodeGen/ScheduleDAG.cpp
+++ b/llvm/lib/CodeGen/ScheduleDAG.cpp
@@ -570,7 +570,8 @@ void ScheduleDAGTopologicalSort::RemovePred(SUnit *M, SUnit *N) {
 }
 
 void ScheduleDAGTopologicalSort::DFS(const SUnit *SU, int UpperBound,
-                                     bool &HasLoop) {
+                                     bool &HasLoop,
+                                     std::optional<SDep::Kind> OnlyDepKind) {
   std::vector<const SUnit*> WorkList;
   WorkList.reserve(SUnits.size());
 
@@ -580,6 +581,8 @@ void ScheduleDAGTopologicalSort::DFS(const SUnit *SU, int UpperBound,
     WorkList.pop_back();
     Visited.set(SU->NodeNum);
     for (const SDep &SuccDep : llvm::reverse(SU->Succs)) {
+      if (OnlyDepKind && SuccDep.getKind() != *OnlyDepKind)
+        continue;
       unsigned s = SuccDep.getSUnit()->NodeNum;
       // Edges to non-SUnits are allowed but ignored (e.g. ExitSU).
       if (s >= Node2Index.size())
@@ -722,8 +725,9 @@ void ScheduleDAGTopologicalSort::AddSUnitWithoutPredecessors(const SUnit *SU) {
   Visited.resize(Node2Index.size());
 }
 
-bool ScheduleDAGTopologicalSort::IsReachable(const SUnit *SU,
-                                             const SUnit *TargetSU) {
+bool ScheduleDAGTopologicalSort::IsReachable(
+    const SUnit *SU, const SUnit *TargetSU,
+    std::optional<SDep::Kind> OnlyDepKind) {
   assert(TargetSU != nullptr && "Invalid target SUnit");
   assert(SU != nullptr && "Invalid SUnit");
   FixOrder();
@@ -737,7 +741,7 @@ bool ScheduleDAGTopologicalSort::IsReachable(const SUnit *SU,
   if (LowerBound < UpperBound) {
     Visited.reset();
     // There may be a path from TargetSU to SU. Check for it.
-    DFS(TargetSU, UpperBound, HasLoop);
+    DFS(TargetSU, UpperBound, HasLoop, OnlyDepKind);
   }
   return HasLoop;
 }
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
index a2b7cd088093a..4e5650d18b35a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
@@ -229,13 +229,13 @@ class SchedGroup {
   SchedGroup(SchedGroupMask SGMask, std::optional<unsigned> MaxSize,
              ScheduleDAGInstrs *DAG, const SIInstrInfo *TII)
       : SGMask(SGMask), MaxSize(MaxSize), DAG(DAG), TII(TII) {
-    SGID = NumSchedGroups++;
+    SGID = __atomic_fetch_add(&NumSchedGroups, 1, __ATOMIC_SEQ_CST);
   }
 
   SchedGroup(SchedGroupMask SGMask, std::optional<unsigned> MaxSize, int SyncID,
              ScheduleDAGInstrs *DAG, const SIInstrInfo *TII)
       : SGMask(SGMask), MaxSize(MaxSize), SyncID(SyncID), DAG(DAG), TII(TII) {
-    SGID = NumSchedGroups++;
+    SGID = __atomic_fetch_add(&NumSchedGroups, 1, __ATOMIC_SEQ_CST);
   }
 };
 
@@ -887,26 +887,26 @@ bool MFMASmallGemmOpt::applyIGLPStrategy(
 class MFMAExpInterleaveOpt final : public IGLPStrategy {
 private:
   // The count of TRANS SUs involved in the interleaved pipeline
-  static unsigned TransPipeCount;
+  unsigned TransPipeCount = 0;
   // The count of MFMA SUs involved in the interleaved pipeline
-  static unsigned MFMAPipeCount;
+  unsigned MFMAPipeCount = 0;
   // The count of Add SUs involved in the interleaved pipeline
-  static unsigned AddPipeCount;
+  unsigned AddPipeCount = 0;
   // The number of transitive MFMA successors for each TRANS SU
-  static unsigned MFMAEnablement;
+  unsigned MFMAEnablement = 0;
   // The number of transitive TRANS predecessors for each MFMA SU
-  static unsigned ExpRequirement;
+  unsigned ExpRequirement = 0;
   // The count of independent "chains" of MFMA instructions in the pipeline
-  static unsigned MFMAChains;
+  unsigned MFMAChains = 0;
   // The length of each independent "chain" of MFMA instructions
-  static unsigned MFMAChainLength;
+  unsigned MFMAChainLength = 0;
   // Whether or not the pipeline has V_CVT instructions
-  static bool HasCvt;
+  bool HasCvt = false;
   // Whether or not there are instructions between the TRANS instruction and
   // V_CVT
-  static bool HasChainBetweenCvt;
+  bool HasChainBetweenCvt = false;
   // The first occuring DS_READ which feeds an MFMA chain
-  static std::optional<unsigned> FirstPipeDSR;
+  std::optional<unsigned> FirstPipeDSR = std::nullopt;
   // The MFMAPipe SUs with no MFMA predecessors
   SmallVector<SUnit *, 4> MFMAChainSeeds;
   // Compute the heuristics for the pipeline, returning whether or not the DAG
@@ -1325,17 +1325,8 @@ class MFMAExpInterleaveOpt final : public IGLPStrategy {
   }
 };
 
-unsigned MFMAExpInterleaveOpt::TransPipeCount = 0;
-unsigned MFMAExpInterleaveOpt::MFMAPipeCount = 0;
-unsigned MFMAExpInterleaveOpt::AddPipeCount = 0;
-unsigned MFMAExpInterleaveOpt::MFMAEnablement = 0;
-unsigned MFMAExpInterleaveOpt::ExpRequirement = 0;
-unsigned MFMAExpInterleaveOpt::MFMAChains = 0;
-unsigned MFMAExpInterleaveOpt::MFMAChainLength = 0;
-bool MFMAExpInterleaveOpt::HasCvt = false;
-bool MFMAExpInterleaveOpt::HasChainBetweenCvt = false;
-std::optional<unsigned> MFMAExpInterleaveOpt::FirstPipeDSR = std::nullopt;
-
+// Note: we only want to check for true (data) dependencies (see SDep::Kind)
+// so that the logic also works in the PostRA phase.
 bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
   SmallVector<SUnit *, 10> ExpPipeCands;
   SmallVector<SUnit *, 10> MFMAPipeCands;
@@ -1353,17 +1344,25 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
 
   auto isAdd = [](unsigned Opc) { return Opc == AMDGPU::V_ADD_F32_e32; };
 
+  // Heuristic helper function (see below)
+  auto IsFMACDataDep = [](SDep &Dep) {
+    return Dep.getKind() == SDep::Kind::Data &&
+           Dep.getSUnit()->getInstr()->getOpcode() == AMDGPU::V_FMAC_F32_e32;
+  };
+
   AddPipeCount = 0;
   for (SUnit &SU : DAG->SUnits) {
     auto Opc = SU.getInstr()->getOpcode();
     if (TII->isTRANS(Opc)) {
       // Avoid counting a potential bonus V_EXP which all the MFMA depend on
-      if (SU.Succs.size() >= 7)
+      // FIXME: This heuristic needs improvement/clarification!
+      // In general, the pipeline seems to look like this:
+      //   fma_f32 -> exp_f32 -> cvt_f16_f32 -> v_pack_b32_f16 -> mfma_.._f16
+      //   (with potential arithmetic between exp and cvt)
+      //   see
+      //   https://github.com/llvm/llvm-project/pull/80370#discussion_r1483660378
+      if (SU.Succs.size() >= 7 && any_of(SU.Succs, IsFMACDataDep))
         continue;
-      for (auto &Succ : SU.Succs) {
-        if (Succ.getSUnit()->Succs.size() >= 7)
-          continue;
-      }
       ExpPipeCands.push_back(&SU);
     }
 
@@ -1390,7 +1389,7 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
   // Count the number of EXPs that reach an MFMA
   for (auto &PredSU : ExpPipeCands) {
     for (auto &SuccSU : MFMAPipeCands) {
-      if (DAG->IsReachable(SuccSU, PredSU)) {
+      if (DAG->IsReachable(SuccSU, PredSU, SDep::Kind::Data)) {
         if (!TempExp) {
           TempExp = PredSU;
           TempMFMA = SuccSU;
@@ -1418,7 +1417,7 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
       continue;
 
     for (auto &PredSU : ExpPipeCands) {
-      if (DAG->IsReachable(SuccSU, PredSU)) {
+      if (DAG->IsReachable(SuccSU, PredSU, SDep::Kind::Data)) {
         MFMAPipeSUs.push_back(SuccSU);
         break;
       }
@@ -1432,7 +1431,7 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
 
   std::optional<SUnit *> TempCvt;
   for (auto &SuccSU : CvtSUs) {
-    if (DAG->IsReachable(SuccSU, *TempExp)) {
+    if (DAG->IsReachable(SuccSU, *TempExp, SDep::Kind::Data)) {
       TempCvt = SuccSU;
       break;
     }
@@ -1441,13 +1440,14 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
   HasCvt = false;
   if (TempCvt.has_value()) {
     for (auto &SuccSU : MFMAPipeSUs) {
-      if (DAG->IsReachable(SuccSU, *TempCvt)) {
+      if (DAG->IsReachable(SuccSU, *TempCvt, SDep::Kind::Data)) {
         HasCvt = true;
         break;
       }
     }
   }
 
+  MFMAChainSeeds.clear();
   MFMAChains = 0;
   for (auto &MFMAPipeSU : MFMAPipeSUs) {
     if (is_contained(MFMAChainSeeds, MFMAPipeSU))
@@ -1474,7 +1474,7 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
   // The number of bit pack operations that depend on a single V_EXP
   unsigned PackSuccCount =
       llvm::count_if(PackSUs, [this, &TempExp](SUnit *VPack) {
-        return DAG->IsReachable(VPack, *TempExp);
+        return DAG->IsReachable(VPack, *TempExp, SDep::Kind::Data);
       });
 
   // The number of bit pack operations an MFMA depends on
@@ -1504,10 +1504,10 @@ bool MFMAExpInterleaveOpt::analyzeDAG(const SIInstrInfo *TII) {
   MFMAEnablement *= PackSuccCount;
 
   // The number of V_EXPs required to resolve all dependencies for an MFMA
-  ExpRequirement =
-      llvm::count_if(ExpPipeCands, [this, &PackPred](SUnit *ExpBase) {
-        return DAG->IsReachable(PackPred->getSUnit(), ExpBase);
-      });
+  ExpRequirement = llvm::count_if(ExpPipeCands, [this,
+                                                 &PackPred](SUnit *ExpBase) {
+    return DAG->IsReachable(PackPred->getSUnit(), ExpBase, SDep::Kind::Data);
+  });
 
   ExpRequirement *= PackPredCount;
   return true;
@@ -1518,12 +1518,7 @@ bool MFMAExpInterleaveOpt::shouldApplyStrategy(ScheduleDAGInstrs *DAG,
   const GCNSubtarget &ST = DAG->MF.getSubtarget<GCNSubtarget>();
   const SIInstrInfo *TII = ST.getInstrInfo();
 
-  if (Phase != AMDGPU::SchedulingPhase::PostRA)
-    MFMAChainSeeds.clear();
-  if (Phase != AMDGPU::SchedulingPhase::PostRA && !analyzeDAG(TII))
-    return false;
-
-  return true;
+  return analyzeDAG(TII);
 }
 
 bool MFMAExpInterleaveOpt::applyIGLPStrategy(
@@ -1550,18 +1545,18 @@ bool MFMAExpInterleaveOpt::applyIGLPStrategy(
   unsigned CurrMFMAForTransPosition = 0;
 
   auto incrementTransPosition = [&MFMAChain, &PositionInChain,
-                                 &CurrMFMAForTransPosition]() {
+                                 &CurrMFMAForTransPosition, this]() {
     CurrMFMAForTransPosition += MFMAEnablement;
     PositionInChain = (CurrMFMAForTransPosition / MFMAChains);
     MFMAChain = CurrMFMAForTransPosition % MFMAChains;
   };
 
-  auto getNextTransPositionInChain = [&CurrMFMAForTransPosition]() {
+  auto getNextTransPositionInChain = [&CurrMFMAForTransPosition, this]() {
     auto TempMFMAForTrans = CurrMFMAForTransPosition + MFMAEnablement;
     return (TempMFMAForTrans / MFMAChains);
   };
 
-  auto getNextTransMFMAChain = [&CurrMFMAForTransPosition]() {
+  auto getNextTransMFMAChain = [&CurrMFMAForTransPosition, this]() {
     auto TempMFMAForTrans = CurrMFMAForTransPosition + MFMAEnablement;
     return TempMFMAForTrans % MFMAChains;
   };
@@ -1571,7 +1566,7 @@ bool MFMAExpInterleaveOpt::applyIGLPStrategy(
   unsigned PositionInChainForMFMA = 0;
 
   auto incrementMFMAPosition = [&CurrMFMAPosition, &MFMAChainForMFMA,
-                                &PositionInChainForMFMA]() {
+                                &PositionInChainForMFMA, this]() {
     ++CurrMFMAPosition;
     MFMAChainForMFMA = CurrMFMAPosition % MFMAChains;
     PositionInChainForMFMA = CurrMFMAPosition / MFMAChains;
@@ -2062,22 +2057,16 @@ class MFMASmallGemmSingleWaveOpt final : public IGLPStrategy {
   }
 };
 
-static unsigned DSWCount = 0;
-static unsigned DSWWithPermCount = 0;
-static unsigned DSWWithSharedVMEMCount = 0;
-
 bool MFMASmallGemmSingleWaveOpt::applyIGLPStrategy(
     DenseMap<int, SUnitsToCandidateSGsMap> &SyncedInstrs,
     DenseMap<int, SmallVector<SchedGroup, 4>> &SyncedSchedGroups,
     AMDGPU::SchedulingPhase Phase) {
   unsigned MFMACount = 0;
   unsigned DSRCount = 0;
+  unsigned DSWCount = 0;
+  unsigned DSWWithPermCount = 0;
+  unsigned DSWWithSharedVMEMCount = 0;
 
-  bool IsInitial = Phase == AMDGPU::SchedulingPhase::Initial;
-
-  assert((!IsInitial || (DSWCount == 0 && DSWWithPermCount == 0 &&
-                         DSWWithSharedVMEMCount == 0)) &&
-         "DSWCounters should be zero in pre-RA scheduling!");
   SmallVector<SUnit *, 6> DSWithPerms;
   for (auto &SU : DAG->SUnits) {
     auto *I = SU.getInstr();
@@ -2086,7 +2075,7 @@ bool MFMASmallGemmSingleWaveOpt::applyIGLPStrategy(
     else if (TII->isDS(*I)) {
       if (I->mayLoad())
         ++DSRCount;
-      else if (I->mayStore() && IsInitial) {
+      else if (I->mayStore()) {
         ++DSWCount;
         for (auto Pred : SU.Preds) {
           if (Pred.getSUnit()->getInstr()->getOpcode() ==
@@ -2099,58 +2088,56 @@ bool MFMASmallGemmSingleWaveOpt::applyIGLPStrategy(
     }
   }
 
-  if (IsInitial) {
-    DSWWithPermCount = DSWithPerms.size();
-    auto *I = DSWithPerms.begin();
-    auto *E = DSWithPerms.end();
-
-    // Get the count of DS_WRITES with V_PERM predecessors which
-    // have loop carried dependencies (WAR) on the same VMEM_READs.
-    // We consider partial overlap as a miss -- in other words,
-    // for a given DS_W, we only consider another DS_W as matching
-    // if there is a corresponding (in terms of the VMEM_R it uses) V_PERM pred
-    // for every V_PERM pred of this DS_W.
-    DenseMap<MachineInstr *, SUnit *> VMEMLookup;
-    SmallVector<SUnit *, 6> Counted;
-    for (; I != E; I++) {
-      SUnit *Cand = nullptr;
-      bool MissedAny = false;
-      for (auto &Pred : (*I)->Preds) {
-        if (Pred.getSUnit()->getInstr()->getOpcode() != AMDGPU::V_PERM_B32_e64)
-          continue;
+  DSWWithPermCount = DSWithPerms.size();
+  auto *I = DSWithPerms.begin();
+  auto *E = DSWithPerms.end();
+
+  // Get the count of DS_WRITES with V_PERM predecessors which
+  // have loop carried dependencies (WAR) on the same VMEM_READs.
+  // We consider partial overlap as a miss -- in other words,
+  // for a given DS_W, we only consider another DS_W as matching
+  // if there is a corresponding (in terms of the VMEM_R it uses) V_PERM pred
+  // for every V_PERM pred of this DS_W.
+  DenseMap<MachineInstr *, SUnit *> VMEMLookup;
+  SmallVector<SUnit *, 6> Counted;
+  for (; I != E; I++) {
+    SUnit *Cand = nullptr;
+    bool MissedAny = false;
+    for (auto &Pred : (*I)->Preds) {
+      if (Pred.getSUnit()->getInstr()->getOpcode() != AMDGPU::V_PERM_B32_e64)
+        continue;
 
-        if (Cand && llvm::is_contained(Counted, Cand))
-          break;
+      if (Cand && llvm::is_contained(Counted, Cand))
+        break;
 
-        for (auto &Succ : Pred.getSUnit()->Succs) {
-          auto *MI = Succ.getSUnit()->getInstr();
-          if (!TII->isVMEM(*MI) || !MI->mayLoad())
-            continue;
+      for (auto &Succ : Pred.getSUnit()->Succs) {
+        auto *MI = Succ.getSUnit()->getInstr();
+        if (!TII->isVMEM(*MI) || !MI->mayLoad())
+          continue;
 
-          if (MissedAny || !VMEMLookup.size()) {
-            MissedAny = true;
-            VMEMLookup[MI] = *I;
-            continue;
-          }
+        if (MissedAny || !VMEMLookup.size()) {
+          MissedAny = true;
+          VMEMLookup[MI] = *I;
+          continue;
+        }
 
-          auto [It, Inserted] = VMEMLookup.try_emplace(MI, *I);
-          if (Inserted) {
-            MissedAny = true;
-            continue;
-          }
+        auto [It, Inserted] = VMEMLookup.try_emplace(MI, *I);
+        if (Inserted) {
+          MissedAny = true;
+          continue;
+        }
 
-          Cand = It->second;
-          if (llvm::is_contained(Counted, Cand)) {
-            MissedAny = true;
-            break;
-          }
+        Cand = It->second;
+        if (llvm::is_contained(Counted, Cand)) {
+          MissedAny = true;
+          break;
         }
       }
-      if (!MissedAny && Cand) {
-        DSWWithSharedVMEMCount += 2;
-        Counted.push_back(Cand);
-        Counted.push_back(*I);
-      }
+    }
+    if (!MissedAny && Cand) {
+      DSWWithSharedVMEMCount += 2;
+      Counted.push_back(Cand);
+      Counted.push_back(*I);
     }
   }

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

arsenm · 2025-10-11T03:06:19Z

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp

+      //   (with potential arithmetic between exp and cvt)
+      //   see
+      //   https://github.com/llvm/llvm-project/pull/80370#discussion_r1483660378
+      if (SU.Succs.size() >= 7 && any_of(SU.Succs, IsFMACDataDep))


I don't understand why this is looking for fmac specifically. The comment says "fma", but "FMA" isn't a real classification of instruction. I would expect something based off of scheduler resource use, maybe?

For the moment, I reverted this specific change to keep the patch NFC (with the exception that missing reset of variables is fixed since those variables aren't static anymore)

ro-i · 2026-01-23T08:59:27Z

Ping (note that this is NFC regarding existing tests and should be NFC regarding existing code that doesn't run into the issue that the IGLP stuff wasn't reset properly)

arsenm · 2026-01-24T17:19:24Z

llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.h

+  struct MFMASmallGemmSingleWaveCache {
+    unsigned DSWCount = 0;
+    unsigned DSWWithPermCount = 0;
+    unsigned DSWWithSharedVMEMCount = 0;
+  };
+
+  struct MFMAExpInterleaveCache {
+    SmallVector<const MachineInstr *, 4> MFMAChainSeedInstrs;
+    const MachineInstr *FirstPipeDSRInstr = nullptr;
+    unsigned TransPipeCount = 0;
+    unsigned MFMAPipeCount = 0;
+    unsigned AddPipeCount = 0;
+    unsigned MFMAEnablement = 0;
+    unsigned ExpRequirement = 0;
+    unsigned MFMAChains = 0;
+    unsigned MFMAChainLength = 0;
+    bool HasCvt = false;
+    bool HasChainBetweenCvt = false;
+    bool AnalysisResult = false;
+  };


This does not belong in SIMachineFunctionInfo. This is pass-local state

Maybe, but we need to cache it somewhere so that we have the data over multiple runs of that pass (otherwise, we're back to recompute everything every time or use static variables). What location would you suggest?

Isn't "the pass" here a scheduling stage, contained within the single machine function scheduler pass?

yeah, sorry for the confusion.
Afaics, the Initial and PreRAReentry phases happen during the machine-scheduler pass, the PostRA phase during the postmisched pass. And we do want to keep our caches between these passes if possible.

The alternative is to make this an analysis pass. But either way it shouldn't be storing pointers and should be serializable. We already suffer greatly from poor reproducibility of the hard problems with MIR tests

The issue is:
I could move the pointer parts to GCNScheduleDAGMILive, so that they are available during the pre-RA stages. (They aren't used in post-RA afaict.)
This would pull out all the pointer stuff from SIMachineFunctionInfo.
But: the caches in SIMachineFunctionInfo would still be keyed by instruction pointer, which we would need to translate to something else for serialization. MBB + index of instruction or something like this.

I whipped up a draft with Claude so that you can see what I'm talking about.

The analysis pass: could we even ensure that we can preserve the analysis? Otherwise, we would be back to square one, ig.

What would be your opinion on whether the IGLP algorithm itself could be changed to not even require that much information? I.e. kind of circumvent the issue of caching by not having that much that would need to be cached.

github-actions · 2026-02-23T16:34:44Z

🐧 Linux x64 Test Results

191834 tests passed
4904 tests skipped

✅ The build succeeded and all tests passed.

frederik-h

I am not sure that the performance improvements achievable by the caching are worth the added code complexity. In fact, I did a quick experiment and I changed the baseline MFMASmallGemmSingleWaveOpt::applyIGLPStrategy implementation to always recompute the analysis data and compared the time spent in the scheduler of that build with the time using a build of this PR. For this I used different IR files from a build of project that uses "IGLP_OPT 1" (composable_kernel). Recomputing was consistently faster for me. Perhaps that's different for the other strategy or perhaps I made a mistake in my ad hoc benchmarking, but in any case I would suggest that you also try to benchmark if you haven't already. I think @jrbyrnes has originally implemented those strategies. @jrbyrnes Do you think the caching is crucial? @ro-i If you do not see a significant performance advantage obtained by the caching, perhaps you could fix the correctness issue first (recomputing the analysis data) and then try to improve the performance?

frederik-h · 2026-03-04T14:10:52Z

llvm/test/CodeGen/AMDGPU/iglp-cache-serialization.mir

+  bb.1:
+    liveins: $vgpr0, $vgpr1
+
+    IGLP_OPT 1


Add test for IGLP_OPT 2.

frederik-h · 2026-03-04T14:11:45Z

llvm/test/CodeGen/AMDGPU/iglp-cache-serialization.mir

+      hasCvt:          true
+      hasChainBetweenCvt: false
+      analysisResult:  true
+  iglpSmallGemmSingleWaveCaches:


This checks the cache parsing. Also check cache creation?

frederik-h · 2026-03-04T14:13:35Z

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.iglp.opt.ll

+; GCN-NEXT:    s_endpgm
+entry:
+  %a = load i1, ptr %src, align 1
+  call void @llvm.amdgcn.iglp.opt(i32 1)


Test "call void @llvm.amdgcn.iglp.opt(i32 2)"?

Replace global / class-level static variables with instance members and guarantee thread-safety.

Assisted-by: claude-4.6-opus

ro-i requested a review from arsenm April 27, 2025 21:09

llvmbot added the backend:AMDGPU label Apr 27, 2025

ro-i mentioned this pull request Apr 27, 2025

[NFC][AMDGPU] IGLP: Fixes for unsigned int handling #135090

Merged

arsenm requested review from jrbyrnes and kerbowa April 28, 2025 07:53

ro-i commented Apr 29, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp Outdated Show resolved Hide resolved

arsenm reviewed May 1, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp Outdated Show resolved Hide resolved

ro-i mentioned this pull request May 16, 2025

[AMDGPU] misched: avoid subregister dependencies #140255

Open

arsenm reviewed Oct 11, 2025

View reviewed changes

ro-i force-pushed the iglp-static-fix branch from 4d53d16 to f67dbe3 Compare November 26, 2025 15:15

ro-i requested a review from arsenm November 26, 2025 15:20

arsenm reviewed Jan 24, 2026

View reviewed changes

frederik-h reviewed Mar 4, 2026

View reviewed changes

ro-i added 2 commits March 9, 2026 06:11

[AMDGPU] IGLP: Fix static variables

773eab3

Replace global / class-level static variables with instance members and guarantee thread-safety.

move pointer cache to GCNScheduleDAGMILive

97e8d0d

Assisted-by: claude-4.6-opus

ro-i force-pushed the iglp-static-fix branch from 53875b4 to 97e8d0d Compare March 9, 2026 14:49

ro-i mentioned this pull request Mar 9, 2026

[AMDGPU] IGLP: Fix static variables (alternative to #137549) #185436

Draft

Conversation

ro-i commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ro-i commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐧 Linux x64 Test Results

Uh oh!

frederik-h left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ro-i commented Apr 27, 2025 •

edited

Loading

github-actions bot commented Feb 23, 2026 •

edited

Loading