Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)" by lucas-rami · Pull Request #177206 · llvm/llvm-project

lucas-rami · 2026-01-21T17:18:37Z

This re-applies commit f21e359 along with the compile fix failure introduced in 8ab7937 before the initial patch was reverted. It also fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back.

Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state means that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before.
Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guarantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.

llvmbot · 2026-01-21T17:19:37Z

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

This re-applies commit f21e359 along with the compile fix failure introduced in 8ab7937 before the initial patch was reverted. It also fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back.

Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state means that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before.
Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guarantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.

Patch is 174.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/177206.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+510-294)
(modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+208-51)
(added) llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir (+523)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-attr.mir (+194-194)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-debug.mir (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+238-31)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+1-1)

diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 8bd47aaadae56..7365030267b5a 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -28,11 +28,18 @@
 #include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/CodeGen/CalcSpillWeights.h"
 #include "llvm/CodeGen/MachineOperand.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
+#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
 #include "llvm/MC/LaneBitmask.h"
+#include "llvm/MC/MCInstrItineraries.h"
+#include "llvm/MC/MCSchedule.h"
+#include "llvm/MC/TargetRegistry.h"
 #include "llvm/Support/ErrorHandling.h"
 
 #define DEBUG_TYPE "machine-scheduler"
@@ -970,6 +977,8 @@ void GCNScheduleDAGMILive::schedule() {
 
 GCNRegPressure
 GCNScheduleDAGMILive::getRealRegPressure(unsigned RegionIdx) const {
+  if (Regions[RegionIdx].first == Regions[RegionIdx].second)
+    return llvm::getRegPressure(MRI, LiveIns[RegionIdx]);
   GCNDownwardRPTracker RPTracker(*LIS);
   RPTracker.advance(Regions[RegionIdx].first, Regions[RegionIdx].second,
                     &LiveIns[RegionIdx]);
@@ -1274,33 +1283,222 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
 #define REMAT_PREFIX "[PreRARemat] "
 #define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << REMAT_PREFIX; X;)
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+Printable PreRARematStage::ScoredRemat::print() const {
+  return Printable([&](raw_ostream &OS) {
+    OS << '(' << MaxFreq << ", " << FreqDiff << ", " << RegionImpact << ')';
+  });
+}
+#endif
+
 bool PreRARematStage::initGCNSchedStage() {
   // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
   // regions inbetween the defs and region we sinked the def to. Will need to be
   // fixed if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() == 1)
+  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() <= 1)
     return false;
 
+  // Maps all MIs (except lone terminators, which are not part of any region) to
+  // their parent region. Non-lone terminators are considered part of the region
+  // they delimitate.
+  DenseMap<MachineInstr *, unsigned> MIRegion(MF.getInstructionCount());
+
   // Before performing any IR modification record the parent region of each MI
   // and the parent MBB of each region.
   const unsigned NumRegions = DAG.Regions.size();
-  RegionBB.reserve(NumRegions);
   for (unsigned I = 0; I < NumRegions; ++I) {
     RegionBoundaries Region = DAG.Regions[I];
     for (auto MI = Region.first; MI != Region.second; ++MI)
       MIRegion.insert({&*MI, I});
-    RegionBB.push_back(Region.first->getParent());
+    MachineBasicBlock *ParentMBB = Region.first->getParent();
+    if (Region.second != ParentMBB->end())
+      MIRegion.insert({&*Region.second, I});
+    RegionBB.push_back(ParentMBB);
   }
 
-  if (!canIncreaseOccupancyOrReduceSpill())
+#ifndef NDEBUG
+  auto PrintTargetRegions = [&]() -> void {
+    if (TargetRegions.none()) {
+      dbgs() << REMAT_PREFIX << "No target regions\n";
+      return;
+    }
+    dbgs() << REMAT_PREFIX << "Target regions:\n";
+    for (unsigned I : TargetRegions.set_bits())
+      dbgs() << REMAT_PREFIX << "  [" << I << "] " << RPTargets[I] << '\n';
+  };
+  auto PrintRematReg = [&](const RematReg &Remat) -> Printable {
+    return Printable([&, Remat](raw_ostream &OS) {
+      // Concatenate all region numbers in which the register is unused and
+      // live-through.
+      bool HasLiveThroughRegion = false;
+      OS << '[' << Remat.DefRegion << " -";
+      for (unsigned I = 0; I < NumRegions; ++I) {
+        if (Remat.isUnusedLiveThrough(I)) {
+          if (HasLiveThroughRegion) {
+            OS << ',';
+          } else {
+            OS << "- ";
+            HasLiveThroughRegion = true;
+          }
+          OS << I;
+        }
+      }
+      if (HasLiveThroughRegion)
+        OS << " -";
+      OS << "-> " << Remat.UseRegion << "] ";
+      Remat.DefMI->print(OS, /*IsStandalone=*/true, /*SkipOpers=*/false,
+                         /*SkipDebugLoc=*/false, /*AddNewLine=*/false);
+    });
+  };
+#endif
+
+  // Set an objective for the stage based on current RP in each region.
+  REMAT_DEBUG({
+    dbgs() << "Analyzing ";
+    MF.getFunction().printAsOperand(dbgs(), false);
+    dbgs() << ": ";
+  });
+  if (!setObjective()) {
+    LLVM_DEBUG(dbgs() << "no objective to achieve, occupancy is maximal at "
+                      << MFI.getMaxWavesPerEU() << '\n');
+    return false;
+  }
+  LLVM_DEBUG({
+    if (TargetOcc) {
+      dbgs() << "increase occupancy from " << *TargetOcc - 1 << '\n';
+    } else {
+      dbgs() << "reduce spilling (minimum target occupancy is "
+             << MFI.getMinWavesPerEU() << ")\n";
+    }
+    PrintTargetRegions();
+  });
+
+  if (!collectRematRegs(MIRegion)) {
+    REMAT_DEBUG(dbgs() << "No rematerializable registers\n");
+    return false;
+  }
+  const ScoredRemat::FreqInfo FreqInfo(MF, DAG);
+  REMAT_DEBUG({
+    dbgs() << "Rematerializable registers:\n";
+    for (const RematReg &Remat : RematRegs)
+      dbgs() << REMAT_PREFIX << "  " << PrintRematReg(Remat) << '\n';
+    dbgs() << REMAT_PREFIX << "Region frequencies\n";
+    for (auto [I, Freq] : enumerate(FreqInfo.Regions)) {
+      dbgs() << REMAT_PREFIX << "  [" << I << "] ";
+      if (Freq)
+        dbgs() << Freq;
+      else
+        dbgs() << "unknown ";
+      dbgs() << " | " << *DAG.Regions[I].first;
+    }
+  });
+
+  SmallVector<ScoredRemat> ScoredRemats;
+  for (const RematReg &Remat : RematRegs)
+    ScoredRemats.emplace_back(&Remat, FreqInfo, DAG);
+
+// Rematerialize registers in successive rounds until all RP targets are
+// satisifed or until we run out of rematerialization candidates.
+#ifndef NDEBUG
+  unsigned RoundNum = 0;
+#endif
+  BitVector RecomputeRP(NumRegions);
+  do {
+    assert(!ScoredRemats.empty() && "no more remat candidates");
+
+    // (Re-)Score and (re-)sort all remats in increasing score order.
+    for (ScoredRemat &Remat : ScoredRemats)
+      Remat.update(TargetRegions, RPTargets, FreqInfo, !TargetOcc);
+    sort(ScoredRemats);
+
+    REMAT_DEBUG({
+      dbgs() << "==== ROUND " << RoundNum++ << " ====\n"
+             << REMAT_PREFIX
+             << "Candidates with non-null score, in rematerialization order:\n";
+      for (const ScoredRemat &RematDecision : reverse(ScoredRemats)) {
+        if (RematDecision.hasNullScore())
+          break;
+        dbgs() << REMAT_PREFIX << "  " << RematDecision.print() << " | "
+               << *RematDecision.Remat->DefMI;
+      }
+      PrintTargetRegions();
+    });
+
+    RecomputeRP.reset();
+    unsigned RematIdx = ScoredRemats.size();
+
+    // Rematerialize registers in decreasing score order until we estimate
+    // that all RP targets are satisfied or until rematerialization candidates
+    // are no longer useful to decrease RP.
+    for (; RematIdx && TargetRegions.any(); --RematIdx) {
+      const ScoredRemat &Candidate = ScoredRemats[RematIdx - 1];
+      // Stop rematerializing on encountering a null score. Since scores
+      // monotonically decrease as we rematerialize, we know there is nothing
+      // useful left to do in such cases, even if we were to re-score.
+      if (Candidate.hasNullScore()) {
+        RematIdx = 0;
+        break;
+      }
+
+      const RematReg &Remat = *Candidate.Remat;
+      // When previous rematerializations in this round have already satisfied
+      // RP targets in all regions this rematerialization can impact, we have a
+      // good indication that our scores have diverged significantly from
+      // reality, in which case we interrupt this round and re-score. This also
+      // ensures that every rematerialization we perform is possibly impactful
+      // in at least one target region.
+      if (!Remat.maybeBeneficial(TargetRegions, RPTargets))
+        break;
+
+      REMAT_DEBUG(dbgs() << "** REMAT " << PrintRematReg(Remat) << '\n';);
+      // Every rematerialization we do here is likely to move the instruction
+      // into a higher frequency region, increasing the total sum latency of the
+      // instruction itself. This is acceptable if we are eliminating a spill in
+      // the process, but when the goal is increasing occupancy we get nothing
+      // out of rematerialization if occupancy is not increased in the end; in
+      // such cases we want to roll back the rematerialization.
+      RollbackInfo *Rollback =
+          TargetOcc ? &Rollbacks.emplace_back(&Remat) : nullptr;
+      rematerialize(Remat, RecomputeRP, Rollback);
+      unsetSatisifedRPTargets(Remat.Live);
+    }
+
+    REMAT_DEBUG({
+      if (!TargetRegions.any()) {
+        dbgs() << "** Interrupt round on all targets achieved\n";
+      } else if (RematIdx) {
+        dbgs() << "** Interrupt round on stale score for "
+               << *ScoredRemats[RematIdx - 1].Remat->DefMI;
+      } else {
+        dbgs() << "** Stop on exhausted rematerialization candidates\n";
+      }
+    });
+
+    // Peel off registers we already rematerialized from the vector's tail.
+    ScoredRemats.truncate(RematIdx);
+  } while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) &&
+           !ScoredRemats.empty());
+  if (RescheduleRegions.none())
     return false;
 
-  // Rematerialize identified instructions and update scheduler's state.
-  rematerialize();
-  if (GCNTrackers)
-    DAG.RegionLiveOuts.buildLiveRegMap();
+  // Commit all pressure changes to the DAG and compute minimum achieved
+  // occupancy in impacted regions.
+  REMAT_DEBUG(dbgs() << "==== REMAT RESULTS ====\n");
+  unsigned DynamicVGPRBlockSize = MFI.getDynamicVGPRBlockSize();
+  for (unsigned I : RescheduleRegions.set_bits()) {
+    DAG.Pressure[I] = RPTargets[I].getCurrentRP();
+    REMAT_DEBUG(dbgs() << '[' << I << "] Achieved occupancy "
+                       << DAG.Pressure[I].getOccupancy(ST, DynamicVGPRBlockSize)
+                       << " (" << RPTargets[I] << ")\n");
+  }
+  AchievedOcc = MFI.getMaxWavesPerEU();
+  for (const GCNRegPressure &RP : DAG.Pressure) {
+    AchievedOcc =
+        std::min(AchievedOcc, RP.getOccupancy(ST, DynamicVGPRBlockSize));
+  }
+
   REMAT_DEBUG({
     dbgs() << "Retrying function scheduling with new min. occupancy of "
            << AchievedOcc << " from rematerializing (original was "
@@ -1339,6 +1537,10 @@ void UnclusteredHighRPStage::finalizeGCNSchedStage() {
 }
 
 bool GCNSchedStage::initGCNRegion() {
+  // Skip empty scheduling region.
+  if (DAG.begin() == DAG.end())
+    return false;
+
   // Check whether this new region is also a new block.
   if (DAG.RegionBegin->getParent() != CurrentMBB)
     setupNewBlock();
@@ -1346,8 +1548,8 @@ bool GCNSchedStage::initGCNRegion() {
   unsigned NumRegionInstrs = std::distance(DAG.begin(), DAG.end());
   DAG.enterRegion(CurrentMBB, DAG.begin(), DAG.end(), NumRegionInstrs);
 
-  // Skip empty scheduling regions (0 or 1 schedulable instructions).
-  if (DAG.begin() == DAG.end() || DAG.begin() == std::prev(DAG.end()))
+  // Skip regions with 1 schedulable instruction.
+  if (DAG.begin() == std::prev(DAG.end()))
     return false;
 
   LLVM_DEBUG(dbgs() << "********** MI Scheduling **********\n");
@@ -1842,37 +2044,21 @@ unsigned PreRARematStage::getStageTargetOccupancy() const {
   return TargetOcc ? *TargetOcc : MFI.getMinWavesPerEU();
 }
 
-bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
+bool PreRARematStage::setObjective() {
   const Function &F = MF.getFunction();
 
-  // Maps optimizable regions (i.e., regions at minimum and register-limited
-  // occupancy, or regions with spilling) to the target RP we would like to
-  // reach.
-  DenseMap<unsigned, GCNRPTarget> OptRegions;
+  // Set up "spilling targets" for all regions.
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(F);
   unsigned MaxVGPRs = ST.getMaxNumVGPRs(F);
-  bool HasVectorRegisterExcess;
-
-  auto ResetTargetRegions = [&]() {
-    OptRegions.clear();
-    HasVectorRegisterExcess = false;
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      const GCNRegPressure &RP = DAG.Pressure[I];
-      GCNRPTarget Target(MaxSGPRs, MaxVGPRs, MF, RP);
-      if (!Target.satisfied())
-        OptRegions.insert({I, Target});
-      // We are willing to consider a RP that does not satisfy the target
-      // as long as it doesn't result in spilling to memory. We
-      // tolerate SGPR spills to VGPR since these have a lower impact on the
-      // performance.
-      // VGPR/AGPR excess could be aliviated by copying to an AGPR/VGPR.
-      // Currently, we consider any spilling as bad (we should consider this
-      // case in the future).
-      HasVectorRegisterExcess |= Target.hasVectorRegisterExcess();
-    }
-  };
+  bool HasVectorRegisterExcess = false;
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    const GCNRegPressure &RP = DAG.Pressure[I];
+    GCNRPTarget &Target = RPTargets.emplace_back(MaxSGPRs, MaxVGPRs, MF, RP);
+    if (!Target.satisfied())
+      TargetRegions.set(I);
+    HasVectorRegisterExcess |= Target.hasVectorRegisterExcess();
+  }
 
-  ResetTargetRegions();
   if (HasVectorRegisterExcess || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
     // In addition to register usage being above addressable limits, occupancy
     // below the minimum is considered like "spilling" as well.
@@ -1881,94 +2067,68 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
     // There is no spilling and room to improve occupancy; set up "increased
     // occupancy targets" for all regions.
     TargetOcc = DAG.MinOccupancy + 1;
-    unsigned VGPRBlockSize =
-        MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize();
+    const unsigned VGPRBlockSize = MFI.getDynamicVGPRBlockSize();
     MaxSGPRs = ST.getMaxNumSGPRs(*TargetOcc, false);
     MaxVGPRs = ST.getMaxNumVGPRs(*TargetOcc, VGPRBlockSize);
-    ResetTargetRegions();
-  }
-  REMAT_DEBUG({
-    dbgs() << "Analyzing ";
-    MF.getFunction().printAsOperand(dbgs(), false);
-    dbgs() << ": ";
-    if (OptRegions.empty()) {
-      dbgs() << "no objective to achieve, occupancy is maximal at "
-             << MFI.getMaxWavesPerEU();
-    } else if (!TargetOcc) {
-      dbgs() << "reduce spilling (minimum target occupancy is "
-             << MFI.getMinWavesPerEU() << ')';
-    } else {
-      dbgs() << "increase occupancy from " << DAG.MinOccupancy << " to "
-             << TargetOcc;
-    }
-    dbgs() << '\n';
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) {
-        dbgs() << REMAT_PREFIX << "  [" << I << "] " << OptIt->getSecond()
-               << '\n';
-      }
+    for (auto [I, Target] : enumerate(RPTargets)) {
+      Target.setTarget(MaxSGPRs, MaxVGPRs);
+      if (!Target.satisfied())
+        TargetRegions.set(I);
     }
-  });
-  if (OptRegions.empty())
-    return false;
+  }
 
-  // Accounts for a reduction in RP in an optimizable region. Returns whether we
-  // estimate that we have identified enough rematerialization opportunities to
-  // achieve our goal, and sets Progress to true when this particular reduction
-  // in pressure was helpful toward that goal.
-  auto ReduceRPInRegion = [&](auto OptIt, Register Reg, LaneBitmask Mask,
-                              bool &Progress) -> bool {
-    GCNRPTarget &Target = OptIt->getSecond();
-    if (!Target.isSaveBeneficial(Reg))
-      return false;
-    Progress = true;
-    Target.saveReg(Reg, Mask, DAG.MRI);
-    if (Target.satisfied())
-      OptRegions.erase(OptIt->getFirst());
-    return OptRegions.empty();
-  };
+  return TargetRegions.any();
+}
 
+bool PreRARematStage::collectRematRegs(
+    const DenseMap<MachineInstr *, unsigned> &MIRegion) {
   // We need up-to-date live-out info. to query live-out register masks in
   // regions containing rematerializable instructions.
   DAG.RegionLiveOuts.buildLiveRegMap();
 
-  // Cache set of registers that are going to be rematerialized.
-  DenseSet<unsigned> RematRegs;
+  // Set of registers already marked for potential remterialization; used to
+  // avoid rematerialization chains.
+  SmallSet<Register, 4> MarkedRegs;
+  auto IsMarkedForRemat = [&MarkedRegs](const MachineOperand &MO) -> bool {
+    return MO.isReg() && MarkedRegs.contains(MO.getReg());
+  };
 
   // Identify rematerializable instructions in the function.
   for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-    auto Region = DAG.Regions[I];
-    for (auto MI = Region.first; MI != Region.second; ++MI) {
+    RegionBoundaries Bounds = DAG.Regions[I];
+    for (auto MI = Bounds.first; MI != Bounds.second; ++MI) {
       // The instruction must be rematerializable.
       MachineInstr &DefMI = *MI;
       if (!isReMaterializable(DefMI))
         continue;
 
-      // We only support rematerializing virtual registers with one definition.
+      // We only support rematerializing virtual registers with one
+      // definition.
       Register Reg = DefMI.getOperand(0).getReg();
       if (!Reg.isVirtual() || !DAG.MRI.hasOneDef(Reg))
         continue;
 
       // We only care to rematerialize the instruction if it has a single
-      // non-debug user in a different region. The using MI may not belong to a
-      // region if it is a lone region terminator.
+      // non-debug user in a different region.
+      // FIXME: Allow rematerializations with multiple uses. This should be
+      // relatively easy to support using the current cost model.
       MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
       if (!UseMI)
         continue;
       auto UseRegion = MIRegion.find(UseMI);
-      if (UseRegion != MIRegion.end() && UseRegion->second == I)
+      if (UseRegion == MIRegion.end() || UseRegion->second == I)
         continue;
 
       // Do not rematerialize an instruction if it uses or is used by an
       // instruction that we have designated for rematerialization.
       // FIXME: Allow for rematerialization chains: this requires 1. updating
-      // remat points to account for uses that are rematerialized, and 2. either
-      // rematerializing the candidates in careful ordering, or deferring the
-      // MBB RP walk until the entire chain has been rematerialized.
-      if (Rematerializations.contains(UseMI) ||
-          llvm::any_of(DefMI.operands(), [&RematRegs](MachineOperand &MO) {
-            return MO.isReg() && RematRegs.contains(MO.getReg());
-          }))
+      // remat points to account for uses that are rematerialized, and 2.
+      // either rematerializing the candidates in careful ordering, or
+      // deferring the MBB RP walk until the entire chain has been
+      // rematerialized.
+      const MachineOperand &UseMO = UseMI->getOperand(0);
+      if (IsMarkedForRemat(UseMO) ||
+          llvm::any_of(DefMI.operands(), IsMarkedForRemat))
         continue;
 
       // Do not rematerialize an instruction it it uses registers that aren't
@@ -1979,106 +2139,182 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
                                               *DAG.TII))
         continue;
 
-      REMAT_DEBUG(dbgs() << "Region " << I << ": remat instruction " << DefMI);
-      RematInstruction &Remat =
-          Rematerializations.try_emplace(&DefMI, UseMI).first->second;
-
-      bool RematUseful = false;
-      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
-        // Optimistically consider that moving the instruction out of its
-        // defining region will reduce RP in the latter; this assumes that
-        // maximum RP in the region is reached somewhere between the defining
-        // instruction and the end of the region.
-        REMAT_DEBUG(dbgs() << "  Defining region is optimizable\n");
-        LaneBitmask Mask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
-        if (ReduceRPInRegion(It, Reg, Mask, RematUseful))
-          return true;
-      }
-
-      for (unsigned LIRegion = 0; LIRegion != E; ++LIRegion) {
-        // We are only collecting regions in which the register is a live-in
-        // (and may be live-thr...
[truncated]

github-actions · 2026-01-21T17:20:40Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2026-01-22T21:31:57Z

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp

+    Rollback->RematMI = RematMI;
+    // Make the original MI a debug instruction so that it does not influence
+    // scheduling.
+    DefMI.setDesc(TII->get(TargetOpcode::DBG_VALUE));


I don't really love this usage of DBG_VALUE. It's also at risk of becoming an obstacle for the debug record transition

Me neither, I am still trying to think of a way to make this better.

Could we use KILL instead?

I didn't know about KILL instructions. I will try that, thanks for the suggestion.

What if there is a def on the instruction? Kill with a def on it will be an issue

Yes the def/KILL interaction causes issues here. I was able to verify that KILL instructions with defs still contribute to and increase RP locally. Even if the defined register is marked dead RP increases over the instruction, which can incorrectly push our max pressure estimate above what it really is.

lucas-rami · 2026-01-22T22:04:25Z

Sorry for the late change @arsenm, I realized the implementation was triggering an issue on PSDB (ROCm#1170). The last commit fixes this by replacing register operands of instructions whose opcode we set to DBG_VALUE to the sentinel register during re-scheduling. If we don't do that these operands are still in use-lists of other registers and this can make us hit a LIS assert (!BundleNonDebug.isDebugInstr() && "Could not use a debug instruction to query mi2iMap.") when updating their intervals.

The const qualifiers that the last commit removes could also trigger compile failures depending on the config (#175755).

…175050)" This re-applies commit f21e359 along with the compile fix failure introduced in 8ab7937 before the initial patch was reverted and fixes for the previously observed assert failure. We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back. 1. Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state meand that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before. 2. Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guatantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.

Also removes a bunch of const specified on class members that prevents std::sort from compiling on some configs.

mikaelholmen · 2026-02-03T10:00:25Z

Hello @lucas-rami

The following two lit tests start crashing with this patch if we build with EXPENSIVE_CHECKS:

FAIL: LLVM :: CodeGen/AMDGPU/copy-hoist-no-spills.ll (1654 of 4564)
FAIL: LLVM :: CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir (3793 of 4564)

We get

# | llc: ../include/llvm/ADT/DenseMap.h:226: ValueT &llvm::DenseMapBase<llvm::DenseMap<unsigned int, llvm::LaneBitmask>, unsigned int, llvm::LaneBitmask, llvm::DenseMapInfo<unsigned int>, llvm::detail::DenseMapPair<unsigned int, llvm::LaneBitmask>>::at(const_arg_type_t<KeyT>) [DerivedT = llvm::DenseMap<unsigned int, llvm::LaneBitmask>, KeyT = unsigned int, ValueT = llvm::LaneBitmask, KeyInfoT = llvm::DenseMapInfo<unsigned int>, BucketT = llvm::detail::DenseMapPair<unsigned int, llvm::LaneBitmask>]: Assertion `Iter != this->end() && "DenseMap::at failed due to a missing key"' failed.
# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug.
# | Stack dump:
# | 0.	Program arguments: /repo/main/llvm/build-all-expensive/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -run-pass=machine-scheduler -amdgpu-disable-rewrite-mfma-form-sched-stage=false -o - /repo/main/llvm/test/CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir
# | 1.	Running pass 'Function Pass Manager' on module '/repo/main/llvm/test/CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir'.
# | 2.	Running pass 'Machine Instruction Scheduler' on function '@src2_singledef_singleuse_dst_singleuse_singledef_vgpr'
# |  #0 0x000056282a67ac06 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/repo/main/llvm/build-all-expensive/bin/llc+0x8b32c06)
# |  #1 0x000056282a678365 llvm::sys::RunSignalHandlers() (/repo/main/llvm/build-all-expensive/bin/llc+0x8b30365)
# |  #2 0x000056282a67ba09 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
# |  #3 0x00007f593a1cb990 __restore_rt (/lib64/libpthread.so.0+0x12990)
# |  #4 0x00007f5937b6a52f raise (/lib64/libc.so.6+0x4e52f)
# |  #5 0x00007f5937b3de65 abort (/lib64/libc.so.6+0x21e65)
# |  #6 0x00007f5937b3dd39 _nl_load_domain.cold.0 (/lib64/libc.so.6+0x21d39)
# |  #7 0x00007f5937b62e86 (/lib64/libc.so.6+0x46e86)
# |  #8 0x000056282779d224 llvm::PreRARematStage::rematerialize(llvm::PreRARematStage::RematReg const&, llvm::BitVector&, llvm::PreRARematStage::RollbackInfo*) GCNSchedStrategy.cpp:0:0
# |  #9 0x000056282779a595 llvm::PreRARematStage::initGCNSchedStage() GCNSchedStrategy.cpp:0:0
# | #10 0x0000562827791f8f llvm::GCNScheduleDAGMILive::runSchedStages() GCNSchedStrategy.cpp:0:0
# | #11 0x0000562829541747 llvm::impl_detail::MachineSchedulerBase::scheduleRegions(llvm::ScheduleDAGInstrs&, bool) (/repo/main/llvm/build-all-expensive/bin/llc+0x79f9747)
# | #12 0x0000562829540a59 llvm::impl_detail::MachineSchedulerImpl::run(llvm::MachineFunction&, llvm::TargetMachine const&, llvm::impl_detail::MachineSchedulerImpl::RequiredAnalyses const&) (/repo/main/llvm/build-all-expensive/bin/llc+0x79f8a59)
# | #13 0x0000562829559986 (anonymous namespace)::MachineSchedulerLegacy::runOnMachineFunction(llvm::MachineFunction&) MachineScheduler.cpp:0:0
# | #14 0x0000562829453167 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/repo/main/llvm/build-all-expensive/bin/llc+0x790b167)
# | #15 0x0000562829a95334 llvm::FPPassManager::runOnFunction(llvm::Function&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f4d334)
# | #16 0x0000562829a9d562 llvm::FPPassManager::runOnModule(llvm::Module&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f55562)
# | #17 0x0000562829a95e56 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f4de56)
# | #18 0x0000562826f983b7 compileModule(char**, llvm::SmallVectorImpl<llvm::PassPlugin>&, llvm::LLVMContext&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&) llc.cpp:0:0
# | #19 0x0000562826f95816 main (/repo/main/llvm/build-all-expensive/bin/llc+0x544d816)
# | #20 0x00007f5937b567e5 __libc_start_main (/lib64/libc.so.6+0x3a7e5)
# | #21 0x0000562826f949ae _start (/repo/main/llvm/build-all-expensive/bin/llc+0x544c9ae)

lucas-rami · 2026-02-03T11:20:53Z

@mikaelholmen Thanks for the catch, looking into it now.

…179461) #177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the scheduler's rematerialization stage. When rematerializing a register that depends on other registers, the dependent registers should be live at the live-ins of the rematerialized register's using region, but not necessarily at the live-ins of its defining region, as it is written right now. This fixes that and also hoists the check from the loop on regions where the rematerialized register is live, since it's only supposed to run once for the using region anyway. This fixes the following unit tests which were failing when building with EXPENSIVE_CHECKS. - `CodeGen/AMDGPU/copy-hoist-no-spills.ll` - `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`

…lvm#179461) llvm#177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the scheduler's rematerialization stage. When rematerializing a register that depends on other registers, the dependent registers should be live at the live-ins of the rematerialized register's using region, but not necessarily at the live-ins of its defining region, as it is written right now. This fixes that and also hoists the check from the loop on regions where the rematerialized register is live, since it's only supposed to run once for the using region anyway. This fixes the following unit tests which were failing when building with EXPENSIVE_CHECKS. - `CodeGen/AMDGPU/copy-hoist-no-spills.ll` - `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`

lucas-rami requested review from arsenm, jmmartinez, jplehr, nhaehnle, qcolombet and ruiling January 21, 2026 17:18

llvmbot added the backend:AMDGPU label Jan 21, 2026

lucas-rami force-pushed the users/lucas-rami/rematerialization-reapply/4 branch from c324769 to 8155bcf Compare January 21, 2026 17:30

arsenm approved these changes Jan 22, 2026

View reviewed changes

arsenm approved these changes Jan 24, 2026

View reviewed changes

This was referenced Jan 26, 2026

[AMDGPU][Scheduler] Simplify scheduling revert logic #177203

Merged

[AMDGPU][Scheduler] Make finalizeGCNRegion an overridable hook (NFC) #177199

Merged

[AMDGPU][Scheduler] Revert all regions when remat fails to increase occ. #177205

Merged

lucas-rami force-pushed the users/lucas-rami/rematerialization-reapply/3 branch from 41ebf17 to d289e01 Compare February 2, 2026 00:05

Base automatically changed from users/lucas-rami/rematerialization-reapply/3 to main February 2, 2026 00:33

lucas-rami added 4 commits February 2, 2026 00:36

Format

d739081

Set rematerialized MIs' reg operands to sentinel reg

6a7a832

Also removes a bunch of const specified on class members that prevents std::sort from compiling on some configs.

Update new tests and format

acf2bbd

lucas-rami force-pushed the users/lucas-rami/rematerialization-reapply/4 branch from e786740 to acf2bbd Compare February 2, 2026 01:13

lucas-rami merged commit 9dc6b49 into main Feb 2, 2026
10 checks passed

lucas-rami deleted the users/lucas-rami/rematerialization-reapply/4 branch February 2, 2026 11:34

lucas-rami mentioned this pull request Feb 3, 2026

[AMDGPU][Scheduler] Fix incorrect region index in EXPENSIVE_CHECKS #179461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)"#177206

Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)"#177206
lucas-rami merged 4 commits intomainfrom
users/lucas-rami/rematerialization-reapply/4

lucas-rami commented Jan 21, 2026 •

edited

Loading

Uh oh!

llvmbot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

arsenm Jan 22, 2026

Uh oh!

lucas-rami Jan 22, 2026

Uh oh!

qcolombet Jan 28, 2026

Uh oh!

lucas-rami Jan 29, 2026

Uh oh!

arsenm Jan 29, 2026

Uh oh!

lucas-rami Jan 29, 2026

Uh oh!

lucas-rami commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

mikaelholmen commented Feb 3, 2026

Uh oh!

lucas-rami commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lucas-rami commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arsenm Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

lucas-rami Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

qcolombet Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

lucas-rami Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

arsenm Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

lucas-rami Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

lucas-rami commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mikaelholmen commented Feb 3, 2026

Uh oh!

lucas-rami commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lucas-rami commented Jan 21, 2026 •

edited

Loading

github-actions bot commented Jan 21, 2026 •

edited

Loading

lucas-rami commented Jan 22, 2026 •

edited

Loading