Skip to content

Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)"#177206

Merged
lucas-rami merged 4 commits intomainfrom
users/lucas-rami/rematerialization-reapply/4
Feb 2, 2026
Merged

Re-apply "[AMDGPU][Scheduler] Scoring system for rematerializations (#175050)"#177206
lucas-rami merged 4 commits intomainfrom
users/lucas-rami/rematerialization-reapply/4

Conversation

@lucas-rami
Copy link
Contributor

@lucas-rami lucas-rami commented Jan 21, 2026

This re-applies commit f21e359 along with the compile fix failure introduced in 8ab7937 before the initial patch was reverted. It also fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back.

  1. Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state means that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before.
  2. Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guarantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.

@llvmbot
Copy link
Member

llvmbot commented Jan 21, 2026

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

This re-applies commit f21e359 along with the compile fix failure introduced in 8ab7937 before the initial patch was reverted. It also fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of two issues that could happen when rematerializations are being rolled back.

  1. Small changes in slots indices (while preserving instruction order) compared to the pre-re-scheduling state means that we have to re-compute live ranges for all register operands of rolled back rematerializations. This was not being done before.
  2. Re-scheduling can move registers that were rematerialized at arbitrary positions in their respective regions while their opcode is set to DBG_VALUE, even before their read operands are defined. This makes re-scheduling reverts mandatory before rolling back rematerializations, as otherwise def-use chains may be broken. The original patch did not guarantee that, but previous refactoring of the rollback/revert logic for the rematerialization stage now ensures that reverts always precede rollbacks.

Patch is 174.02 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/177206.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp (+510-294)
  • (modified) llvm/lib/Target/AMDGPU/GCNSchedStrategy.h (+208-51)
  • (added) llvm/test/CodeGen/AMDGPU/machine-scheduler-rematerialization-scoring.mir (+523)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-attr.mir (+194-194)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats-debug.mir (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+238-31)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+1-1)
diff --git a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
index 8bd47aaadae56..7365030267b5a 100644
--- a/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
+++ b/llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
@@ -28,11 +28,18 @@
 #include "GCNRegPressure.h"
 #include "SIMachineFunctionInfo.h"
 #include "Utils/AMDGPUBaseInfo.h"
+#include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/CodeGen/CalcSpillWeights.h"
 #include "llvm/CodeGen/MachineOperand.h"
+#include "llvm/CodeGen/MachineBasicBlock.h"
+#include "llvm/CodeGen/MachineBlockFrequencyInfo.h"
+#include "llvm/CodeGen/MachineBranchProbabilityInfo.h"
 #include "llvm/CodeGen/RegisterClassInfo.h"
 #include "llvm/MC/LaneBitmask.h"
+#include "llvm/MC/MCInstrItineraries.h"
+#include "llvm/MC/MCSchedule.h"
+#include "llvm/MC/TargetRegistry.h"
 #include "llvm/Support/ErrorHandling.h"
 
 #define DEBUG_TYPE "machine-scheduler"
@@ -970,6 +977,8 @@ void GCNScheduleDAGMILive::schedule() {
 
 GCNRegPressure
 GCNScheduleDAGMILive::getRealRegPressure(unsigned RegionIdx) const {
+  if (Regions[RegionIdx].first == Regions[RegionIdx].second)
+    return llvm::getRegPressure(MRI, LiveIns[RegionIdx]);
   GCNDownwardRPTracker RPTracker(*LIS);
   RPTracker.advance(Regions[RegionIdx].first, Regions[RegionIdx].second,
                     &LiveIns[RegionIdx]);
@@ -1274,33 +1283,222 @@ bool ClusteredLowOccStage::initGCNSchedStage() {
 #define REMAT_PREFIX "[PreRARemat] "
 #define REMAT_DEBUG(X) LLVM_DEBUG(dbgs() << REMAT_PREFIX; X;)
 
+#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
+Printable PreRARematStage::ScoredRemat::print() const {
+  return Printable([&](raw_ostream &OS) {
+    OS << '(' << MaxFreq << ", " << FreqDiff << ", " << RegionImpact << ')';
+  });
+}
+#endif
+
 bool PreRARematStage::initGCNSchedStage() {
   // FIXME: This pass will invalidate cached BBLiveInMap and MBBLiveIns for
   // regions inbetween the defs and region we sinked the def to. Will need to be
   // fixed if there is another pass after this pass.
   assert(!S.hasNextStage());
 
-  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() == 1)
+  if (!GCNSchedStage::initGCNSchedStage() || DAG.Regions.size() <= 1)
     return false;
 
+  // Maps all MIs (except lone terminators, which are not part of any region) to
+  // their parent region. Non-lone terminators are considered part of the region
+  // they delimitate.
+  DenseMap<MachineInstr *, unsigned> MIRegion(MF.getInstructionCount());
+
   // Before performing any IR modification record the parent region of each MI
   // and the parent MBB of each region.
   const unsigned NumRegions = DAG.Regions.size();
-  RegionBB.reserve(NumRegions);
   for (unsigned I = 0; I < NumRegions; ++I) {
     RegionBoundaries Region = DAG.Regions[I];
     for (auto MI = Region.first; MI != Region.second; ++MI)
       MIRegion.insert({&*MI, I});
-    RegionBB.push_back(Region.first->getParent());
+    MachineBasicBlock *ParentMBB = Region.first->getParent();
+    if (Region.second != ParentMBB->end())
+      MIRegion.insert({&*Region.second, I});
+    RegionBB.push_back(ParentMBB);
   }
 
-  if (!canIncreaseOccupancyOrReduceSpill())
+#ifndef NDEBUG
+  auto PrintTargetRegions = [&]() -> void {
+    if (TargetRegions.none()) {
+      dbgs() << REMAT_PREFIX << "No target regions\n";
+      return;
+    }
+    dbgs() << REMAT_PREFIX << "Target regions:\n";
+    for (unsigned I : TargetRegions.set_bits())
+      dbgs() << REMAT_PREFIX << "  [" << I << "] " << RPTargets[I] << '\n';
+  };
+  auto PrintRematReg = [&](const RematReg &Remat) -> Printable {
+    return Printable([&, Remat](raw_ostream &OS) {
+      // Concatenate all region numbers in which the register is unused and
+      // live-through.
+      bool HasLiveThroughRegion = false;
+      OS << '[' << Remat.DefRegion << " -";
+      for (unsigned I = 0; I < NumRegions; ++I) {
+        if (Remat.isUnusedLiveThrough(I)) {
+          if (HasLiveThroughRegion) {
+            OS << ',';
+          } else {
+            OS << "- ";
+            HasLiveThroughRegion = true;
+          }
+          OS << I;
+        }
+      }
+      if (HasLiveThroughRegion)
+        OS << " -";
+      OS << "-> " << Remat.UseRegion << "] ";
+      Remat.DefMI->print(OS, /*IsStandalone=*/true, /*SkipOpers=*/false,
+                         /*SkipDebugLoc=*/false, /*AddNewLine=*/false);
+    });
+  };
+#endif
+
+  // Set an objective for the stage based on current RP in each region.
+  REMAT_DEBUG({
+    dbgs() << "Analyzing ";
+    MF.getFunction().printAsOperand(dbgs(), false);
+    dbgs() << ": ";
+  });
+  if (!setObjective()) {
+    LLVM_DEBUG(dbgs() << "no objective to achieve, occupancy is maximal at "
+                      << MFI.getMaxWavesPerEU() << '\n');
+    return false;
+  }
+  LLVM_DEBUG({
+    if (TargetOcc) {
+      dbgs() << "increase occupancy from " << *TargetOcc - 1 << '\n';
+    } else {
+      dbgs() << "reduce spilling (minimum target occupancy is "
+             << MFI.getMinWavesPerEU() << ")\n";
+    }
+    PrintTargetRegions();
+  });
+
+  if (!collectRematRegs(MIRegion)) {
+    REMAT_DEBUG(dbgs() << "No rematerializable registers\n");
+    return false;
+  }
+  const ScoredRemat::FreqInfo FreqInfo(MF, DAG);
+  REMAT_DEBUG({
+    dbgs() << "Rematerializable registers:\n";
+    for (const RematReg &Remat : RematRegs)
+      dbgs() << REMAT_PREFIX << "  " << PrintRematReg(Remat) << '\n';
+    dbgs() << REMAT_PREFIX << "Region frequencies\n";
+    for (auto [I, Freq] : enumerate(FreqInfo.Regions)) {
+      dbgs() << REMAT_PREFIX << "  [" << I << "] ";
+      if (Freq)
+        dbgs() << Freq;
+      else
+        dbgs() << "unknown ";
+      dbgs() << " | " << *DAG.Regions[I].first;
+    }
+  });
+
+  SmallVector<ScoredRemat> ScoredRemats;
+  for (const RematReg &Remat : RematRegs)
+    ScoredRemats.emplace_back(&Remat, FreqInfo, DAG);
+
+// Rematerialize registers in successive rounds until all RP targets are
+// satisifed or until we run out of rematerialization candidates.
+#ifndef NDEBUG
+  unsigned RoundNum = 0;
+#endif
+  BitVector RecomputeRP(NumRegions);
+  do {
+    assert(!ScoredRemats.empty() && "no more remat candidates");
+
+    // (Re-)Score and (re-)sort all remats in increasing score order.
+    for (ScoredRemat &Remat : ScoredRemats)
+      Remat.update(TargetRegions, RPTargets, FreqInfo, !TargetOcc);
+    sort(ScoredRemats);
+
+    REMAT_DEBUG({
+      dbgs() << "==== ROUND " << RoundNum++ << " ====\n"
+             << REMAT_PREFIX
+             << "Candidates with non-null score, in rematerialization order:\n";
+      for (const ScoredRemat &RematDecision : reverse(ScoredRemats)) {
+        if (RematDecision.hasNullScore())
+          break;
+        dbgs() << REMAT_PREFIX << "  " << RematDecision.print() << " | "
+               << *RematDecision.Remat->DefMI;
+      }
+      PrintTargetRegions();
+    });
+
+    RecomputeRP.reset();
+    unsigned RematIdx = ScoredRemats.size();
+
+    // Rematerialize registers in decreasing score order until we estimate
+    // that all RP targets are satisfied or until rematerialization candidates
+    // are no longer useful to decrease RP.
+    for (; RematIdx && TargetRegions.any(); --RematIdx) {
+      const ScoredRemat &Candidate = ScoredRemats[RematIdx - 1];
+      // Stop rematerializing on encountering a null score. Since scores
+      // monotonically decrease as we rematerialize, we know there is nothing
+      // useful left to do in such cases, even if we were to re-score.
+      if (Candidate.hasNullScore()) {
+        RematIdx = 0;
+        break;
+      }
+
+      const RematReg &Remat = *Candidate.Remat;
+      // When previous rematerializations in this round have already satisfied
+      // RP targets in all regions this rematerialization can impact, we have a
+      // good indication that our scores have diverged significantly from
+      // reality, in which case we interrupt this round and re-score. This also
+      // ensures that every rematerialization we perform is possibly impactful
+      // in at least one target region.
+      if (!Remat.maybeBeneficial(TargetRegions, RPTargets))
+        break;
+
+      REMAT_DEBUG(dbgs() << "** REMAT " << PrintRematReg(Remat) << '\n';);
+      // Every rematerialization we do here is likely to move the instruction
+      // into a higher frequency region, increasing the total sum latency of the
+      // instruction itself. This is acceptable if we are eliminating a spill in
+      // the process, but when the goal is increasing occupancy we get nothing
+      // out of rematerialization if occupancy is not increased in the end; in
+      // such cases we want to roll back the rematerialization.
+      RollbackInfo *Rollback =
+          TargetOcc ? &Rollbacks.emplace_back(&Remat) : nullptr;
+      rematerialize(Remat, RecomputeRP, Rollback);
+      unsetSatisifedRPTargets(Remat.Live);
+    }
+
+    REMAT_DEBUG({
+      if (!TargetRegions.any()) {
+        dbgs() << "** Interrupt round on all targets achieved\n";
+      } else if (RematIdx) {
+        dbgs() << "** Interrupt round on stale score for "
+               << *ScoredRemats[RematIdx - 1].Remat->DefMI;
+      } else {
+        dbgs() << "** Stop on exhausted rematerialization candidates\n";
+      }
+    });
+
+    // Peel off registers we already rematerialized from the vector's tail.
+    ScoredRemats.truncate(RematIdx);
+  } while ((updateAndVerifyRPTargets(RecomputeRP) || TargetRegions.any()) &&
+           !ScoredRemats.empty());
+  if (RescheduleRegions.none())
     return false;
 
-  // Rematerialize identified instructions and update scheduler's state.
-  rematerialize();
-  if (GCNTrackers)
-    DAG.RegionLiveOuts.buildLiveRegMap();
+  // Commit all pressure changes to the DAG and compute minimum achieved
+  // occupancy in impacted regions.
+  REMAT_DEBUG(dbgs() << "==== REMAT RESULTS ====\n");
+  unsigned DynamicVGPRBlockSize = MFI.getDynamicVGPRBlockSize();
+  for (unsigned I : RescheduleRegions.set_bits()) {
+    DAG.Pressure[I] = RPTargets[I].getCurrentRP();
+    REMAT_DEBUG(dbgs() << '[' << I << "] Achieved occupancy "
+                       << DAG.Pressure[I].getOccupancy(ST, DynamicVGPRBlockSize)
+                       << " (" << RPTargets[I] << ")\n");
+  }
+  AchievedOcc = MFI.getMaxWavesPerEU();
+  for (const GCNRegPressure &RP : DAG.Pressure) {
+    AchievedOcc =
+        std::min(AchievedOcc, RP.getOccupancy(ST, DynamicVGPRBlockSize));
+  }
+
   REMAT_DEBUG({
     dbgs() << "Retrying function scheduling with new min. occupancy of "
            << AchievedOcc << " from rematerializing (original was "
@@ -1339,6 +1537,10 @@ void UnclusteredHighRPStage::finalizeGCNSchedStage() {
 }
 
 bool GCNSchedStage::initGCNRegion() {
+  // Skip empty scheduling region.
+  if (DAG.begin() == DAG.end())
+    return false;
+
   // Check whether this new region is also a new block.
   if (DAG.RegionBegin->getParent() != CurrentMBB)
     setupNewBlock();
@@ -1346,8 +1548,8 @@ bool GCNSchedStage::initGCNRegion() {
   unsigned NumRegionInstrs = std::distance(DAG.begin(), DAG.end());
   DAG.enterRegion(CurrentMBB, DAG.begin(), DAG.end(), NumRegionInstrs);
 
-  // Skip empty scheduling regions (0 or 1 schedulable instructions).
-  if (DAG.begin() == DAG.end() || DAG.begin() == std::prev(DAG.end()))
+  // Skip regions with 1 schedulable instruction.
+  if (DAG.begin() == std::prev(DAG.end()))
     return false;
 
   LLVM_DEBUG(dbgs() << "********** MI Scheduling **********\n");
@@ -1842,37 +2044,21 @@ unsigned PreRARematStage::getStageTargetOccupancy() const {
   return TargetOcc ? *TargetOcc : MFI.getMinWavesPerEU();
 }
 
-bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
+bool PreRARematStage::setObjective() {
   const Function &F = MF.getFunction();
 
-  // Maps optimizable regions (i.e., regions at minimum and register-limited
-  // occupancy, or regions with spilling) to the target RP we would like to
-  // reach.
-  DenseMap<unsigned, GCNRPTarget> OptRegions;
+  // Set up "spilling targets" for all regions.
   unsigned MaxSGPRs = ST.getMaxNumSGPRs(F);
   unsigned MaxVGPRs = ST.getMaxNumVGPRs(F);
-  bool HasVectorRegisterExcess;
-
-  auto ResetTargetRegions = [&]() {
-    OptRegions.clear();
-    HasVectorRegisterExcess = false;
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      const GCNRegPressure &RP = DAG.Pressure[I];
-      GCNRPTarget Target(MaxSGPRs, MaxVGPRs, MF, RP);
-      if (!Target.satisfied())
-        OptRegions.insert({I, Target});
-      // We are willing to consider a RP that does not satisfy the target
-      // as long as it doesn't result in spilling to memory. We
-      // tolerate SGPR spills to VGPR since these have a lower impact on the
-      // performance.
-      // VGPR/AGPR excess could be aliviated by copying to an AGPR/VGPR.
-      // Currently, we consider any spilling as bad (we should consider this
-      // case in the future).
-      HasVectorRegisterExcess |= Target.hasVectorRegisterExcess();
-    }
-  };
+  bool HasVectorRegisterExcess = false;
+  for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
+    const GCNRegPressure &RP = DAG.Pressure[I];
+    GCNRPTarget &Target = RPTargets.emplace_back(MaxSGPRs, MaxVGPRs, MF, RP);
+    if (!Target.satisfied())
+      TargetRegions.set(I);
+    HasVectorRegisterExcess |= Target.hasVectorRegisterExcess();
+  }
 
-  ResetTargetRegions();
   if (HasVectorRegisterExcess || DAG.MinOccupancy >= MFI.getMaxWavesPerEU()) {
     // In addition to register usage being above addressable limits, occupancy
     // below the minimum is considered like "spilling" as well.
@@ -1881,94 +2067,68 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
     // There is no spilling and room to improve occupancy; set up "increased
     // occupancy targets" for all regions.
     TargetOcc = DAG.MinOccupancy + 1;
-    unsigned VGPRBlockSize =
-        MF.getInfo<SIMachineFunctionInfo>()->getDynamicVGPRBlockSize();
+    const unsigned VGPRBlockSize = MFI.getDynamicVGPRBlockSize();
     MaxSGPRs = ST.getMaxNumSGPRs(*TargetOcc, false);
     MaxVGPRs = ST.getMaxNumVGPRs(*TargetOcc, VGPRBlockSize);
-    ResetTargetRegions();
-  }
-  REMAT_DEBUG({
-    dbgs() << "Analyzing ";
-    MF.getFunction().printAsOperand(dbgs(), false);
-    dbgs() << ": ";
-    if (OptRegions.empty()) {
-      dbgs() << "no objective to achieve, occupancy is maximal at "
-             << MFI.getMaxWavesPerEU();
-    } else if (!TargetOcc) {
-      dbgs() << "reduce spilling (minimum target occupancy is "
-             << MFI.getMinWavesPerEU() << ')';
-    } else {
-      dbgs() << "increase occupancy from " << DAG.MinOccupancy << " to "
-             << TargetOcc;
-    }
-    dbgs() << '\n';
-    for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-      if (auto OptIt = OptRegions.find(I); OptIt != OptRegions.end()) {
-        dbgs() << REMAT_PREFIX << "  [" << I << "] " << OptIt->getSecond()
-               << '\n';
-      }
+    for (auto [I, Target] : enumerate(RPTargets)) {
+      Target.setTarget(MaxSGPRs, MaxVGPRs);
+      if (!Target.satisfied())
+        TargetRegions.set(I);
     }
-  });
-  if (OptRegions.empty())
-    return false;
+  }
 
-  // Accounts for a reduction in RP in an optimizable region. Returns whether we
-  // estimate that we have identified enough rematerialization opportunities to
-  // achieve our goal, and sets Progress to true when this particular reduction
-  // in pressure was helpful toward that goal.
-  auto ReduceRPInRegion = [&](auto OptIt, Register Reg, LaneBitmask Mask,
-                              bool &Progress) -> bool {
-    GCNRPTarget &Target = OptIt->getSecond();
-    if (!Target.isSaveBeneficial(Reg))
-      return false;
-    Progress = true;
-    Target.saveReg(Reg, Mask, DAG.MRI);
-    if (Target.satisfied())
-      OptRegions.erase(OptIt->getFirst());
-    return OptRegions.empty();
-  };
+  return TargetRegions.any();
+}
 
+bool PreRARematStage::collectRematRegs(
+    const DenseMap<MachineInstr *, unsigned> &MIRegion) {
   // We need up-to-date live-out info. to query live-out register masks in
   // regions containing rematerializable instructions.
   DAG.RegionLiveOuts.buildLiveRegMap();
 
-  // Cache set of registers that are going to be rematerialized.
-  DenseSet<unsigned> RematRegs;
+  // Set of registers already marked for potential remterialization; used to
+  // avoid rematerialization chains.
+  SmallSet<Register, 4> MarkedRegs;
+  auto IsMarkedForRemat = [&MarkedRegs](const MachineOperand &MO) -> bool {
+    return MO.isReg() && MarkedRegs.contains(MO.getReg());
+  };
 
   // Identify rematerializable instructions in the function.
   for (unsigned I = 0, E = DAG.Regions.size(); I != E; ++I) {
-    auto Region = DAG.Regions[I];
-    for (auto MI = Region.first; MI != Region.second; ++MI) {
+    RegionBoundaries Bounds = DAG.Regions[I];
+    for (auto MI = Bounds.first; MI != Bounds.second; ++MI) {
       // The instruction must be rematerializable.
       MachineInstr &DefMI = *MI;
       if (!isReMaterializable(DefMI))
         continue;
 
-      // We only support rematerializing virtual registers with one definition.
+      // We only support rematerializing virtual registers with one
+      // definition.
       Register Reg = DefMI.getOperand(0).getReg();
       if (!Reg.isVirtual() || !DAG.MRI.hasOneDef(Reg))
         continue;
 
       // We only care to rematerialize the instruction if it has a single
-      // non-debug user in a different region. The using MI may not belong to a
-      // region if it is a lone region terminator.
+      // non-debug user in a different region.
+      // FIXME: Allow rematerializations with multiple uses. This should be
+      // relatively easy to support using the current cost model.
       MachineInstr *UseMI = DAG.MRI.getOneNonDBGUser(Reg);
       if (!UseMI)
         continue;
       auto UseRegion = MIRegion.find(UseMI);
-      if (UseRegion != MIRegion.end() && UseRegion->second == I)
+      if (UseRegion == MIRegion.end() || UseRegion->second == I)
         continue;
 
       // Do not rematerialize an instruction if it uses or is used by an
       // instruction that we have designated for rematerialization.
       // FIXME: Allow for rematerialization chains: this requires 1. updating
-      // remat points to account for uses that are rematerialized, and 2. either
-      // rematerializing the candidates in careful ordering, or deferring the
-      // MBB RP walk until the entire chain has been rematerialized.
-      if (Rematerializations.contains(UseMI) ||
-          llvm::any_of(DefMI.operands(), [&RematRegs](MachineOperand &MO) {
-            return MO.isReg() && RematRegs.contains(MO.getReg());
-          }))
+      // remat points to account for uses that are rematerialized, and 2.
+      // either rematerializing the candidates in careful ordering, or
+      // deferring the MBB RP walk until the entire chain has been
+      // rematerialized.
+      const MachineOperand &UseMO = UseMI->getOperand(0);
+      if (IsMarkedForRemat(UseMO) ||
+          llvm::any_of(DefMI.operands(), IsMarkedForRemat))
         continue;
 
       // Do not rematerialize an instruction it it uses registers that aren't
@@ -1979,106 +2139,182 @@ bool PreRARematStage::canIncreaseOccupancyOrReduceSpill() {
                                               *DAG.TII))
         continue;
 
-      REMAT_DEBUG(dbgs() << "Region " << I << ": remat instruction " << DefMI);
-      RematInstruction &Remat =
-          Rematerializations.try_emplace(&DefMI, UseMI).first->second;
-
-      bool RematUseful = false;
-      if (auto It = OptRegions.find(I); It != OptRegions.end()) {
-        // Optimistically consider that moving the instruction out of its
-        // defining region will reduce RP in the latter; this assumes that
-        // maximum RP in the region is reached somewhere between the defining
-        // instruction and the end of the region.
-        REMAT_DEBUG(dbgs() << "  Defining region is optimizable\n");
-        LaneBitmask Mask = DAG.RegionLiveOuts.getLiveRegsForRegionIdx(I)[Reg];
-        if (ReduceRPInRegion(It, Reg, Mask, RematUseful))
-          return true;
-      }
-
-      for (unsigned LIRegion = 0; LIRegion != E; ++LIRegion) {
-        // We are only collecting regions in which the register is a live-in
-        // (and may be live-thr...
[truncated]

@github-actions
Copy link

github-actions bot commented Jan 21, 2026

✅ With the latest revision this PR passed the C/C++ code formatter.

@lucas-rami lucas-rami force-pushed the users/lucas-rami/rematerialization-reapply/4 branch from c324769 to 8155bcf Compare January 21, 2026 17:30
Rollback->RematMI = RematMI;
// Make the original MI a debug instruction so that it does not influence
// scheduling.
DefMI.setDesc(TII->get(TargetOpcode::DBG_VALUE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really love this usage of DBG_VALUE. It's also at risk of becoming an obstacle for the debug record transition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither, I am still trying to think of a way to make this better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use KILL instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about KILL instructions. I will try that, thanks for the suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a def on the instruction? Kill with a def on it will be an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the def/KILL interaction causes issues here. I was able to verify that KILL instructions with defs still contribute to and increase RP locally. Even if the defined register is marked dead RP increases over the instruction, which can incorrectly push our max pressure estimate above what it really is.

@lucas-rami
Copy link
Contributor Author

lucas-rami commented Jan 22, 2026

Sorry for the late change @arsenm, I realized the implementation was triggering an issue on PSDB (ROCm#1170). The last commit fixes this by replacing register operands of instructions whose opcode we set to DBG_VALUE to the sentinel register during re-scheduling. If we don't do that these operands are still in use-lists of other registers and this can make us hit a LIS assert (!BundleNonDebug.isDebugInstr() && "Could not use a debug instruction to query mi2iMap.") when updating their intervals.

The const qualifiers that the last commit removes could also trigger compile failures depending on the config (#175755).

…175050)"

This re-applies commit f21e359 along
with the compile fix failure introduced in
8ab7937 before the initial patch was
reverted and fixes for the previously observed assert failure.

We were hitting the assert in the HIP Blender due to a combination of
two issues that could happen when rematerializations are being rolled
back.

1. Small changes in slots indices (while preserving instruction order)
   compared to the pre-re-scheduling state meand that we have to
   re-compute live ranges for all register operands of rolled back
   rematerializations. This was not being done before.
2. Re-scheduling can move registers that were rematerialized at
   arbitrary positions in their respective regions while their opcode
   is set to DBG_VALUE, even before their read operands are defined.
   This makes re-scheduling reverts mandatory before rolling back
   rematerializations, as otherwise def-use chains may be broken.
   The original patch did not guatantee that, but previous refactoring
   of the rollback/revert logic for the rematerialization stage now
   ensures that reverts always precede rollbacks.
Also removes a bunch of const specified on class members that prevents
std::sort from compiling on some configs.
@lucas-rami lucas-rami force-pushed the users/lucas-rami/rematerialization-reapply/4 branch from e786740 to acf2bbd Compare February 2, 2026 01:13
@lucas-rami lucas-rami merged commit 9dc6b49 into main Feb 2, 2026
10 checks passed
@lucas-rami lucas-rami deleted the users/lucas-rami/rematerialization-reapply/4 branch February 2, 2026 11:34
@mikaelholmen
Copy link
Collaborator

Hello @lucas-rami

The following two lit tests start crashing with this patch if we build with EXPENSIVE_CHECKS:

FAIL: LLVM :: CodeGen/AMDGPU/copy-hoist-no-spills.ll (1654 of 4564)
FAIL: LLVM :: CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir (3793 of 4564)

We get

# | llc: ../include/llvm/ADT/DenseMap.h:226: ValueT &llvm::DenseMapBase<llvm::DenseMap<unsigned int, llvm::LaneBitmask>, unsigned int, llvm::LaneBitmask, llvm::DenseMapInfo<unsigned int>, llvm::detail::DenseMapPair<unsigned int, llvm::LaneBitmask>>::at(const_arg_type_t<KeyT>) [DerivedT = llvm::DenseMap<unsigned int, llvm::LaneBitmask>, KeyT = unsigned int, ValueT = llvm::LaneBitmask, KeyInfoT = llvm::DenseMapInfo<unsigned int>, BucketT = llvm::detail::DenseMapPair<unsigned int, llvm::LaneBitmask>]: Assertion `Iter != this->end() && "DenseMap::at failed due to a missing key"' failed.
# | PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace and instructions to reproduce the bug.
# | Stack dump:
# | 0.	Program arguments: /repo/main/llvm/build-all-expensive/bin/llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx950 -run-pass=machine-scheduler -amdgpu-disable-rewrite-mfma-form-sched-stage=false -o - /repo/main/llvm/test/CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir
# | 1.	Running pass 'Function Pass Manager' on module '/repo/main/llvm/test/CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir'.
# | 2.	Running pass 'Machine Instruction Scheduler' on function '@src2_singledef_singleuse_dst_singleuse_singledef_vgpr'
# |  #0 0x000056282a67ac06 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/repo/main/llvm/build-all-expensive/bin/llc+0x8b32c06)
# |  #1 0x000056282a678365 llvm::sys::RunSignalHandlers() (/repo/main/llvm/build-all-expensive/bin/llc+0x8b30365)
# |  #2 0x000056282a67ba09 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
# |  #3 0x00007f593a1cb990 __restore_rt (/lib64/libpthread.so.0+0x12990)
# |  #4 0x00007f5937b6a52f raise (/lib64/libc.so.6+0x4e52f)
# |  #5 0x00007f5937b3de65 abort (/lib64/libc.so.6+0x21e65)
# |  #6 0x00007f5937b3dd39 _nl_load_domain.cold.0 (/lib64/libc.so.6+0x21d39)
# |  #7 0x00007f5937b62e86 (/lib64/libc.so.6+0x46e86)
# |  #8 0x000056282779d224 llvm::PreRARematStage::rematerialize(llvm::PreRARematStage::RematReg const&, llvm::BitVector&, llvm::PreRARematStage::RollbackInfo*) GCNSchedStrategy.cpp:0:0
# |  #9 0x000056282779a595 llvm::PreRARematStage::initGCNSchedStage() GCNSchedStrategy.cpp:0:0
# | #10 0x0000562827791f8f llvm::GCNScheduleDAGMILive::runSchedStages() GCNSchedStrategy.cpp:0:0
# | #11 0x0000562829541747 llvm::impl_detail::MachineSchedulerBase::scheduleRegions(llvm::ScheduleDAGInstrs&, bool) (/repo/main/llvm/build-all-expensive/bin/llc+0x79f9747)
# | #12 0x0000562829540a59 llvm::impl_detail::MachineSchedulerImpl::run(llvm::MachineFunction&, llvm::TargetMachine const&, llvm::impl_detail::MachineSchedulerImpl::RequiredAnalyses const&) (/repo/main/llvm/build-all-expensive/bin/llc+0x79f8a59)
# | #13 0x0000562829559986 (anonymous namespace)::MachineSchedulerLegacy::runOnMachineFunction(llvm::MachineFunction&) MachineScheduler.cpp:0:0
# | #14 0x0000562829453167 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/repo/main/llvm/build-all-expensive/bin/llc+0x790b167)
# | #15 0x0000562829a95334 llvm::FPPassManager::runOnFunction(llvm::Function&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f4d334)
# | #16 0x0000562829a9d562 llvm::FPPassManager::runOnModule(llvm::Module&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f55562)
# | #17 0x0000562829a95e56 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/repo/main/llvm/build-all-expensive/bin/llc+0x7f4de56)
# | #18 0x0000562826f983b7 compileModule(char**, llvm::SmallVectorImpl<llvm::PassPlugin>&, llvm::LLVMContext&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&) llc.cpp:0:0
# | #19 0x0000562826f95816 main (/repo/main/llvm/build-all-expensive/bin/llc+0x544d816)
# | #20 0x00007f5937b567e5 __libc_start_main (/lib64/libc.so.6+0x3a7e5)
# | #21 0x0000562826f949ae _start (/repo/main/llvm/build-all-expensive/bin/llc+0x544c9ae)

@lucas-rami
Copy link
Contributor Author

@mikaelholmen Thanks for the catch, looking into it now.

lucas-rami added a commit that referenced this pull request Feb 3, 2026
…179461)

#177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the
scheduler's rematerialization stage. When rematerializing a register
that depends on other registers, the dependent registers should be live
at the live-ins of the rematerialized register's using region, but not
necessarily at the live-ins of its defining region, as it is written
right now.

This fixes that and also hoists the check from the loop on regions where
the rematerialized register is live, since it's only supposed to run
once for the using region anyway. This fixes the following unit tests
which were failing when building with EXPENSIVE_CHECKS.
- `CodeGen/AMDGPU/copy-hoist-no-spills.ll`
- `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`
moar55 pushed a commit to moar55/llvm-project that referenced this pull request Feb 3, 2026
…lvm#179461)

llvm#177206 exposed a pre-existing typo in EXPENSIVE_CHECKS in the
scheduler's rematerialization stage. When rematerializing a register
that depends on other registers, the dependent registers should be live
at the live-ins of the rematerialized register's using region, but not
necessarily at the live-ins of its defining region, as it is written
right now.

This fixes that and also hoists the check from the loop on regions where
the rematerialized register is live, since it's only supposed to run
once for the using region anyway. This fixes the following unit tests
which were failing when building with EXPENSIVE_CHECKS.
- `CodeGen/AMDGPU/copy-hoist-no-spills.ll`
- `CodeGen/AMDGPU/sched_mfma_rewrite_copies.mir`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants