-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-llvm-globalisel Author: None (alex-t) ChangesBrief overview:As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block. Let's consider the following example:
The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation. We have 2 options to address this issue:
What was doneWe opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block. Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff 329 Files Affected:
diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
// GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
// GFX90A-CAS-LABEL: _Z14atomic_add_casPf
// GFX90A-CAS: flat_atomic_cmpswap
-// GFX90A-CAS: s_cbranch_execnz
+// GFX90A-CAS: s_cbranch_scc1
__device__ float atomic_add_cas(float *p) {
return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
}
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
unsigned OpNo = I.getSuccessorIndex();
const Instruction *TI = Node->getTerminator();
BasicBlock *SuccBB = TI->getSuccessor(OpNo);
- auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
- double WeightPercent = ((double)BranchProb.getNumerator()) /
- ((double)BranchProb.getDenominator());
+ // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+ // double WeightPercent = ((double)BranchProb.getNumerator()) /
+ // ((double)BranchProb.getDenominator());
+ double WeightPercent = 0.5;
+
std::string TTAttr =
formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
/// debug. This is the correct point to insert copies at the beginning of a
/// basic block. \p Reg is the register being used by a spill or defined for a
/// restore/split during register allocation.
- iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
- bool SkipPseudoOp = true);
+ iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
/// Returns an iterator to the first terminator instruction of this basic
/// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
/// other instructions shall be inserted before it. This can be implemented
/// to prevent register allocator to insert spills for \p Reg before such
/// instructions.
- virtual bool isBasicBlockPrologue(const MachineInstr &MI,
- Register Reg = Register()) const {
+ virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
return false;
}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
[llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
>;
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
[IntrWillReturn, IntrNoCallback, IntrNoFree]>;
// Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
- auto EHPadInsertPoint =
- EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+ auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
LLVM_DEBUG(dbgs() << "...also reload at EHPad "
<< printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
MachineBasicBlock::iterator MII;
if (SrcVNI->isPHIDef())
- MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+ MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
else {
MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
MachineBasicBlock::iterator
MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
- Register Reg, bool SkipPseudoOp) {
+ bool SkipPseudoOp) {
const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
iterator E = end();
while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
(SkipPseudoOp && I->isPseudoProbe()) ||
- TII->isBasicBlockPrologue(*I, Reg)))
+ TII->isBasicBlockPrologue(*I)))
++I;
// FIXME: This needs to change if we wish to bundle labels / dbg_values
// inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
return Start;
}
- unsigned RegIdx = 0;
- Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
- VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
- MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+ VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+ MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
RegAssign.insert(Start, VNI->def, OpenIdx);
LLVM_DEBUG(dump());
return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
return true;
}
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+ MachineInstr &MI) const {
// FIXME: Manually selecting to avoid dealing with the SReg_1 trick
// SelectionDAG uses for wave32 vs wave64.
MachineBasicBlock *BB = MI.getParent();
- BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+ BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
.add(MI.getOperand(1));
Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
MachineInstr &I) const {
Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
switch (IntrinsicID) {
- case Intrinsic::amdgcn_end_cf:
- return selectEndCfIntrinsic(I);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return selectWaveReconvergeIntrinsic(I);
case Intrinsic::amdgcn_ds_ordered_add:
case Intrinsic::amdgcn_ds_ordered_swap:
return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
bool selectReturnAddress(MachineInstr &I) const;
bool selectG_INTRINSIC(MachineInstr &I) const;
- bool selectEndCfIntrinsic(MachineInstr &MI) const;
+ bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
const unsigned MovExecOpc =
Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
- const unsigned MovExecTermOpc =
- Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
const unsigned XorTermOpc = Subtarget.isWave32() ?
AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
B.setInsertPt(*BodyBB, BodyBB->end());
+ Register LoopMask = MRI.createVirtualRegister(
+ TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
// Update EXEC, switch all done bits to 0 and all todo bits to 1.
B.buildInstr(XorTermOpc)
- .addDef(ExecReg)
- .addReg(ExecReg)
- .addReg(NewExec);
+ .addDef(LoopMask)
+ .addReg(ExecReg)
+ .addReg(NewExec);
// XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
// s_cbranch_scc0?
// Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
- B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+ B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+ .addReg(LoopMask)
+ .addReg(NewExec)
+ .addMBB(LoopBB);
// Save the EXEC mask before the loop.
BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
.addReg(ExecReg);
- // Restore the EXEC mask after the loop.
- B.setMBB(*RestoreExecBB);
- B.buildInstr(MovExecTermOpc)
- .addDef(ExecReg)
- .addReg(SaveExecReg);
// Set the insert point after the original instruction, so any new
// instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
break;
}
- case Intrinsic::amdgcn_end_cf: {
+ case Intrinsic::amdgcn_wave_reconverge: {
unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
#include "AMDGPU.h"
#include "AMDGPUTargetMachine.h"
#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/UniformityAnalysis.h"
#include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
Function *Else;
Function *IfBreak;
Function *Loop;
- Function *EndCf;
+ Function *WaveReconverge;
DominatorTree *DT;
StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
bool handleLoop(BranchInst *Term);
- bool closeControlFlow(BasicBlock *BB);
+ bool tryWaveReconverge(BasicBlock *BB);
public:
SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
{ IntMask });
Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
- EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+ WaveReconverge = Intrinsic::getDeclaration(
+ &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
}
/// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
/// Open a new "If" block
bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
- if (isUniform(Term))
- return false;
IRBuilder<> IRB(Term);
Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
}
/// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
- llvm::Loop *L = LI->getLoopFor(BB);
-
- assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
- if (L && L->getHeader() == BB) {
- // We can't insert an EndCF call into a loop header, because it will
- // get executed on every iteration of the loop, when it should be
- // executed only once before the loop.
- SmallVector <BasicBlock *, 8> Latches;
- L->getLoopLatches(Latches);
+ if (succ_empty(BB))
+ return false;
+ BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+ if (Term->getNumSuccessors() == 1) {
+ // The current BBs single successor is a top of the stack. We need to
+ // reconverge over thaqt path.
+ BasicBlock *SingleSucc = *succ_begin(BB);
+ BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+ if (isTopOfStack(SingleSucc)) {
+ Value *Exec = Stack.back().second;
+ IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+ }
+ } else {
+ // We have a uniform conditional branch terminating the block.
+ // THis block may be the last in the Then path of the enclosing divergent
+ // IF.
+ if (!isUniform(Term))
+ // Divergent loop is going to be further processed in another place
+ return false;
+
+ for (auto Succ : Term->successors()) {
+ if (isTopOfStack(Succ)) {
+ // Just split to make a room for further WAVE_RECONVERGE insertion
SmallVector<BasicBlock *, 2> Preds;
- for (BasicBlock *Pred : predecessors(BB)) {
- if (!is_contained(Latches, Pred))
- Preds.push_back(Pred);
+ for (auto P : predecessors(Succ)) {
+ if (DT->dominates(BB, P))
+ Preds.push_back(P);
}
-
- BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+ DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+ SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
false);
}
- Value *Exec = popSaved();
- BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
- if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
- Instruction *ExecDef = cast<Instruction>(Exec);
- BasicBlock *DefBB = ExecDef->getParent();
- if (!DT->dominates(DefBB, BB)) {
- // Split edge to make Def dominate Use
- FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
}
- IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
- // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
- // condition, for now just avoid copying these DebugLocs so that stepping
- // out of the then/else block in a debugger doesn't step to the condition.
- IRB.SetCurrentDebugLocation(DebugLoc());
- IRB.CreateCall(EndCf, {Exec});
}
return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
if (!Term || Term->isUnconditional()) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ Changed |= tryWaveReconverge(BB);
continue;
}
if (I.nodeVisited(Term->getSuccessor(1))) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ // Let's take care of uniform loop latch that may be closing the Then
+ // path of the enclosing divergent branch.
+ Changed |= tryWaveReconverge(BB);
if (DT->dominates(Term->getSuccessor(1), BB))
Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
continue;
}
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
}
+ if (isUniform(Term))
+ // Uniform conditional branch may be in the block that closes the Then
+ // path of the divergent conditional branch.
+ Changed |= tryWaveReconverge(BB);
+ else
Changed |= openIf(Term);
}
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
return AMDGPUISD::ELSE;
case Intrinsic::amdgcn_loop:
return AMDGPUISD::LOOP;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
llvm_unreachable("should not occur");
default:
return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
return SDValue(Load, 0);
}
- case Intrinsic::amdgcn_end_cf:
- return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
- Op->getOperand(2), Chain), 0);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+ MVT::Other, Op->getOperand(2), Chain),
+ 0);
case Intrinsic::amdgcn_s_barrier_init:
case Intrinsic::amdgcn_s_barrier_join:
case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
}
}
+ // ISel inserts copy to regs for the successor PHIs
+ // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+ // branch.
+ for (auto &MBB : MF) {
+ for (auto &MI : MBB) {
+ if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+ MachineBasicBlock::iterator I(MI);
+ MachineBasicBlock::iterator Next = std::next(I);
+ bool NeedToMove = false;
+ while (Next != MBB.end() && !Next->isBranch()) {
+ NeedToMove = true;
+ Next++;
+ }
+ assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+ "Malformed CFG detected!\n");
+ if (NeedToMove) {
+ MBB.splice(Next, &MBB, &MI);
+ }
+ break;
+ }
+ }
+ }
// FIXME: This is a hack to fixup AGPR classes to use the properly aligned
// classes if required. Ideally the register class constraints would differ
// per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
default:
Result = false;
break;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
case Intrinsic::amdgcn_loop:
Result = true;
break;
diff...
[truncated]
|
@llvm/pr-subscribers-llvm-ir Author: None (alex-t) ChangesBrief overview:As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block. Let's consider the following example:
The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation. We have 2 options to address this issue:
What was doneWe opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block. Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff 329 Files Affected:
diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
// GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
// GFX90A-CAS-LABEL: _Z14atomic_add_casPf
// GFX90A-CAS: flat_atomic_cmpswap
-// GFX90A-CAS: s_cbranch_execnz
+// GFX90A-CAS: s_cbranch_scc1
__device__ float atomic_add_cas(float *p) {
return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
}
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
unsigned OpNo = I.getSuccessorIndex();
const Instruction *TI = Node->getTerminator();
BasicBlock *SuccBB = TI->getSuccessor(OpNo);
- auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
- double WeightPercent = ((double)BranchProb.getNumerator()) /
- ((double)BranchProb.getDenominator());
+ // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+ // double WeightPercent = ((double)BranchProb.getNumerator()) /
+ // ((double)BranchProb.getDenominator());
+ double WeightPercent = 0.5;
+
std::string TTAttr =
formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
/// debug. This is the correct point to insert copies at the beginning of a
/// basic block. \p Reg is the register being used by a spill or defined for a
/// restore/split during register allocation.
- iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
- bool SkipPseudoOp = true);
+ iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
/// Returns an iterator to the first terminator instruction of this basic
/// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
/// other instructions shall be inserted before it. This can be implemented
/// to prevent register allocator to insert spills for \p Reg before such
/// instructions.
- virtual bool isBasicBlockPrologue(const MachineInstr &MI,
- Register Reg = Register()) const {
+ virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
return false;
}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
[llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
>;
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
[IntrWillReturn, IntrNoCallback, IntrNoFree]>;
// Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
- auto EHPadInsertPoint =
- EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+ auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
LLVM_DEBUG(dbgs() << "...also reload at EHPad "
<< printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
MachineBasicBlock::iterator MII;
if (SrcVNI->isPHIDef())
- MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+ MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
else {
MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
MachineBasicBlock::iterator
MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
- Register Reg, bool SkipPseudoOp) {
+ bool SkipPseudoOp) {
const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
iterator E = end();
while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
(SkipPseudoOp && I->isPseudoProbe()) ||
- TII->isBasicBlockPrologue(*I, Reg)))
+ TII->isBasicBlockPrologue(*I)))
++I;
// FIXME: This needs to change if we wish to bundle labels / dbg_values
// inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
return Start;
}
- unsigned RegIdx = 0;
- Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
- VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
- MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+ VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+ MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
RegAssign.insert(Start, VNI->def, OpenIdx);
LLVM_DEBUG(dump());
return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
return true;
}
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+ MachineInstr &MI) const {
// FIXME: Manually selecting to avoid dealing with the SReg_1 trick
// SelectionDAG uses for wave32 vs wave64.
MachineBasicBlock *BB = MI.getParent();
- BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+ BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
.add(MI.getOperand(1));
Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
MachineInstr &I) const {
Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
switch (IntrinsicID) {
- case Intrinsic::amdgcn_end_cf:
- return selectEndCfIntrinsic(I);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return selectWaveReconvergeIntrinsic(I);
case Intrinsic::amdgcn_ds_ordered_add:
case Intrinsic::amdgcn_ds_ordered_swap:
return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
bool selectReturnAddress(MachineInstr &I) const;
bool selectG_INTRINSIC(MachineInstr &I) const;
- bool selectEndCfIntrinsic(MachineInstr &MI) const;
+ bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
const unsigned MovExecOpc =
Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
- const unsigned MovExecTermOpc =
- Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
const unsigned XorTermOpc = Subtarget.isWave32() ?
AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
B.setInsertPt(*BodyBB, BodyBB->end());
+ Register LoopMask = MRI.createVirtualRegister(
+ TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
// Update EXEC, switch all done bits to 0 and all todo bits to 1.
B.buildInstr(XorTermOpc)
- .addDef(ExecReg)
- .addReg(ExecReg)
- .addReg(NewExec);
+ .addDef(LoopMask)
+ .addReg(ExecReg)
+ .addReg(NewExec);
// XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
// s_cbranch_scc0?
// Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
- B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+ B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+ .addReg(LoopMask)
+ .addReg(NewExec)
+ .addMBB(LoopBB);
// Save the EXEC mask before the loop.
BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
.addReg(ExecReg);
- // Restore the EXEC mask after the loop.
- B.setMBB(*RestoreExecBB);
- B.buildInstr(MovExecTermOpc)
- .addDef(ExecReg)
- .addReg(SaveExecReg);
// Set the insert point after the original instruction, so any new
// instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
break;
}
- case Intrinsic::amdgcn_end_cf: {
+ case Intrinsic::amdgcn_wave_reconverge: {
unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
#include "AMDGPU.h"
#include "AMDGPUTargetMachine.h"
#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/UniformityAnalysis.h"
#include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
Function *Else;
Function *IfBreak;
Function *Loop;
- Function *EndCf;
+ Function *WaveReconverge;
DominatorTree *DT;
StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
bool handleLoop(BranchInst *Term);
- bool closeControlFlow(BasicBlock *BB);
+ bool tryWaveReconverge(BasicBlock *BB);
public:
SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
{ IntMask });
Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
- EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+ WaveReconverge = Intrinsic::getDeclaration(
+ &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
}
/// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
/// Open a new "If" block
bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
- if (isUniform(Term))
- return false;
IRBuilder<> IRB(Term);
Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
}
/// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
- llvm::Loop *L = LI->getLoopFor(BB);
-
- assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
- if (L && L->getHeader() == BB) {
- // We can't insert an EndCF call into a loop header, because it will
- // get executed on every iteration of the loop, when it should be
- // executed only once before the loop.
- SmallVector <BasicBlock *, 8> Latches;
- L->getLoopLatches(Latches);
+ if (succ_empty(BB))
+ return false;
+ BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+ if (Term->getNumSuccessors() == 1) {
+ // The current BBs single successor is a top of the stack. We need to
+ // reconverge over thaqt path.
+ BasicBlock *SingleSucc = *succ_begin(BB);
+ BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+ if (isTopOfStack(SingleSucc)) {
+ Value *Exec = Stack.back().second;
+ IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+ }
+ } else {
+ // We have a uniform conditional branch terminating the block.
+ // THis block may be the last in the Then path of the enclosing divergent
+ // IF.
+ if (!isUniform(Term))
+ // Divergent loop is going to be further processed in another place
+ return false;
+
+ for (auto Succ : Term->successors()) {
+ if (isTopOfStack(Succ)) {
+ // Just split to make a room for further WAVE_RECONVERGE insertion
SmallVector<BasicBlock *, 2> Preds;
- for (BasicBlock *Pred : predecessors(BB)) {
- if (!is_contained(Latches, Pred))
- Preds.push_back(Pred);
+ for (auto P : predecessors(Succ)) {
+ if (DT->dominates(BB, P))
+ Preds.push_back(P);
}
-
- BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+ DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+ SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
false);
}
- Value *Exec = popSaved();
- BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
- if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
- Instruction *ExecDef = cast<Instruction>(Exec);
- BasicBlock *DefBB = ExecDef->getParent();
- if (!DT->dominates(DefBB, BB)) {
- // Split edge to make Def dominate Use
- FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
}
- IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
- // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
- // condition, for now just avoid copying these DebugLocs so that stepping
- // out of the then/else block in a debugger doesn't step to the condition.
- IRB.SetCurrentDebugLocation(DebugLoc());
- IRB.CreateCall(EndCf, {Exec});
}
return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
if (!Term || Term->isUnconditional()) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ Changed |= tryWaveReconverge(BB);
continue;
}
if (I.nodeVisited(Term->getSuccessor(1))) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ // Let's take care of uniform loop latch that may be closing the Then
+ // path of the enclosing divergent branch.
+ Changed |= tryWaveReconverge(BB);
if (DT->dominates(Term->getSuccessor(1), BB))
Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
continue;
}
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
}
+ if (isUniform(Term))
+ // Uniform conditional branch may be in the block that closes the Then
+ // path of the divergent conditional branch.
+ Changed |= tryWaveReconverge(BB);
+ else
Changed |= openIf(Term);
}
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
return AMDGPUISD::ELSE;
case Intrinsic::amdgcn_loop:
return AMDGPUISD::LOOP;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
llvm_unreachable("should not occur");
default:
return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
return SDValue(Load, 0);
}
- case Intrinsic::amdgcn_end_cf:
- return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
- Op->getOperand(2), Chain), 0);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+ MVT::Other, Op->getOperand(2), Chain),
+ 0);
case Intrinsic::amdgcn_s_barrier_init:
case Intrinsic::amdgcn_s_barrier_join:
case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
}
}
+ // ISel inserts copy to regs for the successor PHIs
+ // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+ // branch.
+ for (auto &MBB : MF) {
+ for (auto &MI : MBB) {
+ if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+ MachineBasicBlock::iterator I(MI);
+ MachineBasicBlock::iterator Next = std::next(I);
+ bool NeedToMove = false;
+ while (Next != MBB.end() && !Next->isBranch()) {
+ NeedToMove = true;
+ Next++;
+ }
+ assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+ "Malformed CFG detected!\n");
+ if (NeedToMove) {
+ MBB.splice(Next, &MBB, &MI);
+ }
+ break;
+ }
+ }
+ }
// FIXME: This is a hack to fixup AGPR classes to use the properly aligned
// classes if required. Ideally the register class constraints would differ
// per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
default:
Result = false;
break;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
case Intrinsic::amdgcn_loop:
Result = true;
break;
diff...
[truncated]
|
@llvm/pr-subscribers-llvm-regalloc Author: None (alex-t) ChangesBrief overview:As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block. Let's consider the following example:
The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation. We have 2 options to address this issue:
What was doneWe opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block. Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff 329 Files Affected:
diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
// GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
// GFX90A-CAS-LABEL: _Z14atomic_add_casPf
// GFX90A-CAS: flat_atomic_cmpswap
-// GFX90A-CAS: s_cbranch_execnz
+// GFX90A-CAS: s_cbranch_scc1
__device__ float atomic_add_cas(float *p) {
return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
}
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
unsigned OpNo = I.getSuccessorIndex();
const Instruction *TI = Node->getTerminator();
BasicBlock *SuccBB = TI->getSuccessor(OpNo);
- auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
- double WeightPercent = ((double)BranchProb.getNumerator()) /
- ((double)BranchProb.getDenominator());
+ // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+ // double WeightPercent = ((double)BranchProb.getNumerator()) /
+ // ((double)BranchProb.getDenominator());
+ double WeightPercent = 0.5;
+
std::string TTAttr =
formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
/// debug. This is the correct point to insert copies at the beginning of a
/// basic block. \p Reg is the register being used by a spill or defined for a
/// restore/split during register allocation.
- iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
- bool SkipPseudoOp = true);
+ iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
/// Returns an iterator to the first terminator instruction of this basic
/// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
/// other instructions shall be inserted before it. This can be implemented
/// to prevent register allocator to insert spills for \p Reg before such
/// instructions.
- virtual bool isBasicBlockPrologue(const MachineInstr &MI,
- Register Reg = Register()) const {
+ virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
return false;
}
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
[llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
>;
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
[IntrWillReturn, IntrNoCallback, IntrNoFree]>;
// Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
- auto EHPadInsertPoint =
- EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+ auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
LLVM_DEBUG(dbgs() << "...also reload at EHPad "
<< printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
MachineBasicBlock::iterator MII;
if (SrcVNI->isPHIDef())
- MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+ MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
else {
MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
MachineBasicBlock::iterator
MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
- Register Reg, bool SkipPseudoOp) {
+ bool SkipPseudoOp) {
const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
iterator E = end();
while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
(SkipPseudoOp && I->isPseudoProbe()) ||
- TII->isBasicBlockPrologue(*I, Reg)))
+ TII->isBasicBlockPrologue(*I)))
++I;
// FIXME: This needs to change if we wish to bundle labels / dbg_values
// inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
return Start;
}
- unsigned RegIdx = 0;
- Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
- VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
- MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+ VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+ MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
RegAssign.insert(Start, VNI->def, OpenIdx);
LLVM_DEBUG(dump());
return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
return true;
}
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+ MachineInstr &MI) const {
// FIXME: Manually selecting to avoid dealing with the SReg_1 trick
// SelectionDAG uses for wave32 vs wave64.
MachineBasicBlock *BB = MI.getParent();
- BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+ BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
.add(MI.getOperand(1));
Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
MachineInstr &I) const {
Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
switch (IntrinsicID) {
- case Intrinsic::amdgcn_end_cf:
- return selectEndCfIntrinsic(I);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return selectWaveReconvergeIntrinsic(I);
case Intrinsic::amdgcn_ds_ordered_add:
case Intrinsic::amdgcn_ds_ordered_swap:
return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
bool selectReturnAddress(MachineInstr &I) const;
bool selectG_INTRINSIC(MachineInstr &I) const;
- bool selectEndCfIntrinsic(MachineInstr &MI) const;
+ bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
const unsigned MovExecOpc =
Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
- const unsigned MovExecTermOpc =
- Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
const unsigned XorTermOpc = Subtarget.isWave32() ?
AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
B.setInsertPt(*BodyBB, BodyBB->end());
+ Register LoopMask = MRI.createVirtualRegister(
+ TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
// Update EXEC, switch all done bits to 0 and all todo bits to 1.
B.buildInstr(XorTermOpc)
- .addDef(ExecReg)
- .addReg(ExecReg)
- .addReg(NewExec);
+ .addDef(LoopMask)
+ .addReg(ExecReg)
+ .addReg(NewExec);
// XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
// s_cbranch_scc0?
// Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
- B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+ B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+ .addReg(LoopMask)
+ .addReg(NewExec)
+ .addMBB(LoopBB);
// Save the EXEC mask before the loop.
BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
.addReg(ExecReg);
- // Restore the EXEC mask after the loop.
- B.setMBB(*RestoreExecBB);
- B.buildInstr(MovExecTermOpc)
- .addDef(ExecReg)
- .addReg(SaveExecReg);
// Set the insert point after the original instruction, so any new
// instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
break;
}
- case Intrinsic::amdgcn_end_cf: {
+ case Intrinsic::amdgcn_wave_reconverge: {
unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
#include "AMDGPU.h"
#include "AMDGPUTargetMachine.h"
#include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/UniformityAnalysis.h"
#include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
Function *Else;
Function *IfBreak;
Function *Loop;
- Function *EndCf;
+ Function *WaveReconverge;
DominatorTree *DT;
StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
bool handleLoop(BranchInst *Term);
- bool closeControlFlow(BasicBlock *BB);
+ bool tryWaveReconverge(BasicBlock *BB);
public:
SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
{ IntMask });
Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
- EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+ WaveReconverge = Intrinsic::getDeclaration(
+ &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
}
/// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
/// Open a new "If" block
bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
- if (isUniform(Term))
- return false;
IRBuilder<> IRB(Term);
Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
}
/// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
- llvm::Loop *L = LI->getLoopFor(BB);
-
- assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
- if (L && L->getHeader() == BB) {
- // We can't insert an EndCF call into a loop header, because it will
- // get executed on every iteration of the loop, when it should be
- // executed only once before the loop.
- SmallVector <BasicBlock *, 8> Latches;
- L->getLoopLatches(Latches);
+ if (succ_empty(BB))
+ return false;
+ BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+ if (Term->getNumSuccessors() == 1) {
+ // The current BBs single successor is a top of the stack. We need to
+ // reconverge over thaqt path.
+ BasicBlock *SingleSucc = *succ_begin(BB);
+ BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+ if (isTopOfStack(SingleSucc)) {
+ Value *Exec = Stack.back().second;
+ IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+ }
+ } else {
+ // We have a uniform conditional branch terminating the block.
+ // THis block may be the last in the Then path of the enclosing divergent
+ // IF.
+ if (!isUniform(Term))
+ // Divergent loop is going to be further processed in another place
+ return false;
+
+ for (auto Succ : Term->successors()) {
+ if (isTopOfStack(Succ)) {
+ // Just split to make a room for further WAVE_RECONVERGE insertion
SmallVector<BasicBlock *, 2> Preds;
- for (BasicBlock *Pred : predecessors(BB)) {
- if (!is_contained(Latches, Pred))
- Preds.push_back(Pred);
+ for (auto P : predecessors(Succ)) {
+ if (DT->dominates(BB, P))
+ Preds.push_back(P);
}
-
- BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+ DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+ SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
false);
}
- Value *Exec = popSaved();
- BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
- if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
- Instruction *ExecDef = cast<Instruction>(Exec);
- BasicBlock *DefBB = ExecDef->getParent();
- if (!DT->dominates(DefBB, BB)) {
- // Split edge to make Def dominate Use
- FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
}
- IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
- // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
- // condition, for now just avoid copying these DebugLocs so that stepping
- // out of the then/else block in a debugger doesn't step to the condition.
- IRB.SetCurrentDebugLocation(DebugLoc());
- IRB.CreateCall(EndCf, {Exec});
}
return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
if (!Term || Term->isUnconditional()) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ Changed |= tryWaveReconverge(BB);
continue;
}
if (I.nodeVisited(Term->getSuccessor(1))) {
if (isTopOfStack(BB))
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
+ // Let's take care of uniform loop latch that may be closing the Then
+ // path of the enclosing divergent branch.
+ Changed |= tryWaveReconverge(BB);
if (DT->dominates(Term->getSuccessor(1), BB))
Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
continue;
}
- Changed |= closeControlFlow(BB);
+ Stack.pop_back();
}
+ if (isUniform(Term))
+ // Uniform conditional branch may be in the block that closes the Then
+ // path of the divergent conditional branch.
+ Changed |= tryWaveReconverge(BB);
+ else
Changed |= openIf(Term);
}
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
return AMDGPUISD::ELSE;
case Intrinsic::amdgcn_loop:
return AMDGPUISD::LOOP;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
llvm_unreachable("should not occur");
default:
return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
return SDValue(Load, 0);
}
- case Intrinsic::amdgcn_end_cf:
- return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
- Op->getOperand(2), Chain), 0);
+ case Intrinsic::amdgcn_wave_reconverge:
+ return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+ MVT::Other, Op->getOperand(2), Chain),
+ 0);
case Intrinsic::amdgcn_s_barrier_init:
case Intrinsic::amdgcn_s_barrier_join:
case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
}
}
+ // ISel inserts copy to regs for the successor PHIs
+ // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+ // branch.
+ for (auto &MBB : MF) {
+ for (auto &MI : MBB) {
+ if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+ MachineBasicBlock::iterator I(MI);
+ MachineBasicBlock::iterator Next = std::next(I);
+ bool NeedToMove = false;
+ while (Next != MBB.end() && !Next->isBranch()) {
+ NeedToMove = true;
+ Next++;
+ }
+ assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+ "Malformed CFG detected!\n");
+ if (NeedToMove) {
+ MBB.splice(Next, &MBB, &MI);
+ }
+ break;
+ }
+ }
+ }
// FIXME: This is a hack to fixup AGPR classes to use the properly aligned
// classes if required. Ideally the register class constraints would differ
// per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
default:
Result = false;
break;
- case Intrinsic::amdgcn_end_cf:
+ case Intrinsic::amdgcn_wave_reconverge:
case Intrinsic::amdgcn_loop:
Result = true;
break;
diff...
[truncated]
|
@@ -196,8 +208,10 @@ define amdgpu_kernel void @add_i32_constant(ptr addrspace(1) %out, ptr addrspace | |||
; GFX11W32-NEXT: v_mbcnt_lo_u32_b32 v0, s1, 0 | |||
; GFX11W32-NEXT: ; implicit-def: $vgpr1 | |||
; GFX11W32-NEXT: s_delay_alu instid0(VALU_DEP_1) | |||
; GFX11W32-NEXT: v_cmpx_eq_u32_e32 0, v0 | |||
; GFX11W32-NEXT: s_cbranch_execz .LBB0_2 | |||
; GFX11W32-NEXT: v_cmp_eq_u32_e32 vcc_lo, 0, v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codegen seems significantly worse for simple conditional branches like this, especially on GFX10.3+ where we try to use v_cmpx. Note that there is special hardware to handle v_cmpx followed by a VALU instruction without the kind of pipeline stall that you would normally get when you write to EXEC.
To improve matters a bit, could we aim for codegen like this?
v_cmp_eq_u32_e32 vcc_lo, 0, v0
s_cbranch_vccz .LBB0_2
s_mov_b32 exec_lo, vcc_lo
This saves one instruction overall and moves the exec modification into the body of the "if".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't actually help much, because now we have an additional stall before the next VALU instruction.
I agree with Jay that this change isn't acceptable from a codegen quality point of view.
That does not seem logical. The restores themselves only read EXEC, they do not write it. So why should they be part of the prolog? It seems the whole problem could be avoided by not considering VGPR restores to be part of the block prolog? |
Let me describe the complete case. We have register allocation split in 2 steps - the first run allocated SGPRs and the second one takes care of VGPRs. It might happen that during the SGPRs allocation we had to spill the register containing EXEC mask value to the memory. Hence, we have to load it back at the beginning of the flow block BEFORE the OR-ing with current EXEC, as it is a live-in value that is an operand for the OR. As soon as we inserted the reload the prologue sequence is broken and during the VGPRs allocation any spill/reload will be placed before the point at which the exec mask is restored. The issue was addressed here: #69924 (also here https://reviews.llvm.org/D145329). Although, revisiting this now, I still don't understand why they decided to include ALL spill opcodes in the prologue, but not only the SGPR spills? Clearly, none of the VGPR reloads really belong to the prologue. At a first glance, changing the isSpill(opcode) to isSGPRSpill(opcode) in the snippet below would solve the initial case. I need to look at this a bit more. I am sure they would have done this if such a simple change had solved the problem. |
Honestly, I don't like the way this change affects the code generation quality. I would not ever try to propose it provided we have another way to achieve correctness. Currently, we have a compiler that either produces incorrect code or fails to compile a valid input. |
Thank you for going into a bit more detail. It does seem like distinguishing between SGPR reloads and VGPR reloads would help. Hmm, what about a case where we run out of VGPRs to spill SGPRs into? What do we actually do in that case today? |
Not sure if I understand your concern. We allocate SGPRs in a separate pass before VGPRs. So, we have the whole amount of VGPRs available on a given target to spill SGPRs. I can hardly imagine a kernel that requires 256x32 SGPRs. |
Somebody recently mentioned a test case that ended up producing hundreds of MBs or even GBs of code :) Maybe the answer is that we don't support it and just report a compiler error. That may be a reasonable thing to do, but in any case having certainty about it would help the discussion here. |
I like this change - it fixes the problem I reported in #109294. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With one trick mentioned inline, we can further reduce one instruction. In most cases, we would have to introduce one scalar instruction in if-block and one salu in flow-block compared with existing solution.
Block prologue is unclear concept and not well implemented. And we don't know if there are any gaps in LLVM core framework to make prologue work correctly.
We first introduced block prologue to get the insertion point past the s_or_bnn exec
for valu instructions. For salu, we still allow them being inserted into the prologue because the s_or_bnn exec
needs an input sgpr. for threaded-vgpr, we should always put them after prologue. But as we may also reload sgpr from wwm-vgpr, I think we also should put wwm-vgpr reload in block prologue.
Another issue with the implementation is we should only count the instructions before the s_or_bnn exec
as prologue. Like in the case below, the last prologue instruction should be the S_OR_B64 exec
.
$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
Although tuning the isSpill()
check help for this case, we still possibly have other issues related to this. I think it would better we fix this, but I am not sure it can be easily done.
Meanwhile I think it would definitely be helpful to figure out whether LiveRangeSplit can work with block prologue properly at least for the known cases.
; GFX11W64-NEXT: s_cmp_lg_u64 vcc, 0 | ||
; GFX11W64-NEXT: s_cmov_b64 exec, vcc | ||
; GFX11W64-NEXT: s_cbranch_scc0 .LBB1_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge the s_cmp_lg for updating SCC and the exec save into one s_and_saveexec
s_and_saveexec_b64 s[0:1], vcc
s_cselect_b64 exec, vcc, s[0:1]
s_cbranch_scc0 .LBB1_2
so we don't need the separate s_mov_b64 s[0:1], exec
at line 472 to save the exec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we cannot. S_AND_SAVEEXEC changes the EXEC unconditionally.
The idea is that we only change the exec if we are going to execute "Then" block but leave it unchanged for Flow block. All further lowering is based on this assumption.
; GFX12W32-NEXT: s_xor_b32 s1, vcc_lo, exec_lo | ||
; GFX12W32-NEXT: s_cmp_lg_u32 vcc_lo, 0 | ||
; GFX12W32-NEXT: s_cmov_b32 exec_lo, vcc_lo | ||
; GFX12W32-NEXT: s_cbranch_scc0 .LBB2_4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also be simplified into:
s_and_saveexec_b32 s1, vcc_lo
s_cselect_b32 exec_lo, vcc_lo, s1
s_cbranch_scc0 .LBB2_4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't work for same reason as above :(
; GFX10-W32-NEXT: s_xor_b32 s13, s14, exec_lo | ||
; GFX10-W32-NEXT: s_cmp_lg_u32 s14, 0 | ||
; GFX10-W32-NEXT: s_cmov_b32 exec_lo, s14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are safe to use (no kill in the else region):
s_and_saveexec s13, s14
s_cselect s14, s13ruil
s_cbranch_scc0
It fixes the problem but requires us to consider WWM reloads as prologue instructions. Any VGPR reload starts a new live range and possibly introduces interference. CD's reply here (#111496 (comment)) made me hope that it would work. However, I would like to see the corresponding MIR test as proof. |
This PR is a follow-up of the #92809
Brief overview:
As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged. We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.
Let's consider the following example:
$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
<-- another LI assigned to the same physreg starts here
%129657:vgpr_32 = COPY %14037:vgpr_32 <-- insertion point chosen after the "prologue"
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.
The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.
We have 2 options to address this issue:
What was done
We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.
Lowering example
Simple "if" MIR
simple_if.pdf
Lowered MIR
simple_if_lowered.pdf
Please note that we use the conditional move to update the EXEC. So, we only update it if the condition is "true" but leave it unchanged otherwise. Also, SI_WAVE_RECONVERGE pseudo is inserted along the "true" path of the conditional branch, at the end of the predecessor block.