Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions llvm/include/llvm/CodeGen/MachineRegisterInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -617,6 +617,11 @@ class MachineRegisterInfo {
/// form, so there should only be one definition.
LLVM_ABI MachineInstr *getVRegDef(Register Reg) const;

/// getDomVRegDefInBasicBlock - Return the last machine instr that defines
/// the specified virtual register in the basic block, searching backwards
/// from instruction I (exclusive). Returns MBB.end() if no definition is found.
LLVM_ABI MachineBasicBlock::iterator getDomVRegDefInBasicBlock(Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const;

/// getUniqueVRegDef - Return the unique machine instr that defines the
/// specified virtual register or null if none is found. If there are
/// multiple definitions or no definition, return null.
Expand Down
14 changes: 14 additions & 0 deletions llvm/lib/CodeGen/MachineRegisterInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -674,3 +674,17 @@ bool MachineRegisterInfo::isReservedRegUnit(MCRegUnit Unit) const {
}
return false;
}

/// getDomVRegDefInBasicBlock - Return the last machine instr that defines
/// the specified virtual register in the basic block, searching backwards
/// from instruction I (exclusive). Returns MBB.end() if no definition is found.
MachineBasicBlock::iterator MachineRegisterInfo::getDomVRegDefInBasicBlock(
Register Reg, MachineBasicBlock &MBB, MachineBasicBlock::iterator I) const {
if(I == MBB.begin()) return MBB.end();
// Iterate backwards from I (exclusive) to the beginning of the basic block
do {
--I;
if (I->modifiesRegister(Reg, getTargetRegisterInfo())) return I;
} while (I != MBB.begin());
return MBB.end();
}
43 changes: 24 additions & 19 deletions llvm/lib/Target/AMDGPU/AMDGPUWaveTransform.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1839,7 +1839,7 @@ void ControlFlowRewriter::rewrite() {
Opcode = AMDGPU::S_CBRANCH_SCC1;
} else {
Register CondReg = Info.OrigCondition;
if (!LMA.isSubsetOfExec(CondReg, *Node->Block)) {
if (!LMA.isSubsetOfExec(CondReg, *Node->Block, Node->Block->end())) {
CondReg = LMU.createLaneMaskReg();
BuildMI(*Node->Block, Node->Block->end(), {}, TII.get(LMC.AndOpc),
CondReg)
Expand Down Expand Up @@ -1937,7 +1937,7 @@ void ControlFlowRewriter::rewrite() {
}
} else {
CondReg = LaneOrigin.CondReg;
if (!LMA.isSubsetOfExec(LaneOrigin.CondReg, *LaneOrigin.Node->Block)) {
if (!LMA.isSubsetOfExec(LaneOrigin.CondReg, *LaneOrigin.Node->Block, LaneOrigin.Node->Block->getFirstTerminator())) {
Register Prev = CondReg;
CondReg = LMU.createLaneMaskReg();
BuildMI(*LaneOrigin.Node->Block,
Expand Down Expand Up @@ -2033,28 +2033,33 @@ void ControlFlowRewriter::rewrite() {
CFGNodeInfo &PredInfo = NodeInfo.find(Pred)->second;
Register PrimaryExec = PredInfo.PrimarySuccessorExec;

MachineInstr *PrimaryExecDef;
for (;;) {
PrimaryExecDef = MRI.getVRegDef(PrimaryExec);
if (PrimaryExecDef->getOpcode() != AMDGPU::COPY)
break;
PrimaryExec = PrimaryExecDef->getOperand(1).getReg();
}
//Turning off this copy-chain optimization to retain the Accumulator as the PrimaryExec

// MachineInstr *PrimaryExecDef;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code commented out won't look good. Better clean them all. What is the significance of adding the above comment here? Are you planning to implement a similar optimization for ACC based non-SSA form? If yes, leave a strong note mentioning that (still need to clean up the commented code). Otherwise, remove the comment as well.

// for (;;) {
// PrimaryExecDef = MRI.getVRegDef(PrimaryExec);
// if (PrimaryExecDef->getOpcode() != AMDGPU::COPY)
// break;
// PrimaryExec = PrimaryExecDef->getOperand(1).getReg();
// }

// Rejoin = EXEC ^ PrimaryExec
//
// Fold immediately if PrimaryExec was obtained via XOR as well.
Register Rejoin;

if (PrimaryExecDef->getParent() == Pred->Block &&
PrimaryExecDef->getOpcode() == LMC.XorOpc &&
PrimaryExecDef->getOperand(1).isReg() &&
PrimaryExecDef->getOperand(2).isReg()) {
if (PrimaryExecDef->getOperand(1).getReg() == LMC.ExecReg)
Rejoin = PrimaryExecDef->getOperand(2).getReg();
else if (PrimaryExecDef->getOperand(2).getReg() == LMC.ExecReg)
Rejoin = PrimaryExecDef->getOperand(1).getReg();
}
//Turning off this XOR optimiztion since buildMergeLaneMasks() will not
// introduce XOR instruction for creating the PrimaryExec

// if (PrimaryExecDef->getParent() == Pred->Block &&
// PrimaryExecDef->getOpcode() == LMC.XorOpc &&
// PrimaryExecDef->getOperand(1).isReg() &&
// PrimaryExecDef->getOperand(2).isReg()) {
// if (PrimaryExecDef->getOperand(1).getReg() == LMC.ExecReg)
// Rejoin = PrimaryExecDef->getOperand(2).getReg();
// else if (PrimaryExecDef->getOperand(2).getReg() == LMC.ExecReg)
// Rejoin = PrimaryExecDef->getOperand(1).getReg();
// }

if (!Rejoin) {
// Try to find a previously generated XOR (or merely masked) value
Expand Down Expand Up @@ -2091,7 +2096,7 @@ void ControlFlowRewriter::rewrite() {

LLVM_DEBUG(Function.dump());
}

Updater.insertAccumulatorResets();
Updater.cleanup();
}

Expand Down
113 changes: 66 additions & 47 deletions llvm/lib/Target/AMDGPU/GCNLaneMaskUtils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,12 @@ bool GCNLaneMaskUtils::maybeLaneMask(Register Reg) const {

/// Determine whether the lane-mask register \p Reg is a wave-wide constant.
/// If so, the value is stored in \p Val.
bool GCNLaneMaskUtils::isConstantLaneMask(Register Reg, bool &Val) const {
bool GCNLaneMaskUtils::isConstantLaneMask(Register Reg, bool &Val, MachineBasicBlock &MBB, MachineBasicBlock::iterator MI) const {
MachineRegisterInfo &MRI = MF.getRegInfo();

const MachineInstr *MI;
for (;;) {
MI = MRI.getVRegDef(Reg);
if (!MI) {
MI = MRI.getDomVRegDefInBasicBlock(Reg, MBB, MI);
if (MI == MBB.end()) {
// This can happen when called from GCNLaneMaskUpdater, where Reg can
// be a placeholder that has not yet been filled in.
return false;
Expand Down Expand Up @@ -100,18 +99,20 @@ Register GCNLaneMaskUtils::createLaneMaskReg() const {
/// properly masked, i.e. use PrevReg directly instead of
/// (PrevReg & ~EXEC), and don't add extra 1-bits to DstReg
/// beyond (CurReg & EXEC).
/// \param isPrevZeroReg Indicates that PrevReg is a zero register.
void GCNLaneMaskUtils::buildMergeLaneMasks(MachineBasicBlock &MBB,
MachineBasicBlock::iterator I,
const DebugLoc &DL, Register DstReg,
Register PrevReg, Register CurReg,
GCNLaneMaskAnalysis *LMA,
bool accumulating) const {
bool accumulating,
bool isPrevZeroReg) const {
const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();
const SIInstrInfo *TII = ST.getInstrInfo();
bool PrevVal = false;
bool PrevConstant = !PrevReg || isConstantLaneMask(PrevReg, PrevVal);
bool PrevConstant = !PrevReg || isPrevZeroReg;
bool CurVal = false;
bool CurConstant = isConstantLaneMask(CurReg, CurVal);
bool CurConstant = isConstantLaneMask(CurReg, CurVal, MBB, I);

assert(PrevReg || !accumulating);

Expand Down Expand Up @@ -147,7 +148,7 @@ void GCNLaneMaskUtils::buildMergeLaneMasks(MachineBasicBlock &MBB,
}
if (!CurConstant) {
if ((PrevConstant && PrevVal) ||
(LMA && LMA->isSubsetOfExec(CurReg, MBB))) {
(LMA && LMA->isSubsetOfExec(CurReg, MBB, I))) {
CurMaskedReg = CurReg;
} else {
CurMaskedReg = createLaneMaskReg();
Expand Down Expand Up @@ -188,22 +189,26 @@ void GCNLaneMaskUtils::buildMergeLaneMasks(MachineBasicBlock &MBB,
/// (Reg & EXEC) == Reg when used in \p UseBlock.
bool GCNLaneMaskAnalysis::isSubsetOfExec(Register Reg,
MachineBasicBlock &UseBlock,
MachineBasicBlock::iterator I,
unsigned RemainingDepth) {
MachineRegisterInfo &MRI = LMU.function()->getRegInfo();
MachineInstr *DefInstr = nullptr;
MachineBasicBlock::iterator DefInstr = UseBlock.end();
const AMDGPU::LaneMaskConstants &LMC = LMU.getLaneMaskConsts();

for (;;) {
if (!Register::isVirtualRegister(Reg)) {
if (Reg == LMC.ExecReg &&
(!DefInstr || DefInstr->getParent() == &UseBlock))
(DefInstr == UseBlock.end() || DefInstr->getParent() == &UseBlock))
return true;
return false;
}

DefInstr = MRI.getVRegDef(Reg);
DefInstr = MRI.getDomVRegDefInBasicBlock(Reg, UseBlock, I);
if(DefInstr == UseBlock.end())
return false;
if (DefInstr->getOpcode() == AMDGPU::COPY) {
Reg = DefInstr->getOperand(1).getReg();
I = DefInstr;
continue;
}

Expand Down Expand Up @@ -242,7 +247,7 @@ bool GCNLaneMaskAnalysis::isSubsetOfExec(Register Reg,
if ((LikeOr || IsAnd || IsAndN2) &&
(DefInstr->getOperand(1).isReg() && DefInstr->getOperand(2).isReg())) {
bool FirstIsSubset = isSubsetOfExec(DefInstr->getOperand(1).getReg(),
UseBlock, RemainingDepth);
UseBlock, DefInstr, RemainingDepth);
if (!FirstIsSubset && (LikeOr || IsAndN2))
return SubsetOfExec.try_emplace(Reg, false).first->second;

Expand All @@ -252,7 +257,7 @@ bool GCNLaneMaskAnalysis::isSubsetOfExec(Register Reg,
}

bool SecondIsSubset = isSubsetOfExec(DefInstr->getOperand(2).getReg(),
UseBlock, RemainingDepth);
UseBlock, DefInstr, RemainingDepth);
if (!SecondIsSubset)
return SubsetOfExec.try_emplace(Reg, false).first->second;

Expand All @@ -268,14 +273,14 @@ void GCNLaneMaskUpdater::init(Register Reg) {
Processed = false;
Blocks.clear();
// SSAUpdater.Initialize(LMU.getLaneMaskConsts().LaneMaskRC);
SSAUpdater.Initialize(Reg);
Accumulator = {};
}

/// Optional cleanup, may remove stray instructions.
void GCNLaneMaskUpdater::cleanup() {
Processed = false;
Blocks.clear();

Accumulator = {};
MachineRegisterInfo &MRI = LMU.function()->getRegInfo();

if (ZeroReg && MRI.use_empty(ZeroReg)) {
Expand Down Expand Up @@ -330,7 +335,7 @@ void GCNLaneMaskUpdater::addAvailable(MachineBasicBlock &Block,
Register GCNLaneMaskUpdater::getValueInMiddleOfBlock(MachineBasicBlock &Block) {
if (!Processed)
process();
return SSAUpdater.GetValueInMiddleOfBlock(&Block);
return Accumulator;
}

/// Return the value at the end of the given block, i.e. after any change that
Expand All @@ -342,7 +347,7 @@ Register GCNLaneMaskUpdater::getValueInMiddleOfBlock(MachineBasicBlock &Block) {
Register GCNLaneMaskUpdater::getValueAtEndOfBlock(MachineBasicBlock &Block) {
if (!Processed)
process();
return SSAUpdater.GetValueAtEndOfBlock(&Block);
return Accumulator;
}

/// Return the value in \p Block after the value merge (if any).
Expand All @@ -352,15 +357,15 @@ Register GCNLaneMaskUpdater::getValueAfterMerge(MachineBasicBlock &Block) {

auto BlockIt = findBlockInfo(Block);
if (BlockIt != Blocks.end()) {
if (BlockIt->Merged)
return BlockIt->Merged;
if (BlockIt->Value)
return Accumulator;
if (BlockIt->Flags & ResetInMiddle)
return ZeroReg;
}

// We didn't merge anything in the block, but the block may still be
// ResetAtEnd, in which case we need the pre-reset value.
return SSAUpdater.GetValueInMiddleOfBlock(&Block);
return Accumulator;
}

/// Determine whether \p MI defines and/or uses SCC.
Expand Down Expand Up @@ -422,22 +427,22 @@ void GCNLaneMaskUpdater::process() {
.addImm(0);
}

// Add available values.
if (!Accumulator) {
Accumulator = LMU.createLaneMaskReg();
BuildMI(Entry, Entry.getFirstTerminator(), {},
TII->get(LMU.getLaneMaskConsts().MovOpc), Accumulator)
.addImm(0);
}

// Reset accumulator.
for (BlockInfo &Info : Blocks) {
assert(Accumulating || !Info.Flags);
assert(Info.Flags || Info.Value);

if (Info.Value)
Info.Merged = LMU.createLaneMaskReg();

SSAUpdater.AddAvailableValue(
Info.Block,
(Info.Value && !(Info.Flags & ResetAtEnd)) ? Info.Merged : ZeroReg);
if(!Info.Value || (Info.Flags & ResetAtEnd))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to discuss further optimization here.

GCNLaneMaskUpdater::process() will process the BlockInfo for the following blocks:
X - The block for which we are computing EXEC mask
R - Set of preds of X in Reconverged CFG
T - Set of preds of X in Thread-level CFG

Info.Value is set for all blocks in T (via GCNLaneMaskUpdater::addAvailable() called from ControFlowRewriter::rewrite() )
ResetAtEnd is set for all blocks in R (via GCNLaneMaskUpdater::addReset() called from ControFlowRewriter::rewrite() )

The SSAUpdater marks the ZeroReg or MergedReg as available on the condition:
(Info.Value && !(Info.Flags & ResetAtEnd)) ? Info.Merged : ZeroReg

which translates to:
SSAUpdater.addAvailableValue(x, MergedReg) for x \in T and \notin R
SSAUpdater.addAvaialbleValue(x, ZeroReg) for x \in R UNION (x \notin R and \notin T)

The NonSSA approach uses a single Accumulator Register to store the contributions from each block in T instead of mulitple Merged Register beign defined. This Accumulator is reset at end of blocks corresponding to where SSAUpdater orignally marked ZeroRegister as available.

Therefore, we add Accumulator reset to 0 instructions at end of block : (x \in R) UNION (x \notin R and \notin T)

I believe we can reduce this set further to just x \in R.
This should work because (x \notin R and \notin T) when not empty, corresponds to block X such that X \notin R and X \notin T.

X is directly preceded by blocks in R in the reconverged CFG.
Blocks in R will have Accumulator reset instruction at their end.
Therefore adding Accumulator reset instruction at end of X is redundant.

Kindly let me know if this logic seems sound.
RefinedConditionForAccReset

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to reset at the end of X when X is in the loop. I am not sure.

AccumulatorResetBlocks[Info.Block].insert(Accumulator);
}

if (Accumulating && !SSAUpdater.HasValueForBlock(&Entry))
SSAUpdater.AddAvailableValue(&Entry, ZeroReg);

// Once the SSA updater is ready, we can fill in all merge code, relying
// on the SSA updater to insert required PHIs.
for (BlockInfo &Info : Blocks) {
Expand All @@ -448,11 +453,8 @@ void GCNLaneMaskUpdater::process() {
Register Previous;
if (Info.Block != &LMU.function()->front() &&
!(Info.Flags & ResetInMiddle)) {
Previous = SSAUpdater.GetValueInMiddleOfBlock(Info.Block);
if (Accumulating) {
assert(!MRI.getVRegDef(Previous) ||
MRI.getVRegDef(Previous)->getOpcode() != AMDGPU::IMPLICIT_DEF);
} else {
Previous = Accumulator;
if (!Accumulating) {
MachineInstr *PrevInstr = MRI.getVRegDef(Previous);
if (PrevInstr && PrevInstr->getOpcode() == AMDGPU::IMPLICIT_DEF) {
PotentiallyDead.insert(PrevInstr);
Expand All @@ -466,18 +468,20 @@ void GCNLaneMaskUpdater::process() {

// Insert merge logic.
MachineBasicBlock::iterator insertPt = getSaluInsertionAtEnd(*Info.Block);
LMU.buildMergeLaneMasks(*Info.Block, insertPt, {}, Info.Merged, Previous,
Info.Value, LMA, Accumulating);

if (Info.Flags & ResetAtEnd) {
MachineInstr *mergeInstr = MRI.getVRegDef(Info.Merged);
if (mergeInstr->getOpcode() == AMDGPU::COPY &&
mergeInstr->getOperand(1).getReg().isVirtual()) {
assert(MRI.use_empty(Info.Merged));
Info.Merged = mergeInstr->getOperand(1).getReg();
mergeInstr->eraseFromParent();
}
}
LMU.buildMergeLaneMasks(*Info.Block, insertPt, {}, Accumulator, Previous,
Info.Value, LMA, Accumulating, Previous == ZeroReg);


// Switching off this optimization, since Accumulator will always have a use
// if (Info.Flags & ResetAtEnd) {
// MachineInstr *mergeInstr = MRI.getVRegDef(Info.Merged);
// if (mergeInstr->getOpcode() == AMDGPU::COPY &&
// mergeInstr->getOperand(1).getReg().isVirtual()) {
// assert(MRI.use_empty(Info.Merged));
// Info.Merged = mergeInstr->getOperand(1).getReg();
// mergeInstr->eraseFromParent();
// }
// }
}

Processed = true;
Expand All @@ -489,3 +493,18 @@ GCNLaneMaskUpdater::findBlockInfo(MachineBasicBlock &Block) {
return llvm::find_if(
Blocks, [&](const auto &Entry) { return Entry.Block == &Block; });
}

void GCNLaneMaskUpdater::insertAccumulatorResets() {
const SIInstrInfo *TII = LMU.function()->getSubtarget<GCNSubtarget>().getInstrInfo();
for (auto &Entry : AccumulatorResetBlocks) {
MachineBasicBlock *B = Entry.first;
DenseSet<Register> &Accumulators = Entry.second;
for (Register ACC : Accumulators) {
//get first branch instruction
MachineBasicBlock::iterator I = B->getFirstTerminator();
while(I != B->end() && !I->isBranch()) I++;
if(I == B->end()) I--;
BuildMI(*B, I, {}, TII->get(LMU.getLaneMaskConsts().MovOpc), ACC).addImm(0);
}
Copy link

@vg0204 vg0204 Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you insert all resets one after another once you find the right place rather than searching for right insertion place for every accumulator to reset? Seems bit expensive!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they need to be inserted at the end of the basic blocks right before the first branch instruction.
When we first identify the inserts in the process() function, more instructions are yet to be added by later iterations in the basic block.
Doing it separately at the end saves us from iterating to the correct insertion point and is a cleaner and less expensive approach.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of having write to EXEC as MovTermOpc breaks the new flow (non-SSA) as we need to insert the accumulator at the end of the BB, before the actual terminator instructions. The better approach would be to delay the insertion of EXEC write alongside the ACC reset routine. There could be challenges as we might not reset ACC all the time. However, if you knew earlier about the need for ACC reset in the block, we could handle them specially, and it can still be done without introducing MovTermOpc.
For now, we can continue with this choice of Inserting MovTermOpc early and then later changing it to MovOpc while resetting ACC.
This innermost loop in your code currently identifies the insertion point for each ACC. That code should be moved outside the loop. Secondly, once you get the first terminator, there is no need for the while loop to identify the branch instruction. You only need to skip the instruction that writes to EXEC mask. If you consider the following, we can change the MoveTermOpc to MoveOpc here as well.
for (auto &Entry : AccumulatorResetBlocks) {
...
MachineBasicBlock::iterator I = B->getFirstTerminator();
if (I is write to EXEC with a MovTermOpc) {
I->setDesc(TII.get(LMC.MovOpc)); // change the Term status from MOV.
I++;
} // This ensures that we have the right InsertionPt identified. Insert the ACC reset for all accumulators.
for (Register ACC : Accumulators) {
BuildMI(*B, I, {}, TII->get(LMU.getLaneMaskConsts().MovOpc), ACC)
.addImm(0);
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of having write to EXEC as MovTermOpc breaks the new flow (non-SSA) as we need to insert the accumulator at the end of the BB, before the actual terminator instructions.

I don't understand this part, writing to EXEC as MovTermOpc seems independent from writing to the accumulator

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The better approach would be to delay the insertion of EXEC write alongside the ACC reset routine. There could be challenges as we might not reset ACC all the time. However, if you knew earlier about the need for ACC reset in the block, we could handle them specially, and it can still be done without introducing MovTermOpc.

EXEC insertions happen in 2 stages: First for all divergent incoming BBs, then for secondary BBs (creating rejoin masks). In the 2nd stage, when we set EXEC to the computed rejoin masks, the insertion point for this is found by iterating from the first terminator (MovTermOpc introduced by stage 1), by getSaluInsertionAtEnd() function.
So this MovTermOpc is acting like a anchor point for EXEC=rejoinmask instruction(s) to be added before the first exec is set.
ACC reset instructions should be after all EXEC set instructions in any BB.
The ACC to be reset is identified while processing Stage1 and 2 in the process() function, so retaining them in a separate data structure AccumulatorResetBlocks and adding them after both stages are complete is the cleanest approach.

I don't understand this part, writing to EXEC as MovTermOpc seems independent from writing to the accumulator

Thats true, but the order of instructions breaks the verifier since it sees ACC Reset instructions after a MovTermOpc:

$exec = rejoin_mask
...
$exec = MOV_TERM %Acc
SI_WAVE_CF_EDGE implicit-def $scc
%Acc = S_MOV_B32 0            //Scalar operation after a TERM operation is invalid
S_CBRANCH_EXECZ %bb.x, implicit $exec
S_BRANCH %bb.y

This innermost loop in your code currently identifies the insertion point for each ACC. That code should be moved outside the loop. Secondly, once you get the first terminator, there is no need for the while loop to identify the branch instruction. You only need to skip the instruction that writes to EXEC mask. If you consider the following, we can change the MoveTermOpc to MoveOpc here as well.

Yes, this is a better approach, will incorporate this.

}
}
Loading