[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

alex-t · 2024-09-13T16:09:35Z

This PR is a follow-up of the #92809

Brief overview:

As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged. We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.

Let's consider the following example:

$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
<-- another LI assigned to the same physreg starts here
%129657:vgpr_32 = COPY %14037:vgpr_32 <-- insertion point chosen after the "prologue"
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.

The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.

We have 2 options to address this issue:

Further tune isBasicBlockPrologue. We could pass the LeaveBefore slot index as an argument and return "false" if the current instruction matches this slot. This could work if we only pass a valid slot index from the exact call site (SplitKit.cpp) but use an invalid slot index as a default. Although, this would require changes in the common code in the split kit, basic regalloc etc.
Restore the EXEC mask before the block where the CF converges, i,e, on the post-dominance frontier. That requires a massive change along the whole CF lowering stack but puts aside all interactions with the insertion position.

What was done

We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.

Lowering example

Simple "if" MIR
simple_if.pdf
Lowered MIR
simple_if_lowered.pdf
Please note that we use the conditional move to update the EXEC. So, we only update it if the condition is "true" but leave it unchanged otherwise. Also, SI_WAVE_RECONVERGE pseudo is inserted along the "true" path of the conditional branch, at the end of the predecessor block.

llvmbot · 2024-09-13T16:10:13Z

@llvm/pr-subscribers-llvm-analysis
@llvm/pr-subscribers-backend-amdgpu
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-llvm-globalisel

Author: None (alex-t)

Changes

Brief overview:

As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged. We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.

Let's consider the following example:

$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
<-- another LI assigned to the same physreg starts here
%129657:vgpr_32 = COPY %14037:vgpr_32 <-- insertion point chosen after the "prologue"
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.

The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.

We have 2 options to address this issue:

Further tune isBasicBlockPrologue. We could pass the LeaveBefore slot index as an argument and return "false" if the current instruction matches this slot. This could work if we only pass a valid slot index from the exact call site (SplitKit.cpp) but use an invalid slot index as a default. Although, this would require changes in the common code in the split kit, basic regalloc etc.
Restore the EXEC mask before the block where the CF converges, i,e, on the post-dominance frontier. That requires a massive change along the whole CF lowering stack but puts aside all interactions with the insertion position.

What was done

We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.

Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff

329 Files Affected:

(modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1)
(modified) llvm/include/llvm/Analysis/CFGPrinter.h (+5-3)
(modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+1-2)
(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+1-2)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+1-1)
(modified) llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp (+1-2)
(modified) llvm/lib/CodeGen/InlineSpiller.cpp (+1-1)
(modified) llvm/lib/CodeGen/MachineBasicBlock.cpp (+2-2)
(modified) llvm/lib/CodeGen/SplitKit.cpp (+2-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+10-12)
(modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+46-37)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+28-5)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+40-23)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1-2)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+10-3)
(modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+202-314)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+36)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+8-8)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (+7-28)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+2-4)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-74)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+94-62)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+50-76)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+23-14)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+48-32)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f64.ll (+2-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+34-42)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (+12-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-amdgcn.if.xfail.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+60-68)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i32.ll (+6-6)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i64.ll (+5-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+196-168)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+13-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+23-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.32.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.64.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+79-57)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+21-13)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+333-240)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+1264-1000)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3478-2793)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+199-155)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+114-92)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+25-21)
(modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14)
(modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+306-278)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+1508-1508)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+5-65)
(modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (-1)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-28)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+493-474)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+281-198)
(modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13)
(modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+394-442)
(modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+405-349)
(modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+13-8)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+77-42)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+2595-2484)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+1720-1720)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8)
(modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+6-4)
(modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+338-204)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+11-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f64.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll (+11-8)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+2439-2426)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+1994-1994)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+320-272)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1711-1212)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1665-1186)
(modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+245-302)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+298-290)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+110-131)
(modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+18-10)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+120-91)
(modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+459-255)
(modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+48-36)
(modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+17-15)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+26-26)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+20-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+34-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f64.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+34-32)

diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
     unsigned OpNo = I.getSuccessorIndex();
     const Instruction *TI = Node->getTerminator();
     BasicBlock *SuccBB = TI->getSuccessor(OpNo);
-    auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
-    double WeightPercent = ((double)BranchProb.getNumerator()) /
-                           ((double)BranchProb.getDenominator());
+    // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+    // double WeightPercent = ((double)BranchProb.getNumerator()) /
+    //                        ((double)BranchProb.getDenominator());
+    double WeightPercent = 0.5;
+
     std::string TTAttr =
         formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
                 getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
   /// debug.  This is the correct point to insert copies at the beginning of a
   /// basic block. \p Reg is the register being used by a spill or defined for a
   /// restore/split during register allocation.
-  iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
-                                  bool SkipPseudoOp = true);
+  iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
 
   /// Returns an iterator to the first terminator instruction of this basic
   /// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
   /// other instructions shall be inserted before it. This can be implemented
   /// to prevent register allocator to insert spills for \p Reg before such
   /// instructions.
-  virtual bool isBasicBlockPrologue(const MachineInstr &MI,
-                                    Register Reg = Register()) const {
+  virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
     return false;
   }
 
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
   [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
 
       if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
         RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
-        auto EHPadInsertPoint =
-            EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+        auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
         insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
         LLVM_DEBUG(dbgs() << "...also reload at EHPad "
                           << printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
   MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
   MachineBasicBlock::iterator MII;
   if (SrcVNI->isPHIDef())
-    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
   else {
     MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
     assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
 
 MachineBasicBlock::iterator
 MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
-                                          Register Reg, bool SkipPseudoOp) {
+                                          bool SkipPseudoOp) {
   const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
 
   iterator E = end();
   while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
                     (SkipPseudoOp && I->isPseudoProbe()) ||
-                    TII->isBasicBlockPrologue(*I, Reg)))
+                    TII->isBasicBlockPrologue(*I)))
     ++I;
   // FIXME: This needs to change if we wish to bundle labels / dbg_values
   // inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
     return Start;
   }
 
-  unsigned RegIdx = 0;
-  Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
-  VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
-                              MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+  VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+                              MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
   RegAssign.insert(Start, VNI->def, OpenIdx);
   LLVM_DEBUG(dump());
   return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
-    .addReg(ExecReg)
-    .addReg(NewExec);
+    .addDef(LoopMask)
+	.addReg(ExecReg)
+	.addReg(NewExec);
 
   // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+      .addReg(LoopMask)
+      .addReg(NewExec)
+      .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
 
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
-
-  assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  if (succ_empty(BB))
+    return false;
 
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+    }
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
     SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
                                 false);
   }
 
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
     Changed |= openIf(Term);
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+        break;
+      }
+    }
+  }
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff...
[truncated]

llvmbot · 2024-09-13T16:10:13Z

@llvm/pr-subscribers-llvm-ir

Author: None (alex-t)

Changes

Brief overview:

As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged. We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.

Let's consider the following example:

$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
<-- another LI assigned to the same physreg starts here
%129657:vgpr_32 = COPY %14037:vgpr_32 <-- insertion point chosen after the "prologue"
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.

The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.

We have 2 options to address this issue:

Further tune isBasicBlockPrologue. We could pass the LeaveBefore slot index as an argument and return "false" if the current instruction matches this slot. This could work if we only pass a valid slot index from the exact call site (SplitKit.cpp) but use an invalid slot index as a default. Although, this would require changes in the common code in the split kit, basic regalloc etc.
Restore the EXEC mask before the block where the CF converges, i,e, on the post-dominance frontier. That requires a massive change along the whole CF lowering stack but puts aside all interactions with the insertion position.

What was done

We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.

Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff

329 Files Affected:

(modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1)
(modified) llvm/include/llvm/Analysis/CFGPrinter.h (+5-3)
(modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+1-2)
(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+1-2)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+1-1)
(modified) llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp (+1-2)
(modified) llvm/lib/CodeGen/InlineSpiller.cpp (+1-1)
(modified) llvm/lib/CodeGen/MachineBasicBlock.cpp (+2-2)
(modified) llvm/lib/CodeGen/SplitKit.cpp (+2-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+10-12)
(modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+46-37)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+28-5)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+40-23)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1-2)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+10-3)
(modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+202-314)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+36)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+8-8)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (+7-28)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+2-4)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-74)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+94-62)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+50-76)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+23-14)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+48-32)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f64.ll (+2-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+34-42)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (+12-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-amdgcn.if.xfail.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+60-68)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i32.ll (+6-6)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i64.ll (+5-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+196-168)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+13-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+23-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.32.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.64.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+79-57)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+21-13)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+333-240)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+1264-1000)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3478-2793)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+199-155)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+114-92)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+25-21)
(modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14)
(modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+306-278)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+1508-1508)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+5-65)
(modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (-1)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-28)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+493-474)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+281-198)
(modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13)
(modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+394-442)
(modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+405-349)
(modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+13-8)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+77-42)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+2595-2484)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+1720-1720)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8)
(modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+6-4)
(modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+338-204)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+11-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f64.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll (+11-8)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+2439-2426)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+1994-1994)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+320-272)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1711-1212)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1665-1186)
(modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+245-302)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+298-290)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+110-131)
(modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+18-10)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+120-91)
(modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+459-255)
(modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+48-36)
(modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+17-15)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+26-26)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+20-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+34-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f64.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+34-32)

diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
     unsigned OpNo = I.getSuccessorIndex();
     const Instruction *TI = Node->getTerminator();
     BasicBlock *SuccBB = TI->getSuccessor(OpNo);
-    auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
-    double WeightPercent = ((double)BranchProb.getNumerator()) /
-                           ((double)BranchProb.getDenominator());
+    // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+    // double WeightPercent = ((double)BranchProb.getNumerator()) /
+    //                        ((double)BranchProb.getDenominator());
+    double WeightPercent = 0.5;
+
     std::string TTAttr =
         formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
                 getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
   /// debug.  This is the correct point to insert copies at the beginning of a
   /// basic block. \p Reg is the register being used by a spill or defined for a
   /// restore/split during register allocation.
-  iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
-                                  bool SkipPseudoOp = true);
+  iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
 
   /// Returns an iterator to the first terminator instruction of this basic
   /// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
   /// other instructions shall be inserted before it. This can be implemented
   /// to prevent register allocator to insert spills for \p Reg before such
   /// instructions.
-  virtual bool isBasicBlockPrologue(const MachineInstr &MI,
-                                    Register Reg = Register()) const {
+  virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
     return false;
   }
 
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
   [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
 
       if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
         RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
-        auto EHPadInsertPoint =
-            EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+        auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
         insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
         LLVM_DEBUG(dbgs() << "...also reload at EHPad "
                           << printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
   MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
   MachineBasicBlock::iterator MII;
   if (SrcVNI->isPHIDef())
-    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
   else {
     MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
     assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
 
 MachineBasicBlock::iterator
 MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
-                                          Register Reg, bool SkipPseudoOp) {
+                                          bool SkipPseudoOp) {
   const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
 
   iterator E = end();
   while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
                     (SkipPseudoOp && I->isPseudoProbe()) ||
-                    TII->isBasicBlockPrologue(*I, Reg)))
+                    TII->isBasicBlockPrologue(*I)))
     ++I;
   // FIXME: This needs to change if we wish to bundle labels / dbg_values
   // inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
     return Start;
   }
 
-  unsigned RegIdx = 0;
-  Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
-  VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
-                              MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+  VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+                              MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
   RegAssign.insert(Start, VNI->def, OpenIdx);
   LLVM_DEBUG(dump());
   return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
-    .addReg(ExecReg)
-    .addReg(NewExec);
+    .addDef(LoopMask)
+	.addReg(ExecReg)
+	.addReg(NewExec);
 
   // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+      .addReg(LoopMask)
+      .addReg(NewExec)
+      .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
 
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
-
-  assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  if (succ_empty(BB))
+    return false;
 
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+    }
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
     SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
                                 false);
   }
 
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
     Changed |= openIf(Term);
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+        break;
+      }
+    }
+  }
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff...
[truncated]

llvmbot · 2024-09-13T16:10:13Z

@llvm/pr-subscribers-llvm-regalloc

Author: None (alex-t)

Changes

Brief overview:

As we mask the bits on the CF diverge, we have to set them back after executing the instructions in a conditional block.
This currently happens at the beginning of the block where the CF converges - i.e. in the immediate post-dominator of the block where the CF diverged. We have to ensure that none of the instructions that read the EXEC mask register are inserted before masked EXEC bits are restored to the state they had before the CF divergence. For that, we tuned the TargetInstrInfo::isBasicBlockPrologue method to report any instruction that writes EXEC as belonging to the prologue. Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue. This solution worked well until we faced the problem with live interval splitting. To split the LI across the given physical register we need to insert a copy of a virtual register before the point where the current interval interferes with another one, already assigned to the same physical register. If the LI being split is "live-in" in the block we need to put a copy at the beginning of the block. SplitKit requests isBasicBlockPrologue for the proper insertion point.

Let's consider the following example:

$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec
<-- another LI assigned to the same physreg starts here
%129657:vgpr_32 = COPY %14037:vgpr_32 <-- insertion point chosen after the "prologue"
We are to split the LI for the virtual register %14037. The COPY position was chosen after all reloads as they are considered to belong to the block prologue. Any of them could start live interval that might have been already assigned to the same physical register which we are aiming to split across. In this case, we hit an "assert" in SplitKit.cpp reporting that the insertion point returned by the target will cause interference.

The root cause for the all mentioned troubles is an attempt to make the common spill/split logic aware of (and involved in) the sophisticated and target-specific details of the control flow implementation.

We have 2 options to address this issue:

Further tune isBasicBlockPrologue. We could pass the LeaveBefore slot index as an argument and return "false" if the current instruction matches this slot. This could work if we only pass a valid slot index from the exact call site (SplitKit.cpp) but use an invalid slot index as a default. Although, this would require changes in the common code in the split kit, basic regalloc etc.
Restore the EXEC mask before the block where the CF converges, i,e, on the post-dominance frontier. That requires a massive change along the whole CF lowering stack but puts aside all interactions with the insertion position.

What was done

We opted for the 2nd one. We made the EXEC manipulation at the divergence point conditional which allows us to change the EXEC for the conditional block but leave it unchanged along the fall-through path. Hence, we only need to restore EXEC at the end of the conditional block.

Patch is 10.24 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108596.diff

329 Files Affected:

(modified) clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu (+1-1)
(modified) llvm/include/llvm/Analysis/CFGPrinter.h (+5-3)
(modified) llvm/include/llvm/CodeGen/MachineBasicBlock.h (+1-2)
(modified) llvm/include/llvm/CodeGen/TargetInstrInfo.h (+1-2)
(modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+1-1)
(modified) llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp (+1-2)
(modified) llvm/lib/CodeGen/InlineSpiller.cpp (+1-1)
(modified) llvm/lib/CodeGen/MachineBasicBlock.cpp (+2-2)
(modified) llvm/lib/CodeGen/SplitKit.cpp (+2-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp (+5-4)
(modified) llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h (+1-1)
(modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+10-12)
(modified) llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp (+46-37)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+28-5)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+40-23)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1-2)
(modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+10-3)
(modified) llvm/lib/Target/AMDGPU/SILowerControlFlow.cpp (+202-314)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeExecMasking.cpp (+36)
(modified) llvm/lib/Target/AMDGPU/SIOptimizeVGPRLiveRange.cpp (+1-1)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/hidden-diverge-gmir.mir (+8-8)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/temporal-divergence.mir (+7-28)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/MIR/uses-value-from-cycle.mir (+2-4)
(modified) llvm/test/Analysis/UniformityAnalysis/AMDGPU/deprecated/hidden-diverge.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmax.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/atomicrmw_fmin.ll (+158-156)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-phis-no-lane-mask-merging.mir (+3-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.ll (+95-74)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-divergent-i1-used-outside-loop.mir (+37-61)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+94-62)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.mir (+50-76)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-i1.mir (+6-18)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-temporal-divergent-reg.mir (-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergent-control-flow.ll (+23-14)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+48-32)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-no-rtn.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/global-atomic-fadd.f64.ll (+2-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/image-waterfall-loop-O0.ll (+34-42)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-atomicrmw.ll (+12-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-function-args.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/is-safe-to-sink-bug.ll (+8-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-amdgcn.if.xfail.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-brcond.mir (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+60-68)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.make.buffer.rsrc.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.load.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.format.f32.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.buffer.store.ll (+24-60)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.load.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f16.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.buffer.store.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.f16.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.i8.ll (+12-30)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.ptr.tbuffer.store.ll (+10-25)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.f16.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.i8.ll (+18-45)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.raw.tbuffer.store.ll (+20-50)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.s.buffer.load.ll (+96-240)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.add.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.cmpswap.ll (+16-40)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.format.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.load.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.format.f32.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.buffer.store.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.add.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.cmpswap.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.ll (+8-20)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.format.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.format.f32.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.buffer.store.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.f16.ll (+4-10)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.ptr.tbuffer.load.ll (+2-5)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.f16.ll (+6-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.struct.tbuffer.load.ll (+4-10)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i32.ll (+6-6)
(renamed) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wave.reconverge.i64.ll (+5-6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+196-168)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memmove.ll (+13-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/localizer.ll (+23-22)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+22-13)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/non-entry-alloca.ll (+23-15)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn-s-buffer-load.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.32.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.else.64.mir (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.load.1d.ll (+8-16)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.image.sample.1d.ll (+12-24)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.raw.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.s.buffer.load.ll (+48-72)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.load.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-amdgcn.struct.ptr.buffer.store.ll (+6-12)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-waterfall-agpr.mir (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+262-257)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+225-218)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udiv.i64.ll (+80-73)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/urem.i64.ll (+76-69)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/vni8-across-blocks.ll (+79-57)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+21-13)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll (+333-240)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+1264-1000)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3478-2793)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_pixelshader.ll (+199-155)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll (+286-200)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+114-92)
(modified) llvm/test/CodeGen/AMDGPU/atomicrmw-nand.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/bb-prolog-spill-during-regalloc.ll (+25-21)
(modified) llvm/test/CodeGen/AMDGPU/block-should-not-be-in-alive-blocks.mir (+13-14)
(modified) llvm/test/CodeGen/AMDGPU/branch-folding-implicit-def-subreg.ll (+306-278)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+50-35)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+1508-1508)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+1112-1101)
(modified) llvm/test/CodeGen/AMDGPU/bug-sdag-emitcopyfromreg.ll (+5-65)
(modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+60-36)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-flat.ll (+140-92)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx1030.ll (-1)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-gfx908.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/codegen-prepare-addrspacecast-non-null.ll (+32-28)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.ll (+493-474)
(modified) llvm/test/CodeGen/AMDGPU/collapse-endcf.mir (+281-198)
(modified) llvm/test/CodeGen/AMDGPU/constant-fold-imm-immreg.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cross-block-use-is-not-abi-copy.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/cse-convergent.ll (+14-13)
(modified) llvm/test/CodeGen/AMDGPU/dag-divergence-atomic.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dagcombine-v1i8-extractvecelt-crash.ll (+5-3)
(modified) llvm/test/CodeGen/AMDGPU/div_i128.ll (+394-442)
(modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+405-349)
(modified) llvm/test/CodeGen/AMDGPU/divergent-branch-uniform-condition.ll (+13-8)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dpp_combine_gfx11.mir (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/early-tailduplicator-terminator.mir (+7-9)
(modified) llvm/test/CodeGen/AMDGPU/extract-subvector.ll (+77-42)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+2595-2484)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+1395-1395)
(modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+1720-1720)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.f16.ll (+22-15)
(modified) llvm/test/CodeGen/AMDGPU/fneg-combines.new.ll (+12-8)
(modified) llvm/test/CodeGen/AMDGPU/fold-fabs.ll (+26-11)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+6-4)
(modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+338-204)
(modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+11-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-no-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f32-rtn.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomic-fadd.f64.ll (-2)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd-wrong-subtarget.ll (+11-8)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+2439-2426)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+1641-1641)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+1994-1994)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+320-272)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+549-534)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+1711-1212)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+1097-793)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+1665-1186)
(modified) llvm/test/CodeGen/AMDGPU/i1-copy-from-loop.ll (+14-12)
(modified) llvm/test/CodeGen/AMDGPU/i1_copy_phi_with_phi_incoming_value.mir (+8-19)
(modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+245-302)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+298-290)
(modified) llvm/test/CodeGen/AMDGPU/indirect-call.ll (+110-131)
(modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+4)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+18-10)
(modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+120-91)
(modified) llvm/test/CodeGen/AMDGPU/issue92561-restore-undef-scc-verifier-error.ll (+9-9)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.bf.ll (+74-40)
(modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+459-255)
(modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+48-36)
(modified) llvm/test/CodeGen/AMDGPU/lds-global-non-entry-func.ll (+110-66)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.f16.ll (+10-15)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.format.ll (+12-18)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.load.ll (+70-105)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f16.ll (+22-33)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll (+24-36)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.ll (+66-99)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.f16.ll (+16-24)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.load.ll (+18-27)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.f16.ll (+20-30)
(modified) llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.tbuffer.store.ll (+50-75)
(modified) llvm/test/CodeGen/AMDGPU/legalize-soffset-mbuf.ll (+64-96)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i32.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.inverse.ballot.i64.ll (+14-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+40-29)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.atomic.fadd.v2bf16.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+158-108)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.softwqm.ll (+20-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.buffer.load.format.v3f16.ll (+17-15)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+44-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd.v2bf16.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_nortn.ll (+26-26)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fadd_rtn.ll (+20-20)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f32.ll (+34-32)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmax.f64.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.buffer.atomic.fmin.f32.ll (+34-32)

diff --git a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
index 946927d88a1ee1..3ca766755a6319 100644
--- a/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
+++ b/clang/test/CodeGenCUDA/atomics-remarks-gfx90a.cu
@@ -10,7 +10,7 @@
 // GFX90A-CAS: A compare and swap loop was generated for an atomic fadd operation at system memory scope
 // GFX90A-CAS-LABEL: _Z14atomic_add_casPf
 // GFX90A-CAS:  flat_atomic_cmpswap
-// GFX90A-CAS:  s_cbranch_execnz
+// GFX90A-CAS:  s_cbranch_scc1
 __device__ float atomic_add_cas(float *p) {
   return __atomic_fetch_add(p, 1.0f, memory_order_relaxed);
 }
diff --git a/llvm/include/llvm/Analysis/CFGPrinter.h b/llvm/include/llvm/Analysis/CFGPrinter.h
index cd785331d1f146..e24a9110d596ca 100644
--- a/llvm/include/llvm/Analysis/CFGPrinter.h
+++ b/llvm/include/llvm/Analysis/CFGPrinter.h
@@ -272,9 +272,11 @@ struct DOTGraphTraits<DOTFuncInfo *> : public DefaultDOTGraphTraits {
     unsigned OpNo = I.getSuccessorIndex();
     const Instruction *TI = Node->getTerminator();
     BasicBlock *SuccBB = TI->getSuccessor(OpNo);
-    auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
-    double WeightPercent = ((double)BranchProb.getNumerator()) /
-                           ((double)BranchProb.getDenominator());
+    // auto BranchProb = CFGInfo->getBPI()->getEdgeProbability(Node, SuccBB);
+    // double WeightPercent = ((double)BranchProb.getNumerator()) /
+    //                        ((double)BranchProb.getDenominator());
+    double WeightPercent = 0.5;
+
     std::string TTAttr =
         formatv("tooltip=\"{0} -> {1}\\nProbability {2:P}\" ", getBBName(Node),
                 getBBName(SuccBB), WeightPercent);
diff --git a/llvm/include/llvm/CodeGen/MachineBasicBlock.h b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
index 6efb17c55493a9..9fcda791fb4c72 100644
--- a/llvm/include/llvm/CodeGen/MachineBasicBlock.h
+++ b/llvm/include/llvm/CodeGen/MachineBasicBlock.h
@@ -879,8 +879,7 @@ class MachineBasicBlock
   /// debug.  This is the correct point to insert copies at the beginning of a
   /// basic block. \p Reg is the register being used by a spill or defined for a
   /// restore/split during register allocation.
-  iterator SkipPHIsLabelsAndDebug(iterator I, Register Reg = Register(),
-                                  bool SkipPseudoOp = true);
+  iterator SkipPHIsLabelsAndDebug(iterator I, bool SkipPseudoOp = true);
 
   /// Returns an iterator to the first terminator instruction of this basic
   /// block. If a terminator does not exist, it returns end().
diff --git a/llvm/include/llvm/CodeGen/TargetInstrInfo.h b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
index 49ce13dd8cbe39..984850980f4c9c 100644
--- a/llvm/include/llvm/CodeGen/TargetInstrInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetInstrInfo.h
@@ -2058,8 +2058,7 @@ class TargetInstrInfo : public MCInstrInfo {
   /// other instructions shall be inserted before it. This can be implemented
   /// to prevent register allocator to insert spills for \p Reg before such
   /// instructions.
-  virtual bool isBasicBlockPrologue(const MachineInstr &MI,
-                                    Register Reg = Register()) const {
+  virtual bool isBasicBlockPrologue(const MachineInstr &MI) const {
     return false;
   }
 
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index dc13a35c66f9ab..195ceb64eae4a8 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -3098,7 +3098,7 @@ def int_amdgcn_loop : Intrinsic<[llvm_i1_ty],
   [llvm_anyint_ty], [IntrWillReturn, IntrNoCallback, IntrNoFree]
 >;
 
-def int_amdgcn_end_cf : Intrinsic<[], [llvm_anyint_ty],
+def int_amdgcn_wave_reconverge : Intrinsic<[], [llvm_anyint_ty],
   [IntrWillReturn, IntrNoCallback, IntrNoFree]>;
 
 // Represent unreachable in a divergent region.
diff --git a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
index 3bb9da5f1a37bb..184b493694894d 100644
--- a/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
+++ b/llvm/lib/CodeGen/FixupStatepointCallerSaved.cpp
@@ -461,8 +461,7 @@ class StatepointState {
 
       if (EHPad && !RC.hasReload(Reg, RegToSlotIdx[Reg], EHPad)) {
         RC.recordReload(Reg, RegToSlotIdx[Reg], EHPad);
-        auto EHPadInsertPoint =
-            EHPad->SkipPHIsLabelsAndDebug(EHPad->begin(), Reg);
+        auto EHPadInsertPoint = EHPad->SkipPHIsLabelsAndDebug(EHPad->begin());
         insertReloadBefore(Reg, EHPadInsertPoint, EHPad);
         LLVM_DEBUG(dbgs() << "...also reload at EHPad "
                           << printMBBReference(*EHPad) << "\n");
diff --git a/llvm/lib/CodeGen/InlineSpiller.cpp b/llvm/lib/CodeGen/InlineSpiller.cpp
index 81ae805d64e1ec..201d3a5df3a536 100644
--- a/llvm/lib/CodeGen/InlineSpiller.cpp
+++ b/llvm/lib/CodeGen/InlineSpiller.cpp
@@ -463,7 +463,7 @@ bool InlineSpiller::hoistSpillInsideBB(LiveInterval &SpillLI,
   MachineBasicBlock *MBB = LIS.getMBBFromIndex(SrcVNI->def);
   MachineBasicBlock::iterator MII;
   if (SrcVNI->isPHIDef())
-    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin(), SrcReg);
+    MII = MBB->SkipPHIsLabelsAndDebug(MBB->begin());
   else {
     MachineInstr *DefMI = LIS.getInstructionFromIndex(SrcVNI->def);
     assert(DefMI && "Defining instruction disappeared");
diff --git a/llvm/lib/CodeGen/MachineBasicBlock.cpp b/llvm/lib/CodeGen/MachineBasicBlock.cpp
index 5d06af3ebf3360..419d7e0312ae08 100644
--- a/llvm/lib/CodeGen/MachineBasicBlock.cpp
+++ b/llvm/lib/CodeGen/MachineBasicBlock.cpp
@@ -223,13 +223,13 @@ MachineBasicBlock::SkipPHIsAndLabels(MachineBasicBlock::iterator I) {
 
 MachineBasicBlock::iterator
 MachineBasicBlock::SkipPHIsLabelsAndDebug(MachineBasicBlock::iterator I,
-                                          Register Reg, bool SkipPseudoOp) {
+                                          bool SkipPseudoOp) {
   const TargetInstrInfo *TII = getParent()->getSubtarget().getInstrInfo();
 
   iterator E = end();
   while (I != E && (I->isPHI() || I->isPosition() || I->isDebugInstr() ||
                     (SkipPseudoOp && I->isPseudoProbe()) ||
-                    TII->isBasicBlockPrologue(*I, Reg)))
+                    TII->isBasicBlockPrologue(*I)))
     ++I;
   // FIXME: This needs to change if we wish to bundle labels / dbg_values
   // inside the bundle.
diff --git a/llvm/lib/CodeGen/SplitKit.cpp b/llvm/lib/CodeGen/SplitKit.cpp
index b671e510387530..22991a0fb4cb1e 100644
--- a/llvm/lib/CodeGen/SplitKit.cpp
+++ b/llvm/lib/CodeGen/SplitKit.cpp
@@ -806,10 +806,8 @@ SlotIndex SplitEditor::leaveIntvAtTop(MachineBasicBlock &MBB) {
     return Start;
   }
 
-  unsigned RegIdx = 0;
-  Register Reg = LIS.getInterval(Edit->get(RegIdx)).reg();
-  VNInfo *VNI = defFromParent(RegIdx, ParentVNI, Start, MBB,
-                              MBB.SkipPHIsLabelsAndDebug(MBB.begin(), Reg));
+  VNInfo *VNI = defFromParent(0, ParentVNI, Start, MBB,
+                              MBB.SkipPHIsLabelsAndDebug(MBB.begin()));
   RegAssign.insert(Start, VNI->def, OpenIdx);
   LLVM_DEBUG(dump());
   return VNI->def;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
index 3fcb364fc2c536..c0d2853d159882 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
@@ -1551,11 +1551,12 @@ bool AMDGPUInstructionSelector::selectReturnAddress(MachineInstr &I) const {
   return true;
 }
 
-bool AMDGPUInstructionSelector::selectEndCfIntrinsic(MachineInstr &MI) const {
+bool AMDGPUInstructionSelector::selectWaveReconvergeIntrinsic(
+    MachineInstr &MI) const {
   // FIXME: Manually selecting to avoid dealing with the SReg_1 trick
   // SelectionDAG uses for wave32 vs wave64.
   MachineBasicBlock *BB = MI.getParent();
-  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_END_CF))
+  BuildMI(*BB, &MI, MI.getDebugLoc(), TII.get(AMDGPU::SI_WAVE_RECONVERGE))
       .add(MI.getOperand(1));
 
   Register Reg = MI.getOperand(1).getReg();
@@ -2083,8 +2084,8 @@ bool AMDGPUInstructionSelector::selectG_INTRINSIC_W_SIDE_EFFECTS(
     MachineInstr &I) const {
   Intrinsic::ID IntrinsicID = cast<GIntrinsic>(I).getIntrinsicID();
   switch (IntrinsicID) {
-  case Intrinsic::amdgcn_end_cf:
-    return selectEndCfIntrinsic(I);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return selectWaveReconvergeIntrinsic(I);
   case Intrinsic::amdgcn_ds_ordered_add:
   case Intrinsic::amdgcn_ds_ordered_swap:
     return selectDSOrderedIntrinsic(I, IntrinsicID);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
index 068db5c1c14496..c3ba26590dfbcf 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h
@@ -116,7 +116,7 @@ class AMDGPUInstructionSelector final : public InstructionSelector {
   bool selectReturnAddress(MachineInstr &I) const;
   bool selectG_INTRINSIC(MachineInstr &I) const;
 
-  bool selectEndCfIntrinsic(MachineInstr &MI) const;
+  bool selectWaveReconvergeIntrinsic(MachineInstr &MI) const;
   bool selectDSOrderedIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSGWSIntrinsic(MachineInstr &MI, Intrinsic::ID IID) const;
   bool selectDSAppendConsume(MachineInstr &MI, bool IsAppend) const;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 4737a322c255f4..1d2ee6a4c96514 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -785,8 +785,6 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
   const TargetRegisterClass *WaveRC = TRI->getWaveMaskRegClass();
   const unsigned MovExecOpc =
       Subtarget.isWave32() ? AMDGPU::S_MOV_B32 : AMDGPU::S_MOV_B64;
-  const unsigned MovExecTermOpc =
-      Subtarget.isWave32() ? AMDGPU::S_MOV_B32_term : AMDGPU::S_MOV_B64_term;
 
   const unsigned XorTermOpc = Subtarget.isWave32() ?
     AMDGPU::S_XOR_B32_term : AMDGPU::S_XOR_B64_term;
@@ -949,27 +947,27 @@ bool AMDGPURegisterBankInfo::executeInWaterfallLoop(
 
   B.setInsertPt(*BodyBB, BodyBB->end());
 
+  Register LoopMask = MRI.createVirtualRegister(
+      TRI->getRegClass(AMDGPU::SReg_1_XEXECRegClassID));
   // Update EXEC, switch all done bits to 0 and all todo bits to 1.
   B.buildInstr(XorTermOpc)
-    .addDef(ExecReg)
-    .addReg(ExecReg)
-    .addReg(NewExec);
+    .addDef(LoopMask)
+	.addReg(ExecReg)
+	.addReg(NewExec);
 
   // XXX - s_xor_b64 sets scc to 1 if the result is nonzero, so can we use
   // s_cbranch_scc0?
 
   // Loop back to V_READFIRSTLANE_B32 if there are still variants to cover.
-  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP).addMBB(LoopBB);
+  B.buildInstr(AMDGPU::SI_WATERFALL_LOOP)
+      .addReg(LoopMask)
+      .addReg(NewExec)
+      .addMBB(LoopBB);
 
   // Save the EXEC mask before the loop.
   BuildMI(MBB, MBB.end(), DL, TII->get(MovExecOpc), SaveExecReg)
     .addReg(ExecReg);
 
-  // Restore the EXEC mask after the loop.
-  B.setMBB(*RestoreExecBB);
-  B.buildInstr(MovExecTermOpc)
-    .addDef(ExecReg)
-    .addReg(SaveExecReg);
 
   // Set the insert point after the original instruction, so any new
   // instructions will be in the remainder.
@@ -4967,7 +4965,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
       OpdsMapping[1] = AMDGPU::getValueMapping(Bank, 32);
       break;
     }
-    case Intrinsic::amdgcn_end_cf: {
+    case Intrinsic::amdgcn_wave_reconverge: {
       unsigned Size = getSizeInBits(MI.getOperand(1).getReg(), MRI, *TRI);
       OpdsMapping[1] = AMDGPU::getValueMapping(AMDGPU::SGPRRegBankID, Size);
       break;
diff --git a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
index edd881c84078c6..cd8cbcc7f689d4 100644
--- a/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
+++ b/llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
@@ -14,6 +14,7 @@
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
 #include "GCNSubtarget.h"
+#include "llvm/Analysis/DomTreeUpdater.h"
 #include "llvm/Analysis/LoopInfo.h"
 #include "llvm/Analysis/UniformityAnalysis.h"
 #include "llvm/CodeGen/TargetPassConfig.h"
@@ -55,7 +56,7 @@ class SIAnnotateControlFlow {
   Function *Else;
   Function *IfBreak;
   Function *Loop;
-  Function *EndCf;
+  Function *WaveReconverge;
 
   DominatorTree *DT;
   StackVector Stack;
@@ -88,7 +89,7 @@ class SIAnnotateControlFlow {
 
   bool handleLoop(BranchInst *Term);
 
-  bool closeControlFlow(BasicBlock *BB);
+  bool tryWaveReconverge(BasicBlock *BB);
 
 public:
   SIAnnotateControlFlow(Module &M, const GCNSubtarget &ST, DominatorTree &DT,
@@ -123,7 +124,8 @@ void SIAnnotateControlFlow::initialize(Module &M, const GCNSubtarget &ST) {
   IfBreak = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_if_break,
                                       { IntMask });
   Loop = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_loop, { IntMask });
-  EndCf = Intrinsic::getDeclaration(&M, Intrinsic::amdgcn_end_cf, { IntMask });
+  WaveReconverge = Intrinsic::getDeclaration(
+      &M, Intrinsic::amdgcn_wave_reconverge, {IntMask});
 }
 
 /// Is the branch condition uniform or did the StructurizeCFG pass
@@ -185,8 +187,6 @@ bool SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
 
 /// Open a new "If" block
 bool SIAnnotateControlFlow::openIf(BranchInst *Term) {
-  if (isUniform(Term))
-    return false;
 
   IRBuilder<> IRB(Term);
   Value *IfCall = IRB.CreateCall(If, {Term->getCondition()});
@@ -287,43 +287,43 @@ bool SIAnnotateControlFlow::handleLoop(BranchInst *Term) {
 }
 
 /// Close the last opened control flow
-bool SIAnnotateControlFlow::closeControlFlow(BasicBlock *BB) {
-  llvm::Loop *L = LI->getLoopFor(BB);
-
-  assert(Stack.back().first == BB);
+bool SIAnnotateControlFlow::tryWaveReconverge(BasicBlock *BB) {
 
-  if (L && L->getHeader() == BB) {
-    // We can't insert an EndCF call into a loop header, because it will
-    // get executed on every iteration of the loop, when it should be
-    // executed only once before the loop.
-    SmallVector <BasicBlock *, 8> Latches;
-    L->getLoopLatches(Latches);
+  if (succ_empty(BB))
+    return false;
 
+  BranchInst *Term = dyn_cast<BranchInst>(BB->getTerminator());
+  if (Term->getNumSuccessors() == 1) {
+    // The current BBs single successor is a top of the stack. We need to
+    // reconverge over thaqt path.
+    BasicBlock *SingleSucc = *succ_begin(BB);
+    BasicBlock::iterator InsPt = Term ? BasicBlock::iterator(Term) : BB->end();
+    if (isTopOfStack(SingleSucc)) {
+      Value *Exec = Stack.back().second;
+      IRBuilder<>(BB, InsPt).CreateCall(WaveReconverge, {Exec});
+    }
+  } else {
+    // We have a uniform conditional branch terminating the block.
+    // THis block may be the last in the Then path of the enclosing divergent
+    // IF.
+    if (!isUniform(Term))
+      // Divergent loop is going to be further processed in another place
+      return false;
+
+    for (auto Succ : Term->successors()) {
+      if (isTopOfStack(Succ)) {
+        // Just split to make a room for further WAVE_RECONVERGE insertion
     SmallVector<BasicBlock *, 2> Preds;
-    for (BasicBlock *Pred : predecessors(BB)) {
-      if (!is_contained(Latches, Pred))
-        Preds.push_back(Pred);
+        for (auto P : predecessors(Succ)) {
+          if (DT->dominates(BB, P))
+            Preds.push_back(P);
     }
-
-    BB = SplitBlockPredecessors(BB, Preds, "endcf.split", DT, LI, nullptr,
+        DomTreeUpdater DTU(DT, DomTreeUpdater::UpdateStrategy::Eager);
+        SplitBlockPredecessors(Succ, Preds, ".reconverge", &DTU, LI, nullptr,
                                 false);
   }
 
-  Value *Exec = popSaved();
-  BasicBlock::iterator FirstInsertionPt = BB->getFirstInsertionPt();
-  if (!isa<UndefValue>(Exec) && !isa<UnreachableInst>(FirstInsertionPt)) {
-    Instruction *ExecDef = cast<Instruction>(Exec);
-    BasicBlock *DefBB = ExecDef->getParent();
-    if (!DT->dominates(DefBB, BB)) {
-      // Split edge to make Def dominate Use
-      FirstInsertionPt = SplitEdge(DefBB, BB, DT, LI)->getFirstInsertionPt();
     }
-    IRBuilder<> IRB(FirstInsertionPt->getParent(), FirstInsertionPt);
-    // TODO: StructurizeCFG 'Flow' blocks have debug locations from the
-    // condition, for now just avoid copying these DebugLocs so that stepping
-    // out of the then/else block in a debugger doesn't step to the condition.
-    IRB.SetCurrentDebugLocation(DebugLoc());
-    IRB.CreateCall(EndCf, {Exec});
   }
 
   return true;
@@ -341,14 +341,18 @@ bool SIAnnotateControlFlow::run(Function &F) {
 
     if (!Term || Term->isUnconditional()) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      Changed |= tryWaveReconverge(BB);
 
       continue;
     }
 
     if (I.nodeVisited(Term->getSuccessor(1))) {
       if (isTopOfStack(BB))
-        Changed |= closeControlFlow(BB);
+        Stack.pop_back();
+      // Let's take care of uniform loop latch that may be closing the Then
+      // path of the enclosing divergent branch.
+      Changed |= tryWaveReconverge(BB);
 
       if (DT->dominates(Term->getSuccessor(1), BB))
         Changed |= handleLoop(Term);
@@ -363,9 +367,14 @@ bool SIAnnotateControlFlow::run(Function &F) {
         continue;
       }
 
-      Changed |= closeControlFlow(BB);
+      Stack.pop_back();
     }
 
+    if (isUniform(Term))
+      // Uniform conditional branch may be in the block that closes the Then
+      // path of the divergent conditional branch.
+      Changed |= tryWaveReconverge(BB);
+    else
     Changed |= openIf(Term);
   }
 
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 81b52935ddf397..ae3b849a55ff2e 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6475,7 +6475,7 @@ unsigned SITargetLowering::isCFIntrinsic(const SDNode *Intr) const {
       return AMDGPUISD::ELSE;
     case Intrinsic::amdgcn_loop:
       return AMDGPUISD::LOOP;
-    case Intrinsic::amdgcn_end_cf:
+    case Intrinsic::amdgcn_wave_reconverge:
       llvm_unreachable("should not occur");
     default:
       return 0;
@@ -9848,9 +9848,10 @@ SDValue SITargetLowering::LowerINTRINSIC_VOID(SDValue Op,
 
     return SDValue(Load, 0);
   }
-  case Intrinsic::amdgcn_end_cf:
-    return SDValue(DAG.getMachineNode(AMDGPU::SI_END_CF, DL, MVT::Other,
-                                      Op->getOperand(2), Chain), 0);
+  case Intrinsic::amdgcn_wave_reconverge:
+    return SDValue(DAG.getMachineNode(AMDGPU::SI_WAVE_RECONVERGE, DL,
+                                      MVT::Other, Op->getOperand(2), Chain),
+                   0);
   case Intrinsic::amdgcn_s_barrier_init:
   case Intrinsic::amdgcn_s_barrier_join:
   case Intrinsic::amdgcn_s_wakeup_barrier: {
@@ -15693,6 +15694,28 @@ void SITargetLowering::finalizeLowering(MachineFunction &MF) const {
     }
   }
 
+  // ISel inserts copy to regs for the successor PHIs
+  // at the BB end. We need to move the SI_WAVE_RECONVERGE right before the
+  // branch.
+  for (auto &MBB : MF) {
+    for (auto &MI : MBB) {
+      if (MI.getOpcode() == AMDGPU::SI_WAVE_RECONVERGE) {
+        MachineBasicBlock::iterator I(MI);
+        MachineBasicBlock::iterator Next = std::next(I);
+        bool NeedToMove = false;
+        while (Next != MBB.end() && !Next->isBranch()) {
+          NeedToMove = true;
+          Next++;
+        }
+        assert((Next == MBB.end() || !Next->readsRegister(AMDGPU::SCC, TRI)) &&
+               "Malformed CFG detected!\n");
+        if (NeedToMove) {
+          MBB.splice(Next, &MBB, &MI);
+        }
+        break;
+      }
+    }
+  }
   // FIXME: This is a hack to fixup AGPR classes to use the properly aligned
   // classes if required. Ideally the register class constraints would differ
   // per-subtarget, but there's no easy way to achieve that right now. This is
@@ -16451,7 +16474,7 @@ static bool hasCFUser(const Value *V, SmallPtrSet<const Value *, 16> &Visited,
         default:
           Result = false;
           break;
-        case Intrinsic::amdgcn_end_cf:
+        case Intrinsic::amdgcn_wave_reconverge:
         case Intrinsic::amdgcn_loop:
           Result = true;
           break;
diff...
[truncated]

jayfoad · 2024-09-16T09:26:27Z

llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

@@ -196,8 +208,10 @@ define amdgpu_kernel void @add_i32_constant(ptr addrspace(1) %out, ptr addrspace
 ; GFX11W32-NEXT:    v_mbcnt_lo_u32_b32 v0, s1, 0
 ; GFX11W32-NEXT:    ; implicit-def: $vgpr1
 ; GFX11W32-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11W32-NEXT:    v_cmpx_eq_u32_e32 0, v0
-; GFX11W32-NEXT:    s_cbranch_execz .LBB0_2
+; GFX11W32-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 0, v0


The codegen seems significantly worse for simple conditional branches like this, especially on GFX10.3+ where we try to use v_cmpx. Note that there is special hardware to handle v_cmpx followed by a VALU instruction without the kind of pipeline stall that you would normally get when you write to EXEC.

To improve matters a bit, could we aim for codegen like this?

v_cmp_eq_u32_e32 vcc_lo, 0, v0 s_cbranch_vccz .LBB0_2 s_mov_b32 exec_lo, vcc_lo

This saves one instruction overall and moves the exec modification into the body of the "if".

That doesn't actually help much, because now we have an additional stall before the next VALU instruction.

nhaehnle · 2024-09-16T12:51:05Z

I agree with Jay that this change isn't acceptable from a codegen quality point of view.

Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue.

That does not seem logical. The restores themselves only read EXEC, they do not write it. So why should they be part of the prolog?

It seems the whole problem could be avoided by not considering VGPR restores to be part of the block prolog?

alex-t · 2024-09-16T17:23:21Z

I agree with Jay that this change isn't acceptable from a codegen quality point of view.

Then it appeared that instructions loading the values spilled to the memory and used in the current block must be placed at the block beginning before they are used but after the point where EXEC mask is restored, since they are loading VGPRs and, hence, read EXEC. Hence, we had to consider all spilling opcodes to belong to the block prologue.

That does not seem logical. The restores themselves only read EXEC, they do not write it. So why should they be part of the prolog?

It seems the whole problem could be avoided by not considering VGPR restores to be part of the block prolog?

Let me describe the complete case. We have register allocation split in 2 steps - the first run allocated SGPRs and the second one takes care of VGPRs. It might happen that during the SGPRs allocation we had to spill the register containing EXEC mask value to the memory. Hence, we have to load it back at the beginning of the flow block BEFORE the OR-ing with current EXEC, as it is a live-in value that is an operand for the OR. As soon as we inserted the reload the prologue sequence is broken and during the VGPRs allocation any spill/reload will be placed before the point at which the exec mask is restored.

The issue was addressed here: #69924 (also here https://reviews.llvm.org/D145329).
Finally, the decision was made to consider all spill/reload opcodes as belonging to the block prologue.

Although, revisiting this now, I still don't understand why they decided to include ALL spill opcodes in the prologue, but not only the SGPR spills? Clearly, none of the VGPR reloads really belong to the prologue.

At a first glance, changing the isSpill(opcode) to isSGPRSpill(opcode) in the snippet below would solve the initial case.
return IsNullOrVectorRegister && (isSpill(Opcode) || (!MI.isTerminator() && Opcode != AMDGPU::COPY && MI.modifiesRegister(AMDGPU::EXEC, &RI))); }

I need to look at this a bit more. I am sure they would have done this if such a simple change had solved the problem.

alex-t · 2024-09-16T17:41:18Z

Honestly, I don't like the way this change affects the code generation quality. I would not ever try to propose it provided we have another way to achieve correctness. Currently, we have a compiler that either produces incorrect code or fails to compile a valid input.
Speaking about the long-term solution I would look to avoid any code insertion after the CF pseudos are lowered to the exec mask manipulation. The LLVM register allocation framework was designed for CPU targets and considers a CF model based on branches. The spill/split placing mechanisms make their decisions based on the edge bundles as boundaries for the in/out values. In our case, for the flow blocks, the input boundary is, in fact, inside the block, right after the exec mask restoring instruction. So, it seems like we are going to keep having more and more trouble with spill/split insertions until we put this logic out of there.

nhaehnle · 2024-09-17T07:55:40Z

Thank you for going into a bit more detail. It does seem like distinguishing between SGPR reloads and VGPR reloads would help.

Hmm, what about a case where we run out of VGPRs to spill SGPRs into? What do we actually do in that case today?

alex-t · 2024-09-17T12:30:28Z

Thank you for going into a bit more detail. It does seem like distinguishing between SGPR reloads and VGPR reloads would help.

Hmm, what about a case where we run out of VGPRs to spill SGPRs into? What do we actually do in that case today?

Not sure if I understand your concern. We allocate SGPRs in a separate pass before VGPRs. So, we have the whole amount of VGPRs available on a given target to spill SGPRs. I can hardly imagine a kernel that requires 256x32 SGPRs.

nhaehnle · 2024-09-19T07:20:56Z

Somebody recently mentioned a test case that ended up producing hundreds of MBs or even GBs of code :)

Maybe the answer is that we don't support it and just report a compiler error. That may be a reasonable thing to do, but in any case having certainty about it would help the discussion here.

jayfoad · 2024-09-20T15:22:57Z

Although, revisiting this now, I still don't understand why they decided to include ALL spill opcodes in the prologue, but not only the SGPR spills? Clearly, none of the VGPR reloads really belong to the prologue.

At a first glance, changing the isSpill(opcode) to isSGPRSpill(opcode) in the snippet below would solve the initial case. return IsNullOrVectorRegister && (isSpill(Opcode) || (!MI.isTerminator() && Opcode != AMDGPU::COPY && MI.modifiesRegister(AMDGPU::EXEC, &RI))); }

I need to look at this a bit more. I am sure they would have done this if such a simple change had solved the problem.

I like this change - it fixes the problem I reported in #109294.

ruiling

With one trick mentioned inline, we can further reduce one instruction. In most cases, we would have to introduce one scalar instruction in if-block and one salu in flow-block compared with existing solution.

Block prologue is unclear concept and not well implemented. And we don't know if there are any gaps in LLVM core framework to make prologue work correctly.

We first introduced block prologue to get the insertion point past the s_or_bnn exec for valu instructions. For salu, we still allow them being inserted into the prologue because the s_or_bnn exec needs an input sgpr. for threaded-vgpr, we should always put them after prologue. But as we may also reload sgpr from wwm-vgpr, I think we also should put wwm-vgpr reload in block prologue.

Another issue with the implementation is we should only count the instructions before the s_or_bnn exec as prologue. Like in the case below, the last prologue instruction should be the S_OR_B64 exec.

$exec = S_OR_B64 $exec, killed renamable $sgpr48_sgpr49, implicit-def $scc
%129034:vgpr_32 = SI_SPILL_V32_RESTORE %stack.265, $sgpr32, 0, implicit $exec
%129055:vgpr_32 = SI_SPILL_V32_RESTORE %stack.266, $sgpr32, 0, implicit $exec
%129083:vgpr_32 = SI_SPILL_V32_RESTORE %stack.267, $sgpr32, 0, implicit $exec
%129635:vgpr_32 = SI_SPILL_V32_RESTORE %stack.282, $sgpr32, 0, implicit $exec

Although tuning the isSpill() check help for this case, we still possibly have other issues related to this. I think it would better we fix this, but I am not sure it can be easily done.

Meanwhile I think it would definitely be helpful to figure out whether LiveRangeSplit can work with block prologue properly at least for the known cases.

ruiling · 2024-09-25T06:28:19Z

llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

+; GFX11W64-NEXT:    s_cmp_lg_u64 vcc, 0
+; GFX11W64-NEXT:    s_cmov_b64 exec, vcc
+; GFX11W64-NEXT:    s_cbranch_scc0 .LBB1_2


We can merge the s_cmp_lg for updating SCC and the exec save into one s_and_saveexec

s_and_saveexec_b64 s[0:1], vcc s_cselect_b64 exec, vcc, s[0:1] s_cbranch_scc0 .LBB1_2

so we don't need the separate s_mov_b64 s[0:1], exec at line 472 to save the exec.

No, we cannot. S_AND_SAVEEXEC changes the EXEC unconditionally.
The idea is that we only change the exec if we are going to execute "Then" block but leave it unchanged for Flow block. All further lowering is based on this assumption.

ruiling · 2024-09-25T06:46:12Z

llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

+; GFX12W32-NEXT:    s_xor_b32 s1, vcc_lo, exec_lo
+; GFX12W32-NEXT:    s_cmp_lg_u32 vcc_lo, 0
+; GFX12W32-NEXT:    s_cmov_b32 exec_lo, vcc_lo
+; GFX12W32-NEXT:    s_cbranch_scc0 .LBB2_4


This can also be simplified into:
s_and_saveexec_b32 s1, vcc_lo
s_cselect_b32 exec_lo, vcc_lo, s1
s_cbranch_scc0 .LBB2_4

Won't work for same reason as above :(

ruiling · 2024-09-26T03:19:53Z

llvm/test/CodeGen/AMDGPU/wqm.ll

+; GFX10-W32-NEXT:    s_xor_b32 s13, s14, exec_lo
+; GFX10-W32-NEXT:    s_cmp_lg_u32 s14, 0
+; GFX10-W32-NEXT:    s_cmov_b32 exec_lo, s14


We are safe to use (no kill in the else region):
s_and_saveexec s13, s14
s_cselect s14, s13ruil
s_cbranch_scc0

alex-t · 2024-10-16T10:26:57Z

Although, revisiting this now, I still don't understand why they decided to include ALL spill opcodes in the prologue, but not only the SGPR spills? Clearly, none of the VGPR reloads really belong to the prologue.
At a first glance, changing the isSpill(opcode) to isSGPRSpill(opcode) in the snippet below would solve the initial case. return IsNullOrVectorRegister && (isSpill(Opcode) || (!MI.isTerminator() && Opcode != AMDGPU::COPY && MI.modifiesRegister(AMDGPU::EXEC, &RI))); }
I need to look at this a bit more. I am sure they would have done this if such a simple change had solved the problem.

I like this change - it fixes the problem I reported in #109294.

It fixes the problem but requires us to consider WWM reloads as prologue instructions. Any VGPR reload starts a new live range and possibly introduces interference. CD's reply here (#111496 (comment)) made me hope that it would work. However, I would like to see the corresponding MIR test as proof.

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors

e59a7e9

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU llvm:regalloc llvm:globalisel llvm:ir llvm:analysis labels Sep 13, 2024

alex-t requested review from arsenm, ruiling and nhaehnle September 13, 2024 16:51

alex-t mentioned this pull request Sep 13, 2024

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors. #92809

Closed

jayfoad reviewed Sep 16, 2024

View reviewed changes

cmc-rep self-requested a review September 20, 2024 01:40

ruiling reviewed Sep 26, 2024

View reviewed changes

alex-t mentioned this pull request Sep 26, 2024

[AMDGPU] When allocating VGPRs, VGPR spills are not part of the prologue #109439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

alex-t commented Sep 13, 2024 •

edited

Loading

llvmbot commented Sep 13, 2024 •

edited

Loading

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

llvmbot commented Sep 13, 2024

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

llvmbot commented Sep 13, 2024

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

jayfoad Sep 16, 2024

nhaehnle Sep 16, 2024

nhaehnle commented Sep 16, 2024

alex-t commented Sep 16, 2024

alex-t commented Sep 16, 2024

nhaehnle commented Sep 17, 2024

alex-t commented Sep 17, 2024 •

edited

Loading

nhaehnle commented Sep 19, 2024

jayfoad commented Sep 20, 2024

ruiling left a comment

ruiling Sep 25, 2024

alex-t Oct 16, 2024

ruiling Sep 25, 2024

alex-t Oct 16, 2024

ruiling Sep 26, 2024

alex-t commented Oct 16, 2024

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

Are you sure you want to change the base?

[AMDGPU] Change CF intrinsics lowering to reconverge on predecessors #108596

Conversation

alex-t commented Sep 13, 2024 • edited Loading

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

Lowering example

llvmbot commented Sep 13, 2024 • edited Loading

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

llvmbot commented Sep 13, 2024

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

llvmbot commented Sep 13, 2024

Brief overview:

Let's consider the following example:

We have 2 options to address this issue:

What was done

jayfoad Sep 16, 2024

Choose a reason for hiding this comment

nhaehnle Sep 16, 2024

Choose a reason for hiding this comment

nhaehnle commented Sep 16, 2024

alex-t commented Sep 16, 2024

alex-t commented Sep 16, 2024

nhaehnle commented Sep 17, 2024

alex-t commented Sep 17, 2024 • edited Loading

nhaehnle commented Sep 19, 2024

jayfoad commented Sep 20, 2024

ruiling left a comment

Choose a reason for hiding this comment

ruiling Sep 25, 2024

Choose a reason for hiding this comment

alex-t Oct 16, 2024

Choose a reason for hiding this comment

ruiling Sep 25, 2024

Choose a reason for hiding this comment

alex-t Oct 16, 2024

Choose a reason for hiding this comment

ruiling Sep 26, 2024

Choose a reason for hiding this comment

alex-t commented Oct 16, 2024

alex-t commented Sep 13, 2024 •

edited

Loading

llvmbot commented Sep 13, 2024 •

edited

Loading

alex-t commented Sep 17, 2024 •

edited

Loading