-
Notifications
You must be signed in to change notification settings - Fork 16k
[AMDGPU] Simplify legalization of PHI operands #177352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In practice when legalizeOperands is called on a PHI node, the result is never an SGPR class and the operands are never subregs. Simplify the code accordingly by using the result regclass for all the inputs. This includes using an AV class where previously we picked either an AGPR or VGPR class.
|
@llvm/pr-subscribers-backend-amdgpu Author: Jay Foad (jayfoad) ChangesIn practice when legalizeOperands is called on a PHI node, the result is Patch is 297.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/177352.diff 14 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index d4292744615c5..bd0644ec507c5 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -7317,44 +7317,12 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
return CreatedBB;
}
- // Legalize REG_SEQUENCE and PHI
- // The register class of the operands much be the same type as the register
+ // Legalize PHI
+ // The register class of the operands must be the same type as the register
// class of the output.
if (MI.getOpcode() == AMDGPU::PHI) {
- const TargetRegisterClass *RC = nullptr, *SRC = nullptr, *VRC = nullptr;
- for (unsigned i = 1, e = MI.getNumOperands(); i != e; i += 2) {
- if (!MI.getOperand(i).isReg() || !MI.getOperand(i).getReg().isVirtual())
- continue;
- const TargetRegisterClass *OpRC =
- MRI.getRegClass(MI.getOperand(i).getReg());
- if (RI.hasVectorRegisters(OpRC)) {
- VRC = OpRC;
- } else {
- SRC = OpRC;
- }
- }
-
- // If any of the operands are VGPR registers, then they all most be
- // otherwise we will create illegal VGPR->SGPR copies when legalizing
- // them.
- if (VRC || !RI.isSGPRClass(getOpRegClass(MI, 0))) {
- if (!VRC) {
- assert(SRC);
- if (getOpRegClass(MI, 0) == &AMDGPU::VReg_1RegClass) {
- VRC = &AMDGPU::VReg_1RegClass;
- } else
- VRC = RI.isAGPRClass(getOpRegClass(MI, 0))
- ? RI.getEquivalentAGPRClass(SRC)
- : RI.getEquivalentVGPRClass(SRC);
- } else {
- VRC = RI.isAGPRClass(getOpRegClass(MI, 0))
- ? RI.getEquivalentAGPRClass(VRC)
- : RI.getEquivalentVGPRClass(VRC);
- }
- RC = VRC;
- } else {
- RC = SRC;
- }
+ const TargetRegisterClass *VRC = getOpRegClass(MI, 0);
+ assert(!RI.isSGPRClass(VRC));
// Update all the operands so they have the same type.
for (unsigned I = 1, E = MI.getNumOperands(); I != E; I += 2) {
@@ -7368,7 +7336,7 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
// Avoid creating no-op copies with the same src and dst reg class. These
// confuse some of the machine passes.
- legalizeGenericOperand(*InsertBB, Insert, RC, Op, MRI, MI.getDebugLoc());
+ legalizeGenericOperand(*InsertBB, Insert, VRC, Op, MRI, MI.getDebugLoc());
}
}
diff --git a/llvm/test/CodeGen/AMDGPU/a-v-flat-atomicrmw.ll b/llvm/test/CodeGen/AMDGPU/a-v-flat-atomicrmw.ll
index b8962fa29e8f1..b5f952b0bb00d 100644
--- a/llvm/test/CodeGen/AMDGPU/a-v-flat-atomicrmw.ll
+++ b/llvm/test/CodeGen/AMDGPU/a-v-flat-atomicrmw.ll
@@ -926,12 +926,12 @@ define void @flat_atomic_xchg_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_xchg_i64_ret_av_av:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v3
+; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[4:5]
+; GFX90A-NEXT: ; def v[2:3]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
@@ -939,23 +939,23 @@ define void @flat_atomic_xchg_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-NEXT: s_cbranch_execz .LBB14_2
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
; GFX90A-NEXT: buffer_wbl2
-; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[2:3], v[4:5] glc
+; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[4:5], v[2:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
-; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: .LBB14_2: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB14_4
; GFX90A-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[2:3]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc
-; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc
+; GFX90A-NEXT: buffer_load_dword v0, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_nop 0
-; GFX90A-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_store_dword v5, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: buffer_store_dword v2, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: .LBB14_4: ; %atomicrmw.phi
; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_waitcnt vmcnt(2)
@@ -1016,12 +1016,12 @@ define void @flat_atomic_xchg_i64_ret_av_v(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_xchg_i64_ret_av_v:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v3
+; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[4:5]
+; GFX90A-NEXT: ; def v[2:3]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
@@ -1029,23 +1029,23 @@ define void @flat_atomic_xchg_i64_ret_av_v(ptr %ptr) #0 {
; GFX90A-NEXT: s_cbranch_execz .LBB15_2
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
; GFX90A-NEXT: buffer_wbl2
-; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[2:3], v[4:5] glc
+; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[4:5], v[2:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
-; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: .LBB15_2: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB15_4
; GFX90A-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[2:3]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc
-; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc
+; GFX90A-NEXT: buffer_load_dword v0, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_nop 0
-; GFX90A-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_store_dword v5, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: buffer_store_dword v2, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: .LBB15_4: ; %atomicrmw.phi
; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_waitcnt vmcnt(2)
@@ -1294,12 +1294,12 @@ define void @flat_atomic_xchg_i64_ret_v_av(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_xchg_i64_ret_v_av:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v3
+; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[4:5]
+; GFX90A-NEXT: ; def v[2:3]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
@@ -1307,23 +1307,23 @@ define void @flat_atomic_xchg_i64_ret_v_av(ptr %ptr) #0 {
; GFX90A-NEXT: s_cbranch_execz .LBB18_2
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
; GFX90A-NEXT: buffer_wbl2
-; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[2:3], v[4:5] glc
+; GFX90A-NEXT: flat_atomic_swap_x2 v[0:1], v[4:5], v[2:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: buffer_invl2
; GFX90A-NEXT: buffer_wbinvl1_vol
-; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: .LBB18_2: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB18_4
; GFX90A-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[2:3]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc
-; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc
+; GFX90A-NEXT: buffer_load_dword v0, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_nop 0
-; GFX90A-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_store_dword v5, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: buffer_store_dword v2, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: .LBB18_4: ; %atomicrmw.phi
; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_waitcnt vmcnt(2)
@@ -6406,35 +6406,35 @@ define void @flat_atomic_add_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_add_i64_ret_av_av:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v3
+; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[4:5]
+; GFX90A-NEXT: ; def v[2:3]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GFX90A-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB90_2
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX90A-NEXT: flat_atomic_add_x2 v[0:1], v[2:3], v[4:5] glc
+; GFX90A-NEXT: flat_atomic_add_x2 v[0:1], v[4:5], v[2:3] glc
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: .LBB90_2: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB90_4
; GFX90A-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[2:3]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc
-; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc
+; GFX90A-NEXT: buffer_load_dword v0, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v3, vcc, v0, v4
-; GFX90A-NEXT: v_addc_co_u32_e32 v4, vcc, v1, v5, vcc
-; GFX90A-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, v0, v2
+; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, v1, v3, vcc
+; GFX90A-NEXT: buffer_store_dword v2, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: .LBB90_4: ; %atomicrmw.phi
; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -6591,35 +6591,35 @@ define void @flat_atomic_sub_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_sub_i64_ret_av_av:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v2, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v3
+; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[4:5]
+; GFX90A-NEXT: ; def v[2:3]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GFX90A-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB92_2
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX90A-NEXT: flat_atomic_sub_x2 v[0:1], v[2:3], v[4:5] glc
+; GFX90A-NEXT: flat_atomic_sub_x2 v[0:1], v[4:5], v[2:3] glc
; GFX90A-NEXT: s_waitcnt lgkmcnt(0)
-; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
+; GFX90A-NEXT: ; implicit-def: $vgpr2_vgpr3
; GFX90A-NEXT: .LBB92_2: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB92_4
; GFX90A-NEXT: ; %bb.3: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[2:3]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v2, vcc
-; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
+; GFX90A-NEXT: v_cndmask_b32_e32 v4, -1, v4, vcc
+; GFX90A-NEXT: buffer_load_dword v0, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: v_sub_co_u32_e32 v3, vcc, v0, v4
-; GFX90A-NEXT: v_subb_co_u32_e32 v4, vcc, v1, v5, vcc
-; GFX90A-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen
-; GFX90A-NEXT: buffer_store_dword v4, v2, s[0:3], 0 offen offset:4
+; GFX90A-NEXT: v_sub_co_u32_e32 v2, vcc, v0, v2
+; GFX90A-NEXT: v_subb_co_u32_e32 v3, vcc, v1, v3, vcc
+; GFX90A-NEXT: buffer_store_dword v2, v4, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_store_dword v3, v4, s[0:3], 0 offen offset:4
; GFX90A-NEXT: .LBB92_4: ; %atomicrmw.phi
; GFX90A-NEXT: s_or_b64 exec, exec, s[4:5]
; GFX90A-NEXT: s_waitcnt vmcnt(0)
@@ -8881,28 +8881,28 @@ define void @flat_atomic_usub_sat_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-LABEL: flat_atomic_usub_sat_i64_ret_av_av:
; GFX90A: ; %bb.0:
; GFX90A-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_add_co_u32_e32 v4, vcc, 0x50, v0
+; GFX90A-NEXT: v_add_co_u32_e32 v6, vcc, 0x50, v0
; GFX90A-NEXT: s_mov_b64 s[4:5], src_private_base
-; GFX90A-NEXT: v_addc_co_u32_e32 v5, vcc, 0, v1, vcc
-; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v5
+; GFX90A-NEXT: v_addc_co_u32_e32 v7, vcc, 0, v1, vcc
+; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s5, v7
; GFX90A-NEXT: ;;#ASMSTART
-; GFX90A-NEXT: ; def v[6:7]
+; GFX90A-NEXT: ; def v[4:5]
; GFX90A-NEXT: ;;#ASMEND
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: s_and_saveexec_b64 s[4:5], vcc
; GFX90A-NEXT: s_xor_b64 s[4:5], exec, s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB114_4
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.global
-; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[4:5]
+; GFX90A-NEXT: flat_load_dwordx2 v[2:3], v[6:7]
; GFX90A-NEXT: s_mov_b64 s[6:7], 0
; GFX90A-NEXT: .LBB114_2: ; %atomicrmw.start
; GFX90A-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
-; GFX90A-NEXT: v_sub_co_u32_e32 v0, vcc, v2, v6
-; GFX90A-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v7, vcc
+; GFX90A-NEXT: v_sub_co_u32_e32 v0, vcc, v2, v4
+; GFX90A-NEXT: v_subb_co_u32_e32 v1, vcc, v3, v5, vcc
; GFX90A-NEXT: v_cndmask_b32_e64 v0, v0, 0, vcc
; GFX90A-NEXT: v_cndmask_b32_e64 v1, v1, 0, vcc
-; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[4:5], v[0:3] glc
+; GFX90A-NEXT: flat_atomic_cmpswap_x2 v[0:1], v[6:7], v[0:3] glc
; GFX90A-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX90A-NEXT: v_cmp_eq_u64_e32 vcc, v[0:1], v[2:3]
; GFX90A-NEXT: s_or_b64 s[6:7], vcc, s[6:7]
@@ -8911,20 +8911,20 @@ define void @flat_atomic_usub_sat_i64_ret_av_av(ptr %ptr) #0 {
; GFX90A-NEXT: s_cbranch_execnz .LBB114_2
; GFX90A-NEXT: ; %bb.3: ; %Flow
; GFX90A-NEXT: s_or_b64 exec, exec, s[6:7]
-; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX90A-NEXT: ; implicit-def: $vgpr6_vgpr7
+; GFX90A-NEXT: ; implicit-def: $vgpr4_vgpr5
; GFX90A-NEXT: .LBB114_4: ; %Flow3
; GFX90A-NEXT: s_andn2_saveexec_b64 s[4:5], s[4:5]
; GFX90A-NEXT: s_cbranch_execz .LBB114_6
; GFX90A-NEXT: ; %bb.5: ; %atomicrmw.private
-; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[4:5]
-; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v4, vcc
+; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[6:7]
+; GFX90A-NEXT: v_cndmask_b32_e32 v2, -1, v6, vcc
; GFX90A-NEXT: buffer_load_dword v0, v2, s[0:3], 0 offen
; GFX90A-NEXT: buffer_load_dword v1, v2, s[0:3], 0 offen offset:4
; GFX90A-NEXT: s_waitcnt vmcnt(1)
-; GFX90A-NEXT: v_sub_co_u32_e32 v3, vcc, v0, v6
+; GFX90A-NEXT: v_sub_co_u32_e32 v3, vcc, v0, v4
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: v_subb_co_u32_e32 v4, vcc, v1, v7, vcc
+; GFX90A-NEXT: v_subb_co_u32_e32 v4, vcc, v1, v5, vcc
; GFX90A-NEXT: v_cndmask_b32_e64 v3, v3, 0, vcc
; GFX90A-NEXT: v_cndmask_b32_e64 v4, v4, 0, vcc
; GFX90A-NEXT: buffer_store_dword v3, v2, s[0:3], 0 offen
@@ -9027,28 +9027,29 @@ define void @flat_atomic_fadd_f32_ret_a_a(ptr %ptr) #0 {
; GFX90A-NEXT: ; %bb.1: ; %atomicrmw.check.private
; GFX90A-NEXT: s_mov_b64 s[6:7], src_private_base
; GFX90A-NEXT: v_cmp_ne_u32_e32 vcc, s7, v1
-; GFX90A-NEXT: ; implicit-def: $vgpr3
+; GFX90A-NEXT: ; implicit-def: $agpr0
; GFX90A-NEXT: s_and_saveexec_b64 s[6:7], vcc
; GFX90A-NEXT: s_xor_b64 s[6:7], exec, s[6:7]
; GFX90A-NEXT: s_cbranch_execz .LBB115_3
; GFX90A-NEXT: ; %bb.2: ; %atomicrmw.global
-; GFX90A-NEXT: global_atomic_add_f32 v3, v[0:1], v2, off glc
-; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
+; GFX90A-NEXT: global_atomic_add_f32 v0, v[0:1], v2, off glc
; GFX90A-NEXT: ; implicit-def: $vgpr2
+; GFX90A-NEXT: s_waitcnt vmcnt(0)
+; GFX90A-NEXT: v_accvgpr_write_b32 a0, v0
+; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: .LBB115_3: ; %Flow
; GFX90A-NEXT: s_andn2_saveexec_b64 s[6:7], s[6:7]
; GFX90A-NEXT: s_cbranch_execz .LBB115_5
; GFX90A-NEXT: ; %bb.4: ; %atomicrmw.private
; GFX90A-NEXT: v_cmp_ne_u64_e32 vcc, 0, v[0:1]
; GFX90A-NEXT: v_cndmask_b32_e32 v0, -1, v0, vcc
-; GFX90A-NEXT: buffer_load_dword v3, v0, s[0:3], 0 offen
+; GFX90A-NEXT: buffer_load_dword v1, v0, s[0:3], 0 offen
; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: v_add_f32_e32 v1, v3, v2
-; GFX90A-NEXT: buffer_store_dword v1, v0, s[0:3], 0 offen
+; GFX90A-NEXT: v_add_f32_e32 v2, v1, v2
+; GFX90A-NEXT: v_accvgpr_write_b32 a0, v1
+; GFX90A-NEXT: buffer_store_dword v2, v0, s[0:3], 0 offen
; GFX90A-NEXT: .LBB115_5: ; %Flow1
; GFX90A-NEXT: s_or_b64 exec, exec, s[6:7]
-; GFX90A-NEXT: s_waitcnt vmcnt(0)
-; GFX90A-NEXT: v_accvgpr_write_b32 a0, v3
; GFX90A-NEXT: ; implicit-def: $vgpr0_vgpr1
; GFX90A-NEXT: ; implicit-def: $vgpr2
; GFX90A-NEXT: .LBB115_6: ; %Flow2
@@ -9065,6 +9066,7 @@ define void @flat_atomic_fadd_f32_ret_a_a(ptr %ptr) #0 {
; GFX90A-NEXT: ;;#ASMSTART
; GFX90A-NEX...
[truncated]
|
|
@michaelselehov this should fix your test case just as well as #175995. The test case updates in this PR are identical to that one. |
|
I can only assume that this never happens in real codegen because the way we use |
| ; GFX90A-NEXT: v_accvgpr_write_b32 a0, v1 | ||
| ; GFX90A-NEXT: s_cbranch_execz .LBB223_5 | ||
| ; GFX90A-NEXT: s_branch .LBB223_6 | ||
| ; GFX90A-NEXT: .LBB223_3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird that this results in this CFG change. This looks more jumpy now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just arbitrary basic block reordering. With the patch we eliminate this instruction:
%76:vgpr_32 = COPY %5:av_32, implicit $exec
which leaves a basic block empty, which allows MachineSink to combine it into another block but switches the order of the successors, which cuases BranchFolding to make a different arbitrary layout decision.
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 6 | ||
| ; RUN: llc -march=amdgcn -mcpu=gfx950 < %s | FileCheck %s -check-prefix=GFX950 | ||
|
|
||
| declare i32 @_ZN25__hip_builtin_threadIdx_t7__get_xEv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| declare i32 @_ZN25__hip_builtin_threadIdx_t7__get_xEv() | |
| declare hidden i32 @_ZN25__hip_builtin_threadIdx_t7__get_xEv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| %i21 = load <4 x i32>, ptr addrspace(1) null, align 16 | ||
| %i22 = or <4 x i32> %i5, %i21 | ||
| %i23 = or <4 x i32> %i20, %i22 | ||
| store <4 x i32> %i23, ptr addrspace(5) null, align 2147483648 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid the (almost) UB access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| %i8 = phi <4 x i32> [ zeroinitializer, %bb ], [ %i17, %bb4 ] | ||
| %i9 = phi <4 x i32> [ zeroinitializer, %bb ], [ %i13, %bb4 ] | ||
| %i10 = zext i32 %i1 to i64 | ||
| %i11 = getelementptr i64, ptr addrspace(1) null, i64 %i10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid UB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ; RUN: llc -march=amdgcn -mcpu=gfx950 < %s | FileCheck %s -check-prefix=GFX950 | ||
|
|
||
| declare i32 @_ZN25__hip_builtin_threadIdx_t7__get_xEv() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what this should show?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| // Legalize REG_SEQUENCE and PHI | ||
| // The register class of the operands much be the same type as the register | ||
| // Legalize PHI | ||
| // The register class of the operands must be the same type as the register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually true though, though it may be convenient. It's not "legalization" and probably belongs somewhere else?se l
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK but my patch does not change this, and there's only so much code I'm prepared to rewrite to fix a relatively minor issue to do with the choice of a/v/av classes.
As I said in the other PR, I think this is required to implement temporal divergence.
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
Just checked, it does. |
arsenm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with test nits (and maybe worth looking at the one control flow change test?)
That MIR test was specifically testing PHI with subreg inputs, so I removed it. |
Carries local revert of llvm/llvm-project#169614 due to #22649. This revert had a test file conflict with llvm/llvm-project#177352 (`llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll`), which was resolved by excluding that file from the revert. Signed-off-by: Ian Wood <[email protected]>
In practice when legalizeOperands is called on a PHI node, the result is never an SGPR class and the operands are never subregs. Simplify the code accordingly by using the result regclass for all the inputs. This includes using an AV class where previously we picked either an AGPR or VGPR class.
In practice when legalizeOperands is called on a PHI node, the result is
never an SGPR class and the operands are never subregs. Simplify the
code accordingly by using the result regclass for all the inputs. This
includes using an AV class where previously we picked either an AGPR or
VGPR class.