Skip to content

[AMDGPU] Add intrinsic-based optimization for rotate and funnel shift patterns#153406

Open
aleksandar-amd wants to merge 1 commit intollvm:mainfrom
aleksandar-amd:amdgpu-optimize-rotate-instr-selection
Open

[AMDGPU] Add intrinsic-based optimization for rotate and funnel shift patterns#153406
aleksandar-amd wants to merge 1 commit intollvm:mainfrom
aleksandar-amd:amdgpu-optimize-rotate-instr-selection

Conversation

@aleksandar-amd
Copy link
Contributor

This patch introduces a new llvm.amdgcn.v_alignbit intrinsic and implements early pattern matching in AMDGPUCodeGenPrepare to convert divergent fshr/fshl operations into V_ALIGNBIT_B32 instructions. The approach forces expansion of rotate/funnel shift operations during legalization into (shl, or, shr) patterns.

@github-actions
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 13, 2025

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-amdgpu

Author: Aleksandar Spasojevic (aleksandar-amd)

Changes

This patch introduces a new llvm.amdgcn.v_alignbit intrinsic and implements early pattern matching in AMDGPUCodeGenPrepare to convert divergent fshr/fshl operations into V_ALIGNBIT_B32 instructions. The approach forces expansion of rotate/funnel shift operations during legalization into (shl, or, shr) patterns.


Patch is 4.51 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153406.diff

73 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+5)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+48)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+7-3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+5-8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+7)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+32)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+466-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+511-207)
  • (removed) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-fshr.mir (-59)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-fshl.mir (+72-39)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-fshr.mir (+87-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-rotl-rotr.mir (+65-14)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+30-24)
  • (removed) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-fshr.mir (-168)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/uaddsat.ll (+91-65)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/usubsat.ll (+91-65)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18682-17264)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+788-686)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+1689-1485)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll (+147-121)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.48bit.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+4230-3740)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll (+374-324)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll (+438-384)
  • (modified) llvm/test/CodeGen/AMDGPU/any_extend_vector_inreg.ll (+33-33)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+20-18)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1594-1277)
  • (modified) llvm/test/CodeGen/AMDGPU/bswap.ll (+37-32)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+197-181)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+94-88)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+94-88)
  • (modified) llvm/test/CodeGen/AMDGPU/build-vector-packed-partial-undef.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+3-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+7-5)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+27-23)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.bf16.ll (+128-124)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+204-192)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+173-165)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+173-165)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+136-128)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+51-42)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+8-6)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.bf16.ll (+21-16)
  • (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+353-244)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+1026-714)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+55-39)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+350-322)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+262-246)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+262-246)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+200-184)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+22-26)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+29-24)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+9-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.bf16.ll (+25-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.ll (+10-7)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+383-341)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i8.ll (+510-357)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+100-92)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+144-136)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+144-136)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+100-92)
  • (modified) llvm/test/CodeGen/AMDGPU/packetizer.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/permute.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+142-120)
  • (modified) llvm/test/CodeGen/AMDGPU/rotate-add.ll (+60-22)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+136-68)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+234-53)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/truncate-lshr-cast-build-vector-combine.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/urem-seteq-illegal-types.ll (+17-15)
  • (modified) llvm/test/Transforms/InferAddressSpaces/SPIRV/generic-cast-explicit.ll (+8-7)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index cf82f7f06a693..6a34fc039ac36 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -886,6 +886,11 @@ def int_amdgcn_bitop3 :
                         [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
                         [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<3>>]>;
 
+def int_amdgcn_v_alignbit :
+  DefaultAttrsIntrinsic<[llvm_i32_ty],
+                        [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
+                        [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 } // TargetPrefix = "amdgcn"
 
 // New-style image intrinsics
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index a9278c1dc3a6a..0f05389b03e62 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -245,6 +245,7 @@ class AMDGPUCodeGenPrepareImpl
 
   bool visitInstruction(Instruction &I) { return false; }
   bool visitBinaryOperator(BinaryOperator &I);
+  bool visitCallInst(CallInst &I);
   bool visitLoadInst(LoadInst &I);
   bool visitSelectInst(SelectInst &I);
   bool visitPHINode(PHINode &I);
@@ -253,6 +254,7 @@ class AMDGPUCodeGenPrepareImpl
   bool visitIntrinsicInst(IntrinsicInst &I);
   bool visitFMinLike(IntrinsicInst &I);
   bool visitSqrt(IntrinsicInst &I);
+  bool visitRotateIntrinsic(IntrinsicInst &I);
   bool run();
 };
 
@@ -1913,6 +1915,9 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) {
     return visitFMinLike(I);
   case Intrinsic::sqrt:
     return visitSqrt(I);
+  case Intrinsic::fshr:
+  case Intrinsic::fshl:
+    return visitRotateIntrinsic(I);
   default:
     return false;
   }
@@ -2103,6 +2108,49 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F,
   return PA;
 }
 
+bool AMDGPUCodeGenPrepareImpl::visitCallInst(CallInst &I) {
+  if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(&I))
+    return visitIntrinsicInst(*II);
+  return false;
+}
+
+bool AMDGPUCodeGenPrepareImpl::visitRotateIntrinsic(IntrinsicInst &I) {
+  if (!I.getType()->isIntegerTy(32))
+    return false;
+
+  // Only convert divergent operations to v_alignbit
+  if (UA.isUniform(&I))
+    return false;
+
+  Intrinsic::ID IID = I.getIntrinsicID();
+  Value *Src0 = I.getOperand(0);
+  Value *Src1 = I.getOperand(1);
+  Value *Amt = I.getOperand(2);
+
+  IRBuilder<> Builder(&I);
+  Function *AlignBitFn = Intrinsic::getOrInsertDeclaration(
+      I.getModule(), Intrinsic::amdgcn_v_alignbit);
+
+  Value *AlignBitCall = nullptr;
+  
+  switch (IID) {
+  case Intrinsic::fshr:
+    AlignBitCall = Builder.CreateCall(AlignBitFn, {Src0, Src1, Amt});
+    break;
+  case Intrinsic::fshl: {
+    Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt);
+    AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt});
+    break;
+  }
+  default:
+    return false;
+  }
+
+  I.replaceAllUsesWith(AlignBitCall);
+  I.eraseFromParent();
+  return true;
+}
+
 INITIALIZE_PASS_BEGIN(AMDGPUCodeGenPrepare, DEBUG_TYPE,
                       "AMDGPU IR optimizations", false, false)
 INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..3362834796f9d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -487,12 +487,16 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
     setOperationAction({ISD::ADDC, ISD::SUBC, ISD::ADDE, ISD::SUBE}, VT, Legal);
   }
 
-  // The hardware supports 32-bit FSHR, but not FSHL.
-  setOperationAction(ISD::FSHR, MVT::i32, Legal);
+  if (Subtarget->isGCN()) {
+    setOperationAction(ISD::FSHR, MVT::i32, Expand);
+    setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Expand);
+  } else {
+    setOperationAction(ISD::FSHR, MVT::i32, Legal);
+    setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Legal);
+  }
 
   // The hardware supports 32-bit ROTR, but not ROTL.
   setOperationAction(ISD::ROTL, {MVT::i32, MVT::i64}, Expand);
-  setOperationAction(ISD::ROTR, MVT::i64, Expand);
 
   setOperationAction({ISD::MULHU, ISD::MULHS}, MVT::i16, Expand);
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index b88891ac4894b..90634de0b12cd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2058,17 +2058,14 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
     .clampScalar(0, S32, S64)
     .lower();
 
-  getActionDefinitionsBuilder({G_ROTR, G_ROTL})
-    .scalarize(0)
-    .lower();
+  getActionDefinitionsBuilder({G_ROTR, G_ROTL}).scalarize(0).lower();
 
   // TODO: Only Try to form v2s16 with legal packed instructions.
   getActionDefinitionsBuilder(G_FSHR)
-    .legalFor({{S32, S32}})
-    .lowerFor({{V2S16, V2S16}})
-    .clampMaxNumElementsStrict(0, S16, 2)
-    .scalarize(0)
-    .lower();
+      .lowerFor({{V2S16, V2S16}})
+      .clampMaxNumElementsStrict(0, S16, 2)
+      .scalarize(0)
+      .lower();
 
   if (ST.hasVOP3PInsts()) {
     getActionDefinitionsBuilder(G_FSHL)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 237929699dd9d..a5bd08072b47d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4123,6 +4123,12 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_SMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
+  case AMDGPU::G_ROTR:
+  case AMDGPU::G_ROTL: {
+    if (isSALUMapping(MI))
+      return getDefaultMappingSOP(MI);
+    return getDefaultMappingVOP(MI);
+  }
   case AMDGPU::G_UMULH:
   case AMDGPU::G_SMULH: {
     if (Subtarget.hasScalarMulHiInsts() && isSALUMapping(MI))
@@ -4824,6 +4830,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_perm_pk16_b4_u4:
     case Intrinsic::amdgcn_perm_pk16_b6_u4:
     case Intrinsic::amdgcn_perm_pk16_b8_u4:
+    case Intrinsic::amdgcn_v_alignbit:
       return getDefaultMappingVOP(MI);
     case Intrinsic::amdgcn_log:
     case Intrinsic::amdgcn_exp2:
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index bd5dfa92a8e43..7ec9cd69e1600 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -2612,6 +2612,38 @@ def : GCNPat<(fshr i32:$src0, i32:$src1, i32:$src2),
 >;
 } // end True16Predicate = UseFakeTrue16Insts
 
+let True16Predicate = NotHasTrue16BitInsts in {
+let SubtargetPredicate = isNotGFX9Plus in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_e64 VSrc_b32:$src0, VSrc_b32:$src1, VSrc_b32:$src2)>;
+} // isNotGFX9Plus
+
+let SubtargetPredicate = isGFX9GFX10 in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_opsel_e64 /* src0_modifiers */ 0, $src0,
+                                       /* src1_modifiers */ 0, $src1,
+                                       /* src2_modifiers */ 0,
+                                       $src2, /* clamp */ 0, /* op_sel */ 0)>;
+} // isGFX9GFX10
+} // end True16Predicate = NotHasTrue16BitInsts
+
+let True16Predicate = UseRealTrue16Insts in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_t16_e64 /* src0_modifiers */ 0, $src0,
+                                     /* src1_modifiers */ 0, $src1,
+                                     /* src2_modifiers */ 0,
+                                     (EXTRACT_SUBREG $src2, lo16),
+                                     /* clamp */ 0, /* op_sel */ 0)>;
+} // end True16Predicate = UseRealTrue16Insts
+
+let True16Predicate = UseFakeTrue16Insts in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_fake16_e64 /* src0_modifiers */ 0, $src0,
+                                        /* src1_modifiers */ 0, $src1,
+                                        /* src2_modifiers */ 0,
+                                        $src2, /* clamp */ 0, /* op_sel */ 0)>;
+} // end True16Predicate = UseFakeTrue16Insts
+
 /********** ====================== **********/
 /**********   Indirect addressing  **********/
 /********** ====================== **********/
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
index fc81e16d68e98..5e54412c97a5c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
@@ -1768,102 +1768,102 @@ define i24 @v_fshl_i24(i24 %lhs, i24 %rhs, i24 %amt) {
 define amdgpu_ps i48 @s_fshl_v2i24(i48 inreg %lhs.arg, i48 inreg %rhs.arg, i48 inreg %amt.arg) {
 ; GFX6-LABEL: s_fshl_v2i24:
 ; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_cvt_f32_ubyte0_e32 v2, 24
-; GFX6-NEXT:    v_rcp_iflag_f32_e32 v2, v2
-; GFX6-NEXT:    s_bfe_u32 s9, s0, 0x80008
+; GFX6-NEXT:    v_cvt_f32_ubyte0_e32 v0, 24
+; GFX6-NEXT:    v_rcp_iflag_f32_e32 v0, v0
 ; GFX6-NEXT:    s_lshr_b32 s6, s0, 16
-; GFX6-NEXT:    s_and_b32 s8, s0, 0xff
-; GFX6-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX6-NEXT:    v_cvt_u32_f32_e32 v2, v2
-; GFX6-NEXT:    s_lshl_b32 s9, s9, 8
-; GFX6-NEXT:    s_lshr_b32 s7, s1, 8
-; GFX6-NEXT:    s_or_b32 s8, s8, s9
+; GFX6-NEXT:    s_lshr_b32 s7, s0, 24
+; GFX6-NEXT:    s_and_b32 s9, s0, 0xff
+; GFX6-NEXT:    s_bfe_u32 s0, s0, 0x80008
+; GFX6-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX6-NEXT:    s_or_b32 s0, s9, s0
 ; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    s_lshr_b32 s8, s1, 8
+; GFX6-NEXT:    s_and_b32 s0, 0xffff, s0
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
 ; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v0, s0
-; GFX6-NEXT:    s_and_b32 s8, 0xffff, s8
+; GFX6-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
+; GFX6-NEXT:    s_or_b32 s0, s0, s6
+; GFX6-NEXT:    s_lshl_b32 s1, s1, 8
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    v_cvt_u32_f32_e32 v0, v0
+; GFX6-NEXT:    s_or_b32 s1, s7, s1
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s1, s1, s6
+; GFX6-NEXT:    s_lshr_b32 s6, s2, 16
+; GFX6-NEXT:    s_lshr_b32 s7, s2, 24
+; GFX6-NEXT:    s_and_b32 s9, s2, 0xff
+; GFX6-NEXT:    s_bfe_u32 s2, s2, 0x80008
+; GFX6-NEXT:    s_lshl_b32 s2, s2, 8
+; GFX6-NEXT:    v_not_b32_e32 v1, 23
+; GFX6-NEXT:    s_or_b32 s2, s9, s2
+; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    v_mul_lo_u32 v2, v0, v1
+; GFX6-NEXT:    s_lshr_b32 s8, s3, 8
+; GFX6-NEXT:    s_and_b32 s2, 0xffff, s2
 ; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
-; GFX6-NEXT:    v_alignbit_b32 v0, s1, v0, 24
-; GFX6-NEXT:    s_and_b32 s0, s7, 0xff
-; GFX6-NEXT:    v_not_b32_e32 v3, 23
-; GFX6-NEXT:    s_or_b32 s6, s8, s6
-; GFX6-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
-; GFX6-NEXT:    s_bfe_u32 s8, s2, 0x80008
-; GFX6-NEXT:    v_mul_lo_u32 v4, v2, v3
-; GFX6-NEXT:    v_or_b32_e32 v0, s0, v0
-; GFX6-NEXT:    s_lshr_b32 s0, s2, 16
-; GFX6-NEXT:    s_and_b32 s7, s2, 0xff
-; GFX6-NEXT:    s_lshl_b32 s8, s8, 8
-; GFX6-NEXT:    s_lshr_b32 s1, s3, 8
-; GFX6-NEXT:    s_or_b32 s7, s7, s8
-; GFX6-NEXT:    s_and_b32 s0, s0, 0xff
 ; GFX6-NEXT:    s_and_b32 s3, s3, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v1, s2
-; GFX6-NEXT:    s_and_b32 s7, 0xffff, s7
-; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
-; GFX6-NEXT:    v_alignbit_b32 v1, s3, v1, 24
-; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    s_or_b32 s0, s7, s0
-; GFX6-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GFX6-NEXT:    s_lshl_b32 s1, s1, 16
-; GFX6-NEXT:    s_bfe_u32 s7, s4, 0x80008
-; GFX6-NEXT:    v_mul_hi_u32 v4, v2, v4
-; GFX6-NEXT:    v_or_b32_e32 v1, s1, v1
-; GFX6-NEXT:    s_lshr_b32 s1, s4, 16
-; GFX6-NEXT:    s_and_b32 s3, s4, 0xff
-; GFX6-NEXT:    s_lshl_b32 s7, s7, 8
-; GFX6-NEXT:    s_or_b32 s3, s3, s7
-; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    s_and_b32 s3, 0xffff, s3
-; GFX6-NEXT:    s_lshl_b32 s1, s1, 16
-; GFX6-NEXT:    s_or_b32 s1, s3, s1
-; GFX6-NEXT:    v_add_i32_e32 v2, vcc, v2, v4
-; GFX6-NEXT:    v_mul_hi_u32 v4, s1, v2
-; GFX6-NEXT:    s_lshr_b32 s2, s5, 8
-; GFX6-NEXT:    s_and_b32 s3, s5, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v5, s4
-; GFX6-NEXT:    v_alignbit_b32 v5, s3, v5, 24
-; GFX6-NEXT:    s_and_b32 s2, s2, 0xff
-; GFX6-NEXT:    v_and_b32_e32 v5, 0xffff, v5
-; GFX6-NEXT:    v_mul_lo_u32 v4, v4, 24
-; GFX6-NEXT:    s_lshl_b32 s2, s2, 16
-; GFX6-NEXT:    v_or_b32_e32 v5, s2, v5
-; GFX6-NEXT:    v_mul_hi_u32 v2, v5, v2
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s1, v4
-; GFX6-NEXT:    v_add_i32_e32 v6, vcc, v4, v3
-; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v4
+; GFX6-NEXT:    s_or_b32 s2, s2, s6
+; GFX6-NEXT:    s_lshl_b32 s3, s3, 8
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    s_or_b32 s3, s7, s3
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s3, s3, s6
+; GFX6-NEXT:    s_lshr_b32 s6, s4, 16
+; GFX6-NEXT:    s_lshr_b32 s7, s4, 24
+; GFX6-NEXT:    s_and_b32 s9, s4, 0xff
+; GFX6-NEXT:    s_bfe_u32 s4, s4, 0x80008
+; GFX6-NEXT:    v_mul_hi_u32 v2, v0, v2
+; GFX6-NEXT:    s_lshl_b32 s4, s4, 8
+; GFX6-NEXT:    s_or_b32 s4, s9, s4
+; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    s_and_b32 s4, 0xffff, s4
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s4, s4, s6
+; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
+; GFX6-NEXT:    v_mul_hi_u32 v2, s4, v0
+; GFX6-NEXT:    s_lshr_b32 s8, s5, 8
+; GFX6-NEXT:    s_and_b32 s5, s5, 0xff
+; GFX6-NEXT:    s_lshl_b32 s5, s5, 8
 ; GFX6-NEXT:    v_mul_lo_u32 v2, v2, 24
-; GFX6-NEXT:    v_cndmask_b32_e32 v4, v4, v6, vcc
-; GFX6-NEXT:    v_add_i32_e32 v6, vcc, v4, v3
-; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v4
-; GFX6-NEXT:    v_cndmask_b32_e32 v4, v4, v6, vcc
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, v5, v2
-; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 23, v4
-; GFX6-NEXT:    v_add_i32_e32 v5, vcc, v2, v3
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    s_or_b32 s5, s7, s5
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s5, s5, s6
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, s4, v2
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v1
+; GFX6-NEXT:    v_mul_hi_u32 v0, s5, v0
 ; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v2
-; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v5, vcc
-; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v3
+; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v3, vcc
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v1
 ; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v2
-; GFX6-NEXT:    v_and_b32_e32 v4, 0xffffff, v4
-; GFX6-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX6-NEXT:    v_and_b32_e32 v6, 0xffffff, v6
+; GFX6-NEXT:    v_mul_lo_u32 v0, v0, 24
 ; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v3, vcc
-; GFX6-NEXT:    v_lshl_b32_e32 v4, s6, v4
-; GFX6-NEXT:    v_lshr_b32_e32 v6, s0, v6
 ; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 23, v2
 ; GFX6-NEXT:    v_and_b32_e32 v2, 0xffffff, v2
-; GFX6-NEXT:    v_or_b32_e32 v4, v4, v6
-; GFX6-NEXT:    v_lshlrev_b32_e32 v0, v2, v0
-; GFX6-NEXT:    v_lshrrev_b32_e32 v1, 1, v1
-; GFX6-NEXT:    v_and_b32_e32 v2, 0xffffff, v3
-; GFX6-NEXT:    v_lshrrev_b32_e32 v1, v2, v1
-; GFX6-NEXT:    v_bfe_u32 v2, v4, 8, 8
+; GFX6-NEXT:    v_lshl_b32_e32 v2, s0, v2
+; GFX6-NEXT:    s_lshr_b32 s0, s2, 1
+; GFX6-NEXT:    v_and_b32_e32 v3, 0xffffff, v3
+; GFX6-NEXT:    v_lshr_b32_e32 v3, s0, v3
+; GFX6-NEXT:    v_sub_i32_e32 v0, vcc, s5, v0
+; GFX6-NEXT:    v_or_b32_e32 v2, v2, v3
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v0, v1
+; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v0
+; GFX6-NEXT:    v_cndmask_b32_e32 v0, v0, v3, vcc
+; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v0, v1
+; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v0
+; GFX6-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX6-NEXT:    v_sub_i32_e32 v1, vcc, 23, v0
+; GFX6-NEXT:    v_and_b32_e32 v0, 0xffffff, v0
+; GFX6-NEXT:    s_lshr_b32 s0, s3, 1
+; GFX6-NEXT:    v_and_b32_e32 v1, 0xffffff, v1
+; GFX6-NEXT:    v_lshl_b32_e32 v0, s1, v0
+; GFX6-NEXT:    v_lshr_b32_e32 v1, s0, v1
+; GFX6-NEXT:    v_bfe_u32 v3, v2, 8, 8
 ; GFX6-NEXT:    v_or_b32_e32 v0, v0, v1
-; GFX6-NEXT:    v_and_b32_e32 v1, 0xff, v4
-; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GFX6-NEXT:    v_or_b32_e32 v1, v1, v2
-; GFX6-NEXT:    v_bfe_u32 v2, v4, 16, 8
+; GFX6-NEXT:    v_and_b32_e32 v1, 0xff, v2
+; GFX6-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GFX6-NEXT:    v_bfe_u32 v2, v2, 16, 8
+; GFX6-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX6-NEXT:    v_or_b32_e32 v1, v1, v2
 ; GFX6-NEXT:    v_and_b32_e32 v2, 0xff, v0
@@ -2568,156 +2568,101 @@ define <2 x i24> @v_fshl_v2i24(<2 x i24> %lhs, <2 x i24> %rhs, <2 x i24> %amt) {
 }
 
 define amdgpu_ps i32 @s_fshl_i32(i32 inreg %lhs, i32 inreg %rhs, i32 inreg %amt) {
-; GFX6-LABEL: s_fshl_i32:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_mov_b32_e32 v0, s1
-; GFX6-NEXT:    s_not_b32 s1, s2
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX6-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX6-NEXT:    v_mov_b32_e32 v1, s1
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX6-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: s_fshl_i32:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    v_mov_b32_e32 v0, s1
-; GFX8-NEXT:    s_not_b32 s1, s2
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX8-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX8-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX9-LABEL: s_fshl_i32:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    v_mov_b32_e32 v0, s1
-; GFX9-NEXT:    s_not_b32 s1, s2
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX9-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX9-NEXT:    v_mov_b32_e32 v1, s1
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: s_fshl_i32:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, s1, 1
-; GFX10-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX10-NEXT:    s_not_b32 s1, s2
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, v0, s1
-; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX10-NEXT:    ; return to shader part epilog
+; GCN-LABEL: s_fshl_i32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_not_b32 s3, s2
+; GCN-NEXT:    s_lshr_b32 s1, s1, 1
+; GCN-NEXT:    s_lshl_b32 s0, s0, s2
+; GCN-NEXT:    s_lshr_b32 s1, s1, s3
+; GCN-NEXT:    s_or_b32 s0, s0, s1
+; GCN-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: s_fshl_i32:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_alignbit_b32 v0, s0, s1, 1
-; GFX11-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX11-NEXT:    s_not_b32 s1, s2
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX11-NEXT:    v_alignbit_b32 v0, s0, v0, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    s_not_b32 s3, s2
+; GFX11-NEXT:    s_lshr_b32 s1, s1, 1
+; GFX11-NEXT:    s_lshl_b32 s0, s0, s2
+; GFX11-NEXT:    s_lshr_b32 s1, s1, s3
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s0, s1
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i32 @llvm.fshl.i32(i32 %lhs, i32 %rhs, i32 %amt)
   ret i32 %result
 }
 
 define amdgpu_ps i32 @s_fshl_i32_5(i32 inreg %lhs, i32 inreg %rhs) {
-; GFX6-LABEL: s_fshl_i32_5:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_mov_b32_e32 v0, s1
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX6-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: s_fshl_i32_5:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    v_mov_b32_e32 v0, s1
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX9-LABEL: s_fshl_i32_5:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    v_mov_b32_e32 v0, s1
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: s_fshl_i32_5:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, s1, 27
-; GFX10-NEXT:    v_readfirstlane_b32...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 13, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Aleksandar Spasojevic (aleksandar-amd)

Changes

This patch introduces a new llvm.amdgcn.v_alignbit intrinsic and implements early pattern matching in AMDGPUCodeGenPrepare to convert divergent fshr/fshl operations into V_ALIGNBIT_B32 instructions. The approach forces expansion of rotate/funnel shift operations during legalization into (shl, or, shr) patterns.


Patch is 4.51 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/153406.diff

73 Files Affected:

  • (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+5)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+48)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+7-3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+5-8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+7)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+32)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll (+466-458)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fshr.ll (+511-207)
  • (removed) llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-fshr.mir (-59)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-fshl.mir (+72-39)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-fshr.mir (+87-15)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-rotl-rotr.mir (+65-14)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+30-24)
  • (removed) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-fshr.mir (-168)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/uaddsat.ll (+91-65)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/usubsat.ll (+91-65)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.1024bit.ll (+18682-17264)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll (+788-686)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.256bit.ll (+1689-1485)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.32bit.ll (+147-121)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.48bit.ll (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.512bit.ll (+4230-3740)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll (+374-324)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.96bit.ll (+438-384)
  • (modified) llvm/test/CodeGen/AMDGPU/any_extend_vector_inreg.ll (+33-33)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+20-18)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+1594-1277)
  • (modified) llvm/test/CodeGen/AMDGPU/bswap.ll (+37-32)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fadd.ll (+197-181)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmax.ll (+94-88)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointer-atomicrmw-fmin.ll (+94-88)
  • (modified) llvm/test/CodeGen/AMDGPU/build-vector-packed-partial-undef.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/build_vector.ll (+3-1)
  • (modified) llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/divergence-driven-buildvector.ll (+7-5)
  • (modified) llvm/test/CodeGen/AMDGPU/fabs.bf16.ll (+27-23)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.bf16.ll (+128-124)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fadd.ll (+204-192)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmax.ll (+173-165)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fmin.ll (+173-165)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-atomicrmw-fsub.ll (+136-128)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-fabs.bf16.ll (+51-42)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+8-6)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg.bf16.ll (+21-16)
  • (modified) llvm/test/CodeGen/AMDGPU/freeze.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/fshl.ll (+353-244)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+1026-714)
  • (modified) llvm/test/CodeGen/AMDGPU/function-args.ll (+55-39)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+350-322)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmax.ll (+262-246)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fmin.ll (+262-246)
  • (modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fsub.ll (+200-184)
  • (modified) llvm/test/CodeGen/AMDGPU/idot4u.ll (+22-26)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll (+29-24)
  • (modified) llvm/test/CodeGen/AMDGPU/itofp.i128.ll (+9-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.bf16.ll (+25-18)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.ll (+10-7)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i8.ll (+383-341)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i8.ll (+510-357)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+100-92)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmax.ll (+144-136)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fmin.ll (+144-136)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fsub.ll (+100-92)
  • (modified) llvm/test/CodeGen/AMDGPU/packetizer.ll (+3)
  • (modified) llvm/test/CodeGen/AMDGPU/permute.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/permute_i8.ll (+142-120)
  • (modified) llvm/test/CodeGen/AMDGPU/rotate-add.ll (+60-22)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+136-68)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+234-53)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.ll (+10-8)
  • (modified) llvm/test/CodeGen/AMDGPU/truncate-lshr-cast-build-vector-combine.ll (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/urem-seteq-illegal-types.ll (+17-15)
  • (modified) llvm/test/Transforms/InferAddressSpaces/SPIRV/generic-cast-explicit.ll (+8-7)
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index cf82f7f06a693..6a34fc039ac36 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -886,6 +886,11 @@ def int_amdgcn_bitop3 :
                         [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
                         [IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<3>>]>;
 
+def int_amdgcn_v_alignbit :
+  DefaultAttrsIntrinsic<[llvm_i32_ty],
+                        [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
+                        [IntrNoMem, IntrSpeculatable, IntrWillReturn]>;
+
 } // TargetPrefix = "amdgcn"
 
 // New-style image intrinsics
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index a9278c1dc3a6a..0f05389b03e62 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -245,6 +245,7 @@ class AMDGPUCodeGenPrepareImpl
 
   bool visitInstruction(Instruction &I) { return false; }
   bool visitBinaryOperator(BinaryOperator &I);
+  bool visitCallInst(CallInst &I);
   bool visitLoadInst(LoadInst &I);
   bool visitSelectInst(SelectInst &I);
   bool visitPHINode(PHINode &I);
@@ -253,6 +254,7 @@ class AMDGPUCodeGenPrepareImpl
   bool visitIntrinsicInst(IntrinsicInst &I);
   bool visitFMinLike(IntrinsicInst &I);
   bool visitSqrt(IntrinsicInst &I);
+  bool visitRotateIntrinsic(IntrinsicInst &I);
   bool run();
 };
 
@@ -1913,6 +1915,9 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) {
     return visitFMinLike(I);
   case Intrinsic::sqrt:
     return visitSqrt(I);
+  case Intrinsic::fshr:
+  case Intrinsic::fshl:
+    return visitRotateIntrinsic(I);
   default:
     return false;
   }
@@ -2103,6 +2108,49 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F,
   return PA;
 }
 
+bool AMDGPUCodeGenPrepareImpl::visitCallInst(CallInst &I) {
+  if (IntrinsicInst *II = dyn_cast<IntrinsicInst>(&I))
+    return visitIntrinsicInst(*II);
+  return false;
+}
+
+bool AMDGPUCodeGenPrepareImpl::visitRotateIntrinsic(IntrinsicInst &I) {
+  if (!I.getType()->isIntegerTy(32))
+    return false;
+
+  // Only convert divergent operations to v_alignbit
+  if (UA.isUniform(&I))
+    return false;
+
+  Intrinsic::ID IID = I.getIntrinsicID();
+  Value *Src0 = I.getOperand(0);
+  Value *Src1 = I.getOperand(1);
+  Value *Amt = I.getOperand(2);
+
+  IRBuilder<> Builder(&I);
+  Function *AlignBitFn = Intrinsic::getOrInsertDeclaration(
+      I.getModule(), Intrinsic::amdgcn_v_alignbit);
+
+  Value *AlignBitCall = nullptr;
+  
+  switch (IID) {
+  case Intrinsic::fshr:
+    AlignBitCall = Builder.CreateCall(AlignBitFn, {Src0, Src1, Amt});
+    break;
+  case Intrinsic::fshl: {
+    Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt);
+    AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt});
+    break;
+  }
+  default:
+    return false;
+  }
+
+  I.replaceAllUsesWith(AlignBitCall);
+  I.eraseFromParent();
+  return true;
+}
+
 INITIALIZE_PASS_BEGIN(AMDGPUCodeGenPrepare, DEBUG_TYPE,
                       "AMDGPU IR optimizations", false, false)
 INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 64e68ab7d753c..3362834796f9d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -487,12 +487,16 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
     setOperationAction({ISD::ADDC, ISD::SUBC, ISD::ADDE, ISD::SUBE}, VT, Legal);
   }
 
-  // The hardware supports 32-bit FSHR, but not FSHL.
-  setOperationAction(ISD::FSHR, MVT::i32, Legal);
+  if (Subtarget->isGCN()) {
+    setOperationAction(ISD::FSHR, MVT::i32, Expand);
+    setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Expand);
+  } else {
+    setOperationAction(ISD::FSHR, MVT::i32, Legal);
+    setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Legal);
+  }
 
   // The hardware supports 32-bit ROTR, but not ROTL.
   setOperationAction(ISD::ROTL, {MVT::i32, MVT::i64}, Expand);
-  setOperationAction(ISD::ROTR, MVT::i64, Expand);
 
   setOperationAction({ISD::MULHU, ISD::MULHS}, MVT::i16, Expand);
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index b88891ac4894b..90634de0b12cd 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2058,17 +2058,14 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
     .clampScalar(0, S32, S64)
     .lower();
 
-  getActionDefinitionsBuilder({G_ROTR, G_ROTL})
-    .scalarize(0)
-    .lower();
+  getActionDefinitionsBuilder({G_ROTR, G_ROTL}).scalarize(0).lower();
 
   // TODO: Only Try to form v2s16 with legal packed instructions.
   getActionDefinitionsBuilder(G_FSHR)
-    .legalFor({{S32, S32}})
-    .lowerFor({{V2S16, V2S16}})
-    .clampMaxNumElementsStrict(0, S16, 2)
-    .scalarize(0)
-    .lower();
+      .lowerFor({{V2S16, V2S16}})
+      .clampMaxNumElementsStrict(0, S16, 2)
+      .scalarize(0)
+      .lower();
 
   if (ST.hasVOP3PInsts()) {
     getActionDefinitionsBuilder(G_FSHL)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index 237929699dd9d..a5bd08072b47d 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -4123,6 +4123,12 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
   case AMDGPU::G_AMDGPU_SMED3:
   case AMDGPU::G_AMDGPU_FMED3:
     return getDefaultMappingVOP(MI);
+  case AMDGPU::G_ROTR:
+  case AMDGPU::G_ROTL: {
+    if (isSALUMapping(MI))
+      return getDefaultMappingSOP(MI);
+    return getDefaultMappingVOP(MI);
+  }
   case AMDGPU::G_UMULH:
   case AMDGPU::G_SMULH: {
     if (Subtarget.hasScalarMulHiInsts() && isSALUMapping(MI))
@@ -4824,6 +4830,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_perm_pk16_b4_u4:
     case Intrinsic::amdgcn_perm_pk16_b6_u4:
     case Intrinsic::amdgcn_perm_pk16_b8_u4:
+    case Intrinsic::amdgcn_v_alignbit:
       return getDefaultMappingVOP(MI);
     case Intrinsic::amdgcn_log:
     case Intrinsic::amdgcn_exp2:
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index bd5dfa92a8e43..7ec9cd69e1600 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -2612,6 +2612,38 @@ def : GCNPat<(fshr i32:$src0, i32:$src1, i32:$src2),
 >;
 } // end True16Predicate = UseFakeTrue16Insts
 
+let True16Predicate = NotHasTrue16BitInsts in {
+let SubtargetPredicate = isNotGFX9Plus in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_e64 VSrc_b32:$src0, VSrc_b32:$src1, VSrc_b32:$src2)>;
+} // isNotGFX9Plus
+
+let SubtargetPredicate = isGFX9GFX10 in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_opsel_e64 /* src0_modifiers */ 0, $src0,
+                                       /* src1_modifiers */ 0, $src1,
+                                       /* src2_modifiers */ 0,
+                                       $src2, /* clamp */ 0, /* op_sel */ 0)>;
+} // isGFX9GFX10
+} // end True16Predicate = NotHasTrue16BitInsts
+
+let True16Predicate = UseRealTrue16Insts in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_t16_e64 /* src0_modifiers */ 0, $src0,
+                                     /* src1_modifiers */ 0, $src1,
+                                     /* src2_modifiers */ 0,
+                                     (EXTRACT_SUBREG $src2, lo16),
+                                     /* clamp */ 0, /* op_sel */ 0)>;
+} // end True16Predicate = UseRealTrue16Insts
+
+let True16Predicate = UseFakeTrue16Insts in {
+def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
+             (V_ALIGNBIT_B32_fake16_e64 /* src0_modifiers */ 0, $src0,
+                                        /* src1_modifiers */ 0, $src1,
+                                        /* src2_modifiers */ 0,
+                                        $src2, /* clamp */ 0, /* op_sel */ 0)>;
+} // end True16Predicate = UseFakeTrue16Insts
+
 /********** ====================== **********/
 /**********   Indirect addressing  **********/
 /********** ====================== **********/
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
index fc81e16d68e98..5e54412c97a5c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fshl.ll
@@ -1768,102 +1768,102 @@ define i24 @v_fshl_i24(i24 %lhs, i24 %rhs, i24 %amt) {
 define amdgpu_ps i48 @s_fshl_v2i24(i48 inreg %lhs.arg, i48 inreg %rhs.arg, i48 inreg %amt.arg) {
 ; GFX6-LABEL: s_fshl_v2i24:
 ; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_cvt_f32_ubyte0_e32 v2, 24
-; GFX6-NEXT:    v_rcp_iflag_f32_e32 v2, v2
-; GFX6-NEXT:    s_bfe_u32 s9, s0, 0x80008
+; GFX6-NEXT:    v_cvt_f32_ubyte0_e32 v0, 24
+; GFX6-NEXT:    v_rcp_iflag_f32_e32 v0, v0
 ; GFX6-NEXT:    s_lshr_b32 s6, s0, 16
-; GFX6-NEXT:    s_and_b32 s8, s0, 0xff
-; GFX6-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX6-NEXT:    v_cvt_u32_f32_e32 v2, v2
-; GFX6-NEXT:    s_lshl_b32 s9, s9, 8
-; GFX6-NEXT:    s_lshr_b32 s7, s1, 8
-; GFX6-NEXT:    s_or_b32 s8, s8, s9
+; GFX6-NEXT:    s_lshr_b32 s7, s0, 24
+; GFX6-NEXT:    s_and_b32 s9, s0, 0xff
+; GFX6-NEXT:    s_bfe_u32 s0, s0, 0x80008
+; GFX6-NEXT:    s_lshl_b32 s0, s0, 8
+; GFX6-NEXT:    s_or_b32 s0, s9, s0
 ; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    s_lshr_b32 s8, s1, 8
+; GFX6-NEXT:    s_and_b32 s0, 0xffff, s0
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
 ; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v0, s0
-; GFX6-NEXT:    s_and_b32 s8, 0xffff, s8
+; GFX6-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
+; GFX6-NEXT:    s_or_b32 s0, s0, s6
+; GFX6-NEXT:    s_lshl_b32 s1, s1, 8
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    v_cvt_u32_f32_e32 v0, v0
+; GFX6-NEXT:    s_or_b32 s1, s7, s1
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s1, s1, s6
+; GFX6-NEXT:    s_lshr_b32 s6, s2, 16
+; GFX6-NEXT:    s_lshr_b32 s7, s2, 24
+; GFX6-NEXT:    s_and_b32 s9, s2, 0xff
+; GFX6-NEXT:    s_bfe_u32 s2, s2, 0x80008
+; GFX6-NEXT:    s_lshl_b32 s2, s2, 8
+; GFX6-NEXT:    v_not_b32_e32 v1, 23
+; GFX6-NEXT:    s_or_b32 s2, s9, s2
+; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    v_mul_lo_u32 v2, v0, v1
+; GFX6-NEXT:    s_lshr_b32 s8, s3, 8
+; GFX6-NEXT:    s_and_b32 s2, 0xffff, s2
 ; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
-; GFX6-NEXT:    v_alignbit_b32 v0, s1, v0, 24
-; GFX6-NEXT:    s_and_b32 s0, s7, 0xff
-; GFX6-NEXT:    v_not_b32_e32 v3, 23
-; GFX6-NEXT:    s_or_b32 s6, s8, s6
-; GFX6-NEXT:    v_and_b32_e32 v0, 0xffff, v0
-; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
-; GFX6-NEXT:    s_bfe_u32 s8, s2, 0x80008
-; GFX6-NEXT:    v_mul_lo_u32 v4, v2, v3
-; GFX6-NEXT:    v_or_b32_e32 v0, s0, v0
-; GFX6-NEXT:    s_lshr_b32 s0, s2, 16
-; GFX6-NEXT:    s_and_b32 s7, s2, 0xff
-; GFX6-NEXT:    s_lshl_b32 s8, s8, 8
-; GFX6-NEXT:    s_lshr_b32 s1, s3, 8
-; GFX6-NEXT:    s_or_b32 s7, s7, s8
-; GFX6-NEXT:    s_and_b32 s0, s0, 0xff
 ; GFX6-NEXT:    s_and_b32 s3, s3, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v1, s2
-; GFX6-NEXT:    s_and_b32 s7, 0xffff, s7
-; GFX6-NEXT:    s_lshl_b32 s0, s0, 16
-; GFX6-NEXT:    v_alignbit_b32 v1, s3, v1, 24
-; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    s_or_b32 s0, s7, s0
-; GFX6-NEXT:    v_and_b32_e32 v1, 0xffff, v1
-; GFX6-NEXT:    s_lshl_b32 s1, s1, 16
-; GFX6-NEXT:    s_bfe_u32 s7, s4, 0x80008
-; GFX6-NEXT:    v_mul_hi_u32 v4, v2, v4
-; GFX6-NEXT:    v_or_b32_e32 v1, s1, v1
-; GFX6-NEXT:    s_lshr_b32 s1, s4, 16
-; GFX6-NEXT:    s_and_b32 s3, s4, 0xff
-; GFX6-NEXT:    s_lshl_b32 s7, s7, 8
-; GFX6-NEXT:    s_or_b32 s3, s3, s7
-; GFX6-NEXT:    s_and_b32 s1, s1, 0xff
-; GFX6-NEXT:    s_and_b32 s3, 0xffff, s3
-; GFX6-NEXT:    s_lshl_b32 s1, s1, 16
-; GFX6-NEXT:    s_or_b32 s1, s3, s1
-; GFX6-NEXT:    v_add_i32_e32 v2, vcc, v2, v4
-; GFX6-NEXT:    v_mul_hi_u32 v4, s1, v2
-; GFX6-NEXT:    s_lshr_b32 s2, s5, 8
-; GFX6-NEXT:    s_and_b32 s3, s5, 0xff
-; GFX6-NEXT:    v_mov_b32_e32 v5, s4
-; GFX6-NEXT:    v_alignbit_b32 v5, s3, v5, 24
-; GFX6-NEXT:    s_and_b32 s2, s2, 0xff
-; GFX6-NEXT:    v_and_b32_e32 v5, 0xffff, v5
-; GFX6-NEXT:    v_mul_lo_u32 v4, v4, 24
-; GFX6-NEXT:    s_lshl_b32 s2, s2, 16
-; GFX6-NEXT:    v_or_b32_e32 v5, s2, v5
-; GFX6-NEXT:    v_mul_hi_u32 v2, v5, v2
-; GFX6-NEXT:    v_sub_i32_e32 v4, vcc, s1, v4
-; GFX6-NEXT:    v_add_i32_e32 v6, vcc, v4, v3
-; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v4
+; GFX6-NEXT:    s_or_b32 s2, s2, s6
+; GFX6-NEXT:    s_lshl_b32 s3, s3, 8
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    s_or_b32 s3, s7, s3
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s3, s3, s6
+; GFX6-NEXT:    s_lshr_b32 s6, s4, 16
+; GFX6-NEXT:    s_lshr_b32 s7, s4, 24
+; GFX6-NEXT:    s_and_b32 s9, s4, 0xff
+; GFX6-NEXT:    s_bfe_u32 s4, s4, 0x80008
+; GFX6-NEXT:    v_mul_hi_u32 v2, v0, v2
+; GFX6-NEXT:    s_lshl_b32 s4, s4, 8
+; GFX6-NEXT:    s_or_b32 s4, s9, s4
+; GFX6-NEXT:    s_and_b32 s6, s6, 0xff
+; GFX6-NEXT:    s_and_b32 s4, 0xffff, s4
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s4, s4, s6
+; GFX6-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
+; GFX6-NEXT:    v_mul_hi_u32 v2, s4, v0
+; GFX6-NEXT:    s_lshr_b32 s8, s5, 8
+; GFX6-NEXT:    s_and_b32 s5, s5, 0xff
+; GFX6-NEXT:    s_lshl_b32 s5, s5, 8
 ; GFX6-NEXT:    v_mul_lo_u32 v2, v2, 24
-; GFX6-NEXT:    v_cndmask_b32_e32 v4, v4, v6, vcc
-; GFX6-NEXT:    v_add_i32_e32 v6, vcc, v4, v3
-; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v4
-; GFX6-NEXT:    v_cndmask_b32_e32 v4, v4, v6, vcc
-; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, v5, v2
-; GFX6-NEXT:    v_sub_i32_e32 v6, vcc, 23, v4
-; GFX6-NEXT:    v_add_i32_e32 v5, vcc, v2, v3
+; GFX6-NEXT:    s_and_b32 s6, s8, 0xff
+; GFX6-NEXT:    s_or_b32 s5, s7, s5
+; GFX6-NEXT:    s_lshl_b32 s6, s6, 16
+; GFX6-NEXT:    s_or_b32 s5, s5, s6
+; GFX6-NEXT:    v_sub_i32_e32 v2, vcc, s4, v2
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v1
+; GFX6-NEXT:    v_mul_hi_u32 v0, s5, v0
 ; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v2
-; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v5, vcc
-; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v3
+; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v3, vcc
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v2, v1
 ; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v2
-; GFX6-NEXT:    v_and_b32_e32 v4, 0xffffff, v4
-; GFX6-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX6-NEXT:    v_and_b32_e32 v6, 0xffffff, v6
+; GFX6-NEXT:    v_mul_lo_u32 v0, v0, 24
 ; GFX6-NEXT:    v_cndmask_b32_e32 v2, v2, v3, vcc
-; GFX6-NEXT:    v_lshl_b32_e32 v4, s6, v4
-; GFX6-NEXT:    v_lshr_b32_e32 v6, s0, v6
 ; GFX6-NEXT:    v_sub_i32_e32 v3, vcc, 23, v2
 ; GFX6-NEXT:    v_and_b32_e32 v2, 0xffffff, v2
-; GFX6-NEXT:    v_or_b32_e32 v4, v4, v6
-; GFX6-NEXT:    v_lshlrev_b32_e32 v0, v2, v0
-; GFX6-NEXT:    v_lshrrev_b32_e32 v1, 1, v1
-; GFX6-NEXT:    v_and_b32_e32 v2, 0xffffff, v3
-; GFX6-NEXT:    v_lshrrev_b32_e32 v1, v2, v1
-; GFX6-NEXT:    v_bfe_u32 v2, v4, 8, 8
+; GFX6-NEXT:    v_lshl_b32_e32 v2, s0, v2
+; GFX6-NEXT:    s_lshr_b32 s0, s2, 1
+; GFX6-NEXT:    v_and_b32_e32 v3, 0xffffff, v3
+; GFX6-NEXT:    v_lshr_b32_e32 v3, s0, v3
+; GFX6-NEXT:    v_sub_i32_e32 v0, vcc, s5, v0
+; GFX6-NEXT:    v_or_b32_e32 v2, v2, v3
+; GFX6-NEXT:    v_add_i32_e32 v3, vcc, v0, v1
+; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v0
+; GFX6-NEXT:    v_cndmask_b32_e32 v0, v0, v3, vcc
+; GFX6-NEXT:    v_add_i32_e32 v1, vcc, v0, v1
+; GFX6-NEXT:    v_cmp_le_u32_e32 vcc, 24, v0
+; GFX6-NEXT:    v_cndmask_b32_e32 v0, v0, v1, vcc
+; GFX6-NEXT:    v_sub_i32_e32 v1, vcc, 23, v0
+; GFX6-NEXT:    v_and_b32_e32 v0, 0xffffff, v0
+; GFX6-NEXT:    s_lshr_b32 s0, s3, 1
+; GFX6-NEXT:    v_and_b32_e32 v1, 0xffffff, v1
+; GFX6-NEXT:    v_lshl_b32_e32 v0, s1, v0
+; GFX6-NEXT:    v_lshr_b32_e32 v1, s0, v1
+; GFX6-NEXT:    v_bfe_u32 v3, v2, 8, 8
 ; GFX6-NEXT:    v_or_b32_e32 v0, v0, v1
-; GFX6-NEXT:    v_and_b32_e32 v1, 0xff, v4
-; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
-; GFX6-NEXT:    v_or_b32_e32 v1, v1, v2
-; GFX6-NEXT:    v_bfe_u32 v2, v4, 16, 8
+; GFX6-NEXT:    v_and_b32_e32 v1, 0xff, v2
+; GFX6-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
+; GFX6-NEXT:    v_bfe_u32 v2, v2, 16, 8
+; GFX6-NEXT:    v_or_b32_e32 v1, v1, v3
 ; GFX6-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
 ; GFX6-NEXT:    v_or_b32_e32 v1, v1, v2
 ; GFX6-NEXT:    v_and_b32_e32 v2, 0xff, v0
@@ -2568,156 +2568,101 @@ define <2 x i24> @v_fshl_v2i24(<2 x i24> %lhs, <2 x i24> %rhs, <2 x i24> %amt) {
 }
 
 define amdgpu_ps i32 @s_fshl_i32(i32 inreg %lhs, i32 inreg %rhs, i32 inreg %amt) {
-; GFX6-LABEL: s_fshl_i32:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_mov_b32_e32 v0, s1
-; GFX6-NEXT:    s_not_b32 s1, s2
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX6-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX6-NEXT:    v_mov_b32_e32 v1, s1
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX6-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: s_fshl_i32:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    v_mov_b32_e32 v0, s1
-; GFX8-NEXT:    s_not_b32 s1, s2
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX8-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX8-NEXT:    v_mov_b32_e32 v1, s1
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX9-LABEL: s_fshl_i32:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    v_mov_b32_e32 v0, s1
-; GFX9-NEXT:    s_not_b32 s1, s2
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, 1
-; GFX9-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX9-NEXT:    v_mov_b32_e32 v1, s1
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, v1
-; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: s_fshl_i32:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, s1, 1
-; GFX10-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX10-NEXT:    s_not_b32 s1, s2
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, v0, s1
-; GFX10-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX10-NEXT:    ; return to shader part epilog
+; GCN-LABEL: s_fshl_i32:
+; GCN:       ; %bb.0:
+; GCN-NEXT:    s_not_b32 s3, s2
+; GCN-NEXT:    s_lshr_b32 s1, s1, 1
+; GCN-NEXT:    s_lshl_b32 s0, s0, s2
+; GCN-NEXT:    s_lshr_b32 s1, s1, s3
+; GCN-NEXT:    s_or_b32 s0, s0, s1
+; GCN-NEXT:    ; return to shader part epilog
 ;
 ; GFX11-LABEL: s_fshl_i32:
 ; GFX11:       ; %bb.0:
-; GFX11-NEXT:    v_alignbit_b32 v0, s0, s1, 1
-; GFX11-NEXT:    s_lshr_b32 s0, s0, 1
-; GFX11-NEXT:    s_not_b32 s1, s2
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
-; GFX11-NEXT:    v_alignbit_b32 v0, s0, v0, s1
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_readfirstlane_b32 s0, v0
+; GFX11-NEXT:    s_not_b32 s3, s2
+; GFX11-NEXT:    s_lshr_b32 s1, s1, 1
+; GFX11-NEXT:    s_lshl_b32 s0, s0, s2
+; GFX11-NEXT:    s_lshr_b32 s1, s1, s3
+; GFX11-NEXT:    s_delay_alu instid0(SALU_CYCLE_1)
+; GFX11-NEXT:    s_or_b32 s0, s0, s1
 ; GFX11-NEXT:    ; return to shader part epilog
   %result = call i32 @llvm.fshl.i32(i32 %lhs, i32 %rhs, i32 %amt)
   ret i32 %result
 }
 
 define amdgpu_ps i32 @s_fshl_i32_5(i32 inreg %lhs, i32 inreg %rhs) {
-; GFX6-LABEL: s_fshl_i32_5:
-; GFX6:       ; %bb.0:
-; GFX6-NEXT:    v_mov_b32_e32 v0, s1
-; GFX6-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX6-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX6-NEXT:    ; return to shader part epilog
-;
-; GFX8-LABEL: s_fshl_i32_5:
-; GFX8:       ; %bb.0:
-; GFX8-NEXT:    v_mov_b32_e32 v0, s1
-; GFX8-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX8-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX8-NEXT:    ; return to shader part epilog
-;
-; GFX9-LABEL: s_fshl_i32_5:
-; GFX9:       ; %bb.0:
-; GFX9-NEXT:    v_mov_b32_e32 v0, s1
-; GFX9-NEXT:    v_alignbit_b32 v0, s0, v0, 27
-; GFX9-NEXT:    v_readfirstlane_b32 s0, v0
-; GFX9-NEXT:    ; return to shader part epilog
-;
-; GFX10-LABEL: s_fshl_i32_5:
-; GFX10:       ; %bb.0:
-; GFX10-NEXT:    v_alignbit_b32 v0, s0, s1, 27
-; GFX10-NEXT:    v_readfirstlane_b32...
[truncated]

@aleksandar-amd aleksandar-amd force-pushed the amdgpu-optimize-rotate-instr-selection branch 3 times, most recently from 0eb9414 to c3bb4ac Compare August 13, 2025 14:20
… patterns

This patch introduces a new llvm.amdgcn.v_alignbit intrinsic and implements
early pattern matching in AMDGPUCodeGenPrepare to convert divergent fshr/fshl
operations into V_ALIGNBIT_B32 instructions. The approach forces expansion of
rotate/funnel shift operations during legalization into (shl, or, shr) patterns.
@github-actions
Copy link

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index f8e8fcf8d..f42a4aede 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2064,10 +2064,10 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,
 
   // TODO: Only Try to form v2s16 with legal packed instructions.
   getActionDefinitionsBuilder(G_FSHR)
-    .lowerFor({{V2S16, V2S16}})
-    .clampMaxNumElementsStrict(0, S16, 2)
-    .scalarize(0)
-    .lower();
+      .lowerFor({{V2S16, V2S16}})
+      .clampMaxNumElementsStrict(0, S16, 2)
+      .scalarize(0)
+      .lower();
 
   if (ST.hasVOP3PInsts()) {
     getActionDefinitionsBuilder(G_FSHL)

Copy link
Collaborator

@nhaehnle nhaehnle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think this is a good start, but note the important comments inline.

Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt);
AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt});
} else
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, and our policy is to fail fast and loud rather than to try to continue somehow when something that shouldn't happen happens. (To go along with this, our policy is to use release builds with assertions enabled for a lot of testing -- you should definitely make sure you do the same in case you're running check-* targets with release builds.)

In this particular case, there are one of two idiomatic patterns you can use:

  if (IID == Intrinsic::fshr) {
    ...
  } else {
    assert(IID == Intrinsic::fshl);
    ...
  }

or

  if (IID == Intrinsic::fshr) {
    ...
  } else if (IID == Intrinsic::fshl) {
    ...
  } else {
    llvm_unreachable("bad intrinsic");
  }

In this case, I'd prefer the former due to the obvious left/right symmetry.

; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: v_alignbit_b32 v0, v0, v1, 27
; GCN-NEXT: v_alignbit_b32 v0, v1, v0, 27
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test change is a mis-compilation.

Comment on lines +2130 to +2132
else if (IID == Intrinsic::fshl) {
Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt);
AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt});
Copy link
Collaborator

@nhaehnle nhaehnle Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. See also the comment about a mis-compilation in the tests.

A key property of funnel shifts is that bits of src0 are always more significant than bits of src1 in the output. Therefore, swapping the sources like that cannot be correct.

Looking at the tests, here's what we used to generate for a fully generic fshl:

v_alignbit_b32 v1, v0, v1, 1
v_lshrrev_b32_e32 v0, 1, v0
v_not_b32_e32 v2, v2
v_alignbit_b32 v0, v0, v1, v2

This is basically ({src0 : src1} >> 1) >> (31 - amt) (the outer >> is the final v_alignbit_32), which I think is kind of what you were going for with this lowering: shifting first by 1 and then by (31 - amt) (the bitwise NOT of shamt) seems like it'd be the same as shifting by 32 - amt.

The problem is that all shifts only use the LSBs to determine the shift amount, and so this lowering will give incorrect results when amt is 0 (or to be more precise, when amt & 31 == 0).

I'm not actually sure what the best lowering here is. There are two kinds of contenders:

  1. The above sequence, which you can potentially get by forming (in LLVM IR) a 64-bit value out of the sources and then doing first a shift by 1 and then a shift by not Amt.
  2. A sequence ((amt & 31) != 0) ? v.align.bit(src0, src1, -amt) : src0

The second sequence should become something like:

v_and_b32 v3, 31, v2
v_sub_b32 v6, 0, v2
v_alignbit_b32 v5, v0, v1, v6
v_cmp_eq_b32 vcc_lo, 0, v2
v_cndmask_b32 v0, v0, v5, vcc_lo

This is worse than the sequence above (though there are probably many cases where the v_and_b32 can be optimized away thanks to known bits analysis). But! It is somewhat natural to have a uniform or even constant shift amount, and I don't know how well isel would be able to optimize those cases if you the first option for the IR here. (I suspect it might be able to optimize the constant case, but not the uniform case.) With the version of LLVM IR that corresponds to this alternative, a uniform shift amount should lead to only 2x VALU (v_alignbit_b32 + v_cndmask_b32) and a bunch of SALU, which is certainly better than 4x VALU.

So you may want to do a case distinction here based on whether the shift amount is uniform. However, the more important thing is that you try a couple of variants of this to find out which is really best. There may be a variation I haven't thought of.

(I strongly recommend not doing an explicit case distinction based on whether amt is constant, since we have uniform analysis here. UA catches the constant case as well, and automatic constant folding in the builder should do the rest.)

; This test checks that the address space casts for SPIR-V generic pointer casts
; are lowered correctly by the infer-address-spaces pass.
; RUN: opt < %s -passes=infer-address-spaces -S --mtriple=spirv64-unknown-unknown | FileCheck %s

; Casting a global pointer to a global pointer.
; Casting a global pointer to a global pointer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid churn caused by unrelated changes.

@@ -1,3 +1,4 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More unrelated changes... make sure to clean up your diff!

@nhaehnle
Copy link
Collaborator

nhaehnle commented Sep 22, 2025

By the way, on that last comment: Please make sure to always do a self-review first. By this I mean re-reading the diff yourself with a critical eye towards correctness, style, etc.; you should catch unrelated changes before asking somebody else to look at it.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intrinsic was explicitly removed in 82de129. This should not be reintroduced. It has identical semantics as llvm.fshr. We should not be introducing a new intrinsic for a bespoke handling of SALU vs. VALU optimization

[LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
[IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<3>>]>;

def int_amdgcn_v_alignbit :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention omits the v prefix in intrinsic names

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we've talked about this explicitly, but I think the v prefix makes sense because the entire purpose of the intrinsic to serve as a pre-selection of sorts of v_alignbit in certain cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes it even worse?

@nhaehnle
Copy link
Collaborator

nhaehnle commented Oct 1, 2025

This intrinsic was explicitly removed in 82de129. This should not be reintroduced. It has identical semantics as llvm.fshr. We should not be introducing a new intrinsic for a bespoke handling of SALU vs. VALU optimization

There was a conversation about this before and I don't know if there was more conversation now in the last week, but the current version seems the most pragmatic approach to me.

IIRC, the most natural alternative was to have separate ISel patterns based on divergence, but then certain optimizations would be missed because the pattern selection happens too late. And re-implementing those optimizations just out of some fairly arbitrary notion of purity seems wrong-headed to me.

In general, the fact that we have SelectionDAG + G-MIR + MIR sucks, and I'd say that the more codegen tasks can be done in LLVM IR, the better. If there is a problem with this strategy, it's that we do not want frontends to generate this intrinsic. Admittedly, the mere existence of the intrinsic makes it tempting for frontends to make this mistake, and probably we should do something about that. At a minimum, the .td file should have a comment to that effect on the intrinsic definition. And perhaps there should be an "IR validation" step early in the codegen pipeline that errors out if the intrinsic is present. More fundamentally, LLVM isn't set up to distinguish "codegen-only" intrinsics, although we have actually had them forever (llvm.amdgcn.if and friends). How about we make that distinction clearer somehow?

@arsenm
Copy link
Contributor

arsenm commented Oct 1, 2025

IIRC, the most natural alternative was to have separate ISel patterns based on divergence, but then certain optimizations would be missed because the pattern selection happens too late.

I don't understand why the divergence predicates aren't good enough; it would maybe miss the cases with uniform values in VGPRs, but that's more of a DAG specific problem that I don't think we should design around.

At a minimum, the .td file should have a comment to that effect on the intrinsic definition.

This would not work at all. We could have instcombine replace uses of the intrinsic

And perhaps there should be an "IR validation" step early in the codegen pipeline that errors out if the intrinsic is present. More fundamentally, LLVM isn't set up to distinguish "codegen-only" intrinsics, although we have actually had them forever (llvm.amdgcn.if and friends). How about we make that distinction clearer somehow?

I don't think this would be good infrastructure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants