Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions llvm/include/llvm/IR/IntrinsicsAMDGPU.td
Original file line number Diff line number Diff line change
Expand Up @@ -886,6 +886,11 @@ def int_amdgcn_bitop3 :
[LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
[IntrNoMem, IntrSpeculatable, ImmArg<ArgIndex<3>>]>;

def int_amdgcn_v_alignbit :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming convention omits the v prefix in intrinsic names

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we've talked about this explicitly, but I think the v prefix makes sense because the entire purpose of the intrinsic to serve as a pre-selection of sorts of v_alignbit in certain cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes it even worse?

DefaultAttrsIntrinsic<[llvm_i32_ty],
[llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
[IntrNoMem, IntrSpeculatable, IntrWillReturn]>;

} // TargetPrefix = "amdgcn"

// New-style image intrinsics
Expand Down
35 changes: 35 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@ class AMDGPUCodeGenPrepareImpl
bool visitIntrinsicInst(IntrinsicInst &I);
bool visitFMinLike(IntrinsicInst &I);
bool visitSqrt(IntrinsicInst &I);
bool visitFunnelShift(IntrinsicInst &I);
bool run();
};

Expand Down Expand Up @@ -1913,6 +1914,9 @@ bool AMDGPUCodeGenPrepareImpl::visitIntrinsicInst(IntrinsicInst &I) {
return visitFMinLike(I);
case Intrinsic::sqrt:
return visitSqrt(I);
case Intrinsic::fshr:
case Intrinsic::fshl:
return visitFunnelShift(I);
default:
return false;
}
Expand Down Expand Up @@ -2103,6 +2107,37 @@ PreservedAnalyses AMDGPUCodeGenPreparePass::run(Function &F,
return PA;
}

bool AMDGPUCodeGenPrepareImpl::visitFunnelShift(IntrinsicInst &I) {
if (!I.getType()->isIntegerTy(32))
return false;

// Only convert divergent operations to v_alignbit
if (UA.isUniform(&I))
return false;

Intrinsic::ID IID = I.getIntrinsicID();
Value *Src0 = I.getOperand(0);
Value *Src1 = I.getOperand(1);
Value *Amt = I.getOperand(2);

IRBuilder<> Builder(&I);
Function *AlignBitFn = Intrinsic::getOrInsertDeclaration(
I.getModule(), Intrinsic::amdgcn_v_alignbit);

Value *AlignBitCall = nullptr;
if (IID == Intrinsic::fshr)
AlignBitCall = Builder.CreateCall(AlignBitFn, {Src0, Src1, Amt});
else if (IID == Intrinsic::fshl) {
Value *InvAmt = Builder.CreateSub(Builder.getInt32(32), Amt);
AlignBitCall = Builder.CreateCall(AlignBitFn, {Src1, Src0, InvAmt});
Comment on lines +2130 to +2132
Copy link
Collaborator

@nhaehnle nhaehnle Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. See also the comment about a mis-compilation in the tests.

A key property of funnel shifts is that bits of src0 are always more significant than bits of src1 in the output. Therefore, swapping the sources like that cannot be correct.

Looking at the tests, here's what we used to generate for a fully generic fshl:

v_alignbit_b32 v1, v0, v1, 1
v_lshrrev_b32_e32 v0, 1, v0
v_not_b32_e32 v2, v2
v_alignbit_b32 v0, v0, v1, v2

This is basically ({src0 : src1} >> 1) >> (31 - amt) (the outer >> is the final v_alignbit_32), which I think is kind of what you were going for with this lowering: shifting first by 1 and then by (31 - amt) (the bitwise NOT of shamt) seems like it'd be the same as shifting by 32 - amt.

The problem is that all shifts only use the LSBs to determine the shift amount, and so this lowering will give incorrect results when amt is 0 (or to be more precise, when amt & 31 == 0).

I'm not actually sure what the best lowering here is. There are two kinds of contenders:

  1. The above sequence, which you can potentially get by forming (in LLVM IR) a 64-bit value out of the sources and then doing first a shift by 1 and then a shift by not Amt.
  2. A sequence ((amt & 31) != 0) ? v.align.bit(src0, src1, -amt) : src0

The second sequence should become something like:

v_and_b32 v3, 31, v2
v_sub_b32 v6, 0, v2
v_alignbit_b32 v5, v0, v1, v6
v_cmp_eq_b32 vcc_lo, 0, v2
v_cndmask_b32 v0, v0, v5, vcc_lo

This is worse than the sequence above (though there are probably many cases where the v_and_b32 can be optimized away thanks to known bits analysis). But! It is somewhat natural to have a uniform or even constant shift amount, and I don't know how well isel would be able to optimize those cases if you the first option for the IR here. (I suspect it might be able to optimize the constant case, but not the uniform case.) With the version of LLVM IR that corresponds to this alternative, a uniform shift amount should lead to only 2x VALU (v_alignbit_b32 + v_cndmask_b32) and a bunch of SALU, which is certainly better than 4x VALU.

So you may want to do a case distinction here based on whether the shift amount is uniform. However, the more important thing is that you try a couple of variants of this to find out which is really best. There may be a variation I haven't thought of.

(I strongly recommend not doing an explicit case distinction based on whether amt is constant, since we have uniform analysis here. UA catches the constant case as well, and automatic constant folding in the builder should do the rest.)

} else
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, and our policy is to fail fast and loud rather than to try to continue somehow when something that shouldn't happen happens. (To go along with this, our policy is to use release builds with assertions enabled for a lot of testing -- you should definitely make sure you do the same in case you're running check-* targets with release builds.)

In this particular case, there are one of two idiomatic patterns you can use:

  if (IID == Intrinsic::fshr) {
    ...
  } else {
    assert(IID == Intrinsic::fshl);
    ...
  }

or

  if (IID == Intrinsic::fshr) {
    ...
  } else if (IID == Intrinsic::fshl) {
    ...
  } else {
    llvm_unreachable("bad intrinsic");
  }

In this case, I'd prefer the former due to the obvious left/right symmetry.


I.replaceAllUsesWith(AlignBitCall);
I.eraseFromParent();
return true;
}

INITIALIZE_PASS_BEGIN(AMDGPUCodeGenPrepare, DEBUG_TYPE,
"AMDGPU IR optimizations", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
Expand Down
10 changes: 7 additions & 3 deletions llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -487,12 +487,16 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
setOperationAction({ISD::ADDC, ISD::SUBC, ISD::ADDE, ISD::SUBE}, VT, Legal);
}

// The hardware supports 32-bit FSHR, but not FSHL.
setOperationAction(ISD::FSHR, MVT::i32, Legal);
if (Subtarget->isGCN()) {
setOperationAction(ISD::FSHR, MVT::i32, Expand);
setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Expand);
} else {
setOperationAction(ISD::FSHR, MVT::i32, Legal);
setOperationAction(ISD::ROTR, {MVT::i32, MVT::i64}, Legal);
}

// The hardware supports 32-bit ROTR, but not ROTL.
setOperationAction(ISD::ROTL, {MVT::i32, MVT::i64}, Expand);
setOperationAction(ISD::ROTR, MVT::i64, Expand);

setOperationAction({ISD::MULHU, ISD::MULHS}, MVT::i16, Expand);

Expand Down
1 change: 0 additions & 1 deletion llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2064,7 +2064,6 @@ AMDGPULegalizerInfo::AMDGPULegalizerInfo(const GCNSubtarget &ST_,

// TODO: Only Try to form v2s16 with legal packed instructions.
getActionDefinitionsBuilder(G_FSHR)
.legalFor({{S32, S32}})
.lowerFor({{V2S16, V2S16}})
.clampMaxNumElementsStrict(0, S16, 2)
.scalarize(0)
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4824,6 +4824,7 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_perm_pk16_b4_u4:
case Intrinsic::amdgcn_perm_pk16_b6_u4:
case Intrinsic::amdgcn_perm_pk16_b8_u4:
case Intrinsic::amdgcn_v_alignbit:
return getDefaultMappingVOP(MI);
case Intrinsic::amdgcn_log:
case Intrinsic::amdgcn_exp2:
Expand Down
32 changes: 32 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInstructions.td
Original file line number Diff line number Diff line change
Expand Up @@ -2612,6 +2612,38 @@ def : GCNPat<(fshr i32:$src0, i32:$src1, i32:$src2),
>;
} // end True16Predicate = UseFakeTrue16Insts

let True16Predicate = NotHasTrue16BitInsts in {
let SubtargetPredicate = isNotGFX9Plus in {
def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
(V_ALIGNBIT_B32_e64 VSrc_b32:$src0, VSrc_b32:$src1, VSrc_b32:$src2)>;
} // isNotGFX9Plus

let SubtargetPredicate = isGFX9GFX10 in {
def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
(V_ALIGNBIT_B32_opsel_e64 /* src0_modifiers */ 0, $src0,
/* src1_modifiers */ 0, $src1,
/* src2_modifiers */ 0,
$src2, /* clamp */ 0, /* op_sel */ 0)>;
} // isGFX9GFX10
} // end True16Predicate = NotHasTrue16BitInsts

let True16Predicate = UseRealTrue16Insts in {
def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
(V_ALIGNBIT_B32_t16_e64 /* src0_modifiers */ 0, $src0,
/* src1_modifiers */ 0, $src1,
/* src2_modifiers */ 0,
(EXTRACT_SUBREG $src2, lo16),
/* clamp */ 0, /* op_sel */ 0)>;
} // end True16Predicate = UseRealTrue16Insts

let True16Predicate = UseFakeTrue16Insts in {
def : GCNPat<(i32 (int_amdgcn_v_alignbit i32:$src0, i32:$src1, i32:$src2)),
(V_ALIGNBIT_B32_fake16_e64 /* src0_modifiers */ 0, $src0,
/* src1_modifiers */ 0, $src1,
/* src2_modifiers */ 0,
$src2, /* clamp */ 0, /* op_sel */ 0)>;
} // end True16Predicate = UseFakeTrue16Insts

/********** ====================== **********/
/********** Indirect addressing **********/
/********** ====================== **********/
Expand Down
Loading
Loading