Skip to content

Conversation

@arsenm
Copy link
Contributor

@arsenm arsenm commented Jan 13, 2026

This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.

Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.

Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.

All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.

@llvmbot
Copy link
Member

llvmbot commented Jan 13, 2026

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-llvm-globalisel

Author: Matt Arsenault (arsenm)

Changes

This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.

Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.

Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.

All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.


Full diff: https://github.com/llvm/llvm-project/pull/175780.diff

1 Files Affected:

  • (modified) llvm/lib/CodeGen/GlobalISel/CallLowering.cpp (+24-2)
diff --git a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
index e2ed45eec0ecd..0da360d8038b6 100644
--- a/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CallLowering.cpp
@@ -568,6 +568,13 @@ static void buildCopyToRegs(MachineIRBuilder &B, ArrayRef<Register> DstRegs,
 
   const TypeSize PartSize = PartTy.getSizeInBits();
 
+  if (PartSize == SrcTy.getSizeInBits() && DstRegs.size() == 1) {
+    // TODO: Handle int<->ptr casts. It just happens the ABI lowering
+    // assignments are not pointer aware.
+    B.buildBitcast(DstRegs[0], SrcReg);
+    return;
+  }
+
   if (PartTy.isVector() == SrcTy.isVector() &&
       PartTy.getScalarSizeInBits() > SrcTy.getScalarSizeInBits()) {
     assert(DstRegs.size() == 1);
@@ -576,9 +583,11 @@ static void buildCopyToRegs(MachineIRBuilder &B, ArrayRef<Register> DstRegs,
   }
 
   if (SrcTy.isVector() && !PartTy.isVector() &&
-      TypeSize::isKnownGT(PartSize, SrcTy.getElementType().getSizeInBits())) {
+      TypeSize::isKnownGT(PartSize, SrcTy.getElementType().getSizeInBits()) &&
+      SrcTy.getElementCount() == ElementCount::getFixed(DstRegs.size())) {
     // Vector was scalarized, and the elements extended.
     auto UnmergeToEltTy = B.buildUnmerge(SrcTy.getElementType(), SrcReg);
+
     for (int i = 0, e = DstRegs.size(); i != e; ++i)
       B.buildAnyExt(DstRegs[i], UnmergeToEltTy.getReg(i));
     return;
@@ -645,9 +654,22 @@ static void buildCopyToRegs(MachineIRBuilder &B, ArrayRef<Register> DstRegs,
     }
   }
 
-  if (LCMTy.isVector() && CoveringSize != SrcSize)
+  if (LCMTy.isVector() && CoveringSize != SrcSize) {
     UnmergeSrc = B.buildPadVectorWithUndefElements(LCMTy, SrcReg).getReg(0);
 
+    unsigned ExcessBits = CoveringSize - DstSize * DstRegs.size();
+    if (ExcessBits != 0) {
+      SmallVector<Register, 8> PaddedDstRegs(DstRegs.begin(), DstRegs.end());
+
+      MachineRegisterInfo &MRI = *B.getMRI();
+      for (unsigned I = 0; I != ExcessBits; I += PartSize)
+        PaddedDstRegs.push_back(MRI.createGenericVirtualRegister(PartTy));
+
+      B.buildUnmerge(PaddedDstRegs, UnmergeSrc);
+      return;
+    }
+  }
+
   B.buildUnmerge(DstRegs, UnmergeSrc);
 }
 

@arsenm arsenm marked this pull request as ready for review January 13, 2026 15:30
Base automatically changed from users/arsenm/amdgpu/fix-bf16-register-type-for-calling-conv to main January 13, 2026 16:48
@arsenm arsenm force-pushed the users/arsenm/globalisel/fix-mishandling-bitcast-vector-to-scalar-return branch from 186d520 to aa92839 Compare January 13, 2026 16:51
@arsenm arsenm force-pushed the users/arsenm/globalisel/fix-mishandling-bitcast-vector-to-scalar-return branch from aa92839 to 29cc6a8 Compare January 13, 2026 20:09
@aemerson
Copy link
Contributor

Do we have a test for this or will it be exercised later?

@arsenm
Copy link
Contributor Author

arsenm commented Jan 17, 2026

Do we have a test for this or will it be exercised later?

Later, this combination isn't used until #175781

@arsenm
Copy link
Contributor Author

arsenm commented Jan 22, 2026

ping

Comment on lines 657 to 671
if (LCMTy.isVector() && CoveringSize != SrcSize) {
UnmergeSrc = B.buildPadVectorWithUndefElements(LCMTy, SrcReg).getReg(0);

unsigned ExcessBits = CoveringSize - DstSize * DstRegs.size();
if (ExcessBits != 0) {
SmallVector<Register, 8> PaddedDstRegs(DstRegs.begin(), DstRegs.end());

MachineRegisterInfo &MRI = *B.getMRI();
for (unsigned I = 0; I != ExcessBits; I += PartSize)
PaddedDstRegs.push_back(MRI.createGenericVirtualRegister(PartTy));

B.buildUnmerge(PaddedDstRegs, UnmergeSrc);
return;
}
}
Copy link
Collaborator

@petar-avramovic petar-avramovic Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    %23:_(s16), %24:_(s16), %25:_(s16) = G_UNMERGE_VALUES %20(<3 x s16>)
    %26:_(s16) = G_IMPLICIT_DEF
    %27:_(<6 x s16>) = G_BUILD_VECTOR %23(s16), %24(s16), %25(s16), %26(s16), %26(s16), %26(s16)
    %21:_(s32), %22:_(s32), %28:_(s32) = G_UNMERGE_VALUES %27(<6 x s16>)
    $vgpr0 = COPY %21(s32)
    $vgpr1 = COPY %22(s32)

Not sure how this part is meant to work but padding DstRegs looks wrong
it is asked to return
<3 x s16>
in 2 s32 (DstRegs)
LCMTy is making larger type then needed it seems
and we return in more registers then target asked
SmallVector<Register, 8> PaddedDstRegs(DstRegs.begin(), DstRegs.end());

This might be getting into target specific territory
Instead of LCMTy
need to calculate with how many elements to pad SrcTy with to have its size equal to (DstRegs.size() * PartTy.getSizeInBits())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example when calculating LCMTy like this there is no need to pad DstRegs

  LLT LCMTy = getCoverTy(SrcTy, PartTy);
  if (SrcTy.isVector() && DstRegs.size() > 1) {
    unsigned CoverSize = DstRegs.size() * DstTy.getSizeInBits();
    LLT EltTy = SrcTy.getElementType();
    unsigned EltSize = EltTy.getSizeInBits();
    if (CoverSize % EltSize == 0)
      LCMTy = LLT::fixed_vector(CoverSize / EltSize, EltTy);
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns exactly as many registers as the target asked, the padding is just to conform to the requirements of the unmerge.

e.g. for this:

define <3 x i16> @ret_v3i16() {
  ret <3 x i16> poison
}

We get:

    %8:_(<3 x s16>) = G_IMPLICIT_DEF
    %11:_(s16), %12:_(s16), %13:_(s16) = G_UNMERGE_VALUES %8(<3 x s16>)
    %14:_(s16) = G_IMPLICIT_DEF
    %15:_(<6 x s16>) = G_BUILD_VECTOR %11(s16), %12(s16), %13(s16), %14(s16), %14(s16), %14(s16)
    %9:_(s32), %10:_(s32), %16:_(s32) = G_UNMERGE_VALUES %15(<6 x s16>)
    $vgpr0 = COPY %9(s32)
    $vgpr1 = COPY %10(s32)
    SI_RETURN implicit $vgpr0, implicit $vgpr1

The padding is the unused %16

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the requirements of the unmerge?
If LCMTy is calculated in different way (see comment above) there is no need for if (ExcessBits != 0) ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result pieces need to cover the input part piece, which is the widened LCMTy, so you need the extra unused result register

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LCMTy is calculated in a better way (get V4S16 instead of V6S16), can get:

    %8:_(<3 x s16>) = G_IMPLICIT_DEF
    %11:_(s16), %12:_(s16), %13:_(s16) = G_UNMERGE_VALUES %8(<3 x s16>)
    %14:_(s16) = G_IMPLICIT_DEF
    %15:_(<4 x s16>) = G_BUILD_VECTOR %11(s16), %12(s16), %13(s16), %14(s16)
    %9:_(s32), %10:_(s32) = G_UNMERGE_VALUES %15(<4 x s16>)
    $vgpr0 = COPY %9(s32)
    $vgpr1 = COPY %10(s32)
    SI_RETURN implicit $vgpr0, implicit $vgpr1

which does not require padding of DstRegs, can just unmerge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have your suggestion working. It happens to give worse codegen in 2 tests, but that's a downstream issue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into regressions, it is something in legalizer/artifact combiner

This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.

Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.

Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.

All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.
@arsenm arsenm force-pushed the users/arsenm/globalisel/fix-mishandling-bitcast-vector-to-scalar-return branch from 29cc6a8 to 85d881b Compare January 22, 2026 14:23
Copy link
Collaborator

@petar-avramovic petar-avramovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@arsenm arsenm merged commit 4a3f33d into main Jan 22, 2026
8 of 11 checks passed
@arsenm arsenm deleted the users/arsenm/globalisel/fix-mishandling-bitcast-vector-to-scalar-return branch January 22, 2026 16:24
Harrish92 pushed a commit to Harrish92/llvm-project that referenced this pull request Jan 23, 2026
…75780)

This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.

Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.

Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.

All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.
Harrish92 pushed a commit to Harrish92/llvm-project that referenced this pull request Jan 24, 2026
…75780)

This fixes 2 cases when the AMDGPU ABI is fixed to pass <2 x i16>
values as packed on gfx6/gfx7. The ABI does not pack values
currently; this is a pre-fix for that change.

Insert a bitcast if there is a single part with a different size.
Previously this would miscompile by going through the scalarization
and extend path, dropping the high element.

Also fix assertions in odd cases, like <3 x i16> -> i32. This needs
to unmerge with excess elements from the widened source vector.

All of this code is in need of a cleanup; this should look more
like the DAG version using getVectorTypeBreakdown.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants