Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm] Deal vector data failed when using option Os、-mfloat-abi=softfp、--target=armebv7-linux #102418

Closed
Zhenhang1213 opened this issue Aug 8, 2024 · 16 comments · Fixed by #105519

Comments

@Zhenhang1213
Copy link
Contributor

Zhenhang1213 commented Aug 8, 2024

I find results in armebv7-linux are different from armv7.
code

#include <stdio.h> 
volatile int one = 1;

int main (int argc, char *argv[]) {
    __attribute__((vector_size((8)*sizeof(short)))) short v0 = {one, 1, 2, 3, 4, 5, 6, 7};
    __attribute__((vector_size((8)*sizeof(short)))) short v1;
    v1 = v0 % 2; 
    for (int __i = 0; __i < 8; __i++) { 
        printf("v1: %d\n", (*((short *) &(v1) + __i))); 
    }
    return 0;
}

When __i is 4, the result is different. I think this Instruction: vrev32.16 d17, d18 is wrong, and I try to modify it is vmov.16 d17, d18, the result is right. However I don't know how to fix this bug.

https://godbolt.org/z/EqYebY73z

@llvmbot
Copy link

llvmbot commented Aug 8, 2024

@llvm/issue-subscribers-backend-arm

Author: Austin (Zhenhang1213)

I find results in armebv7-linux are different from armv7. code ''' #include <stdio.h> volatile int one = 1;

int main (int argc, char *argv[]) {
attribute((vector_size((8)*sizeof(short)))) short v0 = {one, 1, 2, 3, 4, 5, 6, 7};
attribute((vector_size((8)sizeof(short)))) short v1;
v1 = v0 % 2;
for (int __i = 0; __i < 8; __i++) {
printf("v1: %d\n", (
((short *) &(v1) + __i)));
}
return 0;
}
'''

When __i is 4, the result is different. I think this Instruction: vrev32.16 d17, d18 is wrong, and I try to modify it is vmov.16 d17, d18, the result is right. However I don't know how to fix this bug.

https://godbolt.org/z/EqYebY73z

@davemgreen
Copy link
Collaborator

#97782 recently fixed a similar sounding issue, but it doesn't look like the code has changed on trunk https://godbolt.org/z/7WYn38xTs.

@Zhenhang1213
Copy link
Contributor Author

#97782 recently fixed a similar sounding issue, but it doesn't look like the code has changed on trunk https://godbolt.org/z/7WYn38xTs.

yes, I compile clang with this patch, and the code is same

@hstk30-hw
Copy link
Contributor

Same problem, but different situation.
The vrev32.16 d17, d18 is unnecessary,

vmov.i32        d18, #0x10000
vrev32.16 d17, d18

repalce to

vmov.i32        d17, #0x10000

is ok.

@Zhenhang1213
Copy link
Contributor Author

Same problem, but different situation. The vrev32.16 d17, d18 is unnecessary,

vmov.i32        d18, #0x10000
vrev32.16 d17, d18

repalce to

vmov.i32        d17, #0x10000

is ok.

@davemgreen could you give me some ideas? I've been puzzled by this question for a long time.

@davemgreen
Copy link
Collaborator

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat).
https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

@Zhenhang1213
Copy link
Contributor Author

Zhenhang1213 commented Aug 10, 2024

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat). https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

However,I find llvm 15 also is wrong without this patch

@Zhenhang1213
Copy link
Contributor Author

Zhenhang1213 commented Aug 15, 2024

Hi it honestly looks like vector constants are generated incorrectly if they can be specified with a larger movi size (i.e. using a vmov.i32 for a i16 constant splat). https://godbolt.org/z/oPczbxbvq

That may be related to #68212, we perhaps should be passing IsBigEndian to isConstantSplat in LowerBUILD_VECTOR. The bitcast of the VMOVIMM also looks suspicious, and there might be a problem in PerformBITCASTCombine where it is ignoring casts that it should not. There might be cases where we get it wrong in two places that usually cancel each other out.

if (Val.getNode()) {
SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);
return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
}

Is this code for endianness adjustment? In this scenario, should I adjust the lower part and adjust byte order finally after CONCAT_VECTORS?

@davemgreen
Copy link
Collaborator

davemgreen commented Aug 15, 2024

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

@Zhenhang1213
Copy link
Contributor Author

Zhenhang1213 commented Aug 16, 2024

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

yes,I try to mody it and some tests failed, I think there are some problems with the modified behavior, such as this case

; RUN: llc -mtriple armeb-eabi -mattr=armv8.2-a,neon,fullfp16 -target-abi=aapcs-gnu -float-abi hard -o - %s | FileCheck %s
define void @conv_v4i16_to_v4f16( <4 x i16> %a, <4 x half>* %store ) {
; CHECK-LABEL: conv_v4i16_to_v4f16:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vmov.i64 d16, #0xffff00000000ffff
; CHECK-NEXT:    vldr d17, [r0]
; CHECK-NEXT:    vrev64.16 d18, d0
; CHECK-NEXT:    vrev64.16 d17, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vadd.i16 d16, d18, d16
; CHECK-NEXT:    vadd.f16 d16, d16, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vstr d16, [r0]
; CHECK-NEXT:    bx lr
entry:
  %c = add <4 x i16> %a, <i16 -1, i16 0, i16 0, i16 -1>
  %v = bitcast <4 x i16> %c to <4 x half>
  %w = load <4 x half>, <4 x half>* %store
  %z = fadd <4 x half> %v, %w
  store <4 x half> %z, <4 x half>* %store
  ret void
}

and the result is

 vmov.i64        d16, #0xffff00000000ffff
        vldr    d17, [r0]
        vrev64.16       d18, d0
        vadd.i16        d16, d18, d16
        vrev64.16       d17, d17
        vadd.f16        d16, d16, d17
        vrev64.16       d16, d16
        vstr    d16, [r0]
        bx      lr

The logic of the two parts is not the same. When I debug PerformVECTOR_REG_CASTCombine function, Op->getOpcode() isnot ARMISD::VECTOR_REG_CAST, so return SDValue();

@Zhenhang1213
Copy link
Contributor Author

I don't believe that #68212 broke or fixed anything on it's own as we don't set the new parameter. We might need to add the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering. Then change the BITCAST you point to above into a VECTOR_REG_CAST and then see what else changes and if there are any other follow-on fixes that are needed. From what I remember from looking into it the PerformBITCASTCombine function might need to be tightened-up when there is BITCAST(VECTOR_REG_CAST). The important thing is to test it to make sure we are now getting the right results.

Does that sound do-able? Big-endian is unfortunately much less used than the rest so doesn't get as much testing.

After adding the IsBigEndian parameter with the correct value in the call to isConstantSplat in ARMISelLowering, cancel the adaptation for 64-bit VMOV of BigEndian in isVMOVModifiedImm

entry:
  %0 = bitcast <4 x i32> %src to <16 x i8>
  %r = and <16 x i8> %0, <i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1, i8 0, i8 0, i8 0, i8 1>
  ret <16 x i8> %r

and the instruction generated I think is incorrect,


vrev64.8        q1, q0
vmov.i32        q0, #0x1
vrev32.8        q0, q0
vand            q1, q1, q0
vrev64.8        q0, q1
bx              lr

could you give me more advice? Modifying BITCAST to VECTOR_REG_CAST) brings more failures, thx.

@davemgreen
Copy link
Collaborator

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in

if (DAG.getDataLayout().isBigEndian()) {
, that should be removed so we are consistent about which order lanes appear.

Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

@Zhenhang1213
Copy link
Contributor Author

Zhenhang1213 commented Aug 21, 2024

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in

if (DAG.getDataLayout().isBigEndian()) {

, that should be removed so we are consistent about which order lanes appear.
Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

I have tried these modifications.

diff --git a/llvm/lib/Target/ARM/ARMISelLowering.cpp b/llvm/lib/Target/ARM/ARMISelLowering.cpp
index 4c24d7020932..d864ba00a02b 100644
--- a/llvm/lib/Target/ARM/ARMISelLowering.cpp
+++ b/llvm/lib/Target/ARM/ARMISelLowering.cpp
@@ -7011,19 +7011,6 @@ static SDValue isVMOVModifiedImm(uint64_t SplatBits, uint64_t SplatUndef,
       ImmMask <<= 1;
     }

-    if (DAG.getDataLayout().isBigEndian()) {
-      // Reverse the order of elements within the vector.
-      unsigned BytesPerElem = VectorVT.getScalarSizeInBits() / 8;
-      unsigned Mask = (1 << BytesPerElem) - 1;
-      unsigned NumElems = 8 / BytesPerElem;
-      unsigned NewImm = 0;
-      for (unsigned ElemNum = 0; ElemNum < NumElems; ++ElemNum) {
-        unsigned Elem = ((Imm >> ElemNum * BytesPerElem) & Mask);
-        NewImm |= Elem << (NumElems - ElemNum - 1) * BytesPerElem;
-      }
-      Imm = NewImm;
-    }
-
     // Op=1, Cmode=1110.
     OpCmode = 0x1e;
     VT = is128Bits ? MVT::v2i64 : MVT::v1i64;
@@ -7831,7 +7818,7 @@ SDValue ARMTargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG,

       if (Val.getNode()) {
         SDValue Vmov = DAG.getNode(ARMISD::VMOVIMM, dl, VmovVT, Val);
-        return DAG.getNode(ISD::BITCAST, dl, VT, Vmov);
+        return DAG.getNode(ARMISD::VECTOR_REG_CAST, dl, VT, Vmov);
       }

       // Try an immediate VMVN.
@@ -18292,7 +18279,6 @@ static SDValue PerformBITCASTCombine(SDNode *N,
   if ((Src.getOpcode() == ARMISD::VMOVIMM ||
        Src.getOpcode() == ARMISD::VMVNIMM ||
        Src.getOpcode() == ARMISD::VMOVFPIMM) &&
-      SrcVT.getScalarSizeInBits() <= DstVT.getScalarSizeInBits() &&
       DAG.getDataLayout().isBigEndian())
     return DAG.getNode(ARMISD::VECTOR_REG_CAST, SDLoc(N), DstVT, Src);

I test big-endian-neon-fp16-bitconv.ll failed, I think PerformVECTOR_REG_CASTCombine should be changed next? right?

@Zhenhang1213
Copy link
Contributor Author

Zhenhang1213 commented Aug 21, 2024

I was taking another look and using the IsBigEndian parameter to isConstantSplat might have been a mistake to suggest. I think it would make more sense to always have the constants produced in the "vector-lane-order" (same as little-endian, lowest lane at the bottom), which means not passing isBigEndian through to isConstantSplat.

The constants we create then don't need to be bitcast to change the lane order, they just need to be vector_reg_cast. Apparently i64 are for some reason special and the vector elements get reversed in

if (DAG.getDataLayout().isBigEndian()) {

, that should be removed so we are consistent about which order lanes appear.
Could you put a patch together with those changes? We can go through the test differences to makes sure they all seem correct. I'm pretty sure that something will need to change in PerformBITCASTCombine in the optimization of bitcast(vector_reg_cast(vmovimm)).

After this patch, after vmov, all vrev will be cancelled, so those tests will failed, like this

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple armeb-eabi -mattr=armv8.2-a,neon,fullfp16 -target-abi=aapcs-gnu -float-abi hard -o - %s | FileCheck %s
define void @conv_v4i16_to_v4f16( <4 x i16> %a, <4 x half>* %store ) {
; CHECK-LABEL: conv_v4i16_to_v4f16:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vmov.i64 d16, #0xffff00000000ffff
; CHECK-NEXT:    vldr d17, [r0]
; CHECK-NEXT:    vrev64.16 d18, d0
; CHECK-NEXT:    vrev64.16 d17, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vadd.i16 d16, d18, d16
; CHECK-NEXT:    vadd.f16 d16, d16, d17
; CHECK-NEXT:    vrev64.16 d16, d16
; CHECK-NEXT:    vstr d16, [r0]
; CHECK-NEXT:    bx lr
entry:
  %c = add <4 x i16> %a, <i16 -1, i16 0, i16 0, i16 -1>
  %v = bitcast <4 x i16> %c to <4 x half>
  %w = load <4 x half>, <4 x half>* %store
  %z = fadd <4 x half> %v, %w
  store <4 x half> %z, <4 x half>* %store
  ret void
}

the result is:

vmov.i64        d16, #0xffff00000000ffff
vldr    d17, [r0]
vrev64.16       d18, d0
vadd.i16        d16, d18, d16
vrev64.16       d17, d17
vadd.f16        d16, d16, d17
vrev64.16       d16, d16
vstr    d16, [r0]
bx      lr

@davemgreen
Copy link
Collaborator

Is that incorrect now, or was that vrev64.16 unnecessary?

The PerformBITCASTCombine part probably needs to be more restrictive, not less. If there is:

%a = v2f64 vmovimm
%b = v16i8 vector_reg_cast %a
%c = v2i64 bitcast %b

We need to keep the v16i8->v2i64 bitcast to keep the lane shuffling, even if the original type elt size is <= the final type elt size.

@Zhenhang1213
Copy link
Contributor Author

Is that incorrect now, or was that vrev64.16 unnecessary?

The PerformBITCASTCombine part probably needs to be more restrictive, not less. If there is:

%a = v2f64 vmovimm
%b = v16i8 vector_reg_cast %a
%c = v2i64 bitcast %b

We need to keep the v16i8->v2i64 bitcast to keep the lane shuffling, even if the original type elt size is <= the final type elt size.

Compared to the original case,vrev64.16 is unnecessary, in this case, the result is same
https://godbolt.org/z/zYv866sMY

CoTinker pushed a commit that referenced this issue Sep 7, 2024
Fix #102418, resolved the issue of generating incorrect vrev during
vectorization in big-endian scenarios
VitaNuo pushed a commit to VitaNuo/llvm-project that referenced this issue Sep 12, 2024
Fix llvm#102418, resolved the issue of generating incorrect vrev during
vectorization in big-endian scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants