Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

Open
israellot opened this issue Sep 23, 2022 · 5 comments
Open

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

israellot opened this issue Sep 23, 2022 · 5 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@israellot
Copy link

israellot commented Sep 23, 2022

When using Vector256.Zero, I would expect it to be kept in a fixed register and reused. Instead, what I see is a vxorps operation emitted every time.

AVX2 has 16 YMM registers.

Bellow is one example. I can get the desired behavior by forcing a zero vector variable instead of using Vector256.Zero;
Assigning Vector256<byte>.Zero to a variable alone does not do the trick. Only the extra xor operation ensures it stays in a fixed register.

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
           
var low = Avx2.UnpackLow(byteVector, Vector256<byte>.Zero);
var high = Avx2.UnpackHigh(byteVector, Vector256<byte>.Zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());

added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);

//ASM

mov      rax, bword ptr [rcx]
vmovdqu  ymm0, ymmword ptr[rax]
vxorps   ymm1, ymm1, ymm1
vpunpcklbw ymm1, ymm0, ymm1
vxorps   ymm2, ymm2, ymm2
vpunpckhbw ymm0, ymm0, ymm2
vpaddw   ymm0, ymm1, ymm0
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);

var zero = Vector256<byte>.Zero;
zero = Avx2.Xor(zero, zero); //forces fixed register
            
var low = Avx2.UnpackLow(byteVector, zero);
var high = Avx2.UnpackHigh(byteVector, zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());


//ASM
mov      rax, bword ptr [rcx]
vxorps   ymm0, ymm0, ymm0
vmovdqu  ymm1, ymmword ptr[rax]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw   ymm1, ymm2, ymm1
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:small

@israellot israellot added the tenet-performance Performance related issue label Sep 23, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@ghost
Copy link

ghost commented Sep 23, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When using Vector256.Zero, I would expect it to be kept in a fixed register and reused. Instead, what I see is a vxorps operation emitted every time.

AVX2 has 16 YMM registers.

Bellow is one example. I can get the desired behavior by forcing a zero vector variable instead of using Vector256.Zero;
Assigning Vector256<byte>.Zero to a variable alone does not do the trick. Only the extra xor operation ensures it stays in a fixed register.

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
           
var low = Avx2.UnpackLow(byteVector, Vector256<byte>.Zero);
var high = Avx2.UnpackHigh(byteVector, Vector256<byte>.Zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());

added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);

//ASM

mov      rax, bword ptr [rcx]
vmovdqu  ymm0, ymmword ptr[rax]
vxorps   ymm1, ymm1, ymm1
vpunpcklbw ymm1, ymm0, ymm1
vxorps   ymm2, ymm2, ymm2
vpunpckhbw ymm0, ymm0, ymm2
vpaddw   ymm0, ymm1, ymm0
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);

var zero = Vector256<byte>.Zero;
zero = Avx2.Xor(zero, zero); //forces fixed register
            
var low = Avx2.UnpackLow(byteVector, zero);
var high = Avx2.UnpackHigh(byteVector, zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());

mov      rax, bword ptr [rcx]
vxorps   ymm0, ymm0, ymm0
vmovdqu  ymm1, ymmword ptr[rax]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw   ymm1, ymm2, ymm1
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
Author: israellot
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: -

@EgorBo
Copy link
Member

EgorBo commented Sep 23, 2022

This is a known issue that we might eventually address with constant materialization during LSRA.
We don't currently do CSE for Vector<>.Zero because 1) it's cheap to materialize 2) we plan to rely on it in optimizations in lower*.cpp

@EgorBo EgorBo removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022
@EgorBo EgorBo added this to the Future milestone Sep 23, 2022
@israellot
Copy link
Author

Thanks for the info @EgorBo

@TIHan
Copy link
Contributor

TIHan commented Feb 2, 2023

Discussed with @tannergooding for a bit regarding this, as an immediate thing we could do, is a peephole optimization to eliminate unnecessary vxorps ymm1, ymm1, ymm1 instructions by looking back to see if the same instruction occurred, as long as we know that ymm1 was not written to in any other way.

@EgorBo
Copy link
Member

EgorBo commented Oct 13, 2023

Another manifistation of this problem:

void Foo(ref byte b)
{
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 0)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 16)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 32)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 48)) = default;
}

Emits:

            movi    v16.4s, #0
            str     q16, [x1]
            movi    v16.4s, #0
            str     q16, [x1, #0x10]
            movi    v16.4s, #0
            str     q16, [x1, #0x20]
            movi    v16.4s, #0
            str     q16, [x1, #0x30]

Expected:

            movi    v16.4s, #0
            str     q16, q16, [x1]
            str     q16, q16, [x1, #0x20]

eliminate unnecessary vxorps ymm1, ymm1, ymm1 instructions by looking back to see if the same instruction occurred, as long as we know that ymm1 was not written to in any other way.

TLDR: that didn't work due to ABI issues, although, it won't be a real fix anyway because it wouldn't help with loop hoisting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

3 participants