Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

israellot · 2022-09-23T13:24:46Z

When using Vector256.Zero, I would expect it to be kept in a fixed register and reused. Instead, what I see is a vxorps operation emitted every time.

AVX2 has 16 YMM registers.

Bellow is one example. I can get the desired behavior by forcing a zero vector variable instead of using Vector256.Zero;
Assigning Vector256<byte>.Zero to a variable alone does not do the trick. Only the extra xor operation ensures it stays in a fixed register.

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
           
var low = Avx2.UnpackLow(byteVector, Vector256<byte>.Zero);
var high = Avx2.UnpackHigh(byteVector, Vector256<byte>.Zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());

added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);

//ASM

mov      rax, bword ptr [rcx]
vmovdqu  ymm0, ymmword ptr[rax]
vxorps   ymm1, ymm1, ymm1
vpunpcklbw ymm1, ymm0, ymm1
vxorps   ymm2, ymm2, ymm2
vpunpckhbw ymm0, ymm0, ymm2
vpaddw   ymm0, ymm1, ymm0
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);

var zero = Vector256<byte>.Zero;
zero = Avx2.Xor(zero, zero); //forces fixed register
            
var low = Avx2.UnpackLow(byteVector, zero);
var high = Avx2.UnpackHigh(byteVector, zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());


//ASM
mov      rax, bword ptr [rcx]
vxorps   ymm0, ymm0, ymm0
vmovdqu  ymm1, ymmword ptr[rax]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw   ymm1, ymm2, ymm1
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:small

The text was updated successfully, but these errors were encountered:

ghost · 2022-09-23T13:24:55Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When using Vector256.Zero, I would expect it to be kept in a fixed register and reused. Instead, what I see is a vxorps operation emitted every time.

AVX2 has 16 YMM registers.

Bellow is one example. I can get the desired behavior by forcing a zero vector variable instead of using Vector256.Zero;
Assigning Vector256<byte>.Zero to a variable alone does not do the trick. Only the extra xor operation ensures it stays in a fixed register.

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);
           
var low = Avx2.UnpackLow(byteVector, Vector256<byte>.Zero);
var high = Avx2.UnpackHigh(byteVector, Vector256<byte>.Zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());

added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);
added = Avx2.HorizontalAdd(added, Vector256<short>.Zero);

//ASM

mov      rax, bword ptr [rcx]
vmovdqu  ymm0, ymmword ptr[rax]
vxorps   ymm1, ymm1, ymm1
vpunpcklbw ymm1, ymm0, ymm1
vxorps   ymm2, ymm2, ymm2
vpunpckhbw ymm0, ymm0, ymm2
vpaddw   ymm0, ymm1, ymm0
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1
vxorps   ymm1, ymm1, ymm1
vphaddw  ymm0, ymm0, ymm1

var byteVector = Vector256.LoadUnsafe<byte>(ref spanRef);

var zero = Vector256<byte>.Zero;
zero = Avx2.Xor(zero, zero); //forces fixed register
            
var low = Avx2.UnpackLow(byteVector, zero);
var high = Avx2.UnpackHigh(byteVector, zero);

var added = Avx2.Add(low.AsInt16(), high.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());
added = Avx2.HorizontalAdd(added, zero.AsInt16());

mov      rax, bword ptr [rcx]
vxorps   ymm0, ymm0, ymm0
vmovdqu  ymm1, ymmword ptr[rax]
vpunpcklbw ymm2, ymm1, ymm0
vpunpckhbw ymm1, ymm1, ymm0
vpaddw   ymm1, ymm2, ymm1
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0
vphaddw  ymm1, ymm1, ymm0

Author:	israellot
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	-

EgorBo · 2022-09-23T14:58:32Z

This is a known issue that we might eventually address with constant materialization during LSRA.
We don't currently do CSE for Vector<>.Zero because 1) it's cheap to materialize 2) we plan to rely on it in optimizations in lower*.cpp

israellot · 2022-09-23T15:29:51Z

Thanks for the info @EgorBo

TIHan · 2023-02-02T02:31:29Z

Discussed with @tannergooding for a bit regarding this, as an immediate thing we could do, is a peephole optimization to eliminate unnecessary vxorps ymm1, ymm1, ymm1 instructions by looking back to see if the same instruction occurred, as long as we know that ymm1 was not written to in any other way.

EgorBo · 2023-10-13T17:23:45Z

Another manifistation of this problem:

void Foo(ref byte b)
{
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 0)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 16)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 32)) = default;
    Unsafe.As<byte, Vector128<byte>>(ref Unsafe.Add(ref b, 48)) = default;
}

Emits:

            movi    v16.4s, #0
            str     q16, [x1]
            movi    v16.4s, #0
            str     q16, [x1, #0x10]
            movi    v16.4s, #0
            str     q16, [x1, #0x20]
            movi    v16.4s, #0
            str     q16, [x1, #0x30]

Expected:

            movi    v16.4s, #0
            str     q16, q16, [x1]
            str     q16, q16, [x1, #0x20]

eliminate unnecessary vxorps ymm1, ymm1, ymm1 instructions by looking back to see if the same instruction occurred, as long as we know that ymm1 was not written to in any other way.

TLDR: that didn't work due to ABI issues, although, it won't be a real fix anyway because it wouldn't help with loop hoisting.

israellot added the tenet-performance Performance related issue label Sep 23, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 23, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

EgorBo removed the untriaged New issue has not been triaged by the area owner label Sep 23, 2022

EgorBo added this to the Future milestone Sep 23, 2022

TIHan mentioned this issue Feb 2, 2023

Improve handling of reused constants in the register allocator #70182

Open

EgorBo mentioned this issue Oct 13, 2023

Improve SpanHelpers.ClearWithReferences for arm64 #93346

Closed

tannergooding mentioned this issue Jan 28, 2024

Update constant prop to only consider certain hwintrinsics #97616

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

israellot commented Sep 23, 2022 •

edited by BruceForstall

Loading

ghost commented Sep 23, 2022

EgorBo commented Sep 23, 2022

israellot commented Sep 23, 2022

TIHan commented Feb 2, 2023

EgorBo commented Oct 13, 2023 •

edited

Loading

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

Suboptimal ASM emmited for Vector256<T>.Zero and Vector128<T>.Zero #76067

Comments

israellot commented Sep 23, 2022 • edited by BruceForstall Loading

ghost commented Sep 23, 2022

EgorBo commented Sep 23, 2022

israellot commented Sep 23, 2022

TIHan commented Feb 2, 2023

EgorBo commented Oct 13, 2023 • edited Loading

israellot commented Sep 23, 2022 •

edited by BruceForstall

Loading

EgorBo commented Oct 13, 2023 •

edited

Loading