Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky CSE around pdep mask #477

Open
damageboy opened this issue Dec 3, 2019 · 2 comments
Open

Flaky CSE around pdep mask #477

damageboy opened this issue Dec 3, 2019 · 2 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@damageboy
Copy link
Contributor

damageboy commented Dec 3, 2019

Repro Repo:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse

Relevant piece of code:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse/blob/d6bc610c1dd5416f717211676f2fb0b0ce42e3a2/Program.cs#L28-L41

            ulong t64;
            t64 = P.AsUInt64().GetElement(0);
            var p0 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p1 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            t64 = P.AsUInt64().GetElement(1);
            var p2 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p3 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            var tmp128 = ExtractVector128(P, 1);
            t64 = tmp128.AsUInt64().GetElement(0);
            var p4 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p5 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
            t64 = tmp128.AsUInt64().GetElement(1);
            var p6 = ParallelBitDeposit(t64, 0x0707070707070707);
            var p7 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);```

Generated asm:

https://github.com/damageboy/coreclr-pdep-mask-flaky-cse/blob/d6bc610c1dd5416f717211676f2fb0b0ce42e3a2/listing.asm#L15-L66

;             t64 = P.AsUInt64().GetElement(0);
00007FC3A6AB07F5 C5FC28C8             vmovaps ymm1,ymm0
00007FC3A6AB07F9 C4E1F97EC8           vmovq   rax,xmm1


;             var p0 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB07FE 48BF0707070707070707 mov     rdi,707070707070707h
00007FC3A6AB0808 C4E2FBF5FF           pdep    rdi,rax,rdi


;             var p1 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB080D 48C1E820             shr     rax,20h
00007FC3A6AB0811 48BE0707070707070707 mov     rsi,707070707070707h
00007FC3A6AB081B C4E2FBF5F6           pdep    rsi,rax,rsi


;             t64 = P.AsUInt64().GetElement(1);
00007FC3A6AB0820 C5FC28C8             vmovaps ymm1,ymm0
00007FC3A6AB0824 C4E3F916C801         vpextrq rax,xmm1,1


;             var p2 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB082A 48BA0707070707070707 mov     rdx,707070707070707h
00007FC3A6AB0834 C4E2FBF5D2           pdep    rdx,rax,rdx


;             var p3 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB0839 48C1E820             shr     rax,20h
00007FC3A6AB083D 48B90707070707070707 mov     rcx,707070707070707h
00007FC3A6AB0847 C4E2FBF5C9           pdep    rcx,rax,rcx


;             var tmp128 = ExtractVector128(P, 1);
00007FC3A6AB084C C4E37D39C001         vextracti128 xmm0,ymm0,1


;             t64 = tmp128.AsUInt64().GetElement(0);
00007FC3A6AB0852 C4E1F97EC0           vmovq   rax,xmm0


;             var p4 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB0857 49B80707070707070707 mov     r8,707070707070707h
00007FC3A6AB0861 C442FBF5C0           pdep    r8,rax,r8


;             var p5 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB0866 48C1E820             shr     rax,20h
00007FC3A6AB086A 49B90707070707070707 mov     r9,707070707070707h
00007FC3A6AB0874 C442FBF5C9           pdep    r9,rax,r9


;             t64 = tmp128.AsUInt64().GetElement(1);
00007FC3A6AB0879 C4E3F916C001         vpextrq rax,xmm0,1


;             var p6 = ParallelBitDeposit(t64, 0x0707070707070707);
00007FC3A6AB087F 49BA0707070707070707 mov     r10,707070707070707h
00007FC3A6AB0889 C442FBF5D2           pdep    r10,rax,r10


;             var p7 = ParallelBitDeposit(t64 >> 32, 0x0707070707070707);
00007FC3A6AB088E 48C1E820             shr     rax,20h
00007FC3A6AB0892 49BB0707070707070707 mov     r11,707070707070707h
00007FC3A6AB089C C4C2FBF5C3           pdep    rax,rax,r11

Issue

This is a very minor tweak for the bug I opened yesterday: #442

Somehow, just moving a few of these expressions around causes the JIT to not perform CSE on the mask parameter for PDEP in a dependable way... (unlike the code I posted on the previous issue).

Not sure why this is suddenly happening for such a trivial change compared to the previous listing...

category:cq
theme:cse
skill-level:intermediate
cost:medium
impact:small

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 3, 2019
@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 3, 2019
@BruceForstall BruceForstall added this to the Future milestone Dec 3, 2019
@BruceForstall BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Dec 3, 2019
@BruceForstall
Copy link
Member

@briansull

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@jakobbotsch
Copy link
Member

We don't CSE constants on xarch. If the example is run with DOTNET_JitConstCSE=3 (CSE constants including off of nearby constants), then we do CSE the constants here.
For proper profitable CSE'ing of constants we need support for rematerialization (#70182).

@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

No branches or pull requests

6 participants