Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of reused constants in the register allocator #70182

Open
tannergooding opened this issue Jun 2, 2022 · 9 comments
Open

Improve handling of reused constants in the register allocator #70182

tannergooding opened this issue Jun 2, 2022 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Jun 2, 2022

Today https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does FREE and then CONST_AVAILABLE. This ends up not being able to reuse an already enregistered constant if it wasn't "last use".

A simple example is the following where C81 and C82 both represent the CNS_DBL of 0.0:

68.#46  C81   Def    ORDER(A) mm0  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|     |     |V101a|     |
70.#47  C82   Def    ORDER(A) mm1  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|C82 a|     |V101a|     |

While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized (as is the case for Matrix4x4.Decompose) to the same constant value (often zero so emitting xorps many times in a row).

It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact Zero and AllBitsSet since they are "cheap to compute" and so typically don't undergo CSE.

category:cq
theme:register-allocator

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 2, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Jun 2, 2022
@ghost
Copy link

ghost commented Jun 2, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Today https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/lsra-heuristic-tuning.md documents that it does FREE and then CONST_AVAILABLE. This ends up not being able to reuse an already enregistered constant if it wasn't "last use".

A simple example is the following where C81 and C82 both represent the CNS_DBL of 0.0:

68.#46  C81   Def    ORDER(A) mm0  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|     |     |V101a|     |
70.#47  C82   Def    ORDER(A) mm1  |     |V13 a|     |V2  a|V3  a|V0  a|V1  a|     |     |V10 a|V4  a|V7  a|C81 a|C82 a|     |V101a|     |

While these values aren't last use, the last use does happen before the constant is overwritten and in some cases we can end up with 4-5 registers all being initialized to the same constant value (often zero so emitting xorps many times in a row).

It would be beneficial if the register allocator could improve this scenario and keep the constant in one register for such scenarios. This in particular appears to impact Zero and AllBitsSet since they are "cheap to compute" and so typically don't undergo CSE.

Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member

There is an assumption in LSRA that materializing constants are cheap and hence we don't want to tie it to a register. May be if we have an information of the occurances of particular constants (zeros for sure), we can have it assigned a register. I tried experimenting by switching the order of CONST_AVAILABLE and FREE and that (not surprisingly) shows lot of regressions:

[15:20:52]
[15:20:52] Top method regressions (bytes):
[15:20:52]           54 (20.00% of base) : 36178.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           43 ( 2.55% of base) : 34909.dasm - System.Numerics.Matrix4x4:Decompose(System.Numerics.Matrix4x4,byref,byref,byref):bool
[15:20:52]           28 (19.18% of base) : 10260.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateLookAtBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 17977.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateWorldBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           26 (13.27% of base) : 35808.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           22 ( 6.83% of base) : 17425.dasm - System.Numerics.Tests.Perf_Plane:CreateFromVerticesBenchmark():System.Numerics.Plane:this
[15:20:52]           18 ( 1.91% of base) : 3355.dasm - Benchmarks.SIMD.RayTracer.RayTracer:CreateDefaultScene():Benchmarks.SIMD.RayTracer.Scene
[15:20:52]           10 ( 4.44% of base) : 33917.dasm - Benchstone.MDBenchI.MDXposMatrix:Test():bool:this
[15:20:52]            5 ( 0.38% of base) : 15958.dasm - Benchstone.MDBenchF.MDRomber:Test():bool:this
[15:20:52]            5 ( 1.61% of base) : 16620.dasm - Benchstone.MDBenchF.MDSqMtx:Test():bool:this
[15:20:52]
[15:20:52] Top method regressions (percentages):
[15:20:52]           54 (20.00% of base) : 36178.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateConstrainedBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 10260.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateLookAtBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           28 (19.18% of base) : 17977.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateWorldBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           26 (13.27% of base) : 35808.dasm - System.Numerics.Tests.Perf_Matrix4x4:CreateBillboardBenchmark():System.Numerics.Matrix4x4:this
[15:20:52]           22 ( 6.83% of base) : 17425.dasm - System.Numerics.Tests.Perf_Plane:CreateFromVerticesBenchmark():System.Numerics.Plane:this
[15:20:52]           10 ( 4.44% of base) : 33917.dasm - Benchstone.MDBenchI.MDXposMatrix:Test():bool:this
[15:20:52]           43 ( 2.55% of base) : 34909.dasm - System.Numerics.Matrix4x4:Decompose(System.Numerics.Matrix4x4,byref,byref,byref):bool
[15:20:52]           18 ( 1.91% of base) : 3355.dasm - Benchmarks.SIMD.RayTracer.RayTracer:CreateDefaultScene():Benchmarks.SIMD.RayTracer.Scene
[15:20:52]            5 ( 1.61% of base) : 16620.dasm - Benchstone.MDBenchF.MDSqMtx:Test():bool:this
[15:20:52]            5 ( 0.38% of base) : 15958.dasm - Benchstone.MDBenchF.MDRomber:Test():bool:this
[15:20:52]
[15:20:52] 10 total methods with Code Size differences (0 improved, 10 regressed), 0 unchanged.

image

windows/x64 coreclr:

image

@kunalspathak
Copy link
Member

This will most likely not happen in .NET 7

@tannergooding
Copy link
Member Author

This will most likely not happen in .NET 7

This seems fine to me. There are a few places where we still have what appear to be associated regressions most of which appear to be cases of Zero or AllBitsSet being initialized more than necessary.

This sometimes impacts loop alignment, etc. So we'll probably want to take a closer look in 8. But overall the changes related to GT_CNS_VEC are positive improvements and the regressions tend to be nanoseconds in range so it shouldn't be a huge priority in 7

@tannergooding
Copy link
Member Author

Just adding in a summary that I gave elsewhere...

There are basically two fighting forces at play. On one hand we have constant folding, containment, and other optimizations that can only kick in when we know we have a constant. On the other hand, if we don't know that one of these optimizations are going to kick in then hoisting these values outside the loop is likely strictly better. However, given that we can't really know for sure what will be emitted until lowering, we're making vague heuristical guesses in HIR instead.

The premise of my proposal is that we should always CSE and always hoist constants into locals. We should then specially track that a given local represents a constant and we should prevent such locals from being rewritten (either via a new node kind GT_LCL_CNS or a flag on GT_LCL_VAR/GT_LCL_FLD).

Constant folding and lowering would then need to be updated to consider both GT_CNS_* and this new type of "constant local". This allows constant folding to still occur, it allows lowering to still perform containment, etc. However, it also ensures that if a given architecture or code path won't specially handle the constant that it is still hoisted outside the loop and won't be a repeated per-iteration cost.

A downside to this are that it may have have some minor impact on throughput. LSRA will also need to determine if all users of a local mark it as contained so that it can avoid initializing a value and burning a register unnecessarily. Much like with other scenarios, LSRA could understand and treat these constants specially and know they are cheaper to rematerialize and so therefore can be preference them for spilling if register contention is an issue (which should result in no-worse codegen than what we have today).

@TIHan
Copy link
Contributor

TIHan commented Feb 2, 2023

Related: #76067

@kunalspathak
Copy link
Member

I don't have time to look into this issue in .NET 8. We need to have a "User Story" in future release that contains all these LSRA issues to get them prioritized.

@kunalspathak kunalspathak removed this from the 8.0.0 milestone Jun 5, 2023
@kunalspathak
Copy link
Member

kunalspathak commented Jun 16, 2023

While working on #87082, been debugging the codegen on osx/arm64 to see why we don't reuse the constant for such scenarios:

           movz    x0, #0x8C00
           movk    x0, #817 LSL #16
           movk    x0, #1 LSL #32
           movz    x1, #0x8C00
           movk    x1, #817 LSL #16
           movk    x1, #1 LSL #32
           ldr     x1, [x1]
           blr     x1

Idle code gen:

            movz    x0, #0x8C00
            movk    x0, #817 LSL #16
            movk    x0, #1 LSL #32
            ldr     x1, [x0]
            blr     x1

Multiple reasons:

We seem to track the registers containing constants, but as soon as we see a different interval, we reset the tracked register.

if (interval->isConstant)
{
setConstantReg(reg->regNum, interval->registerType);
}
else
{
clearConstantReg(reg->regNum, interval->registerType);
}

We try to match the constant represented by IConNode or VecCons but not anything further like FtnAddress (which is the case in above example).
If a register is holding a constant, we mark it as "busy". When we try to allocate for another position that contains exact same constant, we run try_FREE() heuristic first, which eliminates the register holding constant from list of possible candidates. When we do try_CONST_AVAILABLE(), we never see the register that is holding the same constant. Hence, we end up assigning a new register. And of course, there are bunch of asserts that kicks in when I was trying to make this work.

I am actually wondering, why can't we just have a lookup table of N entries, where N is the number of available registers. key would be const and value would be reg. We store the constant against the register that is holding it. A new heuristic will check if the constant we are allocating for, is present in lookup and if yes, just give it the register present in lookup instead of assigning it a new register. Of course, at block boundaries, we should flush this lookup table, so we don't hold the entries that are true in one pred and not in another.

@kunalspathak
Copy link
Member

Another example from @jakobbotsch - #76928 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants