Convert all user calls to indirect early on arm64 #89213

EgorBo · 2023-07-19T20:21:51Z

static void Test(int n)
{
    for (int i = 0; i < n; i++)
    {
        Console.WriteLine();
        Console.WriteLine();
    }
}

Codegen diff (on arm64): https://www.diffchecker.com/vHceD9dT/
(target address of Console.WriteLine is not CSE'd and hoisted from the loop)

Seeing some nice diffs locally on arm64 SPMI:

Same PR but limitted only to loops (blocks with "backward-branch" flag) to mitigate TP regressions:

EgorBo · 2023-07-19T23:48:04Z

Ugh, I guess I should not CSE/Hoist ldr instruction, otherwise if I call a method in a loop I won't be able to reload its new address (e.g if it's tiered up), so only 64-bit address of the vm slot should be CSE'd

EgorBo · 2023-07-20T15:51:39Z

[Benchmark]
public void NonVirtual()
{
    var x = Y;
    for (var i = 0; i < 100000000; i++)
    {
        x.Dispose(); // non-virtual non-inlineable call
        x.Dispose();
    }
}

Main:

; Assembly listing for method BasicBenchmark.Program+BasicBenchmark:NonVirtual():this (Tier1)	
            stp     fp, lr, [sp, #-0x30]!	
            stp     x19, x20, [sp, #0x18]	
            str     x21, [sp, #0x28]	
            mov     fp, sp	
            movz    x0, #0x81A8	
            movk    x0, #0x5800 LSL #16	
            movk    x0, #1 LSL #32	
            ldr     x19, [x0]	
            mov     w20, wzr	
            movz    w21, #0xE100	
            movk    w21, #0x5F5 LSL #16	
G_M000_IG03:	
            mov     x0, x19	
            movz    x1, #0x3B48	
            movk    x1, #0x668 LSL #16	
            movk    x1, #1 LSL #32	
            ldr     x1, [x1]	
            ldr     wzr, [x0]	
            blr     x1	
            mov     x0, x19	
            movz    x1, #0x3B48	
            movk    x1, #0x668 LSL #16	
            movk    x1, #1 LSL #32	
            ldr     x1, [x1]	
            blr     x1	
            add     w20, w20, #1	
            cmp     w20, w21	
            blt     G_M000_IG03	

            ldr     x21, [sp, #0x28]	
            ldp     x19, x20, [sp, #0x18]	
            ldp     fp, lr, [sp], #0x30	
            ret     lr	
; Total bytes of code 124

PR:

; Assembly listing for method BasicBenchmark.Program+BasicBenchmark:NonVirtual():this (Tier1-OSR)
            stp     fp, lr, [sp, #-0x30]!
            stp     x19, x20, [sp, #0x10]
            stp     x21, x22, [sp, #0x20]
            mov     fp, sp
            ldr     x20, [fp, #0x50]
            ldr     w19, [fp, #0x4C]
            movz    w21, #0xE100
            movk    w21, #0x5F5 LSL #16
            cmp     w19, w21
            bge     G_M000_IG05
            movz    x22, #0x3918
            movk    x22, #0x4D4 LSL #16
            movk    x22, #1 LSL #32
G_M000_IG04:
            mov     x0, x20
            ldr     x1, [x22]
            ldr     wzr, [x0]
            blr     x1
            mov     x0, x20
            ldr     x1, [x22]
            blr     x1
            add     w19, w19, #1
            cmp     w19, w21
            blt     G_M000_IG04

G_M000_IG05:
            ldp     x21, x22, [sp, #0x20]
            ldp     x19, x20, [sp, #0x10]
            ldp     fp, lr, [sp], #0x30
            add     sp, sp, #48
            ret     lr
; Total bytes of code 112

Codegen diff: https://www.diffchecker.com/dSO0krs3/

It does look like my PR is supposed to make it faster since we hoist 2 constants from the loop body - but the benchmark says it's not 😢:

Method	Job	Toolchain	Mean	Error	StdDev	Ratio
NonVirtual	Job-BMMOJI	/Core_Root/corerun	285.9 ms	0.66 ms	0.51 ms	1.42
NonVirtual	Job-JRBCUN	/Core_Root_base/corerun	202.0 ms	0.24 ms	0.22 ms	1.00

Apparently, CPU is more comforable with calls where target is populated right before the call? Anyway, so far I wasn't able to see perf wins from this...

kunalspathak · 2023-07-22T05:50:07Z

@a74nh - could you please check on your end why we don't see the expected wins here?

Test change

3b45165

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 19, 2023

Add comments, limit to arm, ignore cold blocks

31013b3

EgorBo added 2 commits July 20, 2023 12:14

Merge branch 'main' of github.com:dotnet/runtime into arm-indirect-call

9900779

Don't cse loads

f44457a

EgorBo marked this pull request as ready for review July 20, 2023 13:34

Limit to blocks with backward branches

52274de

EgorBo closed this Jul 20, 2023

EgorBo mentioned this pull request Jul 20, 2023

[JIT] ARM64 - NonVirtual method call slower inside of a loop than a Virtual method call #88807

Open

ghost locked as resolved and limited conversation to collaborators Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert all user calls to indirect early on arm64 #89213

Convert all user calls to indirect early on arm64 #89213

EgorBo commented Jul 19, 2023 •

edited

Loading

EgorBo commented Jul 19, 2023 •

edited

Loading

EgorBo commented Jul 20, 2023 •

edited

Loading

kunalspathak commented Jul 22, 2023

Convert all user calls to indirect early on arm64 #89213

Convert all user calls to indirect early on arm64 #89213

Conversation

EgorBo commented Jul 19, 2023 • edited Loading

EgorBo commented Jul 19, 2023 • edited Loading

EgorBo commented Jul 20, 2023 • edited Loading

kunalspathak commented Jul 22, 2023

EgorBo commented Jul 19, 2023 •

edited

Loading

EgorBo commented Jul 19, 2023 •

edited

Loading

EgorBo commented Jul 20, 2023 •

edited

Loading