Skip to content

Conversation

@EgorBo
Copy link
Member

@EgorBo EgorBo commented Mar 15, 2025

This PR enables loop cloning for Spans
Closes #82946
Closes #110986
Closes #112019

Example:

static void Test(Span<int> span, int len)
{
    for (int i = 0; i < len; i++)
        span[i] = 0;
}

Current codegen:

; Assembly listing for method Test(System.Span`1[int],int) (FullOpts)
       sub      rsp, 40
       xor      eax, eax
       test     edx, edx
       jle      SHORT G_M2065_IG04
       align    [0 bytes for IG03]
G_M2065_IG03:
       cmp      eax, dword ptr [rcx+0x08]    ;; <-- bounds check each iteration
       jae      SHORT G_M2065_IG05
       mov      r8, bword ptr [rcx]
       xor      r10d, r10d
       mov      dword ptr [r8+4*rax], r10d
       inc      eax
       cmp      eax, edx
       jl       SHORT G_M2065_IG03
G_M2065_IG04:
       add      rsp, 40
       ret    
  
G_M2065_IG05:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 42, prolog size 4, PerfScore 38.00

New codegen:

; Assembly listing for method Test(System.Span`1[int],int) (FullOpts)
       sub      rsp, 40
       xor      eax, eax
       test     edx, edx
       jle      SHORT G_M2065_IG05
       cmp      edx, dword ptr [rcx+0x08]
       jg       SHORT G_M2065_IG06
       xor      eax, eax
       align    [15 bytes for IG04]
G_M2065_IG04:
       mov      r8, bword ptr [rcx]     ;; <-- no bounds checks (fast loop)
       xor      r10d, r10d
       mov      dword ptr [r8+rax], r10d
       add      rax, 4
       dec      edx
       jne      SHORT G_M2065_IG04
G_M2065_IG05:
       add      rsp, 40
       ret      

G_M2065_IG06:
       cmp      eax, dword ptr [rcx+0x08]  ;; slow loop (cloned)
       jae      SHORT G_M2065_IG07
       mov      r8, bword ptr [rcx]
       mov      r10d, eax
       xor      r9d, r9d
       mov      dword ptr [r8+4*r10], r9d
       inc      eax
       cmp      eax, edx
       jl       SHORT G_M2065_IG06
       jmp      SHORT G_M2065_IG05
G_M2065_IG07:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 87, prolog size 4, PerfScore 23.38

Another example where this helps:

    void Test(Span<int> a, Span<int> b)
    {
        if (a.Length == b.Length)
        {
            for (int i = 0; i < a.Length; i++)
                a[i] = b[i];
        }
    }

previously, we couldn't optimize range check for ^

Diffs

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 15, 2025
@EgorBo EgorBo changed the title Loop cloning for Span (non-promoted) Loop cloning for Span Mar 15, 2025
@EgorBo

This comment was marked as outdated.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 16, 2025

@MihuBot

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

/azp list

@azure-pipelines

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr pgo, runtime-coreclr pgostress, Fuzzlyn

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@bencyoung-Fignum
Copy link

Out of interest, for loop cloning, if (len > span.Length) would it more less code/more efficient to run the cloned loop up to span.Length and only then switch to the un-optimized version? Or because that path is likely to throw anyway, it's better to just do the simplest thing?

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

Out of interest, for loop cloning, if (len > span.Length) would it more less code/more efficient to run the cloned loop up to span.Length and only then switch to the un-optimized version? Or because that path is likely to throw anyway, it's better to just do the simplest thing?

Yep, the current impl is easier to implement and is more generic - we also need to check array instance for being null (for arrays) and there are other kinds of cloning conditions, e.g. if we have a virtual call inside the loop, we can add an additional cloning condition for the most popular type under that virtual call (PGO).

@bencyoung-Fignum
Copy link

Thanks for the info. So would all "likely" optimization go in the optimized verison of the loop, and none of them in the fallback or could you have some combinations? E.g. potentially multiple clone loops with different assumptions? I guess you can assume the fallback is always the fully-unoptimized version as you assume there will be a failure at some point

@EgorBo
Copy link
Member Author

EgorBo commented Mar 17, 2025

Thanks for the info. So would all "likely" optimization go in the optimized verison of the loop, and none of them in the fallback or could you have some combinations? E.g. potentially multiple clone loops with different assumptions? I guess you can assume the fallback is always the fully-unoptimized version as you assume there will be a failure at some point

It depends. Normally, yes, fallback is not expected to be hit in normal circumstances unless code is relying on OOB exception, but it is not the case for virtual calls, we clone loops with them but the fallback still may be invoked (when some other type arrives), we discussed this recently in #113579 (comment)

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

/azp run runtime-coreclr jitstress, runtime-coreclr pgo, runtime-coreclr pgostress

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

@MihuBot

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@EgorBo EgorBo marked this pull request as ready for review March 18, 2025 01:35
Copilot AI review requested due to automatic review settings March 18, 2025 01:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 18, 2025

@AndyAyersMS @BruceForstall @dotnet/jit-contrib PTAL

Surprisingly, it was not difficult, my changes are mostly cosmetic (with asserts). Basically, if we have a LCL_VAR length, we don't need to deref the array object (it was either already dereferenced when this local was created, or it's a local span that doesn't need any dereference).

Diffs look sane to me, the TP impact is ~0.2% on average with a huge outlier in libraries_tests.run., however, same happens today for existing array cloning (for reference, here are the diffs for Main where loop cloning is disabled: diffs). The diffs are PerfScore improvements, they're better if we mark the cloned loop (slow one) as cold (today, we mark it as 0.01 weight).

Outerloop failures are not related.

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've overloaded the existing "jagged" array implementation with a case to support Span. Is this the cleanest way to express this? Would it be better to introduce a new LC_OPT(LcSpan) "type" of optimization (and maybe a LC_Span type that parallels LC_Array, etc.)?

Can Span participate in "jagged" arrays? E.g., for a[x][y][z], can a be a Span, a[x] be a span, a[x][y] be an array?

@EgorBo

This comment was marked as outdated.

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2025

@BruceForstall I've addressed your feedback. The impl is 2x bigger now, but I agree that it looks better. Diffs

assert(isIncreasingLoop || iterInfo->IsDecreasingLoop());
if (!isIncreasingLoop && !iterInfo->IsDecreasingLoop())
{
// Normally, we reject weird-looking loops in optIsLoopClonable, but it's not the case
Copy link
Member Author

@EgorBo EgorBo Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small pre-existing issue, can be reproduced (hits an assert in Checked) in Main via this snippet

Click me
using System;
using System.Runtime.CompilerServices;

class Program : IDisposable
{
    public static void Main()
    {
        for (int i = 0; i < 1200; i++)
        {
            try
            {
                Test(new int[100000000], 44, new Program());
                Thread.Sleep(16);
            }
            catch
            {
            }
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Test(int[] arr, int x, IDisposable d)
    {
        for (int i = 0; i < x; i--)
        {
            d.Dispose();
            Console.WriteLine(arr[i]);
        }
    }

    public void Dispose()
    {
    }
}

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2025

@EgorBot -amd -arm -profiler

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Bench
{
    byte[] _arr1 = new byte[2000];
    byte[] _arr2 = new byte[2000];

    [Benchmark]
    [Arguments(1000)]
    public void CopyN(int elems)
    {
        Span<byte> span1 = _arr1;
        Span<byte> span2 = _arr2;

        for (int i = 0; i < elems; i++)
            span1[i] = span2[i];
    }


    [Benchmark]
    [Arguments(1000)]
    public void ReversedIter(int elems)
    {
        Span<byte> span = _arr1;

        // Reversed iteration
        for (int i = span.Length - 1; i >= 0; i--)
            span[i] = 42;
    }
}

Copy link
Contributor

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit/question to consider.

//
void SpanIndex::Print()
{
printf("V%02d[V%02d]", lenLcl, indLcl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit odd, as it indicates that lenLcl is an "array", but it's actually just the length of the array (in the span). Maybe use something like Span<V%02d>[V%02d] instead? Basically, something to make it obvious that lenLcl is not an array.

Copy link
Member Author

@EgorBo EgorBo Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, copy-paste from ArrIndex 🙂 I'll fix the comment in a follow up (I think I found a few missing opportunities for LC) to avoid spinning the CI again

@EgorBo
Copy link
Member Author

EgorBo commented Mar 19, 2025

/ba-g "wasm build failure is already fixed - #113685"

@EgorBo EgorBo merged commit 254b55a into dotnet:main Mar 20, 2025
105 of 107 checks passed
@EgorBo EgorBo deleted the loop-clone branch March 20, 2025 01:07
@omariom
Copy link
Contributor

omariom commented Mar 20, 2025

@EgorBo Does it have to deference Span's _reference on each iteration?

mov      r8, bword ptr [rcx]     ;; <-- no bounds checks (fast loop)

it doesn't for arrays

    L0020: xor eax, eax
    L0022: mov [rcx], eax
    L0024: add rcx, 4
    L0028: dec edx
    L002a: jne short L0020

@EgorBo
Copy link
Member Author

EgorBo commented Mar 20, 2025

@EgorBo Does it have to deference Span's _reference on each iteration?

@omariom it's an unfortunate unrelated issue, in this example, JIT doesn't promote the Span arg (which is an implicit byref) into variables and loads it each iteration from the arg (stack). It rarely happens on practice as the code usually more complex and convince the JIT's struct promoter that it is profitable to promote it. I observe this issue only on Windows for these examples, e.g. same example on Linux has no issues.

We'll eventually fix it once @jakobbotsch makes the new (aka physical) promotion default

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

4 participants