Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Add a disabled-by-default implementation of strength reduction #104243

Merged
merged 6 commits into from
Jul 4, 2024

Commits on Jul 1, 2024

  1. JIT: Add a disabled-by-default implementation of strength reduction

    This adds a disabled-by-default implementation of strength reduction. At
    this point the implementation should be correct, however it is currently
    both a size and perfscore regression when it is enabled. More work will
    be needed to get the heuristics right and to make it kick in for more
    cases.
    
    Strength reduction replaces "expensive" operations computed on every
    loop iteration with cheaper ones by creating more induction
    variables. In C# terms it effectively transforms something like
    
    ```
    private struct S
    {
        public int A, B, C;
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private static float Sum(S[] ss)
    {
        int sum = 0;
        foreach (S v in ss)
        {
            sum += v.A;
            sum += v.B;
            sum += v.C;
        }
    
        return sum;
    }
    ```
    
    into an equivalent
    ```
    int sum = 0;
    ref S curS = ref ss[0];
    for (int i = 0; i < ss.Length; i++)
    {
      sum += curS.A;
      sum += curS.B;
      sum += curS.C;
      curS = ref Unsafe.Add(ref curS, 1);
    }
    ```
    
    With strength reduction enabled this PR thus changes codegen of the
    standard `foreach` version above from
    ```asm
    G_M63518_IG03:  ;; offset=0x0011
           lea      r10, [rdx+2*rdx]
           lea      r10, bword ptr [rcx+4*r10+0x10]
           mov      r9d, dword ptr [r10]
           mov      r11d, dword ptr [r10+0x04]
           mov      r10d, dword ptr [r10+0x08]
           add      eax, r9d
           add      eax, r11d
           add      eax, r10d
           inc      edx
           cmp      r8d, edx
           jg       SHORT G_M63518_IG03
    						;; size=36 bbWeight=4 PerfScore 39.00
    ```
    
    to
    ```asm
    G_M63518_IG04:  ;; offset=0x0011
           mov      r8, rcx
           mov      r10d, dword ptr [r8]
           mov      r9d, dword ptr [r8+0x04]
           mov      r8d, dword ptr [r8+0x08]
           add      eax, r10d
           add      eax, r9d
           add      eax, r8d
           add      rcx, 12
           dec      edx
           jne      SHORT G_M63518_IG04
    						;; size=31 bbWeight=4 PerfScore 34.00
    ```
    on x64. Further improvements can be made to enable better address modes.
    
    The current heuristics try to ensure that we do not actually end up with
    more primary induction variables. The strength reduction only kicks in
    when it thinks that all uses of the primary IV can be replaced by the
    new primary IV. However, uses inside loop exit tests are allowed to stay
    unreplaced by the assumption that the downwards loop transformation
    will be able to get rid of them.
    
    Getting the cases around overflow right turned out to be hard and
    required reasoning about trip counts that was added in a previous PR.
    Generally, the issue is that we need to prove that transforming a zero
    extension of an add recurrence to a 64-bit add recurrence is legal. For
    example, for a simple case of
    ```
    for (int i = 0; i < arr.Length; i++)
      sum += arr[i];
    ```
    
    the IV analysis is eventually going to end up wanting to show that
    `zext<64>(int32 <L, 0, 1>) => int64 <L, 0, 1>` is a correct
    transformation. This requires showing that the add recurrence does not
    step past 2^32-1, which requires the bound on the trip count that we can
    now compute. The reasoning done for both the trip count and around the
    overflow is still very limited but can be improved incrementally.
    
    The implementation works by considering every primary IV of the loop in
    turn, and by initializing 'cursors' pointing to each use of the primary
    IV. It then tries to repeatedly advance these cursors to the parent of
    the uses while it results in a new set of cursors that still compute the
    same (now derived) IV. If it manages to do this once, then replacing the
    cursors by a new primary IV should result in the old primary IV no
    longer being necessary, while having replaced some operations by cheaper
    ones.
    jakobbotsch committed Jul 1, 2024
    Configuration menu
    Copy the full SHA
    1786472 View commit details
    Browse the repository at this point in the history
  2. Fix a GC hole

    jakobbotsch committed Jul 1, 2024
    Configuration menu
    Copy the full SHA
    efb641c View commit details
    Browse the repository at this point in the history

Commits on Jul 2, 2024

  1. Configuration menu
    Copy the full SHA
    3a1850a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    7c0e2f8 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3d66247 View commit details
    Browse the repository at this point in the history

Commits on Jul 3, 2024

  1. Fix some comments

    jakobbotsch committed Jul 3, 2024
    Configuration menu
    Copy the full SHA
    c06728d View commit details
    Browse the repository at this point in the history