Skip to content

Commit

Permalink
JIT: Add a disabled-by-default implementation of strength reduction (#…
Browse files Browse the repository at this point in the history
…104243)

This adds a disabled-by-default implementation of strength reduction. At
this point the implementation should be correct, however it is currently
both a size and perfscore regression when it is enabled. More work will
be needed to get the heuristics right and to make it kick in for more
cases.

Strength reduction replaces "expensive" operations computed on every
loop iteration with cheaper ones by creating more induction
variables. In C# terms it effectively transforms something like

```
private struct S
{
    public int A, B, C;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static float Sum(S[] ss)
{
    int sum = 0;
    foreach (S v in ss)
    {
        sum += v.A;
        sum += v.B;
        sum += v.C;
    }

    return sum;
}
```

into an equivalent
```
int sum = 0;
ref S curS = ref ss[0];
for (int i = 0; i < ss.Length; i++)
{
  sum += curS.A;
  sum += curS.B;
  sum += curS.C;
  curS = ref Unsafe.Add(ref curS, 1);
}
```

With strength reduction enabled this PR thus changes codegen of the
standard `foreach` version above from
```asm
G_M63518_IG03:  ;; offset=0x0011
       lea      r10, [rdx+2*rdx]
       lea      r10, bword ptr [rcx+4*r10+0x10]
       mov      r9d, dword ptr [r10]
       mov      r11d, dword ptr [r10+0x04]
       mov      r10d, dword ptr [r10+0x08]
       add      eax, r9d
       add      eax, r11d
       add      eax, r10d
       inc      edx
       cmp      r8d, edx
       jg       SHORT G_M63518_IG03
						;; size=36 bbWeight=4 PerfScore 39.00
```

to
```asm
G_M63518_IG04:  ;; offset=0x0011
       mov      r8, rcx
       mov      r10d, dword ptr [r8]
       mov      r9d, dword ptr [r8+0x04]
       mov      r8d, dword ptr [r8+0x08]
       add      eax, r10d
       add      eax, r9d
       add      eax, r8d
       add      rcx, 12
       dec      edx
       jne      SHORT G_M63518_IG04
						;; size=31 bbWeight=4 PerfScore 34.00
```
on x64. Further improvements can be made to enable better address modes.

The current heuristics try to ensure that we do not actually end up with
more primary induction variables. The strength reduction only kicks in
when it thinks that all uses of the primary IV can be replaced by the
new primary IV. However, uses inside loop exit tests are allowed to stay
unreplaced by the assumption that the downwards loop transformation
will be able to get rid of them.

Getting the cases around overflow right turned out to be hard and
required reasoning about trip counts that was added in a previous PR.
Generally, the issue is that we need to prove that transforming a zero
extension of an add recurrence to a 64-bit add recurrence is legal. For
example, for a simple case of
```
for (int i = 0; i < arr.Length; i++)
  sum += arr[i];
```

the IV analysis is eventually going to end up wanting to show that
`zext<64>(int32 <L, 0, 1>) => int64 <L, 0, 1>` is a correct
transformation. This requires showing that the add recurrence does not
step past 2^32-1, which requires the bound on the trip count that we can
now compute. The reasoning done for both the trip count and around the
overflow is still very limited but can be improved incrementally.

The implementation works by considering every primary IV of the loop in
turn, and by initializing 'cursors' pointing to each use of the primary
IV. It then tries to repeatedly advance these cursors to the parent of
the uses while it results in a new set of cursors that still compute the
same (now derived) IV. If it manages to do this once, then replacing the
cursors by a new primary IV should result in the old primary IV no
longer being necessary, while having replaced some operations by cheaper
ones.
  • Loading branch information
jakobbotsch authored Jul 4, 2024
1 parent c57f9ae commit a8616d9
Show file tree
Hide file tree
Showing 7 changed files with 1,000 additions and 34 deletions.
5 changes: 5 additions & 0 deletions src/coreclr/jit/arraystack.h
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,11 @@ class ArrayStack
tosIndex = 0;
}

T* Data()
{
return data;
}

private:
CompAllocator m_alloc;
int tosIndex; // first free location
Expand Down
6 changes: 4 additions & 2 deletions src/coreclr/jit/compiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -6430,11 +6430,9 @@ class Compiler
Statement* fgNewStmtAtEnd(BasicBlock* block, GenTree* tree, const DebugInfo& di = DebugInfo());
Statement* fgNewStmtNearEnd(BasicBlock* block, GenTree* tree, const DebugInfo& di = DebugInfo());

private:
void fgInsertStmtNearEnd(BasicBlock* block, Statement* stmt);
void fgInsertStmtAtBeg(BasicBlock* block, Statement* stmt);

public:
void fgInsertStmtAfter(BasicBlock* block, Statement* insertionPoint, Statement* stmt);
void fgInsertStmtBefore(BasicBlock* block, Statement* insertionPoint, Statement* stmt);

Expand Down Expand Up @@ -7563,6 +7561,8 @@ class Compiler

PhaseStatus optInductionVariables();

template <typename TFunctor>
void optVisitBoundingExitingCondBlocks(FlowGraphNaturalLoop* loop, TFunctor func);
bool optMakeLoopDownwardsCounted(ScalarEvolutionContext& scevContext,
FlowGraphNaturalLoop* loop,
LoopLocalOccurrences* loopLocals);
Expand Down Expand Up @@ -10345,6 +10345,8 @@ class Compiler
STRESS_MODE(OPT_REPEAT) /* stress JitOptRepeat */ \
STRESS_MODE(INITIAL_PARAM_REG) /* Stress initial register assigned to parameters */ \
STRESS_MODE(DOWNWARDS_COUNTED_LOOPS) /* Make more loops downwards counted */ \
STRESS_MODE(STRENGTH_REDUCTION) /* Enable strength reduction */ \
STRESS_MODE(STRENGTH_REDUCTION_PROFITABILITY) /* Do more strength reduction */ \
\
/* After COUNT_VARN, stress level 2 does all of these all the time */ \
\
Expand Down
Loading

0 comments on commit a8616d9

Please sign in to comment.