-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Canonicalize loop exits #98096
JIT: Canonicalize loop exits #98096
Conversation
This adds another canonicalization requirement for loops during the optimization phases, namely that all regular loop exit blocks have only loop predecessors. This gives a natural place to insert IR that we want to run only when we know the loop was entered. Exceptional loop exit blocks can still have non-loop predecessors, so these must still be accounted for by optimizations.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis adds another canonicalization requirement for loops during the Exceptional loop exit blocks can still have non-loop predecessors, so We could also have a stronger property, namely that all loop exits have only a single predecessor. However the main motivation for the canonicalization is #97865 which does not require this stronger canonicalization.
|
There are cases where we no longer do jump threading because RBO no longer can prove the exact value of a dominator when reaching through a pred. Example without this change: flowgraph
With this change, however, we get this flowgraph. This time @AndyAyersMS I wonder if there's a way to generalize RBO follow predecssors a bit more persistently and then do the jump threading on the BB36 -> BB38 edge instead of trying to do it on the BB38 -> BB39 edge? BB38 is an empty block in this case. |
Diff results for #98096Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 1,610,910 contexts (368,644 MinOpts, 1,242,266 FullOpts). MISSED contexts: 2,790 (0.17%) Overall (-101,288 bytes)
FullOpts (-101,288 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 1,621,203 contexts (360,162 MinOpts, 1,261,041 FullOpts). MISSED contexts: 2,647 (0.16%) Overall (-162,350 bytes)
FullOpts (-162,350 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 1,733,619 contexts (561,303 MinOpts, 1,172,316 FullOpts). MISSED contexts: 2,902 (0.17%) Overall (-6,016 bytes)
FullOpts (-6,016 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 1,478,010 contexts (263,527 MinOpts, 1,214,483 FullOpts). MISSED contexts: 2,751 (0.19%) Overall (-4,376 bytes)
FullOpts (-4,376 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,999,649 contexts (587,594 MinOpts, 1,412,055 FullOpts). MISSED contexts: base: 3,225 (0.16%), diff: 3,239 (0.16%) Overall (+41,365 bytes)
FullOpts (+41,365 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 1,449,677 contexts (345,734 MinOpts, 1,103,943 FullOpts). MISSED contexts: 55,309 (3.68%) Overall (+6,954 bytes)
FullOpts (+6,954 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,624,599 contexts (327,626 MinOpts, 1,296,973 FullOpts). MISSED contexts: base: 4,647 (0.29%), diff: 5,067 (0.31%) Overall (+30,563 bytes)
FullOpts (+30,563 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.08% to +0.40%)
FullOpts (+0.09% to +0.45%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.09% to +0.41%)
FullOpts (+0.10% to +0.46%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.07% to +0.36%)
FullOpts (+0.09% to +0.53%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.08% to +0.37%)
FullOpts (+0.09% to +0.43%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.09% to +0.51%)
FullOpts (+0.10% to +0.65%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.04% to +0.30%)
FullOpts (+0.04% to +0.31%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.10% to +0.29%)
FullOpts (+0.10% to +0.34%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.08% to +0.39%)
FullOpts (+0.09% to +0.44%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.09% to +0.40%)
FullOpts (+0.09% to +0.44%)
Details here |
If there are empty preds we ought to be able to look through them. Added a note to #48115. |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
Diff results for #98096Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 1,610,910 contexts (368,644 MinOpts, 1,242,266 FullOpts). MISSED contexts: 2,790 (0.17%) Overall (-101,288 bytes)
FullOpts (-101,288 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 1,621,203 contexts (360,162 MinOpts, 1,261,041 FullOpts). MISSED contexts: 2,647 (0.16%) Overall (-162,350 bytes)
FullOpts (-162,350 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 1,733,619 contexts (561,303 MinOpts, 1,172,316 FullOpts). MISSED contexts: 2,902 (0.17%) Overall (-6,016 bytes)
FullOpts (-6,016 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 1,478,010 contexts (263,527 MinOpts, 1,214,483 FullOpts). MISSED contexts: 2,751 (0.19%) Overall (-4,376 bytes)
FullOpts (-4,376 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,999,649 contexts (587,594 MinOpts, 1,412,055 FullOpts). MISSED contexts: base: 3,225 (0.16%), diff: 3,239 (0.16%) Overall (+41,365 bytes)
FullOpts (+41,365 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 1,449,677 contexts (345,734 MinOpts, 1,103,943 FullOpts). MISSED contexts: 55,309 (3.68%) Overall (+6,954 bytes)
FullOpts (+6,954 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,624,674 contexts (327,626 MinOpts, 1,297,048 FullOpts). MISSED contexts: base: 4,647 (0.29%), diff: 5,061 (0.31%) Overall (+30,611 bytes)
FullOpts (+30,611 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.08% to +0.40%)
FullOpts (+0.09% to +0.45%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.09% to +0.41%)
FullOpts (+0.10% to +0.46%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.07% to +0.37%)
FullOpts (+0.09% to +0.53%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.08% to +0.37%)
FullOpts (+0.09% to +0.43%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.09% to +0.51%)
FullOpts (+0.10% to +0.65%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.04% to +0.30%)
FullOpts (+0.04% to +0.31%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.10% to +0.30%)
FullOpts (+0.10% to +0.34%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.08% to +0.39%)
FullOpts (+0.09% to +0.44%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.09% to +0.40%)
FullOpts (+0.09% to +0.44%)
Details here |
Diff results for #98096Assembly diffsAssembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,999,649 contexts (587,594 MinOpts, 1,412,055 FullOpts). MISSED contexts: base: 3,225 (0.16%), diff: 3,239 (0.16%) Overall (+41,365 bytes)
FullOpts (+41,365 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 1,449,677 contexts (345,734 MinOpts, 1,103,943 FullOpts). MISSED contexts: 55,309 (3.68%) Overall (+6,954 bytes)
FullOpts (+6,954 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,624,674 contexts (327,626 MinOpts, 1,297,048 FullOpts). MISSED contexts: base: 4,647 (0.29%), diff: 5,061 (0.31%) Overall (+30,611 bytes)
FullOpts (+30,611 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.08% to +0.39%)
FullOpts (+0.09% to +0.44%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.09% to +0.40%)
FullOpts (+0.09% to +0.44%)
Details here |
// Returns: | ||
// Estimated likelihood of the edge being taken. | ||
// | ||
weight_t Compiler::optEstimateEdgeLikelihood(BasicBlock* from, BasicBlock* to, bool* fromProfile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully this should make the logic more easily updated once we get to the point of having likelihoods here. I also want to switch the preheader code to use this helper in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will become much simpler as you can just query the edge.
Diff results for #98096Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (+0.08% to +0.36%)
FullOpts (+0.09% to +0.40%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.09% to +0.37%)
FullOpts (+0.10% to +0.41%)
Details here Throughput diffs for linux/arm64 ran on windows/x64Overall (+0.09% to +0.37%)
FullOpts (+0.09% to +0.41%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.09% to +0.38%)
FullOpts (+0.10% to +0.43%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.07% to +0.39%)
FullOpts (+0.09% to +0.57%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.08% to +0.36%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.09% to +0.41%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.09% to +0.43%)
FullOpts (+0.10% to +0.55%)
Details here |
Diff results for #98096Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 1,610,272 contexts (368,644 MinOpts, 1,241,628 FullOpts). MISSED contexts: 3,428 (0.21%) Overall (-7,664 bytes)
FullOpts (-7,664 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 1,620,764 contexts (360,162 MinOpts, 1,260,602 FullOpts). MISSED contexts: 3,086 (0.19%) Overall (-19,189 bytes)
FullOpts (-19,189 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 1,733,061 contexts (561,303 MinOpts, 1,171,758 FullOpts). MISSED contexts: 3,460 (0.20%) Overall (-2,384 bytes)
FullOpts (-2,384 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 1,477,297 contexts (263,527 MinOpts, 1,213,770 FullOpts). MISSED contexts: 3,464 (0.23%) Overall (-3,368 bytes)
FullOpts (-3,368 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,999,231 contexts (587,594 MinOpts, 1,411,637 FullOpts). MISSED contexts: 3,657 (0.18%) Overall (-11,691 bytes)
FullOpts (-11,691 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 1,449,330 contexts (345,734 MinOpts, 1,103,596 FullOpts). MISSED contexts: 55,656 (3.70%) Overall (+306 bytes)
FullOpts (+306 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,618,548 contexts (327,626 MinOpts, 1,290,922 FullOpts). MISSED contexts: base: 11,019 (0.68%), diff: 11,173 (0.69%) Overall (+13,886 bytes)
FullOpts (+13,886 bytes)
Details here Throughput diffsThroughput diffs for windows/x64 ran on windows/x64Overall (+0.09% to +0.43%)
FullOpts (+0.10% to +0.55%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.04% to +0.32%)
FullOpts (+0.04% to +0.32%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.10% to +0.29%)
FullOpts (+0.10% to +0.33%)
Details here |
Current detailed throughput on benchmarks.run_pgo (which has one of the biggest TP regressions) looks like: Base: 97028407386, Diff: 97444421217, +0.4288%
30325644 : +8.35% : 6.25% : +0.0313% : `Compiler::fgRunDfs<`Compiler::fgComputeDfs'::`2'::<lambda_1>,`Compiler::fgComputeDfs'::`2'::<lambda_2>,`Compiler::fgComputeDfs'::`2'::<lambda_3> >'::`2'::<lambda_1>::operator()
29399635 : +11.02% : 6.06% : +0.0303% : public: static class BlockReachabilitySets * __cdecl BlockReachabilitySets::Build(class FlowGraphDfsTree *)
26004409 : +8.19% : 5.36% : +0.0268% : private: void __cdecl SsaBuilder::ComputeIteratedDominanceFrontier(struct BasicBlock *, class JitHashTable<struct BasicBlock *, struct JitPtrKeyFuncs<struct BasicBlock>, class jitstd::vector<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>>, class CompAllocator, class JitHashTableBehavior> const *, class jitstd::vector<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>> *)
25847460 : +8.79% : 5.33% : +0.0266% : public: static class FlowGraphDominatorTree * __cdecl FlowGraphDominatorTree::Build(class FlowGraphDfsTree const *)
20586307 : +1.37% : 4.25% : +0.0212% : private: bool __cdecl LiveVarAnalysis::PerBlockAnalysis(struct BasicBlock *, bool, bool)
15109730 : +0.68% : 3.12% : +0.0156% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
15029836 : NA : 3.10% : +0.0155% : public: bool __cdecl Compiler::optCanonicalizeExit(class FlowGraphNaturalLoop *, struct BasicBlock *)
13364811 : +11.53% : 2.76% : +0.0138% : public: static class FlowGraphNaturalLoops * __cdecl FlowGraphNaturalLoops::Find(class FlowGraphDfsTree const *)
10774676 : +11.78% : 2.22% : +0.0111% : public: struct FlowEdge * __cdecl Compiler::fgAddRefPred<0>(struct BasicBlock *, struct BasicBlock *, struct FlowEdge *)
9910264 : +1.31% : 2.04% : +0.0102% : memset
9424309 : +17.94% : 1.94% : +0.0097% : public: bool __cdecl FlowGraphNaturalLoop::ContainsBlock(struct BasicBlock *)
9337439 : +8.19% : 1.93% : +0.0096% : BasicBlock::VisitAllSuccs<`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >
8249103 : +3.15% : 1.70% : +0.0085% : public: virtual enum PhaseStatus __cdecl LinearScan::doLinearScan(void)
8209161 : +6.61% : 1.69% : +0.0085% : public: void __cdecl Compiler::fgCompactBlocks(struct BasicBlock *, struct BasicBlock *)
7489193 : +4.54% : 1.54% : +0.0077% : public: void __cdecl Compiler::optImpliedByTypeOfAssertions(unsigned __int64 *&)
7434216 : +11.80% : 1.53% : +0.0077% : public: struct FlowEdge * __cdecl Compiler::fgRemoveRefPred(struct BasicBlock *, struct BasicBlock *)
6934584 : +8.68% : 1.43% : +0.0071% : private: static bool __cdecl FlowGraphNaturalLoops::FindNaturalLoopBlocks(class FlowGraphNaturalLoop *, class ArrayStack<struct BasicBlock *> &)
5724310 : +3.53% : 1.18% : +0.0059% : private: void __cdecl SsaBuilder::InsertPhiFunctions(void)
5204044 : +4.10% : 1.07% : +0.0054% : BasicBlock::VisitAllSuccs<`Compiler::optReachable'::`12'::<lambda_1> >
5194162 : +3.78% : 1.07% : +0.0054% : `Compiler::optReachable'::`12'::<lambda_1>::operator()
5005719 : NA : 1.03% : +0.0052% : public: double __cdecl Compiler::optEstimateEdgeLikelihood(struct BasicBlock *, struct BasicBlock *, bool *)
4781029 : +72.14% : 0.99% : +0.0049% : public: void __cdecl Compiler::fgReplaceJumpTarget(struct BasicBlock *, struct BasicBlock *, struct BasicBlock *)
4739874 : +9.76% : 0.98% : +0.0049% : public: struct FlowEdge * __cdecl Compiler::BlockDominancePreds(struct BasicBlock *)
4657517 : NA : 0.96% : +0.0048% : public: bool __cdecl Compiler::optCanonicalizeExits(class FlowGraphNaturalLoop *)
4515950 : +2.63% : 0.93% : +0.0047% : public: void __cdecl BasicBlock::ensurePredListOrder(class Compiler *)
4455957 : +5.95% : 0.92% : +0.0046% : private: bool __cdecl jitstd::vector<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>>::ensure_capacity(unsigned __int64)
4303713 : +1.67% : 0.89% : +0.0044% : private: void __cdecl SsaBuilder::BlockRenameVariables(struct BasicBlock *)
4195084 : +5.40% : 0.87% : +0.0043% : public: enum BasicBlockVisit __cdecl BasicBlock::VisitRegularSuccs<class `public: void __cdecl DataFlow::ForwardAnalysis<class CSE_DataFlow>(class CSE_DataFlow &)'::`16'::<lambda_1>>(class Compiler *, class `public: void __cdecl DataFlow::ForwardAnalysis<class CSE_DataFlow>(class CSE_DataFlow &)'::`16'::<lambda_1>)
3822670 : +1.40% : 0.79% : +0.0039% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::ClearD(class Compiler *, unsigned __int64 *&)
3816270 : +8.32% : 0.79% : +0.0039% : VisitEHSuccs<0,`AllSuccessorEnumerator::AllSuccessorEnumerator'::`2'::<lambda_1> >
3801464 : +3.92% : 0.78% : +0.0039% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::Assign(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *)
3698187 : +1.59% : 0.76% : +0.0038% : public: void __cdecl Compiler::fgValueNumberBlock(struct BasicBlock *)
3606886 : +0.98% : 0.74% : +0.0037% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::UnionD(class Compiler *, unsigned __int64 *&, unsigned __int64 *)
3485781 : +7.52% : 0.72% : +0.0036% : BasicBlock::VisitRegularSuccs<``FlowGraphNaturalLoops::Find'::`7'::<lambda_1>::operator()'::`2'::<lambda_1> >
3426035 : +1.91% : 0.71% : +0.0035% : public: void __cdecl DataFlow::ForwardAnalysis<class AssertionPropFlowCallback>(class AssertionPropFlowCallback &)
3272310 : +0.35% : 0.67% : +0.0034% : public: bool __cdecl Compiler::fgUpdateFlowGraph(bool, bool)
3202086 : +0.61% : 0.66% : +0.0033% : public: void __cdecl Compiler::fgPerBlockLocalVarLiveness(void)
3147900 : +4.39% : 0.65% : +0.0032% : VisitEHSuccs<0,`Compiler::optReachable'::`12'::<lambda_1> >
3036441 : +1.36% : 0.63% : +0.0031% : private: void __cdecl LiveVarAnalysis::Run(bool)
2892124 : +4.60% : 0.60% : +0.0030% : public: unsigned __int64 ** __cdecl Compiler::optInitAssertionDataflowFlags(void)
2723427 : +4.18% : 0.56% : +0.0028% : public: bool __cdecl Compiler::optReachable(struct BasicBlock *const, struct BasicBlock *const, struct BasicBlock *const)
2708291 : +0.78% : 0.56% : +0.0028% : public: void __cdecl ArrayStack<struct GenTree *>::Push(struct GenTree *)
2694774 : +0.83% : 0.56% : +0.0028% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::Assign(class Compiler *, unsigned __int64 *&, unsigned __int64 *)
2607088 : +1.30% : 0.54% : +0.0027% : public: void __cdecl RangeCheck::MergeEdgeAssertions(unsigned int, unsigned __int64 *const &, struct Range *)
2557973 : +1.69% : 0.53% : +0.0026% : public: static struct BasicBlock * __cdecl BasicBlock::New(class Compiler *)
2512966 : +3.22% : 0.52% : +0.0026% : public: enum PhaseStatus __cdecl Compiler::fgValueNumber(void)
2480155 : +4.94% : 0.51% : +0.0026% : private: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::DataFlowDLong(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const)
2476801 : +2.19% : 0.51% : +0.0026% : public: void __cdecl DataFlow::ForwardAnalysis<class CSE_DataFlow>(class CSE_DataFlow &)
2384405 : +5.83% : 0.49% : +0.0025% : public: struct FlowEdge * __cdecl Compiler::BlockPredsWithEH(struct BasicBlock *)
2370665 : +2.11% : 0.49% : +0.0024% : private: void __cdecl SsaBuilder::ComputeDominanceFrontiers(struct BasicBlock **, int, class JitHashTable<struct BasicBlock *, struct JitPtrKeyFuncs<struct BasicBlock>, class jitstd::vector<struct BasicBlock *, class jitstd::allocator<struct BasicBlock *>>, class CompAllocator, class JitHashTableBehavior> *)
2327029 : +1.31% : 0.48% : +0.0024% : private: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::LivenessDLong(class Compiler *, unsigned __int64 *&, unsigned __int64 *const, unsigned __int64 *const, unsigned __int64 *const)
2253101 : +0.72% : 0.46% : +0.0023% : public: enum PhaseStatus __cdecl Compiler::optAssertionPropMain(void)
2117591 : +0.46% : 0.44% : +0.0022% : public: void __cdecl Compiler::fgInterBlockLocalVarLiveness(void)
2052538 : +7.86% : 0.42% : +0.0021% : public: static unsigned __int64 * __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::MakeEmpty(class Compiler *)
2001861 : +3.09% : 0.41% : +0.0021% : protected: void __cdecl Compiler::optValnumCSE_InitDataFlow(void)
1907864 : +2.24% : 0.39% : +0.0020% : private: void __cdecl SsaBuilder::RenameVariables(void)
1879814 : +1.85% : 0.39% : +0.0019% : `SsaBuilder::AddPhiArgsToSuccessors'::`2'::<lambda_1>::operator()
1845748 : +2.31% : 0.38% : +0.0019% : public: bool __cdecl Compiler::fgHasCycleWithoutGCSafePoint(void)
1808558 : +1.36% : 0.37% : +0.0019% : public: bool __cdecl Compiler::fgCanCompactBlocks(struct BasicBlock *, struct BasicBlock *)
1806120 : +216.39% : 0.37% : +0.0019% : public: struct BasicBlock * __cdecl Compiler::fgNewBBbefore(enum BBKinds, struct BasicBlock *, bool, struct BasicBlock *)
1771031 : +3.61% : 0.37% : +0.0018% : public: void __cdecl jitstd::vector<struct FlowEdge *, class jitstd::allocator<struct FlowEdge *>>::push_back(struct FlowEdge *const &)
1718771 : +1.05% : 0.35% : +0.0018% : public: void __cdecl Compiler::fgInitBlockVarSets(void)
1716876 : +4.06% : 0.35% : +0.0018% : BasicBlock::VisitAllSuccs<`SsaBuilder::AddPhiArgsToSuccessors'::`2'::<lambda_1> >
1703289 : NA : 0.35% : +0.0018% : public: bool __cdecl Compiler::optCanonicalizeLoops(void)
1695760 : +5.76% : 0.35% : +0.0017% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::ClearD(struct BitVecTraits *, unsigned __int64 *&)
1549927 : +1.83% : 0.32% : +0.0016% : public: bool __cdecl Compiler::fgRenumberBlocks(void)
1502424 : +0.55% : 0.31% : +0.0015% : public: bool __cdecl Compiler::optBlockCopyProp(struct BasicBlock *, class JitHashTable<unsigned int, struct JitSmallPrimitiveKeyFuncs<unsigned int>, class ArrayStack<class Compiler::CopyPropSsaDef> *, class CompAllocator, class JitHashTableBehavior> *)
1497554 : +1.20% : 0.31% : +0.0015% : public: unsigned __int64 ** __cdecl Compiler::optComputeAssertionGen(void)
1457410 : +2.81% : 0.30% : +0.0015% : VisitEHSuccs<0,`Compiler::fgAddHandlerLiveVars'::`2'::<lambda_1> >
1431654 : +1.25% : 0.30% : +0.0015% : public: unsigned int __cdecl ValueNumStore::VNForMapSelectWork(enum ValueNumKind, enum var_types, unsigned int, unsigned int, int *, bool *, class SmallValueNumSet &)
1404833 : +0.95% : 0.29% : +0.0014% : public: enum PhaseStatus __cdecl Compiler::optVnCopyProp(void)
1340214 : +1.01% : 0.28% : +0.0014% : public: struct BasicBlock * __cdecl Compiler::fgFindInsertPoint(unsigned int, bool, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, struct BasicBlock *, bool)
1252790 : +215.07% : 0.26% : +0.0013% : public: void __cdecl Compiler::fgExtendEHRegionBefore(struct BasicBlock *)
1110922 : +4.38% : 0.23% : +0.0011% : VisitEHSuccs<0,`SsaBuilder::AddPhiArgsToSuccessors'::`2'::<lambda_1> >
1064221 : NA : 0.22% : +0.0011% : BasicBlock::VisitRegularSuccs<`Compiler::optEstimateEdgeLikelihood'::`11'::<lambda_2> >
1045442 : +2.06% : 0.22% : +0.0011% : public: unsigned int __cdecl BasicBlock::NumSucc(void) const
982656 : +623.63% : 0.20% : +0.0010% : public: static bool __cdecl Compiler::fgProfileWeightsConsistent(double, double)
981072 : +1.81% : 0.20% : +0.0010% : public: void __cdecl Compiler::lvaMarkLocalVars(struct BasicBlock *, bool)
960577 : +1.94% : 0.20% : +0.0010% : public: unsigned int __cdecl ValueNumStore::VNForFunc(enum var_types, enum VNFunc, unsigned int, unsigned int, unsigned int)
866074 : +3.18% : 0.18% : +0.0009% : private: void __cdecl JitHashTable<struct BasicBlock *, struct JitPtrKeyFuncs<struct BasicBlock>, struct FlowEdge *, class CompAllocator, class JitHashTableBehavior>::Grow(void)
824600 : +0.19% : 0.17% : +0.0008% : protected: static enum Compiler::fgWalkResult __cdecl Compiler::optVNAssertionPropCurStmtVisitor(struct GenTree **, struct Compiler::fgWalkData *)
800285 : +0.15% : 0.17% : +0.0008% : public: void __cdecl Compiler::optAssertionGen(struct GenTree *)
781361 : +3.05% : 0.16% : +0.0008% : public: enum PhaseStatus __cdecl Compiler::optRedundantBranches(void)
744362 : +9.51% : 0.15% : +0.0008% : public: void __cdecl ValueNumStore::GetCompareCheckedBound(unsigned int, struct ValueNumStore::CompareCheckedBoundArithInfo *)
715014 : +0.21% : 0.15% : +0.0007% : public: enum Compiler::fgWalkResult __cdecl GenTreeVisitor<class GenericTreeWalker<0, 1, 0, 1>>::WalkTree(struct GenTree **, struct GenTree *)
688709 : +0.43% : 0.14% : +0.0007% : protected: void __cdecl Compiler::optValnumCSE_Availability(void)
665899 : +6.19% : 0.14% : +0.0007% : public: struct BasicBlock * __cdecl Compiler::fgRemoveBlock(struct BasicBlock *, bool)
653562 : +2.14% : 0.13% : +0.0007% : public: __cdecl GCSafePointSuccessorEnumerator::GCSafePointSuccessorEnumerator(class Compiler *, struct BasicBlock *)
644460 : +0.22% : 0.13% : +0.0007% : private: struct ValueNumStore::Chunk * __cdecl ValueNumStore::GetAllocChunk(enum var_types, enum ValueNumStore::ChunkExtraAttribs)
629868 : +0.17% : 0.13% : +0.0006% : __security_check_cookie
605278 : +0.74% : 0.12% : +0.0006% : private: unsigned int __cdecl ValueNumStore::VnForConst<int, class ValueNumStore::VNMap<int, struct JitLargePrimitiveKeyFuncs<int>>>(int, class ValueNumStore::VNMap<int, struct JitLargePrimitiveKeyFuncs<int>> *, enum var_types)
593362 : +12.58% : 0.12% : +0.0006% : public: bool __cdecl Compiler::fgOptimizeBranchToEmptyUnconditional(struct BasicBlock *, struct BasicBlock *)
574772 : +0.87% : 0.12% : +0.0006% : private: void __cdecl Compiler::optComputeLoopSideEffectsOfBlock(struct BasicBlock *, class FlowGraphNaturalLoop *)
560695 : +5.14% : 0.12% : +0.0006% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, class Compiler *, class TrackedVarBitSetTraits>::AssignAllowUninitRhs(class Compiler *, unsigned __int64 *&, unsigned __int64 *)
553860 : NA : 0.11% : +0.0006% : public: void __cdecl Compiler::fgSetEHRegionForNewPreheaderOrExit(struct BasicBlock *)
551370 : +2.30% : 0.11% : +0.0006% : public: bool __cdecl ValueNumStore::IsVNCheckedBound(unsigned int)
514459 : +4.06% : 0.11% : +0.0005% : private: void __cdecl JitHashTable<struct ValueNumStore::VNDefFuncApp<3>, struct ValueNumStore::VNDefFuncAppKeyFuncs<3>, unsigned int, class CompAllocator, class JitHashTableBehavior>::Grow(void)
509557 : +0.45% : 0.11% : +0.0005% : public: enum PhaseStatus __cdecl Compiler::rangeCheckPhase(void)
501918 : +1.17% : 0.10% : +0.0005% : public: bool __cdecl BasicBlock::isEmpty(void) const
-491825 : -0.07% : 0.10% : -0.0005% : GenTreeVisitor<`Compiler::fgSetTreeSeq'::`2'::SetTreeSeqVisitor>::WalkTree
-563351 : -0.04% : 0.12% : -0.0006% : public: unsigned __int64 __cdecl LinearScan::RegisterSelection::select<0>(class Interval *, class RefPosition *)
-585906 : -4.62% : 0.12% : -0.0006% : protected: void __cdecl Compiler::optScaleLoopBlocks(struct BasicBlock *, struct BasicBlock *)
-600065 : -4.01% : 0.12% : -0.0006% : public: struct BasicBlock * __cdecl Compiler::fgConnectFallThrough(struct BasicBlock *, struct BasicBlock *)
-642350 : -0.08% : 0.13% : -0.0007% : protected: void __cdecl CodeGen::genCodeForBBlist(void)
-946757 : -0.77% : 0.20% : -0.0010% : private: void __cdecl emitter::emitJumpDistBind(void)
-982836 : -3.11% : 0.20% : -0.0010% : public: bool __cdecl BasicBlock::bbFallsThrough(void) const
-1067015 : -0.07% : 0.22% : -0.0011% : public: unsigned int __cdecl Compiler::gtSetEvalOrder(struct GenTree *)
-1295538 : -0.84% : 0.27% : -0.0013% : public: static void __cdecl BitSetOps<unsigned __int64 *, 1, struct BitVecTraits *, struct BitVecTraits>::IntersectionD(struct BitVecTraits *, unsigned __int64 *&, unsigned __int64 *)
-3237389 : -0.25% : 0.67% : -0.0033% : private: void __cdecl LinearScan::processBlockStartLocations(struct BasicBlock *)
-5147072 : -1.59% : 1.06% : -0.0053% : public: bool __cdecl Compiler::fgReorderBlocks(bool)
-9839750 : -3.24% : 2.03% : -0.0101% : public: enum PhaseStatus __cdecl Compiler::fgComputeEdgeWeights(void) I don't think there is much we can do here, this mostly just seems like innate cost of creating more blocks. So probably just going to have to live with the fact that it will eat into some of the TP improvements we got from all the DFS/dominator/loop work. |
cc @dotnet/jit-contrib PTAL @AndyAyersMS Diffs. A bit of churn from flow graph opts as is to be expected when new blocks with new weights are introduced. I hope with future FG opts work we'll get to a point where introduction of no-op blocks like this would be more or less free from diffs. I'll also give Bruce a chance to review when he's back next week before I merge this. There is an alternative which is to create these canonical exits on-demand during IV widening. It would presumably get rid of essentially all of the TP regression. However, I think that gets quite messy as we eventually saw with preheaders that used to have the same on-demand strategy but now are created eagerly. |
if (optCanonicalizeLoops()) | ||
{ | ||
fgInvalidateDfsTree(); | ||
m_dfsTree = fgComputeDfs(); | ||
m_loops = FlowGraphNaturalLoops::Find(m_dfsTree); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe one way to improve TP slightly would be to canonicalize exits as part of loop cloning so that we don't have to reinvoke canonicalization here. I might look into what this would save, but I would expect not that much since we're only paying here in cases where we actually did do loop cloning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These costs are likely dwarfed by the cost of cloning, so if we improve the cloning heuristics (which I assume will mean cloning fewer big loops) we should come out ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it wouldn't be hard for cloning to also clone all its exit blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would work to clone the exits, although at the tradeoff of duplicating even more IR. Not sure whether that's worth it.
PTAL @BruceForstall |
We may want this eventually if we start trying to materialize last values -- for multi-exit loops you may need different expressions, depending on the relationship of the exit to the in-loop updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM
if (optCanonicalizeLoops()) | ||
{ | ||
fgInvalidateDfsTree(); | ||
m_dfsTree = fgComputeDfs(); | ||
m_loops = FlowGraphNaturalLoops::Find(m_dfsTree); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These costs are likely dwarfed by the cost of cloning, so if we improve the cloning heuristics (which I assume will mean cloning fewer big loops) we should come out ahead.
// Returns: | ||
// Estimated likelihood of the edge being taken. | ||
// | ||
weight_t Compiler::optEstimateEdgeLikelihood(BasicBlock* from, BasicBlock* to, bool* fromProfile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will become much simpler as you can just query the edge.
if (optCanonicalizeLoops()) | ||
{ | ||
fgInvalidateDfsTree(); | ||
m_dfsTree = fgComputeDfs(); | ||
m_loops = FlowGraphNaturalLoops::Find(m_dfsTree); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it wouldn't be hard for cloning to also clone all its exit blocks?
if (optCanonicalizeLoops()) | ||
{ | ||
fgInvalidateDfsTree(); | ||
m_dfsTree = fgComputeDfs(); | ||
m_loops = FlowGraphNaturalLoops::Find(m_dfsTree); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary for unrolling? If we've fully unrolled a loop, then the exit blocks are no longer unique (they have N preds for N unrolls), but there is no loop left. Presumably all the other loops haven't changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop unrolling does not necessarily remove the loop; if the loop has multiple backedges then it remains a natural loop after redirecting the loop test backedge.
// predecessors. | ||
// | ||
// Parameters: | ||
// loop - The loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing exit
param
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix as part of a follow up
This adds another canonicalization requirement for loops during the
optimization phases, namely that all regular loop exit blocks have only
loop predecessors. This gives natural places to insert IR that we want
to run only when we know the loop was exited.
Exceptional loop exit blocks can still have non-loop predecessors, so
these must still be accounted for by optimizations.
We could also have a stronger property, namely that all loop exits have only a single predecessor. However the main motivation for the canonicalization is #97865 which does not require this stronger canonicalization.
Example flowgraph: before, after