-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for partially jitted methods #34522
Conversation
Jit diffs (tier0, pmi) shows how much space all these throw sequences take up. Regressions seem to be PMI artifacts.
This gives us a ballpark estimate of how much code size we might save if we throw-helperized everything in corelib (biased on the high side because of Tier0 codegen; on the low side because there is no inlining). We can't yet apply this to R2R/Tier1 because we don't support transitions from optimized code. With this optimization we don't need to throw helperize, it more or less does that automagically. Note there are pass-through benefits in many cases as prologs are simpler and we have less zeroing, since the string formatting done in many throws creates structs; these costs are now "deferred" to the remainder method that is jitted only if the exceptional path is reached. Would guess that this ought to give us ~2% startup wins in scenarios where Tier0 jit time is prominent (~5% faster jitting, jit is 50% of total startup cost). Would be nice to try and apply this to funclets (at least catch/fault funclets) too; haven't yet looked into how feasible that might be. We could also potentially apply this more broadly based on IBC, if we had high confidence that IBC "zero" counts were reflective of how most of the code is actually used. cc @dotnet/jit-contrib @jkotas |
Will need a new jit guid, if/when we get closer to merging. |
Testing ongoing, but a few early returns:
|
Or Unsafe.Unlikely 🙂 So if JIT doesn't compile rarely used blocks - doesn't it make sense to allow inliner to inline more? E.g. don't count |
@EgorBo currently this only works for unoptimized codegen (no inlining). Agree things get quite interesting once we can handle patchpoints in optimized codegen, but that's a ways off. I need to look more carefully at what sort of costs partial jitting imposes, both in terms of time and memory. For example, if the continuation piece is already jitted, the transition time from the original method to the continuation is the sum of:
Would guess this time is dominated by VirtualUnwind, so maybe 100,000 instructions? Say ~ 50us? Need to measure. This limits the granularity at which partial compilation pays off -- the not-compiled region has to be relatively costly/slow/unlikely. It might be possible to backpatch the patchpoint call and turn it into a jump to the continuation -- that would reduce the transition time quite a bit (basically just the prolog/epilog, which should be 10's of instructions). That would make this technique viable for a wider variety of cases. Initial transition time would be quite a bit higher, as the jit needs to run; for RyuJit the minimum jitting time is on the order of 200us (?) and there is more work on the runtime side too. |
Remaining failure in the "jit-experimental / uncommon" leg seems to be a dataflow bug. I could work around by not optimizing the Tier0 uncommon fragments, but we should fix the bug. Not clear how easy it will be to repro w/o the unusual dataflow created by the changes in the PR. Will give it a try. |
Think I've fixed the last bug in the pri1 tests. Here's a sample of the codegen this produces, for a simple program: using System;
class E : Exception
{
public E(double v)
{
m_v = v;
}
double m_v;
}
class X
{
public static void Main(string[] args)
{
if (args.Length == 1)
{
throw new E(11.3);
}
Console.WriteLine("hello, world\n");
}
} Initial jitting of ; Assembly listing for method X:Main(System.String[])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 1, 1 ) ref -> [rbp+0x10] class-hnd
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02 ] ( 1, 1 ) ref -> [rbp-0x08] must-init class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 48
G_M3786_IG01:
55 push rbp
4883EC30 sub rsp, 48
488D6C2430 lea rbp, [rsp+30H]
33C0 xor rax, rax
488945F8 mov qword ptr [rbp-08H], rax
48894D10 mov gword ptr [rbp+10H], rcx
;; bbWeight=1 PerfScore 4.00
G_M3786_IG02:
488B4D10 mov rcx, gword ptr [rbp+10H]
83790801 cmp dword ptr [rcx+8], 1
750A jne SHORT G_M3786_IG04
;; bbWeight=1 PerfScore 4.00
G_M3786_IG03:
B906000000 mov ecx, 6
E828044C5F call CORINFO_HELP_UNCOMMON_PATCHPOINT
;; bbWeight=0 PerfScore 0.00
G_M3786_IG04:
48B99031A16CA4010000 mov rcx, 0x1A46CA13190
488B09 mov rcx, gword ptr [rcx]
E8F6F8FFFF call System.Console:WriteLine(System.String)
90 nop
;; bbWeight=1 PerfScore 3.50
G_M3786_IG05:
488D6500 lea rsp, [rbp]
5D pop rbp
C3 ret
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 65, prolog size 16, PerfScore 20.00, (MethodHash=69d3f135) for method X:Main(System.String[])
; ============================================================ at runtime when the helper is called, it creates an OSR method for the bit we didn't compile: ; Assembly listing for method X:Main(System.String[])
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; OSR variant for entry point 0x6
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) ref -> zero-ref class-hnd
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T00] ( 4, 0 ) ref -> rsi class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 32
G_M3786_IG01:
56 push rsi
4883EC20 sub rsp, 32
C5F877 vzeroupper
;; bbWeight=1 PerfScore 2.25
G_M3786_IG02:
48B9E0CF21DBF97F0000 mov rcx, 0x7FF9DB21CFE0
E819A78D5F call CORINFO_HELP_NEWSFAST
488BF0 mov rsi, rax
488BCE mov rcx, rsi
E88E10FFFF call System.Exception:.ctor():this
C5FB100516000000 vmovsd xmm0, qword ptr [reloc @RWD00]
C5FB114678 vmovsd qword ptr [rsi+120], xmm0
488BCE mov rcx, rsi
E839A64B5F call CORINFO_HELP_THROW
CC int3
;; bbWeight=0 PerfScore 0.00
RWD00 dq 402699999999999Ah
; Total bytes of code 56, prolog size 8, PerfScore 8.05, (MethodHash=69d3f135) for method X:Main(System.String[])
; ============================================================ and we see the expected result:
We're optimizing the OSR part which is probably not the right thing here, but if the OSR continuation contains more code than just a throw -- say the throw was in a try -- we might want to optimize. The full Tier0 version of the method is 103 bytes,. Partial Tier0 + OSR is 121. This delta would go down some if we didn't optimize the OSR part. |
Remaining failures in "jit-experimental" are all from write-through. |
@dotnet/jit-contrib think this is ready for some feedback. I'm going to try and measure startup impact from this version. Possible next steps:
|
@dotnet/jit-contrib anyone want to take a look? This was (to me) a surprisingly small change on top of OSR, and I think it opens the door to some interesting possibilities. |
Introduce a new form of patchpoint for rarely-executed blocks in a method (typically, blocks with throws). Instead of jitting the code in those blocks, add code to call a new patchpoint helper (similar to the existing one) which finds or creates the jitted code for the method beyond this point, and then transitions control to that code. Suppressing codegen for exception throws in this way gives similar benefits to the throw helper transformation, without needing to modify source. This currently works only in Tier0, but in principle could be extended to handle optimized code.
There may also be an issue (not restricted to OSR) for try exits; left a todo there for now.
} | ||
else | ||
{ | ||
BitVecOps::IntersectionD(apTraits, block->bbAssertionIn, firstTryBlock->bbAssertionIn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment could be added here stating that for the normal case we use the assertion state from the first block of the try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
BitVecOps::IntersectionD(apTraits, block->bbAssertionIn, firstTryBlock->bbAssertionIn); | ||
} | ||
|
||
// TODO -- possible bug: we may need to merge in the out set for every block in the try range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still want this TODO ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're safe now, could not find a way to break us here. Will remove this.
@@ -396,6 +396,8 @@ CONFIG_INTEGER(JitGuardedDevirtualizationGuessBestClass, W("JitGuardedDevirtuali | |||
|
|||
// Enable insertion of patchpoints into Tier0 methods with loops. | |||
CONFIG_INTEGER(TC_OnStackReplacement, W("TC_OnStackReplacement"), 0) | |||
// Enable uncommon patchpoints in Tier0 methods | |||
CONFIG_INTEGER(TC_UncommonPatchpoint, W("TC_UncommonPatchpoint"), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add " (by default this is disabled)" to the comment.
src/coreclr/src/jit/patchpoint.cpp
Outdated
compiler->lvaTable[ppCounterLclNum].lvType = TYP_INT; | ||
if (compiler->doesMethodHavePatchpoints()) | ||
{ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this Empty block or add some code or JITDUMP.
src/coreclr/src/jit/patchpoint.cpp
Outdated
// | ||
// ==> | ||
// | ||
// call JIT_UNCOMMON_PATHCPOINT (ilOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this method header comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's trying to say that we delete all the statements in the block and add a call; I'll clarify.
" unexpected context IP 0x%p\n", ip, GetIP(&frameContext)); | ||
EEPOLICY_HANDLE_FATAL_ERROR(COR_E_EXECUTIONENGINE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behavior, but it now matches the comments.
{ | ||
// Should be fatal | ||
STRESS_LOG2(LF_TIEREDCOMPILATION, LL_INFO10, "Jit_UncommonPatchpoint: patchpoint (0x%p) TRANSITION" | ||
" unexpected context IP 0x%p\n", ip, GetIP(&frameContext)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says that this should be fatal, but it isn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will update to match the other cases like this.
0ed75da
to
251a617
Compare
@briansull thanks -- updated per your feedback. I must have rebased somewhere along the way too, so had to force push. |
Had to do another merge (conflict with another of my changes). |
I'm going to close this for now, will revisit once 5.0 has branched off. |
Introduce a new form of patchpoint for rarely-executed blocks in a method
(typically, blocks with throws). Instead of jitting the code in those blocks,
add code to call a new patchpoint helper (similar to the existing one) which
finds or creates the jitted code for the method beyond this point, and then
transitions control to that code.
Suppressing codegen for exception throws in this way gives similar benefits
to the throw helper transformation, without needing to modify source. This
currently works only in Tier0, but in principle could be extended to handle
optimized code.