-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] Inline TLS access for windows/x64 #89472
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsInline the TLS access for windows/x64 so we avoid calls to get Current code gen: 0000000000001AC4: E8 00 00 00 00 call __GetThreadStaticBase_repro_Program
0000000000001AC9: C7 40 08 0D F0 0D mov dword ptr [rax+8],900DF00Dh
...
...
__GetThreadStaticBase_repro_Program:
000000000000104D: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]
0000000000001053: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]
00 00 00
000000000000105C: 48 8B 04 C8 mov rax,qword ptr [rax+rcx*8]
0000000000001060: B9 00 00 00 00 mov ecx,offset tls_InlinedThreadStatics
0000000000001065: 48 01 C1 add rcx,rax
0000000000001068: 48 8B 01 mov rax,qword ptr [rcx]
000000000000106B: 48 85 C0 test rax,rax
000000000000106E: 0F 84 00 00 00 00 je S_P_CoreLib_Internal_Runtime_ThreadStatics__GetInlinedThreadStaticBaseSlow
0000000000001074: C3 ret New code gen: 0000000000001AE5: 65 48 8B 04 25 58 mov rax,qword ptr gs:[58h]
00 00 00
0000000000001AEE: 8B 0D 00 00 00 00 mov ecx,dword ptr [_tls_index]
0000000000001AF4: BA 00 00 00 00 mov edx,offset tls_InlinedThreadStatics
0000000000001AF9: 48 03 14 C8 add rdx,qword ptr [rax+rcx*8]
0000000000001AFD: 48 8B CA mov rcx,rdx
0000000000001B00: 48 8B 19 mov rbx,qword ptr [rcx]
0000000000001B03: 48 85 DB test rbx,rbx
0000000000001B06: 74 1D je 0000000000001B25
0000000000001B08: C7 43 08 0D F0 0D mov dword ptr [rbx+8],900DF00Dh
|
@kunalspathak, some platforms seem to have tls_get_addr in RO section. We've started to get this on illumos-x64 after #87082: $ docker run --rm -v$(pwd):/runtime -e ROOTFS_DIR=/crossrootfs/x64 \
mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-cross-illumos-20220531132048-f13d79e \
/runtime/build.sh -c Release -os illumos -cross -gcc
...
[ 95%] Built target clrjit_win_x64_x64
[ 95%] Built target coreclr_static
/crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: warning: relocation against `__tls_get_addr@@SUNWprivate_1.1' in read-only section `.text'
/crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: ../../../vm/wks/CMakeFiles/cee_wks_core.dir/__/amd64/asmhelpers.S.o: relocation R_X86_64_PC32 against protected symbol `__tls_get_addr@@SUNWprivate_1.1' can not be used when making a shared object; recompile with -fPIC
/crossrootfs/x64/bin/../lib/gcc/x86_64-sun-solaris2.10/8.4.0/../../../../x86_64-sun-solaris2.10/bin/ld: final link failed: bad value
collect2: error: ld returned 1 exit status
dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/build.make:974: recipe for target 'dlls/mscoree/coreclr/libcoreclr.so' failed
make[2]: *** [dlls/mscoree/coreclr/libcoreclr.so] Error 1
CMakeFiles/Makefile2:4678: recipe for target 'dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all' failed
make[1]: *** [dlls/mscoree/coreclr/CMakeFiles/coreclr.dir/all] Error 2
Makefile:135: recipe for target 'all' failed
make: *** [all] Error 2
/runtime/src/coreclr
Failed to build "CoreCLR component". I haven't had time to go deeper on it, but thought you might wana know about this variation. |
That seems familiar. Might have to spend time trying to understand. |
@dotnet/jit-contrib @VSadov @MichalStrehovsky |
@MichalStrehovsky - let me know if there are more pipelines that i should trigger to test this out. |
src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/NodeFactory.cs
Outdated
Show resolved
Hide resolved
src/coreclr/jit/helperexpansion.cpp
Outdated
// block (...): [weight: 1.0] | ||
// use(tlsRoot); | ||
|
||
if (hasLazyStaticCtor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to deal with the static constructor independently from the field access? Something like;
- If the type has static constructor, emit regular static constructor trigger. This can be then optimized/deduplicated with static constructor triggers for the same type.
- Emit thread static access. This has no side-effects, it can be moved around, CSEd, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to deal with the static constructor independently from the field access?
The expansion happens after all the optimizations, CSE, etc. is done. We tried earlier to do in earlier phase, but it was making it hard to hoist certain checks as mentioned in #82973 (comment). Ideally, we need to expand just subset of appropriate portion of tree in various phases to achieve idle code, but might be tricky to get all the cases. Even today, without these changes, we do not hoist the call to helper out of loop because of
runtime/src/coreclr/jit/optimizer.cpp
Lines 7575 to 7580 in bd83e17
else if (s_helperCallProperties.MayRunCctor(helpFunc) && | |
((call->gtFlags & GTF_CALL_HOISTABLE) == 0)) | |
{ | |
INDEBUG(failReason = "non-hoistable helper call";) | |
treeIsHoistable = false; | |
} |
static void Test(int n)
{
for (int i = 0; i < n; i++)
{
x1 = 5; x2 = 5; x3 = 5;
y1 = 5; y2 = 5; y3 = 5;
}
}
G_M000_IG01: ;; offset=0000H
push rsi
push rbx
sub rsp, 40
mov ebx, ecx
G_M000_IG02: ;; offset=0008H
lea rax, [(reloc 0x42aa78)]
cmp qword ptr [rax-08H], 0
jne SHORT G_M000_IG06
G_M000_IG03: ;; offset=0016H
xor esi, esi
test ebx, ebx
jle SHORT G_M000_IG05
G_M000_IG04: ;; offset=001CH
call CORINFO_HELP_READYTORUN_THREADSTATIC_BASE
mov dword ptr [rax+08H], 5
mov dword ptr [rax+0CH], 5
mov dword ptr [rax+10H], 5
mov dword ptr [rax+14H], 5
mov dword ptr [rax+18H], 5
mov dword ptr [rax+1CH], 5
inc esi
cmp esi, ebx
jl SHORT G_M000_IG04
G_M000_IG05: ;; offset=0051H
add rsp, 40
pop rbx
pop rsi
ret
G_M000_IG06: ;; offset=0058H
call CORINFO_HELP_READYTORUN_NONGCSTATIC_BASE
jmp SHORT G_M000_IG03
I propose to take these changes for .NET 8 because of time constraints and revisit missing gaps in .NET 9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dotnet/jit-contrib Do you believe that this duplication is worth it for the time being?
.NET 8 because of time constraints and revisit missing gaps in .NET 9.
I think it may be too late to take this change for .NET 8. We are actively trying to shut things down for .NET 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be too late to take this change for .NET 8.
In that case, I can start working on resolving the duplication problem that you pointed and get this in for .NET 9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried exploring on the idea of expanding lazy static ctor separately and it could be beneficial in cases where we have both a static init and a TLS, defined and used on same type (see example below). But it still won't eliminate the extra ctor check we introduce by expanding static initializer. If we have the static ctor trigger early, the VN will not
give it same number because under the hood, they call different helper. We could theortically avoid emitting the 2nd invocation of ctor
check (0x40000000004203c8
related check below) it during expansion, if we have seen similar ctor invocation earlier, but for that we will have to make sure the previous ctor invocation dominates the block having the 2nd ctor invocation.
static string i = 5.ToString();
[ThreadStatic]
static string j = 5.ToString();
void Foo()
{
Consume(i);
Consume(j);
}
G_M27646_IG02: ;; offset=0x0005
lea rcx, [(reloc 0x40000000004203c8)]
cmp qword ptr [rcx-0x08], 0
jne SHORT G_M27646_IG06
mov rcx, qword ptr [(reloc 0x40000000004203d0)]
mov rcx, gword ptr [rcx+0x08]
call Program:Consume(System.String)
lea r8, [(reloc 0x40000000004203c8)]
add r8, -8
cmp qword ptr [r8], 0
jne SHORT G_M27646_IG07
;; size=47 bbWeight=1 PerfScore 14.25
G_M27646_IG03: ;; offset=0x0034
mov rcx, qword ptr GS:[0x0058]
mov eax, dword ptr [(reloc 0x40000000004203d8)]
mov edx, (reloc 0x40000000004203e0)
add rdx, qword ptr [rcx+8*rax]
mov rbx, qword ptr [rdx]
test rbx, rbx
je SHORT G_M27646_IG08
;; size=32 bbWeight=1 PerfScore 9.50
G_M27646_IG04: ;; offset=0x0054
mov rcx, gword ptr [rbx+0x80]
call Program:Consume(System.String)
nop
;; size=13 bbWeight=1 PerfScore 3.25
G_M27646_IG05: ;; offset=0x0061
add rsp, 32
pop rbx
ret
;; size=6 bbWeight=1 PerfScore 1.75
G_M27646_IG06: ;; offset=0x0067
call CORINFO_HELP_READYTORUN_GCSTATIC_BASE
mov rcx, qword ptr [(reloc 0x40000000004203d0)]
mov rcx, gword ptr [rcx+0x08]
call Program:Consume(System.String)
lea r8, [(reloc 0x40000000004203c8)]
add r8, -8
cmp qword ptr [r8], 0
je SHORT G_M27646_IG03
;; size=38 bbWeight=0 PerfScore 0.00
G_M27646_IG07: ;; offset=0x008D
xor ecx, ecx
mov edx, -1
lea rax, [(reloc 0x40000000004203f0)]
call rax
mov rbx, rax
jmp SHORT G_M27646_IG04
;; size=21 bbWeight=0 PerfScore 0.00
G_M27646_IG08: ;; offset=0x00A2
mov rcx, rdx
lea rax, [(reloc 0x40000000004203e8)]
call rax
mov rbx, rax
jmp SHORT G_M27646_IG04
;; size=17 bbWeig
I am not sure how common it is to see such patterns, @VSadov - any thoughts?
In cases such as below, we generate expected code:
[ThreadStatic]
static string i = 5.ToString();
[ThreadStatic]
static string j = 5.ToString();
for (int k = 0; k < 10; k++)
{
Consume(i);
Consume(j);
}
G_M27646_IG02: ;; offset=0x0006
xor ebx, ebx
lea r8, [(reloc 0x40000000004203e8)]
add r8, -8
cmp qword ptr [r8], 0
jne SHORT G_M27646_IG05
mov rcx, qword ptr GS:[0x0058]
mov eax, dword ptr [(reloc 0x40000000004203c8)]
mov edx, (reloc 0x40000000004203d0)
add rdx, qword ptr [rcx+8*rax]
mov rsi, qword ptr [rdx]
test rsi, rsi
je SHORT G_M27646_IG06
;; size=51 bbWeight=1 PerfScore 14.50
G_M27646_IG03: ;; offset=0x0039
mov rcx, gword ptr [rsi+0x68]
call Program:Consume(System.String)
mov rcx, gword ptr [rsi+0x78]
call Program:Consume(System.String)
inc ebx
cmp ebx, 10
jl SHORT G_M27646_IG03
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of general patterns, I think we can only assume that threadstatics are relatively less common than other kinds of fields. Also it is not common to have a lot of them in the same type. It is valid and certainly possible to see numerous threadstatics in a type, but it is more common for a type to have just one such field that holds all the threadstatic state for the type.
I'd say - 2 threadstatic fields in a type is likely common enough, 20 - is much less likely.
Mixing threadstatic and regular static fields in the same type is fairly common.
As for the order of access within the same method - static or threadstatic first, I think it would be too subtle if one pattern performs better than another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can only assume that threadstatics are relatively less common than other kinds of fields
On the other hand, I'd guess that code that uses them is often performance sensitive so handling it efficiently for AOT is important.
I tried running a few libraires testcases in a loop with the change, for stress purposes, and once in a while (about every 5-15 iterations) I see occasional crashes that look like GC holes. Same tests with f2a9ef8 commit from main (I assume that is the baseline for this change) run without failures, at least for 300 iterations, that I waited for. Sometimes it is not a crash, but a test failure - typically like:
Possibly the same reason, just a different way to fail. Maybe worth checking if all is ok with GC info? |
better handling of GTF_ICON_SECREL_OFFSET
Better tracking `t_inlinedThreadStaticBase` as TYP_REF
I am running the same subset of the tests that was hitting the GC holes and after 100+ iterations I do not see any failures. |
I cannot comment on the shape of JIT changes, but we do get the desired codegen and the run time behavior. Very nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!!
The failures in
|
@BruceForstall - can you please take another look? |
* wip * working model * wip * wip * working * Add helper for tlsIndex * add methods in superpmi * revert some local changes * misc fixes * Stop emitting TLS access code for windows/x64 * fix linux build errors * Do not throw not implemented for windows/x64 * fix the problem where ThreadStaticBase helper was still getting invoked * Revert certain changes from JIT method * Introduce getThreadLocalStaticInfo_ReadyToRun() * Consume getThreadLocalStaticInfo_ReadyToRun() * Remove getTlsRootInfo() and other methods * Revert unneeded changes * missing gtInitCldHnd initialization * save target address * jit format * run thunkgenerator * resolve merge conflicts * fix issues so the TLS is inlined * Rename data structures from *_ReadyToRun to *_NativeAOT * jit format * fix some unit test * fix a bug * fix the weird jump problem * use enclosing type cls handle for VN of static gc/non-gc helper * fix a bug of resetting the flag * useEnclosingTypeOnly from runtime to determine if VN should optimize it * do not use vnf, but only use useEnclosingTypeAsArg0 * Use GT_COMMA to add GCStaticBase call next to TLS call * optimize the cctor call * Remove lazy ctor generation from tls * Update jitinterface to not fetch data for lazy ctor * fix errors after merge * fix test build errors * fix bug in CSE * Use CORINFO_FLG_FIELD_INITCLASS instead of separate flag * Use the INITCLASS flag * Remove useEnclosingTypeOnly * Add NoCtor * Use CORINFO_HELP_READYTORUN_THREADSTATIC_BASE_NOCTOR * Minor cleanup * Renegenrate thunk * Add the SetFalseTarget * fix merge conflict resolution * better handling of GTF_ICON_SECREL_OFFSET better handling of GTF_ICON_SECREL_OFFSET * review feedback * Disable optimization for minopts * Add comments around iiaSecRel * jit format * create emitNewInstrCns() * Expand TLS even if optimization is disabled * Track t_inlinedThreadStaticBase Better tracking `t_inlinedThreadStaticBase` as TYP_REF * jit format
Inline the TLS access for windows/x64 so we avoid calls to get
ThreadStaticBase
.Current code gen:
New code gen:
Contributes to #79521