-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mono] Implement synch block fast paths in managed #81380
[mono] Implement synch block fast paths in managed #81380
Conversation
ffb0b11
to
5dbc58e
Compare
5dbc58e
to
1af4d32
Compare
This might lead to perf problems since the previous optimized c code is replaced by the not so optimized JITted code. |
@vargaz not everything has intrinsics (Monitor_test_synchronised and Monitor_test_owner were just plain icalls). Also the interpreter doesn't have intrinsics for On JIT/AOT the intrinsic is: if (mono_monitor_enter_v4_fast (_lock, ref lockTaken)) // no wrapper icall
goto done;
mono_monitor_enter_v4_internal (_lock, ref lockTaken); // icall with wrapper
done: I think it probably makes sense to keep this, but it's also possible that For interp+AOT I think it will for sure be cheaper than a interp->native transition. By the way, my plan is just to do the fast paths in managed. If we see contention, we'll still call |
3a19a0d
to
2c0f91c
Compare
On one laptop, I see this from running Release WASM System.Runtime.Tests using v8 without AOT: Baseline:
This PR:
So we lost about 150ms. Although neither run was very stable - I think my CPU is throttling. I'm planning to look at optimizing the interpreter IR. For single threaded wasm every |
Interesting. I didn't realize it, but we don't have |
6efb45b
to
32dc82d
Compare
Added CompareExchange and some Interpreter IR for ReliableEnterTimeoutRuntime method: System.Threading.Monitor:ReliableEnterTimeout (object,int,bool&) 0x16da01470 Locals size 128 Calculated stack height: 4, stated height: 8 IR_0000: ldstr [104 <- nil], 0 IR_0003: mov.8 [96 <- 0], IR_0006: call [48 <- 96], 1 IR_000a: bge.i4.imm.sp [nil <- 8], 0, IR_001c IR_000e: beq.i4.imm.sp [nil <- 8], 65535, IR_001c IR_0012: ldstr [104 <- nil], 2 IR_0015: newobj [48 <- 96], 4 IR_001a: throw [nil <- 48], IR_001c: mov.8 [40 <- 0], IR_001f: ldfld.i8 [48 <- 0], 8 IR_0023: mov.8 [48 <- 48], IR_0026: mov.8 [96 <- 0], IR_0029: call [56 <- 96], 5 IR_002d: mov.vt [32 <- 48], 8 IR_0031: mov.8 [48 <- 32], IR_0034: bne.un.i8.imm.sp [nil <- 48], 0, IR_0070 IR_0038: call [48 <- 96], 6 IR_003c: cknull [96 <- 48], IR_003f: call [48 <- 96], 7 IR_0043: conv.i8.u4 [48 <- 48], IR_0046: shl.i8.imm [48 <- 48], 10 IR_004a: mov.8 [48 <- 48], IR_004d: mov.vt [56 <- 32], 8 IR_0051: ldflda [64 <- 40], 8 IR_0055: mov.8 [72 <- 48], IR_0058: mov.8 [80 <- 56], IR_005b: mono_interlocked.cmpxchg.i8 [48 <- 64 72 80], IR_0060: mov.8 [96 <- 40], IR_0063: call [56 <- 96], 5 IR_0067: mov.8 [56 <- 32], IR_006a: bne.un.i8.s [nil <- 48 56], IR_00f1 IR_006e: br.s [nil <- nil], IR_00eb IR_0070: mov.8 [48 <- 32], IR_0073: ldc.i8 [56 <- nil], 2 IR_0079: and.i8 [48 <- 48 56], IR_007d: ble.un.i8.imm.sp [nil <- 48], 0, IR_008d IR_0081: mov.8 [96 <- 40], IR_0084: call [48 <- 96], 8 IR_0088: brfalse.i4.s [nil <- 48], IR_00f1 IR_008b: br.s [nil <- nil], IR_00eb IR_008d: mov.8 [48 <- 32], IR_0090: ldc.i8 [56 <- nil], 3 IR_0096: and.i8 [48 <- 48 56], IR_009a: bne.un.i8.imm.sp [nil <- 48], 0, IR_00f1 IR_009e: call [48 <- 96], 6 IR_00a2: cknull [96 <- 48], IR_00a5: call [48 <- 96], 7 IR_00a9: mov.8 [56 <- 32], IR_00ac: shr.un.i8.imm [56 <- 56], 10 IR_00b0: bne.un.i4.s [nil <- 56 48], IR_00f1 IR_00b4: ldloca.s [96 <- nil], 32 IR_00b7: call [48 <- 96], 9 IR_00bb: brfalse.i4.s [nil <- 48], IR_00c0 IR_00be: br.s [nil <- nil], IR_00f1 IR_00c0: mov.8 [48 <- 32], IR_00c3: add.i8.imm [48 <- 48], 4 IR_00c7: mov.8 [48 <- 48], IR_00ca: mov.vt [56 <- 32], 8 IR_00ce: ldflda [64 <- 40], 8 IR_00d2: mov.8 [72 <- 48], IR_00d5: mov.8 [80 <- 56], IR_00d8: mono_interlocked.cmpxchg.i8 [48 <- 64 72 80], IR_00dd: mov.8 [96 <- 40], IR_00e0: call [56 <- 96], 5 IR_00e4: mov.8 [56 <- 32], IR_00e7: bne.un.i8.s [nil <- 48 56], IR_00f1 IR_00eb: ldc.i4.1 [48 <- nil], IR_00ed: stind.i1 [nil <- 16 48], IR_00f0: ret.void [nil <- nil], IR_00f1: ldc.i4.1 [112 <- nil], IR_00f3: mov.8.3 [nil <- nil], 96 <- 0, 104 <- 8, 120 <- 16 IR_00fa: call [48 <- 96], 10 IR_00fe: ret.void [nil <- nil], |
With intrinsics
|
One final update. I think this is ready for review @vargaz @kg @BrzVlad Used a Interpreter IR for ReliableEnterTimeoutRuntime method: System.Threading.Monitor:ReliableEnterTimeout (object,int,bool&) 0x16b9d5470 Locals size 144 Calculated stack height: 4, stated height: 8 IR_0000: ldstr [120 <- nil], 0 IR_0003: mov.8 [112 <- 0], IR_0006: call [64 <- 112], 1 IR_000a: bge.i4.imm.sp [nil <- 8], 0, IR_001c IR_000e: beq.i4.imm.sp [nil <- 8], 65535, IR_001c IR_0012: ldstr [120 <- nil], 2 IR_0015: newobj [64 <- 112], 4 IR_001a: throw [nil <- 64], IR_001c: mov.8 [32 <- 0], IR_001f: ldloca.s [64 <- nil], 32 IR_0022: newobj_vt_inlined [80 <- 72], 8 IR_0026: stfld.i8 [nil <- 80 64], 0 IR_002a: mov.vt [48 <- 72], 8 IR_002e: mov.8 [64 <- 72], IR_0031: ldind.i8 [64 <- 64], IR_0034: ldfld.i8 [64 <- 64], 8 IR_0038: mov.8 [64 <- 64], IR_003b: mov.vt [40 <- 64], 8 IR_003f: mov.8 [64 <- 40], IR_0042: bne.un.i8.imm.sp [nil <- 64], 0, IR_007d IR_0046: call [64 <- 112], 5 IR_004a: cknull [112 <- 64], IR_004d: call [64 <- 112], 6 IR_0051: conv.i8.u4 [64 <- 64], IR_0054: shl.i8.imm [64 <- 64], 10 IR_0058: mov.8 [64 <- 64], IR_005b: mov.vt [72 <- 40], 8 IR_005f: mov.8 [80 <- 48], IR_0062: ldind.i8 [80 <- 80], IR_0065: ldflda [80 <- 80], 8 IR_0069: mov.8 [88 <- 64], IR_006c: mov.8 [96 <- 72], IR_006f: mono_interlocked.cmpxchg.i8 [64 <- 80 88 96], IR_0074: mov.8 [72 <- 40], IR_0077: bne.un.i8.s [nil <- 64 72], IR_00fe IR_007b: br.s [nil <- nil], IR_00f8 IR_007d: mov.8 [64 <- 40], IR_0080: ldc.i8 [72 <- nil], 2 IR_0086: and.i8 [64 <- 64 72], IR_008a: ble.un.i8.imm.sp [nil <- 64], 0, IR_009b IR_008e: mov.vt [112 <- 48], 8 IR_0092: call [64 <- 112], 7 IR_0096: brfalse.i4.s [nil <- 64], IR_00fe IR_0099: br.s [nil <- nil], IR_00f8 IR_009b: mov.8 [64 <- 40], IR_009e: ldc.i8 [72 <- nil], 3 IR_00a4: and.i8 [64 <- 64 72], IR_00a8: bne.un.i8.imm.sp [nil <- 64], 0, IR_00fe IR_00ac: call [64 <- 112], 5 IR_00b0: cknull [112 <- 64], IR_00b3: call [64 <- 112], 6 IR_00b7: mov.8 [72 <- 40], IR_00ba: shr.un.i8.imm [72 <- 72], 10 IR_00be: bne.un.i4.s [nil <- 72 64], IR_00fe IR_00c2: ldloca.s [112 <- nil], 40 IR_00c5: call [64 <- 112], 8 IR_00c9: brfalse.i4.s [nil <- 64], IR_00ce IR_00cc: br.s [nil <- nil], IR_00fe IR_00ce: mov.8 [64 <- 40], IR_00d1: add.i8.imm [64 <- 64], 4 IR_00d5: mov.8 [64 <- 64], IR_00d8: mov.vt [72 <- 40], 8 IR_00dc: mov.8 [80 <- 48], IR_00df: ldind.i8 [80 <- 80], IR_00e2: ldflda [80 <- 80], 8 IR_00e6: mov.8 [88 <- 64], IR_00e9: mov.8 [96 <- 72], IR_00ec: mono_interlocked.cmpxchg.i8 [64 <- 80 88 96], IR_00f1: mov.8 [72 <- 40], IR_00f4: bne.un.i8.s [nil <- 64 72], IR_00fe IR_00f8: ldc.i4.1 [64 <- nil], IR_00fa: stind.i1 [nil <- 16 64], IR_00fd: ret.void [nil <- nil], IR_00fe: ldc.i4.1 [128 <- nil], IR_0100: mov.8.3 [nil <- nil], 112 <- 0, 120 <- 8, 136 <- 16 IR_0107: call [64 <- 112], 9 IR_010b: ret.void [nil <- nil], |
The numbers from the library tests suite are not really useful, since there is no significant difference there. I was able to miraculously find the benchmark suite that I used for testing when I implemented the monitor thin locks ages ago. Running those benchmarks on jit and interp, with and without your change might provide some better context of the performance implications of this. |
Can the argument validation methods become aggressive inlined so that the fast path never performs a call? Calls aren't cheap in interp afaik, those alone might be worse than doing all this in an icall. You could put the failure case for validation in a call |
avoid duplicated lockword and sync block manipulations
Instead of treating InternalGetHashCode/InternalTryGetHashCode as intrinsics in the interpreter, intrinsify RuntimeHelpers.GetHashCode RuntimeHelpers.TryGetHashCode and avoid the managed code entirely when possible
The code that recognizes GC transition icalls in the interpreter was only assuming that it will see mono_threads_enter_gc_safe_region_unbalanced / mono_threads_exit_gc_safe_region_unbalanced which are used by managed-to-native wrappers. In most cases for native-to-managed wrappers the marshaller emits mono_threads_attach_coop / mono_threads_detach_coop and those are handled elsewhere by setting the needs_thread_attach flag on the InterpMethod. However when the mono_marshal_get_thunk_invoke_wrapper API is used, we emit a thunk invoke wrapper which uses mono_threads_enter_gc_unsafe_region_unbalanced / mono_threads_exit_gc_unsafe_region_unbalanced Recognize those two icalls and set the needs_thread_attach flag. (This is slightly more work than necessary - mono_threads_attach_coop checks if the thread was previously attached to the runtime and attaches it if it wasn't. The thunk invoke wrappers seem to assume the API caller attaches the thread - so we pay for an extra check. But on the other hand we don't need to change the execution-time behavior of the interpreter by reusing existing mechanisms)
src/mono/wasm/runtime/jiterpreter.ts
Outdated
append_stloc_tail(builder, getArgU16(ip, 1), WasmOpcode.i32_store); | ||
break; | ||
case MintOpcode.MINT_MONO_CMPXCHG_I8: | ||
// because i64 values can't pass through JS cleanly (c.f getRawCwrap and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these values enter/exit JS since the implementation of cmpxchg_i64 is C and the trace is pure wasm, so it should be fine to use i64s directly here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you're using getRawCwrap there's no JS wrapper involved (see
const result = (<any>Module)["asm"][name]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it doesn't work - Wasm.instantiate throws an import error. because EMSCRIPTEN_KEEPALIVE
makes some weird signature for mono_jiterp_cas_i64
if I pass int64_t arguments by value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that something else was wrong when I was doing it by value. I'm not too inclined to investigate it right now to be honest. We could make an issue to follow-up.
f419274
to
30a405d
Compare
For the i64 version, pass the address of the i64 values to the helper function. The issue is that since JS doesn't have i64, EMSCRIPTEN_KEEPALIVE messes with the function's signature and we get import errors in the JITed wasm module. Passing by address works around the issue since addresses are i32.
This reverts commit 5b498b8d4e265f8a68fe9453b535e48cf56756ce.
Only assert in debug builds
Make a "GC Unsafe Transition Builder" - that always calls mono_threads_attach_coop / mono_threads_detach_coop. Use it in the native to managed wrappers: emit_thunk_invoke_wrapper and emit_managed_wrapper This is a change in behavior for emit_thunk_invoke_wrapper - previously it directly called mono_threads_enter_gc_unsafe_region_unbalanced. That means that compared to the previous behavior, the thunk invoke wrappers are now a bit more lax: they will be able to be called on threads that aren't attached to the runtime and they will attach automatically. On the other hand existing code will continue to work, with the extra cost of a check of a thread local var. Using mono_thread_attach_coop also makes invoke wrappers work correctly in the interpreter - it special cases mono_thread_attach_coop but not enter_gc_unsafe.
30a405d
to
7a55bf0
Compare
src/mono/System.Private.CoreLib/src/System/Threading/Monitor.Mono.cs
Outdated
Show resolved
Hide resolved
// stack that stores a pointer to the object, thus pinning the object. | ||
private unsafe ref struct ObjectHeaderOnStack | ||
{ | ||
private Header** _header; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a Header** doesn't that mean there won't be a visible pointer to the object on the stack? Instead, it's a pointer-to-pointer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_header
is a pointer to pointer. Which means the thing that it points to is on the stack (well... the thing it points to is in memory - but we happen to know it's on the stack) - that's the one we want the GC to notice.
The code that recognizes GC transition icalls in the interpreter was only assuming that it will see `mono_threads_enter_gc_safe_region_unbalanced` / `mono_threads_exit_gc_safe_region_unbalanced` which are used by managed-to-native wrappers. In most cases for native-to-managed wrappers the marshaller emits `mono_threads_attach_coop` / `mono_threads_detach_coop` and those are handled elsewhere by setting the `needs_thread_attach` flag on the `InterpMethod`. However when the `mono_marshal_get_thunk_invoke_wrapper` API is used, we emit a thunk invoke wrapper which uses `mono_threads_enter_gc_unsafe_region_unbalanced` / `mono_threads_exit_gc_unsafe_region_unbalanced` Change the thunk invoke wrapper to also use attach_coop/detach_coop. This makes the thunks a bit more flexible (they can now be called from unattached threads), at the cost of a TLS lookup. Also fixes interpreter support since they use the existing attach/detach handling. As an implementation detail, I added a GC Unsafe Transition Builder to the marshaling code. Fixes a failure seen in #81380 in the MonoAPI/MonoMono/Thunks test (exposed by calling a cctor later than previously) * Assert that CEE_MONO_GET_SP is followed by gc safe enter/exit icalls Only assert in debug builds * [marshal] Add GCUnsafeTransitionBuilder; use for thunk_invoke_wrapper Make a "GC Unsafe Transition Builder" - that always calls mono_threads_attach_coop / mono_threads_detach_coop. Use it in the native to managed wrappers: emit_thunk_invoke_wrapper and emit_managed_wrapper This is a change in behavior for emit_thunk_invoke_wrapper - previously it directly called mono_threads_enter_gc_unsafe_region_unbalanced. That means that compared to the previous behavior, the thunk invoke wrappers are now a bit more lax: they will be able to be called on threads that aren't attached to the runtime and they will attach automatically. On the other hand existing code will continue to work, with the extra cost of a check of a thread local var. Using mono_thread_attach_coop also makes invoke wrappers work correctly in the interpreter - it special cases mono_thread_attach_coop but not enter_gc_unsafe.
f56ffd8
to
71a9422
Compare
The underlying C functions are null-friendly, and the managed intrinsics are allowed to C null inputs. The assert is unnecessary.
Investigate one incremental alternative from #48058
Rather than switching to NativeAOT's Monitor implementation in one step, first get access to Mono's synchronization block from managed, and implement the fast paths for hash code and monitor code in C#.