[NativeAOT-LLVM] Threads and GC Poll #2806

yowl · 2024-11-22T23:23:23Z

I'd like to move threads forward a little bit. We need to insert calls to RhpGcPoll (currently a NOP) into the generated code, I think we are going to do this when making managed calls, is that enough places? Are we concerned with a method that has a long running loop that makes no calls?

Secondly, should we be doing this in llvmlower.cpp or in fgInsertGCPolls or... ?

Thanks.

The text was updated successfully, but these errors were encountered:

SingleAccretion · 2024-11-23T12:13:40Z

I think we are going to do this when making managed calls, is that enough places? Are we concerned with a method that has a long running loop that makes no calls?

The general idea is that a poll needs to be present on all control paths that are "reasonably long". In the upstream suspension scheme, that's:

Method returns in partially interruptible methods, via return address hijacking.
- Note I am talking here about leaf, in the dynamic sense, frames.
- The downstream scheme for partially interruptible methods could similarly add one explicit poll into prologue/epilogue, or even in the middle of the method (provided it dominates all exits). It doesn't matter a lot where - either way we will get one poll per method. However, it would be most advantageous to place polls in locations without live GC references, as the poll would force LSSA to spill them.
Fully interruptible methods can be suspended almost anywhere (the exceptions are small block copies, PIs with SPGCT, prolog/epilog).
- The downstream scheme would need to add polls on 'loop backedges', so that a spinning thread would not be able to starve the GC. Same considerations about GC-live variables apply.

The most difficult part here is the fully interruptible case - the general algorithm would need to guarantee timely suspension for all kinds of loops, including irreducible control flow. It is a problem that has been solved elsewhere though, I just don't have the best algorithm for it on-hand.

What is 'partially' and 'fully' interruptible code?

In the general case, code can contain loops, which can starve the GC. So, we need to be able to suspend the code in the middle of such loops. This is where "full interruptibility" comes in: GC info is recorded for each instructions, and the suspending thread, if it sees that the current IP is in this region, leaves the suspended thread stopped as-is.

GC info for fully interruptible methods is (relatively) large, so it is only done in cases where this granular control is needed, i. e. with loops that lack any GC safe points (managed calls). See https://github.com/dotnet/runtime/blob/f1332ab0d82ee0e21ca387cbd1c8a87c5dfa4906/src/coreclr/jit/flowgraph.cpp#L3980C40-L3984.

For other methods (the majority of them), only GC info at callsites is recorded, the assumption being that all callees, save a few helpers, are themselves interruptible and thus serve as implicit polls. This assumption is then realized via return address hijacking, where the return address on the stack of the executing method is overwritten by the suspending thread to point to a routine that will wait for GC to complete, thus "catching" the thread once is executes ret.

Secondly, should we be doing this in llvmlower.cpp or in fgInsertGCPolls or... ?

It should be done in lowering. fgInsertGCPolls as it currently exists is not really set up to do this kind "full" poll insertion, and it also runs too early.

I am not sure what would be the best place to do it in lowering. Given the interplay with LSSA and the need to predict how poll insertion will affect how much will need to be spilled (for which liveness information can be required), it may be best to do just before LSSA, or even as part of it.

SingleAccretion · 2024-11-23T12:25:05Z

There is another interesting question here that I haven't come across an answer to so far: does WASM even allow explicit polls? That is, if we have a loop:

while (!RhpSuspendThreads) { ... }

Is the WASM compiler allowed to rewrite it to:

bool alwaysTrueLocal = !RhpSuspendThreads;
while (alwaysTrueLocal) { ... } // Now unsuspendable!

E. g. in IL memory model that's allowed - you need to use a stronger volatile (acquire) load to circumvent it, while all that is actually needed is a relaxed load, like volatile in C++.

yowl · 2024-11-23T13:18:17Z

does WASM even allow explicit polls

Sounds like we need to establish this first.

jkotas · 2024-11-24T07:05:36Z

Some Silverlight configurations used explicit GC polls. The code to insert the explicit GC polls was deleted in dotnet/runtime#42664 . It inserted explicit GC polls on back edges and returns.

Is the WASM compiler allowed to rewrite it to:

It is not allowed. Infinite or long running loop on one thread should not prevent other threads from making a progress.

SingleAccretion · 2024-11-24T12:08:58Z

It is not allowed. Infinite or long running loop on one thread should not prevent other threads from making a progress.

Hmm, I am not sure what you mean. I meant can the WASM -> native code compiler perform this transformation (like the IL -> native code one can).

I. e. the question is what are the semantics of ordinary loads in WASM (as opposed to atomic, which are always sequentially consistent).

yowl · 2024-11-24T16:15:56Z

Wouldn't we use the wasm atomics here?

SingleAccretion · 2024-11-24T17:09:52Z

Wouldn't we use the wasm atomics here?

I don't see how that would work performance-wise - inserting Interlocked.CompareExchange into the epilogues of all methods.

yowl · 2024-11-24T17:34:08Z

Why do we need to use Interlocked.CompareExchange to get a wasm atomic load emitted ?

SingleAccretion · 2024-11-24T17:44:56Z

Why do we need to use Interlocked.CompareExchange to get a wasm atomic load emitted ?

It is not a question of IL -> WASM emission, but WASM -> native code emission. A WASM atomic load is sequentially consistent, like CompareExchange is, and so will have similar performance characteristics.

jkotas · 2024-11-24T17:56:54Z

Sequentially consistent load does not require full CompareExchange (in C/C++ definition of sequential consistency at last): https://godbolt.org/z/YqGGY4zK3 . Is wasm different for some reason?

SingleAccretion · 2024-11-24T18:09:57Z

Is wasm different for some reason?

No, my bad. But it will still require a stronger barrier: https://godbolt.org/z/GaET7znrP.

Another question I forgot about initially is the "process-wide barrier" that is used in a few places (in suspension, at least). WASM doesn't have an equivalent.

jkotas · 2024-11-24T20:14:49Z

Process-wide barrier allows Preemptive<->Cooperative transitions without any memory barriers. If the process-wide barrier is not available, Preemptive<->Cooperative transition will have to have memory barriers.

The current thread suspension algorithm works like this:

Set RhpSuspendThreads
Ensure that RhpSuspendThreads is visible in all threads by issuing process-wide barrier. It blocks Preemptive->Cooperative transitions.
Wait for all threads to reach Preemptive mode (+ help threads to reach it faster by hijacking, etc.). Since the Preemptive->Cooperative transitions were blocked by (2), we know that that once we observe that the thread is in preemptive mode it is going to stay there and won't sneak back to cooperative.

In the absence of process-wide barrier, I do not see a way how to make this algorithm thread safe by adding memory barriers around reads of RhpSuspendThreads or writes to m_pTransitionFrame. We may need a different thread suspension algorithm. For example:

Preemptive<->Cooperative transitions would be done using Interlocked.CompareExchange. If they see a special "suspend now" sentinel value in m_pTransitionFrame, they will take the slow path that parks the thread in preemptive mode.
The thread suspension algorithm would work like this:

Set m_pTransitionFrame to the special "suspend now" sentinel value on all threads that are running in cooperative mode using Interlocked.CompareExchange
Wait for all threads to reach Preemptive mode

jakobbotsch · 2024-11-26T14:48:48Z

Note that the full interruptibility check was generalized into a simple cycle check in dotnet/runtime#95299. It should not be hard to adapt it to break all cycles it can find by placing GC polls into any block in those cycles. The code deleted in dotnet/runtime#42664 looks unnecessarily complicated to me.

jkotas added the area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly) label Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT-LLVM] Threads and GC Poll #2806

[NativeAOT-LLVM] Threads and GC Poll #2806

yowl commented Nov 22, 2024

SingleAccretion commented Nov 23, 2024 •

edited

Loading

SingleAccretion commented Nov 23, 2024

yowl commented Nov 23, 2024

jkotas commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024 •

edited

Loading

yowl commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024

yowl commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024

jkotas commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024 •

edited

Loading

jkotas commented Nov 24, 2024 •

edited

Loading

jakobbotsch commented Nov 26, 2024

[NativeAOT-LLVM] Threads and GC Poll #2806

[NativeAOT-LLVM] Threads and GC Poll #2806

Comments

yowl commented Nov 22, 2024

SingleAccretion commented Nov 23, 2024 • edited Loading

SingleAccretion commented Nov 23, 2024

yowl commented Nov 23, 2024

jkotas commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024 • edited Loading

yowl commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024

yowl commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024

jkotas commented Nov 24, 2024

SingleAccretion commented Nov 24, 2024 • edited Loading

jkotas commented Nov 24, 2024 • edited Loading

jakobbotsch commented Nov 26, 2024

SingleAccretion commented Nov 23, 2024 •

edited

Loading

SingleAccretion commented Nov 24, 2024 •

edited

Loading

SingleAccretion commented Nov 24, 2024 •

edited

Loading

jkotas commented Nov 24, 2024 •

edited

Loading