Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT-LLVM] Threads and GC Poll #2806

Open
yowl opened this issue Nov 22, 2024 · 13 comments
Open

[NativeAOT-LLVM] Threads and GC Poll #2806

yowl opened this issue Nov 22, 2024 · 13 comments
Labels
area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly)

Comments

@yowl
Copy link
Contributor

yowl commented Nov 22, 2024

I'd like to move threads forward a little bit. We need to insert calls to RhpGcPoll (currently a NOP) into the generated code, I think we are going to do this when making managed calls, is that enough places? Are we concerned with a method that has a long running loop that makes no calls?

Secondly, should we be doing this in llvmlower.cpp or in fgInsertGCPolls or... ?

Thanks.

@SingleAccretion
Copy link

SingleAccretion commented Nov 23, 2024

I think we are going to do this when making managed calls, is that enough places? Are we concerned with a method that has a long running loop that makes no calls?

The general idea is that a poll needs to be present on all control paths that are "reasonably long". In the upstream suspension scheme, that's:

  1. Method returns in partially interruptible methods, via return address hijacking.
    • Note I am talking here about leaf, in the dynamic sense, frames.
    • The downstream scheme for partially interruptible methods could similarly add one explicit poll into prologue/epilogue, or even in the middle of the method (provided it dominates all exits). It doesn't matter a lot where - either way we will get one poll per method. However, it would be most advantageous to place polls in locations without live GC references, as the poll would force LSSA to spill them.
  2. Fully interruptible methods can be suspended almost anywhere (the exceptions are small block copies, PIs with SPGCT, prolog/epilog).
    • The downstream scheme would need to add polls on 'loop backedges', so that a spinning thread would not be able to starve the GC. Same considerations about GC-live variables apply.

The most difficult part here is the fully interruptible case - the general algorithm would need to guarantee timely suspension for all kinds of loops, including irreducible control flow. It is a problem that has been solved elsewhere though, I just don't have the best algorithm for it on-hand.

What is 'partially' and 'fully' interruptible code?

In the general case, code can contain loops, which can starve the GC. So, we need to be able to suspend the code in the middle of such loops. This is where "full interruptibility" comes in: GC info is recorded for each instructions, and the suspending thread, if it sees that the current IP is in this region, leaves the suspended thread stopped as-is.

GC info for fully interruptible methods is (relatively) large, so it is only done in cases where this granular control is needed, i. e. with loops that lack any GC safe points (managed calls). See https://github.com/dotnet/runtime/blob/f1332ab0d82ee0e21ca387cbd1c8a87c5dfa4906/src/coreclr/jit/flowgraph.cpp#L3980C40-L3984.

For other methods (the majority of them), only GC info at callsites is recorded, the assumption being that all callees, save a few helpers, are themselves interruptible and thus serve as implicit polls. This assumption is then realized via return address hijacking, where the return address on the stack of the executing method is overwritten by the suspending thread to point to a routine that will wait for GC to complete, thus "catching" the thread once is executes ret.

Secondly, should we be doing this in llvmlower.cpp or in fgInsertGCPolls or... ?

It should be done in lowering. fgInsertGCPolls as it currently exists is not really set up to do this kind "full" poll insertion, and it also runs too early.

I am not sure what would be the best place to do it in lowering. Given the interplay with LSSA and the need to predict how poll insertion will affect how much will need to be spilled (for which liveness information can be required), it may be best to do just before LSSA, or even as part of it.

@SingleAccretion
Copy link

There is another interesting question here that I haven't come across an answer to so far: does WASM even allow explicit polls? That is, if we have a loop:

while (!RhpSuspendThreads) { ... }

Is the WASM compiler allowed to rewrite it to:

bool alwaysTrueLocal = !RhpSuspendThreads;
while (alwaysTrueLocal) { ... } // Now unsuspendable!

E. g. in IL memory model that's allowed - you need to use a stronger volatile (acquire) load to circumvent it, while all that is actually needed is a relaxed load, like volatile in C++.

@yowl
Copy link
Contributor Author

yowl commented Nov 23, 2024

does WASM even allow explicit polls

Sounds like we need to establish this first.

@jkotas
Copy link
Member

jkotas commented Nov 24, 2024

Some Silverlight configurations used explicit GC polls. The code to insert the explicit GC polls was deleted in dotnet/runtime#42664 . It inserted explicit GC polls on back edges and returns.

Is the WASM compiler allowed to rewrite it to:

It is not allowed. Infinite or long running loop on one thread should not prevent other threads from making a progress.

@SingleAccretion
Copy link

SingleAccretion commented Nov 24, 2024

It is not allowed. Infinite or long running loop on one thread should not prevent other threads from making a progress.

Hmm, I am not sure what you mean. I meant can the WASM -> native code compiler perform this transformation (like the IL -> native code one can).

I. e. the question is what are the semantics of ordinary loads in WASM (as opposed to atomic, which are always sequentially consistent).

@yowl
Copy link
Contributor Author

yowl commented Nov 24, 2024

Wouldn't we use the wasm atomics here?

@SingleAccretion
Copy link

Wouldn't we use the wasm atomics here?

I don't see how that would work performance-wise - inserting Interlocked.CompareExchange into the epilogues of all methods.

@yowl
Copy link
Contributor Author

yowl commented Nov 24, 2024

Why do we need to use Interlocked.CompareExchange to get a wasm atomic load emitted ?

@SingleAccretion
Copy link

Why do we need to use Interlocked.CompareExchange to get a wasm atomic load emitted ?

It is not a question of IL -> WASM emission, but WASM -> native code emission. A WASM atomic load is sequentially consistent, like CompareExchange is, and so will have similar performance characteristics.

@jkotas
Copy link
Member

jkotas commented Nov 24, 2024

Sequentially consistent load does not require full CompareExchange (in C/C++ definition of sequential consistency at last): https://godbolt.org/z/YqGGY4zK3 . Is wasm different for some reason?

@SingleAccretion
Copy link

SingleAccretion commented Nov 24, 2024

Is wasm different for some reason?

No, my bad. But it will still require a stronger barrier: https://godbolt.org/z/GaET7znrP.

Another question I forgot about initially is the "process-wide barrier" that is used in a few places (in suspension, at least). WASM doesn't have an equivalent.

@jkotas
Copy link
Member

jkotas commented Nov 24, 2024

Process-wide barrier allows Preemptive<->Cooperative transitions without any memory barriers. If the process-wide barrier is not available, Preemptive<->Cooperative transition will have to have memory barriers.

The current thread suspension algorithm works like this:

  1. Set RhpSuspendThreads
  2. Ensure that RhpSuspendThreads is visible in all threads by issuing process-wide barrier. It blocks Preemptive->Cooperative transitions.
  3. Wait for all threads to reach Preemptive mode (+ help threads to reach it faster by hijacking, etc.). Since the Preemptive->Cooperative transitions were blocked by (2), we know that that once we observe that the thread is in preemptive mode it is going to stay there and won't sneak back to cooperative.

In the absence of process-wide barrier, I do not see a way how to make this algorithm thread safe by adding memory barriers around reads of RhpSuspendThreads or writes to m_pTransitionFrame. We may need a different thread suspension algorithm. For example:

  • Preemptive<->Cooperative transitions would be done using Interlocked.CompareExchange. If they see a special "suspend now" sentinel value in m_pTransitionFrame, they will take the slow path that parks the thread in preemptive mode.
  • The thread suspension algorithm would work like this:
  1. Set m_pTransitionFrame to the special "suspend now" sentinel value on all threads that are running in cooperative mode using Interlocked.CompareExchange
  2. Wait for all threads to reach Preemptive mode

@jkotas jkotas added the area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly) label Nov 24, 2024
@jakobbotsch
Copy link
Member

Note that the full interruptibility check was generalized into a simple cycle check in dotnet/runtime#95299. It should not be hard to adapt it to break all cycles it can find by placing GC polls into any block in those cycles. The code deleted in dotnet/runtime#42664 looks unnecessarily complicated to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-LLVM LLVM generation for Native AOT compilation (including Web Assembly)
Projects
None yet
Development

No branches or pull requests

4 participants