-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Distributed.jl Worker
struct thread-safe.
#38405
Conversation
3b00732
to
95e6be5
Compare
I think #38487 may make this feasible now :) |
I hope so as well. |
95e6be5
to
810479f
Compare
@vchuravy bump 🙂 Could you please rebase this to give CI another go? |
810479f
to
20fd68a
Compare
20fd68a
to
300ed29
Compare
@JeffBezanson if you have the time I would appreciate a review. |
Seen failure:
|
e287246
to
20a514f
Compare
I hope you don't mind Valentin: I took the liberty of rebasing this branch over master. (On 1.7 we consistently hit issues that this branch should address, so we would love to see it over the line. Let me know if there is anything I can do to lend a hand :).) |
Interestingly this branch does not fully resolve the issues we hit, but it does cause our test suite to fail much earlier/harder and with a better stacktrace 🎉 (as opposed to, without this branch, throwing a more opaque trace in the background and then stalling till timeout). |
Happy to hand this off :) let me know if you want to chat about what needs doing (Happy to have a co-working session on this). I suspect the least will be making the ClusterSerializer threadsafe as well. |
stdlib/Distributed/src/remotecall.jl
Outdated
notify(any_gc_flag) | ||
msg = (remoteref_id(rr), myid()) | ||
# We cannot acquire locks from finalizers | ||
@async begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MacOS tester hit:
That's an assertion in |
5a64315
to
583004b
Compare
Makes the worker struct threadsafe as well as flushing the GC messages
583004b
to
0c073cc
Compare
@JeffBezanson I removed unnecessary changes like actually using |
I am gonna land this, but we should revert if we see any increased flakiness in CI. |
@@ -659,7 +690,12 @@ function create_worker(manager, wconfig) | |||
end | |||
|
|||
for wl in wlist | |||
(wl.state === W_CREATED) && wait(wl.c_state) | |||
if wl.state === W_CREATED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah merde...
(cherry picked from commit 740a33a)
…stributed_ts" (JuliaLang/julia#41722) Also reverts "fixup to pull request JuliaLang/julia#38405 (JuliaLang/julia#41641)" Seems to be causing hanging in CI testing. This reverts commit 740a33a and this reverts commit 5a1680533b461471f537f5f5d4ee3c2866b21e1f, reversing changes made to 02807b279a5e6d5acaeb7095e4c0527e2a5c190e. (cherry picked from commit bbc9429)
@jpsamaroo and I want to use Distributed in the presence of threading for Dagger. So this is the follow-up to #38134
@JeffBezanson you initially suggested that the
Condition
is replaced by anEvent
, but the code seems to want to support multiple state transitions.wait_on_conn
re-usesc_state
for it's time-out and it not clear to me that timing out there implies timing out other waiters.Lastly we looked at the gc messages for remote references and made a first attempt at making them safe under multi-threading. Since we can't take locks in a finalizer this implied spawning a task.
cc: @jonas-schulze