Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Distributed.jl Worker struct thread-safe. #38405

Merged
merged 1 commit into from
Jul 19, 2021
Merged

Conversation

vchuravy
Copy link
Member

@jpsamaroo and I want to use Distributed in the presence of threading for Dagger. So this is the follow-up to #38134
@JeffBezanson you initially suggested that the Condition is replaced by an Event, but the code seems to want to support multiple state transitions. wait_on_conn re-uses c_state for it's time-out and it not clear to me that timing out there implies timing out other waiters.

Lastly we looked at the gc messages for remote references and made a first attempt at making them safe under multi-threading. Since we can't take locks in a finalizer this implied spawning a task.

cc: @jonas-schulze

stdlib/Distributed/src/cluster.jl Show resolved Hide resolved
stdlib/Distributed/src/remotecall.jl Outdated Show resolved Hide resolved
stdlib/Distributed/src/remotecall.jl Outdated Show resolved Hide resolved
@vtjnash
Copy link
Member

vtjnash commented Nov 20, 2020

I think #38487 may make this feasible now :)

@vchuravy
Copy link
Member Author

I hope so as well.

@jonas-schulze jonas-schulze added the backport 1.6 Change should be backported to release-1.6 label Dec 9, 2020
@jonas-schulze
Copy link
Contributor

@vchuravy bump 🙂

Could you please rebase this to give CI another go?

@KristofferC KristofferC mentioned this pull request Dec 14, 2020
53 tasks
@ViralBShah ViralBShah added the parallelism Parallel or distributed computation label Dec 15, 2020
@KristofferC KristofferC mentioned this pull request Dec 19, 2020
26 tasks
@KristofferC KristofferC mentioned this pull request Jan 19, 2021
60 tasks
@KristofferC KristofferC mentioned this pull request Feb 1, 2021
10 tasks
@KristofferC KristofferC mentioned this pull request Feb 11, 2021
52 tasks
@KristofferC KristofferC mentioned this pull request Mar 14, 2021
14 tasks
@KristofferC KristofferC mentioned this pull request Mar 23, 2021
10 tasks
@vchuravy vchuravy removed the backport 1.6 Change should be backported to release-1.6 label Mar 23, 2021
@vchuravy vchuravy force-pushed the vc/distributed_ts branch from 20fd68a to 300ed29 Compare March 24, 2021 16:33
stdlib/Distributed/src/cluster.jl Show resolved Hide resolved
stdlib/Distributed/src/cluster.jl Outdated Show resolved Hide resolved
stdlib/Distributed/src/cluster.jl Show resolved Hide resolved
stdlib/Distributed/src/remotecall.jl Outdated Show resolved Hide resolved
stdlib/Distributed/test/threads.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member Author

@JeffBezanson if you have the time I would appreciate a review.

@vchuravy
Copy link
Member Author

vchuravy commented Apr 6, 2021

Seen failure:

      From worker 10:   UNHANDLED TASK ERROR: On worker 1:
      From worker 10:   On worker 14:
      From worker 10:   peer 10 has not connected to 14
      From worker 10:   Stacktrace:
      From worker 10:    [1] error
      From worker 10:      @ ./error.jl:33
      From worker 10:    [2] wait_for_conn
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:194
      From worker 10:    [3] #25
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:182
      From worker 10:    [4] exec_conn_func
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:183
      From worker 10:    [5] exec_conn_func
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:177
      From worker 10:    [6] #118
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 10:    [7] run_work_thunk
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 10:    [8] macro expansion
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 10:    [9] #117
      From worker 10:      @ ./task.jl:406
      From worker 10:   Stacktrace:
      From worker 10:    [1] #remotecall_fetch#167
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416 [inlined]
      From worker 10:    [2] remotecall_fetch
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 10:    [3] #remotecall_fetch#170
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 10:    [4] remotecall_fetch
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 10:    [5] #21
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:169
      From worker 10:    [6] #118
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 10:    [7] run_work_thunk
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 10:    [8] macro expansion
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 10:    [9] #117
      From worker 10:      @ ./task.jl:406
      From worker 10:   Stacktrace:
      From worker 10:    [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 10:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416
      From worker 10:    [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64})
      From worker 10:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 10:    [3] remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 10:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 10:    [4] remotecall_fetch
      From worker 10:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 10:    [5] (::Distributed.var"#20#23"{Distributed.Worker})()
      From worker 10:      @ Distributed ./threadingconstructs.jl:170
      From worker 13:   UNHANDLED TASK ERROR: On worker 1:
      From worker 13:   On worker 14:
      From worker 13:   peer 13 has not connected to 14
      From worker 13:   Stacktrace:
      From worker 13:    [1] error
      From worker 13:      @ ./error.jl:33
      From worker 13:    [2] wait_for_conn
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:194
      From worker 13:    [3] #25
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:182
      From worker 13:    [4] exec_conn_func
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:183
      From worker 13:    [5] exec_conn_func
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:177
      From worker 13:    [6] #118
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 13:    [7] run_work_thunk
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 13:    [8] macro expansion
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 13:    [9] #117
      From worker 13:      @ ./task.jl:406
      From worker 13:   Stacktrace:
      From worker 13:    [1] #remotecall_fetch#167
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416 [inlined]
      From worker 13:    [2] remotecall_fetch
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 13:    [3] #remotecall_fetch#170
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 13:    [4] remotecall_fetch
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 13:    [5] #21
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:169
      From worker 13:    [6] #118
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 13:    [7] run_work_thunk
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 13:    [8] macro expansion
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 13:    [9] #117
      From worker 13:      @ ./task.jl:406
      From worker 13:   Stacktrace:
      From worker 13:    [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 13:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416
      From worker 13:    [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64})
      From worker 13:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 13:    [3] remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 13:      @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 13:    [4] remotecall_fetch
      From worker 13:      @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 13:    [5] (::Distributed.var"#20#23"{Distributed.Worker})()
      From worker 13:      @ Distributed ./threadingconstructs.jl:170
      From worker 7:    UNHANDLED TASK ERROR: On worker 1:
      From worker 7:    On worker 14:
      From worker 7:    peer 7 has not connected to 14
      From worker 7:    Stacktrace:
      From worker 7:     [1] error
      From worker 7:       @ ./error.jl:33
      From worker 7:     [2] wait_for_conn
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:194
      From worker 7:     [3] #25
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:182
      From worker 7:     [4] exec_conn_func
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:183
      From worker 7:     [5] exec_conn_func
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:177
      From worker 7:     [6] #118
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 7:     [7] run_work_thunk
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 7:     [8] macro expansion
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 7:     [9] #117
      From worker 7:       @ ./task.jl:406
      From worker 7:    Stacktrace:
      From worker 7:     [1] #remotecall_fetch#167
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416 [inlined]
      From worker 7:     [2] remotecall_fetch
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 7:     [3] #remotecall_fetch#170
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 7:     [4] remotecall_fetch
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 7:     [5] #21
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:169
      From worker 7:     [6] #118
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 7:     [7] run_work_thunk
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 7:     [8] macro expansion
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 7:     [9] #117
      From worker 7:       @ ./task.jl:406
      From worker 7:    Stacktrace:
      From worker 7:     [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 7:       @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:416
      From worker 7:     [2] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64})
      From worker 7:       @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:408
      From worker 7:     [3] remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 7:       @ Distributed ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443
      From worker 7:     [4] remotecall_fetch
      From worker 7:       @ ~/builds/julia/usr/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:443 [inlined]
      From worker 7:     [5] (::Distributed.var"#20#23"{Distributed.Worker})()
      From worker 7:       @ Distributed ./threadingconstructs.jl:170
      From worker 2:    UNHANDLED TASK ERROR: On worker 1:

@Sacha0 Sacha0 force-pushed the vc/distributed_ts branch from e287246 to 20a514f Compare June 22, 2021 14:04
@Sacha0
Copy link
Member

Sacha0 commented Jun 22, 2021

I hope you don't mind Valentin: I took the liberty of rebasing this branch over master. (On 1.7 we consistently hit issues that this branch should address, so we would love to see it over the line. Let me know if there is anything I can do to lend a hand :).)

@Sacha0
Copy link
Member

Sacha0 commented Jun 22, 2021

Interestingly this branch does not fully resolve the issues we hit, but it does cause our test suite to fail much earlier/harder and with a better stacktrace 🎉 (as opposed to, without this branch, throwing a more opaque trace in the background and then stalling till timeout).

@vchuravy
Copy link
Member Author

Happy to hand this off :) let me know if you want to chat about what needs doing (Happy to have a co-working session on this). I suspect the least will be making the ClusterSerializer threadsafe as well.

notify(any_gc_flag)
msg = (remoteref_id(rr), myid())
# We cannot acquire locks from finalizers
@async begin
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sacha0 I was hoping we would get #39529 for this.

@vchuravy
Copy link
Member Author

The MacOS tester hit:

      From worker 3:	AssertionError("wpid > 0")CapturedException(

That's an assertion in process_messages.jl so likely related.

@Sacha0 Sacha0 added the bugfix This change fixes an existing bug label Jun 26, 2021
@JeffBezanson JeffBezanson added the multithreading Base.Threads and related functionality label Jul 15, 2021
Makes the worker struct threadsafe as well as flushing the GC messages
@vchuravy vchuravy force-pushed the vc/distributed_ts branch from 583004b to 0c073cc Compare July 16, 2021 14:02
@vchuravy
Copy link
Member Author

@JeffBezanson I removed unnecessary changes like actually using @spawn everywhere and locked most lookups for w.state

@vchuravy
Copy link
Member Author

I am gonna land this, but we should revert if we see any increased flakiness in CI.

@vchuravy vchuravy merged commit 5a16805 into master Jul 19, 2021
@vchuravy vchuravy deleted the vc/distributed_ts branch July 19, 2021 12:04
@@ -659,7 +690,12 @@ function create_worker(manager, wconfig)
end

for wl in wlist
(wl.state === W_CREATED) && wait(wl.c_state)
if wl.state === W_CREATED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah merde...

JeffBezanson pushed a commit that referenced this pull request Jul 19, 2021
KristofferC pushed a commit that referenced this pull request Jul 26, 2021
(cherry picked from commit 5af1cf0)
vtjnash added a commit that referenced this pull request Jul 28, 2021
Revert "fixup to pull request #38405 (#41641)"

Seems to be causing hanging in CI testing?

This reverts commit 5af1cf0.
This reverts commit 5a16805, reversing
changes made to 02807b2.
vtjnash added a commit that referenced this pull request Jul 29, 2021
…41722)

Also reverts "fixup to pull request #38405 (#41641)"

Seems to be causing hanging in CI testing.

This reverts commit 5af1cf0 and this
reverts commit 5a16805, reversing
changes made to 02807b2.
KristofferC pushed a commit that referenced this pull request Aug 2, 2021
…41722)

Also reverts "fixup to pull request #38405 (#41641)"

Seems to be causing hanging in CI testing.

This reverts commit 5af1cf0 and this
reverts commit 5a16805, reversing
changes made to 02807b2.

(cherry picked from commit 66f9b55)
vchuravy added a commit to JuliaLang/Distributed.jl that referenced this pull request Oct 6, 2023
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this pull request Oct 6, 2023
…stributed_ts" (JuliaLang/julia#41722)

Also reverts "fixup to pull request JuliaLang/julia#38405 (JuliaLang/julia#41641)"

Seems to be causing hanging in CI testing.

This reverts commit 740a33a and this
reverts commit 5a1680533b461471f537f5f5d4ee3c2866b21e1f, reversing
changes made to 02807b279a5e6d5acaeb7095e4c0527e2a5c190e.

(cherry picked from commit bbc9429)
Keno pushed a commit that referenced this pull request Jun 5, 2024
Keno pushed a commit that referenced this pull request Jun 5, 2024
…41722)

Also reverts "fixup to pull request #38405 (#41641)"

Seems to be causing hanging in CI testing.

This reverts commit 740a33a and this
reverts commit 5a16805, reversing
changes made to 02807b2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug multithreading Base.Threads and related functionality parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants