-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not use myid() to differentiate master & worker #32879
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
amitmurthy
reviewed
Aug 14, 2019
Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error: ``` ERROR (unhandled task failure): MethodError: no method matching manage(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig, ::Symbol) Closest candidates are: manage(!Matched::Base.Distributed.SSHManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:224 manage(!Matched::Base.Distributed.LocalManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:337 manage(!Matched::Union{ClusterManagers.PBSManager, ClusterManagers.QRSHManager, ClusterManagers.SGEManager}, ::Int64, ::WorkerConfig, ::Symbol) at /home/jrun/.julia/v0.6/ClusterManagers/src/qsub.jl:115 ... Stacktrace: [1] deregister_worker(::Base.Distributed.ProcessGroup, ::Int64) at ./distributed/cluster.jl:903 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:220 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118 [4] (::Base.Distributed.#JuliaLang#101#102{TCPSocket,TCPSocket,Bool})() at ./event.jl:73 ``` It can be simulated with this exact sequence of events: - worker2 in process of connecting to master - master has received the worker2s listen port, connected to it, sent the JoinPGRP message to it - master is now aware of worker2, and has added it to its list of workers - worker2 has still not processed the JoinPGRP message, so it is still unaware of its worker id - worker3 now connects to master - master sends the JoinPGRP message along with list of existing workers that includes worker2 - worker3 connects to worker2 - worker2 receives a new connection from worker3 and attempts to process it - worker3 faces an error and exits, thus breaking the connection - worker2 gets an error processing message from worker3 - goes into error handling - the current error handling code sees the self pid as 1 and incorrectly thinks it is the master - attempts to process the worker disconnection as a master and gets the error we see The MethodError prevents proper cleanup at the worker where it happens. The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that. This change adds a new local process role variable that is set to `:master` by default, but is set to `:worker` when `start_worker` is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.
tanmaykm
force-pushed
the
tan/lproctype
branch
from
August 14, 2019 12:43
398e571
to
0f5290a
Compare
CI error on win32 seems unrelated. |
CI passes now. |
Keno
pushed a commit
that referenced
this pull request
Jun 5, 2024
Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error: ``` ERROR (unhandled task failure): MethodError: no method matching manage(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig, ::Symbol) Closest candidates are: manage(!Matched::Base.Distributed.SSHManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:224 manage(!Matched::Base.Distributed.LocalManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:337 manage(!Matched::Union{ClusterManagers.PBSManager, ClusterManagers.QRSHManager, ClusterManagers.SGEManager}, ::Int64, ::WorkerConfig, ::Symbol) at /home/jrun/.julia/v0.6/ClusterManagers/src/qsub.jl:115 ... Stacktrace: [1] deregister_worker(::Base.Distributed.ProcessGroup, ::Int64) at ./distributed/cluster.jl:903 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:220 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118 [4] (::Base.Distributed.##101#102{TCPSocket,TCPSocket,Bool})() at ./event.jl:73 ``` It can be simulated with this exact sequence of events: - worker2 in process of connecting to master - master has received the worker2s listen port, connected to it, sent the JoinPGRP message to it - master is now aware of worker2, and has added it to its list of workers - worker2 has still not processed the JoinPGRP message, so it is still unaware of its worker id - worker3 now connects to master - master sends the JoinPGRP message along with list of existing workers that includes worker2 - worker3 connects to worker2 - worker2 receives a new connection from worker3 and attempts to process it - worker3 faces an error and exits, thus breaking the connection - worker2 gets an error processing message from worker3 - goes into error handling - the current error handling code sees the self pid as 1 and incorrectly thinks it is the master - attempts to process the worker disconnection as a master and gets the error we see The MethodError prevents proper cleanup at the worker where it happens. The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that. This change adds a new local process role variable that is set to `:master` by default, but is set to `:worker` when `start_worker` is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error:
It can be simulated with this exact sequence of events:
The MethodError prevents proper cleanup at the worker where it happens.
The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that.
This change adds a new local process role variable that is set to
:master
by default, but is set to:worker
whenstart_worker
is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.