do not use myid() to differentiate master & worker #32879

tanmaykm · 2019-08-13T06:55:35Z

Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error:

ERROR (unhandled task failure): MethodError: no method matching manage(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig, ::Symbol)
Closest candidates are:
  manage(!Matched::Base.Distributed.SSHManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:224
  manage(!Matched::Base.Distributed.LocalManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:337
  manage(!Matched::Union{ClusterManagers.PBSManager, ClusterManagers.QRSHManager, ClusterManagers.SGEManager}, ::Int64, ::WorkerConfig, ::Symbol) at /home/jrun/.julia/v0.6/ClusterManagers/src/qsub.jl:115
  ...
Stacktrace:
 [1] deregister_worker(::Base.Distributed.ProcessGroup, ::Int64) at ./distributed/cluster.jl:903
 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:220
 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118
 [4] (::Base.Distributed.##101#102{TCPSocket,TCPSocket,Bool})() at ./event.jl:73

It can be simulated with this exact sequence of events:

worker2 in process of connecting to master
- master has received the worker2s listen port, connected to it, sent the JoinPGRP message to it
- master is now aware of worker2, and has added it to its list of workers
- worker2 has still not processed the JoinPGRP message, so it is still unaware of its worker id
worker3 now connects to master
- master sends the JoinPGRP message along with list of existing workers that includes worker2
worker3 connects to worker2
worker2 receives a new connection from worker3 and attempts to process it
worker3 faces an error and exits, thus breaking the connection
worker2 gets an error processing message from worker3
- goes into error handling
- the current error handling code sees the self pid as 1 and incorrectly thinks it is the master
- attempts to process the worker disconnection as a master and gets the error we see

The MethodError prevents proper cleanup at the worker where it happens.

The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that.

This change adds a new local process role variable that is set to :master by default, but is set to :worker when start_worker is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.

stdlib/Distributed/src/cluster.jl

Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error: ``` ERROR (unhandled task failure): MethodError: no method matching manage(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig, ::Symbol) Closest candidates are: manage(!Matched::Base.Distributed.SSHManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:224 manage(!Matched::Base.Distributed.LocalManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:337 manage(!Matched::Union{ClusterManagers.PBSManager, ClusterManagers.QRSHManager, ClusterManagers.SGEManager}, ::Int64, ::WorkerConfig, ::Symbol) at /home/jrun/.julia/v0.6/ClusterManagers/src/qsub.jl:115 ... Stacktrace: [1] deregister_worker(::Base.Distributed.ProcessGroup, ::Int64) at ./distributed/cluster.jl:903 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:220 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118 [4] (::Base.Distributed.#JuliaLang#101#102{TCPSocket,TCPSocket,Bool})() at ./event.jl:73 ``` It can be simulated with this exact sequence of events: - worker2 in process of connecting to master - master has received the worker2s listen port, connected to it, sent the JoinPGRP message to it - master is now aware of worker2, and has added it to its list of workers - worker2 has still not processed the JoinPGRP message, so it is still unaware of its worker id - worker3 now connects to master - master sends the JoinPGRP message along with list of existing workers that includes worker2 - worker3 connects to worker2 - worker2 receives a new connection from worker3 and attempts to process it - worker3 faces an error and exits, thus breaking the connection - worker2 gets an error processing message from worker3 - goes into error handling - the current error handling code sees the self pid as 1 and incorrectly thinks it is the master - attempts to process the worker disconnection as a master and gets the error we see The MethodError prevents proper cleanup at the worker where it happens. The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that. This change adds a new local process role variable that is set to `:master` by default, but is set to `:worker` when `start_worker` is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.

tanmaykm · 2019-08-14T16:08:48Z

CI error on win32 seems unrelated.

tanmaykm · 2019-08-15T01:33:09Z

CI passes now.

Occasionally while adding a large number of workers and particularly worker-worker connections are not lazy, it is possible to encounter the following error: ``` ERROR (unhandled task failure): MethodError: no method matching manage(::Base.Distributed.DefaultClusterManager, ::Int64, ::WorkerConfig, ::Symbol) Closest candidates are: manage(!Matched::Base.Distributed.SSHManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:224 manage(!Matched::Base.Distributed.LocalManager, ::Integer, ::WorkerConfig, ::Symbol) at distributed/managers.jl:337 manage(!Matched::Union{ClusterManagers.PBSManager, ClusterManagers.QRSHManager, ClusterManagers.SGEManager}, ::Int64, ::WorkerConfig, ::Symbol) at /home/jrun/.julia/v0.6/ClusterManagers/src/qsub.jl:115 ... Stacktrace: [1] deregister_worker(::Base.Distributed.ProcessGroup, ::Int64) at ./distributed/cluster.jl:903 [2] message_handler_loop(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:220 [3] process_tcp_streams(::TCPSocket, ::TCPSocket, ::Bool) at ./distributed/process_messages.jl:118 [4] (::Base.Distributed.##101#102{TCPSocket,TCPSocket,Bool})() at ./event.jl:73 ``` It can be simulated with this exact sequence of events: - worker2 in process of connecting to master - master has received the worker2s listen port, connected to it, sent the JoinPGRP message to it - master is now aware of worker2, and has added it to its list of workers - worker2 has still not processed the JoinPGRP message, so it is still unaware of its worker id - worker3 now connects to master - master sends the JoinPGRP message along with list of existing workers that includes worker2 - worker3 connects to worker2 - worker2 receives a new connection from worker3 and attempts to process it - worker3 faces an error and exits, thus breaking the connection - worker2 gets an error processing message from worker3 - goes into error handling - the current error handling code sees the self pid as 1 and incorrectly thinks it is the master - attempts to process the worker disconnection as a master and gets the error we see The MethodError prevents proper cleanup at the worker where it happens. The issue seems to be that it is not correct to identify whether a Julia process is master or worker by looking at the process id. Instead we should have a dedicated indicator for that. This change adds a new local process role variable that is set to `:master` by default, but is set to `:worker` when `start_worker` is invoked. This allows a process to know that it is running as a worker irrespective of whether it has received a process id or not.

tanmaykm added the parallelism Parallel or distributed computation label Aug 13, 2019

tanmaykm added a commit to tanmaykm/julia that referenced this pull request Aug 13, 2019

add tests for JuliaLang#32879

398e571

tanmaykm requested a review from amitmurthy August 13, 2019 07:10

amitmurthy reviewed Aug 14, 2019

View reviewed changes

stdlib/Distributed/src/cluster.jl Outdated Show resolved Hide resolved

tanmaykm force-pushed the tan/lproctype branch from 398e571 to 0f5290a Compare August 14, 2019 12:43

JeffBezanson merged commit 4c0049c into JuliaLang:master Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not use myid() to differentiate master & worker #32879

do not use myid() to differentiate master & worker #32879

tanmaykm commented Aug 13, 2019

tanmaykm commented Aug 14, 2019

tanmaykm commented Aug 15, 2019

do not use myid() to differentiate master & worker #32879

do not use myid() to differentiate master & worker #32879

Conversation

tanmaykm commented Aug 13, 2019

tanmaykm commented Aug 14, 2019

tanmaykm commented Aug 15, 2019