Auto launch of additional workers depending on number of cores at a host. #9202

amitmurthy · 2014-11-30T13:02:16Z

This PR is inspired by https://groups.google.com/d/msg/julia-dev/J8plDsw76dI/VQbedWJXi20J and https://groups.google.com/d/msg/julia-dev/Ui7G-99jBpI/VDmO-s0YFiMJ

addprocs has a new keyword argument clone_worker. Default value is 0. Any positive value 'n' will result in 'n' additional workers to be created for every new worker created. clone_worker="auto" will launch Base.CPU_CORES - 1 additional workers.

Users will use this by having ClusterManagers launch a single worker per host. With clone_worker="auto" , addprocs will automatically launch additional workers depending on the number of cores available on the host.

For example, on 2 hosts, each with 8 cores, addprocs(["host1", "host2"]; clone_worker="auto") will first launch a single worker each on "host1" and "host2". Next it will launch 7 additional workers on each host via the 2 newly created workers. Thus only a single ssh session is used to connect to a host.

One issue is that the process which launches the clones does not know the pids of the clones. And hence console output from the clones (which is channeled back to the master via the launching worker) is displayed as shown below:

julia> addprocs(1; clone_worker=3)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere println(myid())
1
        From worker 10: 2
        From worker 10.clone:   3
        From worker 10.clone:   4
        From worker 10.clone:   5

But other than the above, I think it is a good option to launch required number of workers per host.

ViralBShah · 2014-11-30T15:46:06Z

clone_worker suggests that these are forked processes - is that so?

amitmurthy · 2014-12-01T02:19:47Z

No. Usually the master process launches all workers. In this case, the first worker on a host launches the additional workers.

Open to suggestions for a better name - copy_worker?

kourzanov · 2014-12-01T15:55:43Z

This sounds good (output from workers is a nice catch I missed in my version), however, I think the same could be accomplished without an additional keyword like :copy_worker. Starting a worker with a combination of -p N --worker (via :exeflags=-p N) should already be precise enough, with -p -1 doing the job of :clone_worker="auto"

amitmurthy · 2014-12-02T03:27:40Z

I'll give it a shot.

julia -p, i.e. no value would launch as many workers as cores.
julia -p N, launches N additional workers.
julia -p --worker, would launch an additional Base.CPU_CORES - 1 workers from a worker
julia -p N --worker, launches N additional workers from a worker

We should also probably make the first parameter to addprocs optional, since the actual number of workers started will depend on the number of cores detected dynamically.

kourzanov · 2014-12-02T12:33:04Z

What about the situation where one can detect the number of cores already in use (through LSF etc.), say U cores and then use the rest of the cores for Julia workers via -p -U, to play nice with other existing users of a host. When used in conjunction with --worker it could subtract one always, or like you propose.

amitmurthy · 2014-12-02T15:13:32Z

julia -p N --worker will provide a means to do that. The LSF cluster manager should detect the same and launch the worker with appropriate arguments.

kourzanov · 2014-12-02T15:31:02Z

OK, fine. Then scripts for e.g., LSF should not only query the "used" cores but also "total" core counts, to calculate the "remaining" cores.

JeffBezanson · 2014-12-02T16:05:07Z

I have the same confusion as @ViralBShah --- there isn't really any cloning or copying, so a different name should be used. Maybe "multiplicity" or just "count".

More importantly, it seems like this could be integrated more deeply. Instead of just adding a keyword argument, we could make arrays of (host, count) the core representation of a group of processes to start. The * syntax in machine files can use the same mechanism. I don't want to add too many tacked-on second-class features.

amitmurthy · 2014-12-02T16:11:08Z

For starters, I'll do away with the new keyword and go with @kourzanov 's suggestion of reusing exeflags for this - via the -p N --worker option.

Yes, we can integrate the * syntax with this as well - though that is only relevant for the built in SSHManager. Other cluster managers will have their implementations.

StefanKarpinski · 2014-12-02T16:43:58Z

Making the core representation (host, count) seems natural. Another consideration is how this will in the future interact with threading. Is that yet another level of hierarchy?

kmsquire · 2014-12-02T17:02:53Z

Threading should probably be an additional hierarchy level.

When I was working in bioinformatics, I found GATK's cluster framework very
well engineered. In particular, they handled both threading and separate
processes with various cluster managers. It would be worth looking there
for design ideas.

(I'd like to offer something more concrete than a pointer, but I'm finding
my time stretched a bit thin these days.)

Cheers,
Kevin

On Tuesday, December 2, 2014, Stefan Karpinski [email protected]
wrote:

Making the core representation (host, count) seems natural. Another
consideration is how this will in the future interact with threading. Is
that yet another level of hierarchy?

—
Reply to this email directly or view it on GitHub
#9202 (comment).

amitmurthy · 2014-12-11T04:41:11Z

Superceded by #9309

ViralBShah added the parallelism Parallel or distributed computation label Nov 30, 2014

supports cloning of newly launched workers

6a3f65f

amitmurthy mentioned this pull request Dec 11, 2014

rework of cluster manager. support additional launches via a worker #9309

Merged

amitmurthy closed this Dec 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto launch of additional workers depending on number of cores at a host. #9202

Auto launch of additional workers depending on number of cores at a host. #9202

amitmurthy commented Nov 30, 2014

ViralBShah commented Nov 30, 2014

amitmurthy commented Dec 1, 2014

kourzanov commented Dec 1, 2014

amitmurthy commented Dec 2, 2014

kourzanov commented Dec 2, 2014

amitmurthy commented Dec 2, 2014

kourzanov commented Dec 2, 2014

JeffBezanson commented Dec 2, 2014

amitmurthy commented Dec 2, 2014

StefanKarpinski commented Dec 2, 2014

kmsquire commented Dec 2, 2014

amitmurthy commented Dec 11, 2014

Auto launch of additional workers depending on number of cores at a host. #9202

Auto launch of additional workers depending on number of cores at a host. #9202

Conversation

amitmurthy commented Nov 30, 2014

ViralBShah commented Nov 30, 2014

amitmurthy commented Dec 1, 2014

kourzanov commented Dec 1, 2014

amitmurthy commented Dec 2, 2014

kourzanov commented Dec 2, 2014

amitmurthy commented Dec 2, 2014

kourzanov commented Dec 2, 2014

JeffBezanson commented Dec 2, 2014

amitmurthy commented Dec 2, 2014

StefanKarpinski commented Dec 2, 2014

kmsquire commented Dec 2, 2014

amitmurthy commented Dec 11, 2014