Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto launch of additional workers depending on number of cores at a host. #9202

Closed
wants to merge 1 commit into from
Closed

Conversation

amitmurthy
Copy link
Contributor

This PR is inspired by https://groups.google.com/d/msg/julia-dev/J8plDsw76dI/VQbedWJXi20J and https://groups.google.com/d/msg/julia-dev/Ui7G-99jBpI/VDmO-s0YFiMJ

addprocs has a new keyword argument clone_worker. Default value is 0. Any positive value 'n' will result in 'n' additional workers to be created for every new worker created. clone_worker="auto" will launch Base.CPU_CORES - 1 additional workers.

Users will use this by having ClusterManagers launch a single worker per host. With clone_worker="auto" , addprocs will automatically launch additional workers depending on the number of cores available on the host.

For example, on 2 hosts, each with 8 cores, addprocs(["host1", "host2"]; clone_worker="auto") will first launch a single worker each on "host1" and "host2". Next it will launch 7 additional workers on each host via the 2 newly created workers. Thus only a single ssh session is used to connect to a host.

One issue is that the process which launches the clones does not know the pids of the clones. And hence console output from the clones (which is channeled back to the master via the launching worker) is displayed as shown below:

julia> addprocs(1; clone_worker=3)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere println(myid())
1
        From worker 10: 2
        From worker 10.clone:   3
        From worker 10.clone:   4
        From worker 10.clone:   5

But other than the above, I think it is a good option to launch required number of workers per host.

@ViralBShah
Copy link
Member

clone_worker suggests that these are forked processes - is that so?

@ViralBShah ViralBShah added the parallelism Parallel or distributed computation label Nov 30, 2014
@amitmurthy
Copy link
Contributor Author

No. Usually the master process launches all workers. In this case, the first worker on a host launches the additional workers.

Open to suggestions for a better name - copy_worker?

@kourzanov
Copy link

This sounds good (output from workers is a nice catch I missed in my version), however, I think the same could be accomplished without an additional keyword like :copy_worker. Starting a worker with a combination of -p N --worker (via :exeflags=-p N) should already be precise enough, with -p -1 doing the job of :clone_worker="auto"

@amitmurthy
Copy link
Contributor Author

I'll give it a shot.

julia -p, i.e. no value would launch as many workers as cores.
julia -p N, launches N additional workers.
julia -p --worker, would launch an additional Base.CPU_CORES - 1 workers from a worker
julia -p N --worker, launches N additional workers from a worker

We should also probably make the first parameter to addprocs optional, since the actual number of workers started will depend on the number of cores detected dynamically.

@kourzanov
Copy link

What about the situation where one can detect the number of cores already in use (through LSF etc.), say U cores and then use the rest of the cores for Julia workers via -p -U, to play nice with other existing users of a host. When used in conjunction with --worker it could subtract one always, or like you propose.

@amitmurthy
Copy link
Contributor Author

julia -p N --worker will provide a means to do that. The LSF cluster manager should detect the same and launch the worker with appropriate arguments.

@kourzanov
Copy link

OK, fine. Then scripts for e.g., LSF should not only query the "used" cores but also "total" core counts, to calculate the "remaining" cores.

@JeffBezanson
Copy link
Member

I have the same confusion as @ViralBShah --- there isn't really any cloning or copying, so a different name should be used. Maybe "multiplicity" or just "count".

More importantly, it seems like this could be integrated more deeply. Instead of just adding a keyword argument, we could make arrays of (host, count) the core representation of a group of processes to start. The * syntax in machine files can use the same mechanism. I don't want to add too many tacked-on second-class features.

@amitmurthy
Copy link
Contributor Author

For starters, I'll do away with the new keyword and go with @kourzanov 's suggestion of reusing exeflags for this - via the -p N --worker option.

Yes, we can integrate the * syntax with this as well - though that is only relevant for the built in SSHManager. Other cluster managers will have their implementations.

@StefanKarpinski
Copy link
Member

Making the core representation (host, count) seems natural. Another consideration is how this will in the future interact with threading. Is that yet another level of hierarchy?

@kmsquire
Copy link
Member

kmsquire commented Dec 2, 2014

Threading should probably be an additional hierarchy level.

When I was working in bioinformatics, I found GATK's cluster framework very
well engineered. In particular, they handled both threading and separate
processes with various cluster managers. It would be worth looking there
for design ideas.

(I'd like to offer something more concrete than a pointer, but I'm finding
my time stretched a bit thin these days.)

Cheers,
Kevin

On Tuesday, December 2, 2014, Stefan Karpinski [email protected]
wrote:

Making the core representation (host, count) seems natural. Another
consideration is how this will in the future interact with threading. Is
that yet another level of hierarchy?


Reply to this email directly or view it on GitHub
#9202 (comment).

@amitmurthy
Copy link
Contributor Author

Superceded by #9309

@amitmurthy amitmurthy closed this Dec 11, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants