Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting multiple cluster schedulers #3549

Closed
ViralBShah opened this issue Jun 27, 2013 · 9 comments · Fixed by #3649
Closed

Supporting multiple cluster schedulers #3549

ViralBShah opened this issue Jun 27, 2013 · 9 comments · Fixed by #3649

Comments

@ViralBShah
Copy link
Member

Currently, supporting multiple cluster schedulers requires hacking code into multi.jl. It would be nice to have a way to abstract this out of Base. I would ideally like to have the base just have capabilities to ssh and launch jobs.

SGE and scyld support (#3546) should ideally be in packages, and extend the capabilities in base. This will require some re-thinking of the way the code to launch workers is organized currently. Once this is done, it will become much easier to support many other batch schedulers.

cc: @nlhelper

@amitmurthy
Copy link
Contributor

👍

@ViralBShah
Copy link
Member Author

@ViralBShah
Copy link
Member Author

It would be also nice to be able to allocate EC2 computers in the same way.

@pao
Copy link
Member

pao commented Jun 27, 2013

@ViralBShah you're looking for @nlhepler, I think.

@amitmurthy
Copy link
Contributor

A suggested design:

  • Currently, the Worker processes write out the host:port they are listening on to stdout.
    This is parsed by the master process and connections established.
    Also the STDOUT of the worker processes is redirected to the Julia console.
  • We can have another addprocs in Base:
  • addprocs(n::Integer, launch_cb::Function, args...)
  • The callback function is of the type launch_cb(julia_exe_name, julia_exe_args, args...) => io || (io, host, port) || (io, host) || (host, port)
  • The launch_cb is called by addprocs to launch the Julia worker process on any machine in the cluster.
  • The first 2 parameters - julia_exe_name, julia_exe_args - are specified by Base.addprocs. Remaining parameters
    are a pass through
  • launch_cb can return dfferent combinations of io, host and port
  • io is the stdout of the worker process. host and port is the address to use to connect to the process.
  • If only an io is returned, it is parsed for the host port.
  • If only host, port are returned they are used to connect to.
  • If all three are returned, the io is read and then redirected to the console. host, port returned are used.
  • if io, host is returned, the port is obtained from the io while the host returned is used.
  • We can bundle existing sge, scyld and EC2 addprocs into a new package called Cluster,
    with appropriate callback functions for each cluster model.
  • Usage in Julia would be

addprocs(5, Cluster.ec2, <args specific to EC2>)
addprocs(5, Cluster.sge, <args specific to sge>)
addprocs(5, Cluster.scyld, <args specific to scyld>)

Other clustering technologies can be supported via additions to the Cluster package.

@JeffBezanson
Copy link
Member

That looks good.

@papamarkou
Copy link
Contributor

👍

@daviddelaat
Copy link
Contributor

For the API it would maybe be nice to make EC2, SGE, etc immutable types inheriting a common abstract cluster type, and then do:

addprocs(5, Cluster.EC2(<args specific to EC2>))

@amitmurthy
Copy link
Contributor

Good idea.

I am thinking of extending the the signature of

addprocs(machines::AbstractVector;   tunnel=false, dir=JULIA_HOME, exename="./julia-release-basic", sshflags::Cmd=``)

to

addprocs(machines::Union(AbstractVector, ClusterManager);  tunnel=false, dir=JULIA_HOME, exename="./julia-release-basic", sshflags::Cmd=``)

where the concrete implementation of ClusterManager must have two members, viz., np - the number of procs to be added and launch - a function that will launch a worker instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants