Proposal for provision pipeline #400

Michaelvll · 2022-02-21T06:37:30Z

We currently had quite many problems with ray up not updating file_mounts (mostly for multi-node), the logging is not informative enough #384, long wait time during file_mounts for large files, workdir problem in #389, ssh config set too late problem #385, and also the problem when the user's setup fails the cluster will remain INIT state (not distinguishing the whether the instance is successfully launched or not).

The following is my proposal for a new API:

For the Backend, we add a new setup function, and during execution, we add this setup() after the sync_file_mounts function. The provision function is now only responsible for launching the instances and setup ray, just like the first step of gang_scheduling. Also, before the provisioning, our backend can try ray exec handle.yaml -- ray status to check the availability of the cluster and skip the ray up if ready. All the user-customized file_mounts and setups are then handled by our own functions.

The text was updated successfully, but these errors were encountered:

concretevitamin · 2022-02-21T14:09:01Z

Sounds good at a high level. The upsides are solving the mentioned problems by having more control. The only downsides I can see are

not leveraging some of ray autoscaler's reuse and caching logic - it detects if setup / file_mounts are up to date and skip those steps. This may be fine as we make a distinction between sky launch (which should always rerun the entire setup) and sky exec anyway.
this requires some involved work, such as changing the state diagram and definition; thus may require a lot of testing

We should have a meeting to talk about this!

romilbhardwaj · 2022-02-21T15:47:41Z

I think this is a good idea. Also if I understand correctly, this is proposing we bring back STAGE.pre_exec and backend.run_post_setup like we had before and move the setup logic here, right?

https://github.com/sky-proj/sky/blob/795b0e9d703d7c66b7167b4601718ae638984851/prototype/sky/execution.py#L132

Another upside to adding the pre_exec stage is that users can now use contents of file_mounts for setup.

Michaelvll · 2022-02-21T17:38:17Z

I think this is a good idea. Also if I understand correctly, this is proposing we bring back STAGE.pre_exec and backend.run_post_setup like we had before and move the setup logic here, right?

https://github.com/sky-proj/sky/blob/795b0e9d703d7c66b7167b4601718ae638984851/prototype/sky/execution.py#L132

Yes, good point! Basically, it is bringing back backend.run_post_setup and moving all user's setup logic here.

Michaelvll added the proposal label Feb 21, 2022

Michaelvll mentioned this issue Feb 21, 2022

[Logging] Multiple Shared connection closed messages #404

Closed

Michaelvll added the P0 label Feb 21, 2022

Michaelvll self-assigned this Feb 21, 2022

Michaelvll mentioned this issue Feb 22, 2022

Decouple ray up and user's file_mounts / setup #407

Merged

3 tasks

Michaelvll linked a pull request Feb 23, 2022 that will close this issue

Decouple ray up and user's file_mounts / setup #407

Merged

3 tasks

Michaelvll closed this as completed in #407 Feb 23, 2022

Michaelvll mentioned this issue Feb 23, 2022

[Sky Launch] SSH to Machine during first run of sky launch #385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for provision pipeline #400

Proposal for provision pipeline #400

Michaelvll commented Feb 21, 2022 •

edited

Loading

concretevitamin commented Feb 21, 2022

romilbhardwaj commented Feb 21, 2022

Michaelvll commented Feb 21, 2022

Proposal for provision pipeline #400

Proposal for provision pipeline #400

Comments

Michaelvll commented Feb 21, 2022 • edited Loading

concretevitamin commented Feb 21, 2022

romilbhardwaj commented Feb 21, 2022

Michaelvll commented Feb 21, 2022

Michaelvll commented Feb 21, 2022 •

edited

Loading