Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for provision pipeline #400

Closed
Michaelvll opened this issue Feb 21, 2022 · 3 comments · Fixed by #407
Closed

Proposal for provision pipeline #400

Michaelvll opened this issue Feb 21, 2022 · 3 comments · Fixed by #407
Assignees

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Feb 21, 2022

We currently had quite many problems with ray up not updating file_mounts (mostly for multi-node), the logging is not informative enough #384, long wait time during file_mounts for large files, workdir problem in #389, ssh config set too late problem #385, and also the problem when the user's setup fails the cluster will remain INIT state (not distinguishing the whether the instance is successfully launched or not).

The following is my proposal for a new API:

For the Backend, we add a new setup function, and during execution, we add this setup() after the sync_file_mounts function. The provision function is now only responsible for launching the instances and setup ray, just like the first step of gang_scheduling. Also, before the provisioning, our backend can try ray exec handle.yaml -- ray status to check the availability of the cluster and skip the ray up if ready. All the user-customized file_mounts and setups are then handled by our own functions.

@concretevitamin
Copy link
Member

Sounds good at a high level. The upsides are solving the mentioned problems by having more control. The only downsides I can see are

  1. not leveraging some of ray autoscaler's reuse and caching logic - it detects if setup / file_mounts are up to date and skip those steps. This may be fine as we make a distinction between sky launch (which should always rerun the entire setup) and sky exec anyway.
  2. this requires some involved work, such as changing the state diagram and definition; thus may require a lot of testing

We should have a meeting to talk about this!

@romilbhardwaj
Copy link
Collaborator

I think this is a good idea. Also if I understand correctly, this is proposing we bring back STAGE.pre_exec and backend.run_post_setup like we had before and move the setup logic here, right?

https://github.com/sky-proj/sky/blob/795b0e9d703d7c66b7167b4601718ae638984851/prototype/sky/execution.py#L132

Another upside to adding the pre_exec stage is that users can now use contents of file_mounts for setup.

@Michaelvll
Copy link
Collaborator Author

I think this is a good idea. Also if I understand correctly, this is proposing we bring back STAGE.pre_exec and backend.run_post_setup like we had before and move the setup logic here, right?

https://github.com/sky-proj/sky/blob/795b0e9d703d7c66b7167b4601718ae638984851/prototype/sky/execution.py#L132

Yes, good point! Basically, it is bringing back backend.run_post_setup and moving all user's setup logic here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants