Torchx integration #321

vwxyzjn · 2022-11-19T05:00:53Z

Description

Our current cloud integration is pretty hacky. I haven't seen anyone used it and it has been a maintenance burden for us. Using a more managed utility to launch experiments in the cloud is desirable. There are two primary contenders and their pros and cons:

torchx
- ✅ support for slurm
- ✅ support for running tasks locally
- ✅ the docker image is automatically pushed with a hash for AWS Batch
- ❌ still need to spin up cloud resources (e.g., aws batch), which is complicated but can be mitigated by using terraform
skypilot
- ✅ support for managing spot instances and auto resume them
- ✅ compare pricing
- ✅ debuggability via sky ssh mycluster
  - ✅ good for folks who don't always have a GPU machine
- ❌ need to wait for the clusters to be spun up

All of them:

✅ support for aws, gcp, azure

This PR

Better cloud integration utility by leveraging torchx. It should really be an elegant solution for us and has the following benefits:

we can deprecate our cloud utilities and release ourselves from their maintenance burden
support for slurm, kubernetes, aws batch, gcp (Add Google Cloud/GCP Scheduler Support + TPUs pytorch/torchx#410 (comment)) and others

Give it a try by running

poetry run torchx run --scheduler local_docker utils.python --gpu 1 --script cleanrl/cleanrl.py
poetry run torchx run --scheduler aws_batch --scheduler_args queue=c5a-large,image_repo=vwxyzjn/cleanrl  utils.python  --script cleanrl/ppo.py
poetry run torchx status aws_batch://torchx/c5a-large:torchx_utils_python-pn9sx3wzq0qcwd

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

vercel · 2022-11-19T05:00:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jan 1, 2023 at 3:14PM (UTC)

vwxyzjn · 2023-03-26T22:15:54Z

Closing this for now. We are likely going for a slurm integration in the future such as https://github.com/vwxyzjn/cleanba/blob/a61c51214d44cbfcc055c77676c351fdeeb5e6cc/benchmark.sh#L3-L13

Torchx integration

ecad104

update dockerfile

fb247df

vercel bot deployed to Preview January 1, 2023 15:14 View deployment

vwxyzjn closed this Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchx integration #321

Torchx integration #321

vwxyzjn commented Nov 19, 2022 •

edited

Loading

vercel bot commented Nov 19, 2022 •

edited

Loading

vwxyzjn commented Mar 26, 2023

Torchx integration #321

Torchx integration #321

Conversation

vwxyzjn commented Nov 19, 2022 • edited Loading

Description

This PR

Types of changes

Checklist:

vercel bot commented Nov 19, 2022 • edited Loading

vwxyzjn commented Mar 26, 2023

vwxyzjn commented Nov 19, 2022 •

edited

Loading

vercel bot commented Nov 19, 2022 •

edited

Loading