Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchx integration #321

Closed
wants to merge 2 commits into from
Closed

Torchx integration #321

wants to merge 2 commits into from

Conversation

vwxyzjn
Copy link
Owner

@vwxyzjn vwxyzjn commented Nov 19, 2022

Description

Our current cloud integration is pretty hacky. I haven't seen anyone used it and it has been a maintenance burden for us. Using a more managed utility to launch experiments in the cloud is desirable. There are two primary contenders and their pros and cons:

  • torchx
    • ✅ support for slurm
    • ✅ support for running tasks locally
    • ✅ the docker image is automatically pushed with a hash for AWS Batch
    • ❌ still need to spin up cloud resources (e.g., aws batch), which is complicated but can be mitigated by using terraform
  • skypilot
    • ✅ support for managing spot instances and auto resume them
    • compare pricing
    • ✅ debuggability via sky ssh mycluster
      • ✅ good for folks who don't always have a GPU machine
    • ❌ need to wait for the clusters to be spun up

All of them:

  • ✅ support for aws, gcp, azure

This PR

Better cloud integration utility by leveraging torchx. It should really be an elegant solution for us and has the following benefits:

Give it a try by running

poetry run torchx run --scheduler local_docker utils.python --gpu 1 --script cleanrl/cleanrl.py
poetry run torchx run --scheduler aws_batch --scheduler_args queue=c5a-large,image_repo=vwxyzjn/cleanrl  utils.python  --script cleanrl/ppo.py
poetry run torchx status aws_batch://torchx/c5a-large:torchx_utils_python-pn9sx3wzq0qcwd

asciicast

image

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation and previewed the changes via mkdocs serve.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have added links to the PR related to the algorithm variant.
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
  • I have updated the tests accordingly (if applicable).

@vercel
Copy link

vercel bot commented Nov 19, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jan 1, 2023 at 3:14PM (UTC)

@vwxyzjn
Copy link
Owner Author

vwxyzjn commented Mar 26, 2023

Closing this for now. We are likely going for a slurm integration in the future such as https://github.com/vwxyzjn/cleanba/blob/a61c51214d44cbfcc055c77676c351fdeeb5e6cc/benchmark.sh#L3-L13

@vwxyzjn vwxyzjn closed this Mar 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant