Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Google Cloud/GCP Scheduler Support + TPUs #410

Closed
d4l3k opened this issue Mar 7, 2022 · 5 comments · May be fixed by #473
Closed

Add Google Cloud/GCP Scheduler Support + TPUs #410

d4l3k opened this issue Mar 7, 2022 · 5 comments · May be fixed by #473
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules scheduler-request New scheduler requests

Comments

@d4l3k
Copy link
Member

d4l3k commented Mar 7, 2022

It would be nice to have GCP + TPU support in addition to our existing schedulers. Currently you can run on GCP via Kubernetes + the Kubernetes scheduler but would be handy to have direct training platform support.

Example scheduler: AWS Batch https://github.com/pytorch/torchx/blob/main/torchx/schedulers/aws_batch_scheduler.py

Scheduler documentation: https://pytorch.org/torchx/main/schedulers

GCP Docs:

Stretch Goal:

@d4l3k d4l3k added enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules labels Mar 7, 2022
@d4l3k
Copy link
Member Author

d4l3k commented Apr 15, 2022

resources

v2-8

model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
96 cores

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          334Gi       1.5Gi       331Gi       1.0Mi       1.3Gi       331Gi
Swap:            0B          0B          0B

/dev/root        97G   15G   83G  16% /

v3-8

model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
96 cores

tristanr@t1v-n-c76b9a61-w-0:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          334Gi       1.5Gi       331Gi       1.0Mi       1.5Gi       331Gi
Swap:            0B          0B          0B

tristanr@t1v-n-c76b9a61-w-0:~$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        97G   15G   83G  15% /

@vwxyzjn
Copy link

vwxyzjn commented Jul 21, 2022

Also see https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud. GCP equivalent of AWS Batch recently announced.

@d4l3k
Copy link
Member Author

d4l3k commented Jul 26, 2022

@vwxyzjn thanks for sharing that! I hadn't seen it. Do you know if it supports TPUs? I didn't see anything listed on the announcement post

@priyaramani
Copy link
Contributor

@vwxyzjn Hey there, we just added support for GCP batch (initial version w/ support for basic components - #621). Please try it out and let us know. Would appreciate early feedback. Thanks!

@priyaramani
Copy link
Contributor

GCP Batch support has been added in TorchX in Prototype phase: See https://github.com/pytorch/torchx/releases/tag/v0.4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: runner issues related to the torchx.runner and torchx.scheduler modules scheduler-request New scheduler requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants