Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slurm_scheduler, dir_workspace: add isolated workspaces for Slurm #416

Closed
wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Contributor

@d4l3k d4l3k commented Mar 10, 2022

This adds a new DirWorkspace which will copy the current workspace to the directory for code isolation purposes and integrates it in with the slurm scheduler via the job_dir runopt. The job dir must not exist and will be created. The CWD will be located in that job_dir.

  • .torchxignore is used for excluding files from the workspace
  • .torchxslurmjobdirs is used to track where job directories and thus logs are located

Test plan:

Slurm integ tests + unit tests

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 10, 2022
@codecov
Copy link

codecov bot commented Mar 10, 2022

Codecov Report

Merging #416 (0c66b54) into main (2d3a981) will increase coverage by 0.08%.
The diff coverage is 98.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #416      +/-   ##
==========================================
+ Coverage   94.20%   94.28%   +0.08%     
==========================================
  Files          66       67       +1     
  Lines        3690     3761      +71     
==========================================
+ Hits         3476     3546      +70     
- Misses        214      215       +1     
Impacted Files Coverage Δ
torchx/workspace/api.py 95.00% <97.22%> (+11.66%) ⬆️
torchx/schedulers/slurm_scheduler.py 97.46% <97.36%> (-0.10%) ⬇️
torchx/runner/api.py 96.57% <100.00%> (ø)
torchx/schedulers/api.py 93.24% <100.00%> (ø)
torchx/workspace/__init__.py 100.00% <100.00%> (ø)
torchx/workspace/dir_workspace.py 100.00% <100.00%> (ø)
torchx/workspace/docker_workspace.py 91.95% <100.00%> (-1.63%) ⬇️
torchx/schedulers/ray_scheduler.py 94.69% <0.00%> (+0.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2d3a981...0c66b54. Read the comment docs.

@d4l3k d4l3k force-pushed the slurmimp branch 2 times, most recently from f91cfbf to 9b60e81 Compare March 10, 2022 21:53
@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@d4l3k has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@d4l3k has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

d4l3k added a commit that referenced this pull request Mar 11, 2022
Summary:
This adds a new `DirWorkspace` which will copy the current workspace to the directory for code isolation purposes and integrates it in with the `slurm` scheduler via the `job_dir` runopt. The job dir must not exist and will be created. The CWD will be located in that job_dir.

* `.torchxignore` is used for excluding files from the workspace
* `.torchxslurmjobdirs` is used to track where job directories and thus logs are located

Pull Request resolved: #416

Test Plan: Slurm integ tests + unit tests

Reviewed By: kiukchung

Differential Revision: D34801126

Pulled By: d4l3k

fbshipit-source-id: 7423897d4a372f524230d08bc681493c112ce383
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34801126

Summary:
This adds a new `DirWorkspace` which will copy the current workspace to the directory for code isolation purposes and integrates it in with the `slurm` scheduler via the `job_dir` runopt. The job dir must not exist and will be created. The CWD will be located in that job_dir.

* `.torchxignore` is used for excluding files from the workspace
* `.torchxslurmjobdirs` is used to track where job directories and thus logs are located

Pull Request resolved: #416

Test Plan: Slurm integ tests + unit tests

Reviewed By: kiukchung

Differential Revision: D34801126

Pulled By: d4l3k

fbshipit-source-id: d53f6f36ad76921289116ee3e8a7c05b0975e594
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D34801126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants