-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding basic preemption code #6161
Conversation
b50e092
to
483f932
Compare
continue | ||
index = checkpoint.find(self.monitor) + len(self.monitor) + 1 # Find monitor in str + 1 for '=' | ||
if index != -1: | ||
match = re.search('[A-z]', checkpoint[index:]) |
Check warning
Code scanning / CodeQL
Overly permissive regular expression range
try: | ||
self._fs.rm(filepath) | ||
logging.info(f"Removed checkpoint: {filepath}") | ||
except: |
Check notice
Code scanning / CodeQL
Except block handles 'BaseException'
27891dc
to
34a5964
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me. Once we pass CI and verify it's working on the clusters, let's merge it.
99553e5
to
65bc34d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments
nemo/utils/callbacks/preemption.py
Outdated
PreemptionCallback is always enabled. | ||
""" | ||
|
||
def __init__(self, checkpoint_callback, sig=signal.SIGTERM): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a None by default, and if None then self.sig = [signal.SIGTERM]. That way BCP can pass other signals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@titu1994 thanks, made this change
nemo/utils/callbacks/preemption.py
Outdated
|
||
@property | ||
def interrupted(self): | ||
interrupted = torch.tensor(self._interrupted).int().to(torch.device('cuda')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of .to(), use device= inside of the torch.tensor(..., device=torch.cuda.current_device(), dtype=torch.int32).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this too
nemo/utils/exp_manager.py
Outdated
@@ -1122,6 +888,8 @@ def configure_checkpointing( | |||
if 'mp_rank' in checkpoint_callback.last_model_path or 'tp_rank' in checkpoint_callback.last_model_path: | |||
checkpoint_callback.last_model_path = uninject_model_parallel_rank(checkpoint_callback.last_model_path) | |||
trainer.callbacks.append(checkpoint_callback) | |||
preemption_callback = PreemptionCallback(checkpoint_callback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a default ? I would suggest to add a bool flag that enables by default but can be disabled if wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, changed this in the latest commit
fc56c8a
to
34bcc1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemption signal is currentlt not configurable from config
8a5eb1d
to
4f3a280
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this current version doesn't work as it's just overriding all self.xyz with last signal.
Let's revert the commit and merge earlier version.
nemo/utils/callbacks/preemption.py
Outdated
# Bool var that's initialized to false and made True upon receving the preemption signal | ||
self._interrupted = False | ||
self.released = False | ||
self.original_handler = signal.getsignal(sig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just gets overriden by whatever is the last signal now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do special logic for each signal. Listen for all, if any of those are received, ignore the rest and call preemption code c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this logic is getting too complicated revert this commit and go with previous tested version for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thanks for the catch, you are right self.original_handler = signal.getsignal(sig)
will have the last signal. When I tested with 2 signals, it worked though bcoz the actual handlers are functioning fine and self.original_handler = signal.getsignal(sig)
just seems to be used to reset the handlers when the signal is received. Probably, that's why it dint show up as an error in my testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'll go ahead and revert the commit and have the previous code merged first and then look into this.
74b17c4
to
2f1e826
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge
fd50914
to
c04f57a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
dbb20f4
to
e42618e
Compare
Add preemption functionality in preemption_callback.py under utils Refactor the code to move NemoModelCheckpoint callback under callbacks Signed-off-by: Abhishree <[email protected]>
1) Rename nemo/collections/common/callbacks/nemomodelcheckpoint.py to nemo/utils/callbacks/nemo_model_checkpoint.py 2) Rename nemo/utils/preemption_callback.py to nemo/utils/callbacks/preemption.py 3) Add docstrings, headers, logging and check for torch distributed Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
1) Add boolean flag for createing preemption callback 2) Make sig arg in PreemptionCallback as None 3) Other minor modifications and code comments Signed-off-by: Abhishree <[email protected]>
… if unavailable Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
…lass from exp_manager.py Signed-off-by: Abhishree <[email protected]>
c77f90f
to
0a329da
Compare
for more information, see https://pre-commit.ci
import re | ||
from copy import deepcopy | ||
from pathlib import Path | ||
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union |
Check notice
Code scanning / CodeQL
Unused import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a later PR lets clean this part up.
self.best_model_path = best_k_models[0] | ||
self.best_model_score = self.best_k_models[self.best_model_path] | ||
|
||
def on_save_checkpoint(self, trainer, pl_module, checkpoint): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! Looks good, thanks !
import re | ||
from copy import deepcopy | ||
from pathlib import Path | ||
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a later PR lets clean this part up.
* Adding basic preemption code Add preemption functionality in preemption_callback.py under utils Refactor the code to move NemoModelCheckpoint callback under callbacks Signed-off-by: Abhishree <[email protected]> * Adding the following modifications 1) Rename nemo/collections/common/callbacks/nemomodelcheckpoint.py to nemo/utils/callbacks/nemo_model_checkpoint.py 2) Rename nemo/utils/preemption_callback.py to nemo/utils/callbacks/preemption.py 3) Add docstrings, headers, logging and check for torch distributed Signed-off-by: Abhishree <[email protected]> * Minor edit in preemption.py Signed-off-by: Abhishree <[email protected]> * Removing unused imports Signed-off-by: Abhishree <[email protected]> * Remove device arg from PreemptionCallback class Signed-off-by: Abhishree <[email protected]> * Add more details in the NeMoModelCheckpointdocstring Signed-off-by: Abhishree <[email protected]> * Add the following modifications: 1) Add boolean flag for createing preemption callback 2) Make sig arg in PreemptionCallback as None 3) Other minor modifications and code comments Signed-off-by: Abhishree <[email protected]> * Modify torch cuda and distributed available checks to skip preemption if unavailable Signed-off-by: Abhishree <[email protected]> * Add preemption_enabled flag to preemption.py Signed-off-by: Abhishree <[email protected]> * Update nemo_model_checkpoint.py with the latest NemoModelCheckpoint class from exp_manager.py Signed-off-by: Abhishree <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>
Add preemption functionality in preemption_callback.py under utils
Refactor the code to move NemoModelCheckpoint callback under callbacks
What does this PR do ?
Adding basic functionality for cluster preemption.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information