Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic preemption code #6161

Merged
merged 11 commits into from
Apr 6, 2023
Merged

Adding basic preemption code #6161

merged 11 commits into from
Apr 6, 2023

Conversation

athitten
Copy link
Collaborator

@athitten athitten commented Mar 9, 2023

Add preemption functionality in preemption_callback.py under utils
Refactor the code to move NemoModelCheckpoint callback under callbacks

What does this PR do ?

Adding basic functionality for cluster preemption.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the common label Mar 9, 2023
@athitten athitten force-pushed the athitten/cluster_preemption branch 2 times, most recently from b50e092 to 483f932 Compare March 16, 2023 04:35
@github-actions github-actions bot removed the common label Mar 16, 2023
continue
index = checkpoint.find(self.monitor) + len(self.monitor) + 1 # Find monitor in str + 1 for '='
if index != -1:
match = re.search('[A-z]', checkpoint[index:])

Check warning

Code scanning / CodeQL

Overly permissive regular expression range

Suspicious character range that is equivalent to \[A-Z\\[\\\\]^_`a-z\].
try:
self._fs.rm(filepath)
logging.info(f"Removed checkpoint: {filepath}")
except:

Check notice

Code scanning / CodeQL

Except block handles 'BaseException'

Except block directly handles BaseException.
@athitten athitten force-pushed the athitten/cluster_preemption branch 3 times, most recently from 27891dc to 34a5964 Compare March 30, 2023 17:11
@athitten athitten marked this pull request as ready for review March 30, 2023 17:13
@athitten athitten requested a review from titu1994 March 30, 2023 17:14
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. Once we pass CI and verify it's working on the clusters, let's merge it.

@athitten athitten force-pushed the athitten/cluster_preemption branch from 99553e5 to 65bc34d Compare March 30, 2023 21:56
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

PreemptionCallback is always enabled.
"""

def __init__(self, checkpoint_callback, sig=signal.SIGTERM):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a None by default, and if None then self.sig = [signal.SIGTERM]. That way BCP can pass other signals.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titu1994 thanks, made this change


@property
def interrupted(self):
interrupted = torch.tensor(self._interrupted).int().to(torch.device('cuda'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of .to(), use device= inside of the torch.tensor(..., device=torch.cuda.current_device(), dtype=torch.int32).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this too

nemo/utils/callbacks/preemption.py Show resolved Hide resolved
@@ -1122,6 +888,8 @@ def configure_checkpointing(
if 'mp_rank' in checkpoint_callback.last_model_path or 'tp_rank' in checkpoint_callback.last_model_path:
checkpoint_callback.last_model_path = uninject_model_parallel_rank(checkpoint_callback.last_model_path)
trainer.callbacks.append(checkpoint_callback)
preemption_callback = PreemptionCallback(checkpoint_callback)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be a default ? I would suggest to add a bool flag that enables by default but can be disabled if wanted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, changed this in the latest commit

@athitten athitten force-pushed the athitten/cluster_preemption branch from fc56c8a to 34bcc1f Compare April 3, 2023 19:19
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemption signal is currentlt not configurable from config

nemo/utils/exp_manager.py Outdated Show resolved Hide resolved
nemo/utils/exp_manager.py Show resolved Hide resolved
nemo/utils/exp_manager.py Outdated Show resolved Hide resolved
@athitten athitten force-pushed the athitten/cluster_preemption branch from 8a5eb1d to 4f3a280 Compare April 5, 2023 05:47
nemo/utils/callbacks/preemption.py Fixed Show resolved Hide resolved
nemo/utils/callbacks/preemption.py Fixed Show fixed Hide fixed
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this current version doesn't work as it's just overriding all self.xyz with last signal.

Let's revert the commit and merge earlier version.

# Bool var that's initialized to false and made True upon receving the preemption signal
self._interrupted = False
self.released = False
self.original_handler = signal.getsignal(sig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just gets overriden by whatever is the last signal now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do special logic for each signal. Listen for all, if any of those are received, ignore the rest and call preemption code c

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this logic is getting too complicated revert this commit and go with previous tested version for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks for the catch, you are right self.original_handler = signal.getsignal(sig) will have the last signal. When I tested with 2 signals, it worked though bcoz the actual handlers are functioning fine and self.original_handler = signal.getsignal(sig) just seems to be used to reset the handlers when the signal is received. Probably, that's why it dint show up as an error in my testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'll go ahead and revert the commit and have the previous code merged first and then look into this.

nemo/utils/callbacks/preemption.py Fixed Show resolved Hide resolved
@athitten athitten force-pushed the athitten/cluster_preemption branch from 74b17c4 to 2f1e826 Compare April 5, 2023 15:25
titu1994
titu1994 previously approved these changes Apr 5, 2023
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge

@athitten athitten force-pushed the athitten/cluster_preemption branch from fd50914 to c04f57a Compare April 5, 2023 18:27
titu1994
titu1994 previously approved these changes Apr 5, 2023
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

Add preemption functionality in preemption_callback.py under utils
Refactor the code to move NemoModelCheckpoint callback under callbacks

Signed-off-by: Abhishree <[email protected]>
1) Rename nemo/collections/common/callbacks/nemomodelcheckpoint.py to nemo/utils/callbacks/nemo_model_checkpoint.py
2) Rename nemo/utils/preemption_callback.py to nemo/utils/callbacks/preemption.py
3) Add docstrings, headers, logging and check for torch distributed

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree <[email protected]>
1) Add boolean flag for createing preemption callback
2) Make sig arg in PreemptionCallback as None
3) Other minor modifications and code comments

Signed-off-by: Abhishree <[email protected]>
@athitten athitten force-pushed the athitten/cluster_preemption branch from c77f90f to 0a329da Compare April 6, 2023 19:12
import re
from copy import deepcopy
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union

Check notice

Code scanning / CodeQL

Unused import

Import of 'List' is not used. Import of 'Any' is not used. Import of 'Tuple' is not used. Import of 'Dict' is not used.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a later PR lets clean this part up.

self.best_model_path = best_k_models[0]
self.best_model_score = self.best_k_models[self.best_model_path]

def on_save_checkpoint(self, trainer, pl_module, checkpoint):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns

Mixing implicit and explicit returns may indicate an error as implicit returns always return None.
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! Looks good, thanks !

import re
from copy import deepcopy
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple, Union
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a later PR lets clean this part up.

@titu1994 titu1994 merged commit 9ba4abd into main Apr 6, 2023
@titu1994 titu1994 deleted the athitten/cluster_preemption branch April 6, 2023 23:32
hsiehjackson pushed a commit to hsiehjackson/NeMo that referenced this pull request Jun 2, 2023
* Adding basic preemption code

Add preemption functionality in preemption_callback.py under utils
Refactor the code to move NemoModelCheckpoint callback under callbacks

Signed-off-by: Abhishree <[email protected]>

* Adding the following modifications

1) Rename nemo/collections/common/callbacks/nemomodelcheckpoint.py to nemo/utils/callbacks/nemo_model_checkpoint.py
2) Rename nemo/utils/preemption_callback.py to nemo/utils/callbacks/preemption.py
3) Add docstrings, headers, logging and check for torch distributed

Signed-off-by: Abhishree <[email protected]>

* Minor edit in preemption.py

Signed-off-by: Abhishree <[email protected]>

* Removing unused imports

Signed-off-by: Abhishree <[email protected]>

* Remove device arg from PreemptionCallback class

Signed-off-by: Abhishree <[email protected]>

* Add more details in the NeMoModelCheckpointdocstring

Signed-off-by: Abhishree <[email protected]>

* Add the following modifications:

1) Add boolean flag for createing preemption callback
2) Make sig arg in PreemptionCallback as None
3) Other minor modifications and code comments

Signed-off-by: Abhishree <[email protected]>

* Modify torch cuda and distributed available checks to skip preemption if unavailable

Signed-off-by: Abhishree <[email protected]>

* Add preemption_enabled flag to preemption.py

Signed-off-by: Abhishree <[email protected]>

* Update nemo_model_checkpoint.py with the latest NemoModelCheckpoint class from exp_manager.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: hsiehjackson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants