Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add war fix for sync issues #8130

Merged
merged 1 commit into from
Jan 5, 2024
Merged

add war fix for sync issues #8130

merged 1 commit into from
Jan 5, 2024

Conversation

gshennvm
Copy link
Collaborator

@gshennvm gshennvm commented Jan 5, 2024

What does this PR do ?

this is a war around a deeper issue where checkpointing is not guarded by pytorch distributed calls. So sometimes it removes checkpoints that don't exist and rmtree errors

Signed-off-by: Gerald Shen <[email protected]>
@gshennvm gshennvm requested a review from cuichenx January 5, 2024 18:58
@github-actions github-actions bot added the NLP label Jan 5, 2024
@gshennvm
Copy link
Collaborator Author

gshennvm commented Jan 5, 2024

jenkins

Copy link
Collaborator

@cuichenx cuichenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks. Let's plan to fix the underlying issue soon

@gshennvm gshennvm merged commit d81ea6a into main Jan 5, 2024
15 checks passed
@gshennvm gshennvm deleted the geshen/checkpoint_sync_war branch January 5, 2024 23:49
@gshennvm gshennvm restored the geshen/checkpoint_sync_war branch January 5, 2024 23:49
@gshennvm gshennvm deleted the geshen/checkpoint_sync_war branch January 5, 2024 23:49
minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants