Skip to content

Temporary workaround for loading best model at end with DeepSpeed#95

Merged
regisss merged 2 commits into
mainfrom
fix_load_best_model_deepspeed
Sep 9, 2022
Merged

Temporary workaround for loading best model at end with DeepSpeed#95
regisss merged 2 commits into
mainfrom
fix_load_best_model_deepspeed

Conversation

@regisss
Copy link
Copy Markdown
Collaborator

@regisss regisss commented Sep 9, 2022

What does this PR do?

Loading the best model at the end of training with --load_best_model_at_end fails with the current version of Habana DeepSpeed (0.6.1, see huggingface/transformers#17114).
This PR brings a temporary workaround where the best model at the end of training is loaded as a regular PyTorch model and not as a DeepSpeed engine. This should not be an issue since the best model is loaded for inference only and ZeRO-3 has not been validated yet (see here) while ZeRO-1/2 are useful for training only.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@regisss regisss requested a review from libinta September 9, 2022 00:07
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Sep 9, 2022

The documentation is not available anymore as the PR was closed or merged.

@regisss regisss merged commit 3d07854 into main Sep 9, 2022
@regisss regisss deleted the fix_load_best_model_deepspeed branch September 9, 2022 06:34
hsubramony pushed a commit that referenced this pull request Mar 13, 2024
yeonsily pushed a commit that referenced this pull request Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants