Skip to content

Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314

Closed
cavdard wants to merge 65 commits intohuggingface:mainfrom
cavdard:smp_trainer_partial_chekckpoint
Closed

Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314
cavdard wants to merge 65 commits intohuggingface:mainfrom
cavdard:smp_trainer_partial_chekckpoint

Conversation

@cavdard
Copy link
Contributor

@cavdard cavdard commented Mar 21, 2022

What does this PR do?

  • Adds 2 new training args(smp_save_partial and smp_load_partial) to support partial checkpointing with SMP
  • Uses the right ranks for partial checkpoint saving in should_save.
  • Uses local_state_dict() with partial checkpoint saving.
  • Uses smp.save instead of torch.save when partial checkpoint saving is enabled.
  • Uses smp.load instead of torch.load when partial checkpoint loading is enabled. Reorders partial checkpoint loading to happen after wrapping of model, since smp.load can only load to a smp model.
  • Updated checks for the existence of checkpoint files since smp partial checkpoints contain postfixes in addition to filename(example: filename_0_0 or filename_0_0_0).
  • smp_gather is causing increased memory usage on GPU0 when tensor parallelism is enabled. Switches to distributed_concat for ddp.
  • adds load_best_model_at_end support for SMP.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

cavdard and others added 18 commits March 28, 2022 14:42
…e#16591)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add a template to add missing tokenization test

* add cookiecutter setting

* improve doc

* Update templates/adding_a_missing_tokenization_test/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [Doctests] Correct filenaming

* improve quicktour

* make style
…uggingface#15994)

* Adding new train_step logic to make things less confusing for users

* DO NOT ASK WHY WE NEED THAT SUBCLASS

* Metrics now working, at least for single-output models with type annotations!

* Updates and TODOs for the new train_step

* Make fixup

* Temporary test workaround until T5 has types

* Temporary test workaround until T5 has types

* I think this actually works! Needs a lot of tests though

* MAke style/quality

* Revert changes to T5 tests

* Deleting the aforementioned unmentionable subclass

* Deleting the aforementioned unmentionable subclass

* Adding a Keras API test

* Style fixes

* Removing unneeded TODO and comments

* Update test_step too

* Stop trying to compute metrics with the dummy_loss, patch up test

* Make style

* make fixup

* Docstring cleanup

* make fixup

* make fixup

* Stop expanding 1D input tensors when using dummy loss

* Adjust T5 test given the new compile()

* make fixup

* Skipping test for convnext

* Removing old T5-specific Keras test now that we have a common one

* make fixup

* make fixup

* Only skip convnext test on CPU

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Avoiding TF import issues

* make fixup

* Update compile() to support TF 2.3

* Skipping model.fit() on template classes for now

* Skipping model.fit() on template class tests for now

* Replace ad-hoc solution with find_labels

* make fixup

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model 

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

* Type hints for BigBird

* removing typos

Co-authored-by: matt <rocketknight1@gmail.com>
If global_attention_mask is found in the models inputs (used by certain
models, like LED) in the prediction_step method of Seq2SeqTrainer,
it is added to the gen_kwargs, which are passed to model.decode().
This allows us to properly set the global attention when decoding.
* [benchmark tool] trainer-benchmark.py

* improve

* massive rework/expansion

* fix

* mucho improved

* improved

* fix prefix

* fix

* fix diff calculation

* address suggestions
for filename in os.listdir(save_directory):
full_filename = os.path.join(save_directory, filename)
if filename.startswith(WEIGHTS_NAME[:-4]) and os.path.isfile(full_filename):
os.remove(full_filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not needed for SMP?

model = self._wrap_model(self.model_wrapped)

if resume_from_checkpoint is not None:
if is_sagemaker_mp_enabled():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can merge these two ifs

if resume_from_checkpoint is not None:
if is_sagemaker_mp_enabled():
if self.args.smp_load_partial:
state_dict = smp.load(os.path.join(resume_from_checkpoint, WEIGHTS_NAME), partial=self.args.smp_load_partial)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use smp.load for both cases, as long as partial=self.args.smp_load_partial

if is_sagemaker_mp_enabled():
if self.args.smp_load_partial:
state_dict = smp.load(best_model_path, partial=self.args.smp_load_partial)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment to simplify here

if self.args.smp_save_partial:
opt_state_dict = self.optimizer.local_state_dict()
else:
opt_state_dict = self.optimizer.state_dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to standardize on gather_if_shard=False here.

When we do that, which processes need to save the partial state dict changes.

if shard_optimizer_state, then all processes
else rdp_rank==0

stevhliu and others added 9 commits April 5, 2022 12:48
* 📝 add image/vision classification and asr

* 🖍 minor formatting fixes

* Fixed a typo in legacy seq2seq_trainer.py (huggingface#16531)

* Add ONNX export for BeiT (huggingface#16498)

* Add beit onnx conversion support

* Updated docs

* Added cross reference to ViT ONNX config

* call on_train_end when trial is pruned (huggingface#16536)

* Type hints added (huggingface#16529)

* Fix Bart type hints (huggingface#16297)

* Add type hints to PLBart PyTorch

* Remove pending merge conflicts

* Fix PLBart Type Hints

* Add changes from review

* Add VisualBert type hints (huggingface#16544)

* Adding missing type hints for mBART model (PyTorch) (huggingface#16429)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model 

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

Co-authored-by: matt <rocketknight1@gmail.com>

* Remove MBart subclass of XLMRoberta in tokenzier docs (huggingface#16546)

* Remove MBart subclass of XLMRoberta in tokenzier

* Fix style

* Copy docs from MBart50 tokenizer

* Use random_attention_mask for TF tests (huggingface#16517)

* use random_attention_mask for TF tests

* Fix for TFCLIP test (for now).

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Improve code example (huggingface#16450)

Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>

* Pin tokenizers version <0.13 (huggingface#16539)

* Pin tokenizers version <0.13

* Style

* Add code samples for TF speech models (huggingface#16494)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* [FlaxSpeechEncoderDecoder] Fix dtype bug (huggingface#16581)

* [FlaxSpeechEncoderDecoder] Fix dtype bug

* more fixes

* Making the impossible to connect error actually report the right URL. (huggingface#16446)

* Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (huggingface#16556)

* Add utility to find model labels (huggingface#16526)

* Add utility to find model labels

* Use it in the Trainer

* Update src/transformers/utils/generic.py

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Quality

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>

* Enable doc in Spanish (huggingface#16518)

* Reorganize doc for multilingual support

* Fix style

* Style

* Toc trees

* Adapt templates

* Add use_auth to load_datasets for private datasets to PT and TF examples (huggingface#16521)

* fix formatting and remove use_auth

* Add use_auth_token to Flax examples

* add a test checking the format of `convert_tokens_to_string`'s output (huggingface#16540)

* add new tests

* add comment to overridden tests

* TF: Finalize `unpack_inputs`-related changes (huggingface#16499)

* Add unpack_inputs to remaining models

* removed kwargs to `call()` in TF models

* fix TF T5 tests

* [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (huggingface#16586)

* initialize the default rank set on TrainerState (huggingface#16530)

* initialize the default rank set on TrainerState

* fix style

* Trigger doc build

* Fix CI: test_inference_for_pretraining in ViTMAEModelTest (huggingface#16591)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* add a template to add missing tokenization test (huggingface#16553)

* add a template to add missing tokenization test

* add cookiecutter setting

* improve doc

* Update templates/adding_a_missing_tokenization_test/README.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* made _load_pretrained_model_low_mem static + bug fix (huggingface#16548)

* handle torch_dtype in low cpu mem usage (huggingface#16580)

* [Doctests] Correct filenaming (huggingface#16599)

* [Doctests] Correct filenaming

* improve quicktour

* make style

* Adding new train_step logic to make things less confusing for users (huggingface#15994)

* Adding new train_step logic to make things less confusing for users

* DO NOT ASK WHY WE NEED THAT SUBCLASS

* Metrics now working, at least for single-output models with type annotations!

* Updates and TODOs for the new train_step

* Make fixup

* Temporary test workaround until T5 has types

* Temporary test workaround until T5 has types

* I think this actually works! Needs a lot of tests though

* MAke style/quality

* Revert changes to T5 tests

* Deleting the aforementioned unmentionable subclass

* Deleting the aforementioned unmentionable subclass

* Adding a Keras API test

* Style fixes

* Removing unneeded TODO and comments

* Update test_step too

* Stop trying to compute metrics with the dummy_loss, patch up test

* Make style

* make fixup

* Docstring cleanup

* make fixup

* make fixup

* Stop expanding 1D input tensors when using dummy loss

* Adjust T5 test given the new compile()

* make fixup

* Skipping test for convnext

* Removing old T5-specific Keras test now that we have a common one

* make fixup

* make fixup

* Only skip convnext test on CPU

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Avoiding TF import issues

* make fixup

* Update compile() to support TF 2.3

* Skipping model.fit() on template classes for now

* Skipping model.fit() on template class tests for now

* Replace ad-hoc solution with find_labels

* make fixup

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding missing type hints for BigBird model   (huggingface#16555)

* added type hints for mbart tensorflow tf implementation

* Adding missing type hints for mBART model 

Tensorflow Implementation model added with missing type hints

* Missing Type hints - correction

For TF model

* Code fixup using make quality tests

* Hint types - typo error

* make fix-copies and make fixup

* type hints

* updated files

* type hints update

* making dependent modesls coherent

* Type hints for BigBird

* removing typos

Co-authored-by: matt <rocketknight1@gmail.com>

* [deepspeed] fix typo, adjust config name (huggingface#16597)

* 🖍 apply feedback

Co-authored-by: Cathy <815244047@qq.com>
Co-authored-by: Jim Rohrer <jrohrer1@gmail.com>
Co-authored-by: Ferdinand Schlatt <fschlatt@gmail.com>
Co-authored-by: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com>
Co-authored-by: Gunjan Chhablani <chhablani.gunjan@gmail.com>
Co-authored-by: Rishav Chandra Varma <rishavchandra.v16@iiits.in>
Co-authored-by: matt <rocketknight1@gmail.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>
Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Karim Foda <35491698+KMFODA@users.noreply.github.com>
Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>
Co-authored-by: Joao Gante <joao@huggingface.co>
Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
Co-authored-by: Andres Codas <andrescodas@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Completed documentation of CTRL

* Missing optional None

* Added return types

* updated imports

* Update modeling_ctrl.py
* fix bart and mbart

* add ckpt names as variables

* fix mbart

* fix plbart

* use varibale for ckot name
…16609)

* Use CLIP model's config for some fields (if specified) instead of those of vision & text components.

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
lmvasque and others added 25 commits April 7, 2022 10:02
* Add inputs vector to calculate metric method

* Include inputs for evaluation metrics with backwards compatibility

* Prevent inputs create OOM issue and documentation details

* Update style and code documentation

* Fix style formatting issues

* Update files format with make style
…ate_dict (huggingface#16643)

* Updated _load_pretrained_model_low_mem to check if keys are in the stored state_dict

* update after conversions
* Update README.md Support Image

Updates the Support image linking to our EAP page (to give it a refresh + help avoid image fatigue).

Slack thread checking in with #open-source-internal on this update (https://huggingface.slack.com/archives/C021H1P1HKR/p1648838903316709)

* Compressed Updated Support image

* Improves Support Image Logo + Height

Updated the image based on logo + size feedback. Big thanks to Bibi for making quick edits to this image.
* base model done

* make style

* done

* added files

* Apply suggestions from code review

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Trigger doc build

* resolved conversations

* resolved conversations

* seer models

* minor changes

* minor changes

* make fixup

* glob variables

* minor changes

* fix copies

* config when possibile

* resolved conflicts

* resolved conflicts

* resolved conflicts

* CI

* conversion script for 10b param

* fixed for 10b model

* minor updates in the doc + make style

* removed unused code

* Apply suggestions from code review

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

* removed unused code

* removed unused code

* updated modeling_utils from main

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
* Add TapexTokenizer

* Improve docstrings and provide option to provide answer

* Remove option for pretokenized inputs

* Add TAPEX to README

* Fix copies

* Remove option for pretokenized inputs

* Initial commit: add tapex fine-tuning examples on both table-based question answering and table-based fact verification.

* - Draft a README file for running the script and introducing some background.
- Remove unused code lines in tabfact script.
- Disable the deafult `pad_to_max_length` option which is memory-consuming.

* * Support `as_target_tokenizer` function for TapexTokenizer.
* Fix the do_lower_case behaviour of TapexTokenizer.
* Add unit tests for target scenarios and cased/uncased scenarios for both source and target.

* * Replace the label BartTokenizer with TapexTokenizer's as_target_tokenizer function.
* Fix typos in tapex example README.

* * fix the evaluation script - remove the property `task_name`

* * Make the label space more clear for tabfact tasks

* * Using a new fine-tuning script for tapex-base on tabfact.

* * Remove the lowercase code outside the tokenizer - we use the tokenizer to control whether do_lower_case
* Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql

* * Remove the default tokenizer_name option.
* Provide evaluation command.

* * Support for WikiTableQuestion dataset.

* Fix a typo in README.

* * Fix the datasets's key name in WikiTableQuestions

* Run make fixup and move test to folder

* Fix quality

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply some more suggestions from code review

* Improve docstrings

* Overwrite failing test

* Improve comment in example scripts

* Fix rebase

* Add TAPEX to Auto mapping

* Add TAPEX to auto config mappings

* Put TAPEX higher than BART in auto mapping

* Add TAPEX to doc tests

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain>
Co-authored-by: SivilTaram <qianlxc@outlook.com>
Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
* add vit tf doctest with @add_code_sample_docstrings

* add labels string back in

Co-authored-by: Johannes Kolbe <johannes.kolbe@tech.better.team>
The defalut value of `padding` in `DataCollatorWithPadding` is `True`, not `False`.
* fix QA sample

* For TF_QUESTION_ANSWERING_SAMPLE

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fixed some bugs involving saving during epochs
* Added tests mimicking the existing examples tests
* Added in json exporting to all `no_trainer` examples for consistency
* [Trainer] tf32 arg doc

* Update src/transformers/training_args.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* ✨ update audio examples with minds dataset

* 🖍 make style

* 🖍 minor fixes for doctests
@cavdard cavdard marked this pull request as draft April 9, 2022 02:18
@LysandreJik
Copy link
Member

Hey! It seems a bad rebase/merge happened on your PR. Usually, closing this PR and opening a new one from the same branch solves the problem.

@cavdard
Copy link
Contributor Author

cavdard commented Apr 12, 2022

Closing this PR.
Created a new PR

@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this May 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.