Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314
Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314cavdard wants to merge 65 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
…cavdard/transformers into smp_trainer_partial_chekckpoint
…e#16591) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* [Doctests] Correct filenaming * improve quicktour * make style
…uggingface#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt <rocketknight1@gmail.com>
If global_attention_mask is found in the models inputs (used by certain models, like LED) in the prediction_step method of Seq2SeqTrainer, it is added to the gen_kwargs, which are passed to model.decode(). This allows us to properly set the global attention when decoding.
* [benchmark tool] trainer-benchmark.py * improve * massive rework/expansion * fix * mucho improved * improved * fix prefix * fix * fix diff calculation * address suggestions
| for filename in os.listdir(save_directory): | ||
| full_filename = os.path.join(save_directory, filename) | ||
| if filename.startswith(WEIGHTS_NAME[:-4]) and os.path.isfile(full_filename): | ||
| os.remove(full_filename) |
There was a problem hiding this comment.
Is this not needed for SMP?
src/transformers/trainer.py
Outdated
| model = self._wrap_model(self.model_wrapped) | ||
|
|
||
| if resume_from_checkpoint is not None: | ||
| if is_sagemaker_mp_enabled(): |
src/transformers/trainer.py
Outdated
| if resume_from_checkpoint is not None: | ||
| if is_sagemaker_mp_enabled(): | ||
| if self.args.smp_load_partial: | ||
| state_dict = smp.load(os.path.join(resume_from_checkpoint, WEIGHTS_NAME), partial=self.args.smp_load_partial) |
There was a problem hiding this comment.
You can use smp.load for both cases, as long as partial=self.args.smp_load_partial
src/transformers/trainer.py
Outdated
| if is_sagemaker_mp_enabled(): | ||
| if self.args.smp_load_partial: | ||
| state_dict = smp.load(best_model_path, partial=self.args.smp_load_partial) | ||
| else: |
There was a problem hiding this comment.
same comment to simplify here
src/transformers/trainer.py
Outdated
| if self.args.smp_save_partial: | ||
| opt_state_dict = self.optimizer.local_state_dict() | ||
| else: | ||
| opt_state_dict = self.optimizer.state_dict() |
There was a problem hiding this comment.
We wanted to standardize on gather_if_shard=False here.
When we do that, which processes need to save the partial state dict changes.
if shard_optimizer_state, then all processes
else rdp_rank==0
* 📝 add image/vision classification and asr * 🖍 minor formatting fixes * Fixed a typo in legacy seq2seq_trainer.py (huggingface#16531) * Add ONNX export for BeiT (huggingface#16498) * Add beit onnx conversion support * Updated docs * Added cross reference to ViT ONNX config * call on_train_end when trial is pruned (huggingface#16536) * Type hints added (huggingface#16529) * Fix Bart type hints (huggingface#16297) * Add type hints to PLBart PyTorch * Remove pending merge conflicts * Fix PLBart Type Hints * Add changes from review * Add VisualBert type hints (huggingface#16544) * Adding missing type hints for mBART model (PyTorch) (huggingface#16429) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent Co-authored-by: matt <rocketknight1@gmail.com> * Remove MBart subclass of XLMRoberta in tokenzier docs (huggingface#16546) * Remove MBart subclass of XLMRoberta in tokenzier * Fix style * Copy docs from MBart50 tokenizer * Use random_attention_mask for TF tests (huggingface#16517) * use random_attention_mask for TF tests * Fix for TFCLIP test (for now). Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Improve code example (huggingface#16450) Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> * Pin tokenizers version <0.13 (huggingface#16539) * Pin tokenizers version <0.13 * Style * Add code samples for TF speech models (huggingface#16494) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * [FlaxSpeechEncoderDecoder] Fix dtype bug (huggingface#16581) * [FlaxSpeechEncoderDecoder] Fix dtype bug * more fixes * Making the impossible to connect error actually report the right URL. (huggingface#16446) * Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (huggingface#16556) * Add utility to find model labels (huggingface#16526) * Add utility to find model labels * Use it in the Trainer * Update src/transformers/utils/generic.py Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Quality Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Enable doc in Spanish (huggingface#16518) * Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates * Add use_auth to load_datasets for private datasets to PT and TF examples (huggingface#16521) * fix formatting and remove use_auth * Add use_auth_token to Flax examples * add a test checking the format of `convert_tokens_to_string`'s output (huggingface#16540) * add new tests * add comment to overridden tests * TF: Finalize `unpack_inputs`-related changes (huggingface#16499) * Add unpack_inputs to remaining models * removed kwargs to `call()` in TF models * fix TF T5 tests * [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (huggingface#16586) * initialize the default rank set on TrainerState (huggingface#16530) * initialize the default rank set on TrainerState * fix style * Trigger doc build * Fix CI: test_inference_for_pretraining in ViTMAEModelTest (huggingface#16591) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * add a template to add missing tokenization test (huggingface#16553) * add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * made _load_pretrained_model_low_mem static + bug fix (huggingface#16548) * handle torch_dtype in low cpu mem usage (huggingface#16580) * [Doctests] Correct filenaming (huggingface#16599) * [Doctests] Correct filenaming * improve quicktour * make style * Adding new train_step logic to make things less confusing for users (huggingface#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding missing type hints for BigBird model (huggingface#16555) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt <rocketknight1@gmail.com> * [deepspeed] fix typo, adjust config name (huggingface#16597) * 🖍 apply feedback Co-authored-by: Cathy <815244047@qq.com> Co-authored-by: Jim Rohrer <jrohrer1@gmail.com> Co-authored-by: Ferdinand Schlatt <fschlatt@gmail.com> Co-authored-by: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com> Co-authored-by: Gunjan Chhablani <chhablani.gunjan@gmail.com> Co-authored-by: Rishav Chandra Varma <rishavchandra.v16@iiits.in> Co-authored-by: matt <rocketknight1@gmail.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Karim Foda <35491698+KMFODA@users.noreply.github.com> Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> Co-authored-by: Joao Gante <joao@huggingface.co> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Andres Codas <andrescodas@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com> Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Completed documentation of CTRL * Missing optional None * Added return types * updated imports * Update modeling_ctrl.py
* fix bart and mbart * add ckpt names as variables * fix mbart * fix plbart * use varibale for ckot name
…16609) * Use CLIP model's config for some fields (if specified) instead of those of vision & text components. Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Add inputs vector to calculate metric method * Include inputs for evaluation metrics with backwards compatibility * Prevent inputs create OOM issue and documentation details * Update style and code documentation * Fix style formatting issues * Update files format with make style
…ate_dict (huggingface#16643) * Updated _load_pretrained_model_low_mem to check if keys are in the stored state_dict * update after conversions
* Update README.md Support Image Updates the Support image linking to our EAP page (to give it a refresh + help avoid image fatigue). Slack thread checking in with #open-source-internal on this update (https://huggingface.slack.com/archives/C021H1P1HKR/p1648838903316709) * Compressed Updated Support image * Improves Support Image Logo + Height Updated the image based on logo + size feedback. Big thanks to Bibi for making quick edits to this image.
* base model done * make style * done * added files * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Trigger doc build * resolved conversations * resolved conversations * seer models * minor changes * minor changes * make fixup * glob variables * minor changes * fix copies * config when possibile * resolved conflicts * resolved conflicts * resolved conflicts * CI * conversion script for 10b param * fixed for 10b model * minor updates in the doc + make style * removed unused code * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * removed unused code * removed unused code * updated modeling_utils from main Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>
* Add TapexTokenizer * Improve docstrings and provide option to provide answer * Remove option for pretokenized inputs * Add TAPEX to README * Fix copies * Remove option for pretokenized inputs * Initial commit: add tapex fine-tuning examples on both table-based question answering and table-based fact verification. * - Draft a README file for running the script and introducing some background. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming. * * Support `as_target_tokenizer` function for TapexTokenizer. * Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target. * * Replace the label BartTokenizer with TapexTokenizer's as_target_tokenizer function. * Fix typos in tapex example README. * * fix the evaluation script - remove the property `task_name` * * Make the label space more clear for tabfact tasks * * Using a new fine-tuning script for tapex-base on tabfact. * * Remove the lowercase code outside the tokenizer - we use the tokenizer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql * * Remove the default tokenizer_name option. * Provide evaluation command. * * Support for WikiTableQuestion dataset. * Fix a typo in README. * * Fix the datasets's key name in WikiTableQuestions * Run make fixup and move test to folder * Fix quality * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Suraj Patil <surajp815@gmail.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review * Improve docstrings * Overwrite failing test * Improve comment in example scripts * Fix rebase * Add TAPEX to Auto mapping * Add TAPEX to auto config mappings * Put TAPEX higher than BART in auto mapping * Add TAPEX to doc tests Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain> Co-authored-by: SivilTaram <qianlxc@outlook.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
* add vit tf doctest with @add_code_sample_docstrings * add labels string back in Co-authored-by: Johannes Kolbe <johannes.kolbe@tech.better.team>
The defalut value of `padding` in `DataCollatorWithPadding` is `True`, not `False`.
* fix QA sample * For TF_QUESTION_ANSWERING_SAMPLE Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fixed some bugs involving saving during epochs * Added tests mimicking the existing examples tests * Added in json exporting to all `no_trainer` examples for consistency
* [Trainer] tf32 arg doc * Update src/transformers/training_args.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* ✨ update audio examples with minds dataset * 🖍 make style * 🖍 minor fixes for doctests
…cavdard/transformers into smp_trainer_partial_chekckpoint
|
Hey! It seems a bad rebase/merge happened on your PR. Usually, closing this PR and opening a new one from the same branch solves the problem. |
|
Closing this PR. |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
smp_save_partialandsmp_load_partial) to support partial checkpointing with SMPlocal_state_dict()with partial checkpoint saving.smp.saveinstead oftorch.savewhen partial checkpoint saving is enabled.smp.loadinstead oftorch.loadwhen partial checkpoint loading is enabled. Reorders partial checkpoint loading to happen after wrapping of model, sincesmp.loadcan only load to a smp model.smp_gatheris causing increased memory usage on GPU0 when tensor parallelism is enabled. Switches todistributed_concatfor ddp.load_best_model_at_endsupport for SMP.Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.