Updates in Trainer to support partial checkpointing for SM Model Parallel library by cavdard · Pull Request #16314 · huggingface/transformers

cavdard · 2022-03-21T21:34:57Z

What does this PR do?

Adds 2 new training args(smp_save_partial and smp_load_partial) to support partial checkpointing with SMP
Uses the right ranks for partial checkpoint saving in should_save.
Uses local_state_dict() with partial checkpoint saving.
Uses smp.save instead of torch.save when partial checkpoint saving is enabled.
Uses smp.load instead of torch.load when partial checkpoint loading is enabled. Reorders partial checkpoint loading to happen after wrapping of model, since smp.load can only load to a smp model.
Updated checks for the existence of checkpoint files since smp partial checkpoints contain postfixes in addition to filename(example: filename_0_0 or filename_0_0_0).
smp_gather is causing increased memory usage on GPU0 when tensor parallelism is enabled. Switches to distributed_concat for ddp.
adds load_best_model_at_end support for SMP.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-03-21T21:47:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

…cavdard/transformers into smp_trainer_partial_chekckpoint

…e#16591) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* [Doctests] Correct filenaming * improve quicktour * make style

…uggingface#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt <rocketknight1@gmail.com>

If global_attention_mask is found in the models inputs (used by certain models, like LED) in the prediction_step method of Seq2SeqTrainer, it is added to the gen_kwargs, which are passed to model.decode(). This allows us to properly set the global attention when decoding.

* [benchmark tool] trainer-benchmark.py * improve * massive rework/expansion * fix * mucho improved * improved * fix prefix * fix * fix diff calculation * address suggestions

rahul003 · 2022-04-05T17:19:46Z

src/transformers/modeling_utils.py

+            for filename in os.listdir(save_directory):
+                full_filename = os.path.join(save_directory, filename)
+                if filename.startswith(WEIGHTS_NAME[:-4]) and os.path.isfile(full_filename):
+                    os.remove(full_filename)


Is this not needed for SMP?

rahul003 · 2022-04-05T17:21:45Z

src/transformers/trainer.py

        model = self._wrap_model(self.model_wrapped)

+        if resume_from_checkpoint is not None:
+            if is_sagemaker_mp_enabled():


can merge these two ifs

rahul003 · 2022-04-05T17:22:29Z

src/transformers/trainer.py

+        if resume_from_checkpoint is not None:
+            if is_sagemaker_mp_enabled():
+                if self.args.smp_load_partial:
+                    state_dict = smp.load(os.path.join(resume_from_checkpoint, WEIGHTS_NAME), partial=self.args.smp_load_partial)


You can use smp.load for both cases, as long as partial=self.args.smp_load_partial

rahul003 · 2022-04-05T17:23:08Z

src/transformers/trainer.py

+                    if is_sagemaker_mp_enabled():
+                        if self.args.smp_load_partial:
+                            state_dict = smp.load(best_model_path, partial=self.args.smp_load_partial)
+                        else:


same comment to simplify here

rahul003 · 2022-04-05T17:25:52Z

src/transformers/trainer.py

+                if self.args.smp_save_partial:
+                    opt_state_dict = self.optimizer.local_state_dict()
+                else:
+                    opt_state_dict = self.optimizer.state_dict()


We wanted to standardize on gather_if_shard=False here.

When we do that, which processes need to save the partial state dict changes.

if shard_optimizer_state, then all processes
else rdp_rank==0

* 📝 add image/vision classification and asr * 🖍 minor formatting fixes * Fixed a typo in legacy seq2seq_trainer.py (huggingface#16531) * Add ONNX export for BeiT (huggingface#16498) * Add beit onnx conversion support * Updated docs * Added cross reference to ViT ONNX config * call on_train_end when trial is pruned (huggingface#16536) * Type hints added (huggingface#16529) * Fix Bart type hints (huggingface#16297) * Add type hints to PLBart PyTorch * Remove pending merge conflicts * Fix PLBart Type Hints * Add changes from review * Add VisualBert type hints (huggingface#16544) * Adding missing type hints for mBART model (PyTorch) (huggingface#16429) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent Co-authored-by: matt <rocketknight1@gmail.com> * Remove MBart subclass of XLMRoberta in tokenzier docs (huggingface#16546) * Remove MBart subclass of XLMRoberta in tokenzier * Fix style * Copy docs from MBart50 tokenizer * Use random_attention_mask for TF tests (huggingface#16517) * use random_attention_mask for TF tests * Fix for TFCLIP test (for now). Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * Improve code example (huggingface#16450) Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> * Pin tokenizers version <0.13 (huggingface#16539) * Pin tokenizers version <0.13 * Style * Add code samples for TF speech models (huggingface#16494) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * [FlaxSpeechEncoderDecoder] Fix dtype bug (huggingface#16581) * [FlaxSpeechEncoderDecoder] Fix dtype bug * more fixes * Making the impossible to connect error actually report the right URL. (huggingface#16446) * Fix flax import in __init__.py: modeling_xglm -> modeling_flax_xglm (huggingface#16556) * Add utility to find model labels (huggingface#16526) * Add utility to find model labels * Use it in the Trainer * Update src/transformers/utils/generic.py Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Quality Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> * Enable doc in Spanish (huggingface#16518) * Reorganize doc for multilingual support * Fix style * Style * Toc trees * Adapt templates * Add use_auth to load_datasets for private datasets to PT and TF examples (huggingface#16521) * fix formatting and remove use_auth * Add use_auth_token to Flax examples * add a test checking the format of `convert_tokens_to_string`'s output (huggingface#16540) * add new tests * add comment to overridden tests * TF: Finalize `unpack_inputs`-related changes (huggingface#16499) * Add unpack_inputs to remaining models * removed kwargs to `call()` in TF models * fix TF T5 tests * [SpeechEncoderDecoderModel] Correct Encoder Last Hidden State Output (huggingface#16586) * initialize the default rank set on TrainerState (huggingface#16530) * initialize the default rank set on TrainerState * fix style * Trigger doc build * Fix CI: test_inference_for_pretraining in ViTMAEModelTest (huggingface#16591) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> * add a template to add missing tokenization test (huggingface#16553) * add a template to add missing tokenization test * add cookiecutter setting * improve doc * Update templates/adding_a_missing_tokenization_test/README.md Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * made _load_pretrained_model_low_mem static + bug fix (huggingface#16548) * handle torch_dtype in low cpu mem usage (huggingface#16580) * [Doctests] Correct filenaming (huggingface#16599) * [Doctests] Correct filenaming * improve quicktour * make style * Adding new train_step logic to make things less confusing for users (huggingface#15994) * Adding new train_step logic to make things less confusing for users * DO NOT ASK WHY WE NEED THAT SUBCLASS * Metrics now working, at least for single-output models with type annotations! * Updates and TODOs for the new train_step * Make fixup * Temporary test workaround until T5 has types * Temporary test workaround until T5 has types * I think this actually works! Needs a lot of tests though * MAke style/quality * Revert changes to T5 tests * Deleting the aforementioned unmentionable subclass * Deleting the aforementioned unmentionable subclass * Adding a Keras API test * Style fixes * Removing unneeded TODO and comments * Update test_step too * Stop trying to compute metrics with the dummy_loss, patch up test * Make style * make fixup * Docstring cleanup * make fixup * make fixup * Stop expanding 1D input tensors when using dummy loss * Adjust T5 test given the new compile() * make fixup * Skipping test for convnext * Removing old T5-specific Keras test now that we have a common one * make fixup * make fixup * Only skip convnext test on CPU * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_tf_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Avoiding TF import issues * make fixup * Update compile() to support TF 2.3 * Skipping model.fit() on template classes for now * Skipping model.fit() on template class tests for now * Replace ad-hoc solution with find_labels * make fixup Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Adding missing type hints for BigBird model (huggingface#16555) * added type hints for mbart tensorflow tf implementation * Adding missing type hints for mBART model Tensorflow Implementation model added with missing type hints * Missing Type hints - correction For TF model * Code fixup using make quality tests * Hint types - typo error * make fix-copies and make fixup * type hints * updated files * type hints update * making dependent modesls coherent * Type hints for BigBird * removing typos Co-authored-by: matt <rocketknight1@gmail.com> * [deepspeed] fix typo, adjust config name (huggingface#16597) * 🖍 apply feedback Co-authored-by: Cathy <815244047@qq.com> Co-authored-by: Jim Rohrer <jrohrer1@gmail.com> Co-authored-by: Ferdinand Schlatt <fschlatt@gmail.com> Co-authored-by: Dahlbomii <101373053+Dahlbomii@users.noreply.github.com> Co-authored-by: Gunjan Chhablani <chhablani.gunjan@gmail.com> Co-authored-by: Rishav Chandra Varma <rishavchandra.v16@iiits.in> Co-authored-by: matt <rocketknight1@gmail.com> Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Lysandre Debut <lysandre.debut@reseau.eseo.fr> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com> Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Karim Foda <35491698+KMFODA@users.noreply.github.com> Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com> Co-authored-by: Joao Gante <joao@huggingface.co> Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> Co-authored-by: Andres Codas <andrescodas@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com> Co-authored-by: Francesco Saverio Zuppichini <francesco.zuppichini@gmail.com> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Completed documentation of CTRL * Missing optional None * Added return types * updated imports * Update modeling_ctrl.py

* fix bart and mbart * add ckpt names as variables * fix mbart * fix plbart * use varibale for ckot name

…rained (huggingface#16602)

…16609) * Use CLIP model's config for some fields (if specified) instead of those of vision & text components. Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Add inputs vector to calculate metric method * Include inputs for evaluation metrics with backwards compatibility * Prevent inputs create OOM issue and documentation details * Update style and code documentation * Fix style formatting issues * Update files format with make style

…ate_dict (huggingface#16643) * Updated _load_pretrained_model_low_mem to check if keys are in the stored state_dict * update after conversions

* Update README.md Support Image Updates the Support image linking to our EAP page (to give it a refresh + help avoid image fatigue). Slack thread checking in with #open-source-internal on this update (https://huggingface.slack.com/archives/C021H1P1HKR/p1648838903316709) * Compressed Updated Support image * Improves Support Image Logo + Height Updated the image based on logo + size feedback. Big thanks to Bibi for making quick edits to this image.

* base model done * make style * done * added files * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Trigger doc build * resolved conversations * resolved conversations * seer models * minor changes * minor changes * make fixup * glob variables * minor changes * fix copies * config when possibile * resolved conflicts * resolved conflicts * resolved conflicts * CI * conversion script for 10b param * fixed for 10b model * minor updates in the doc + make style * removed unused code * Apply suggestions from code review Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> * removed unused code * removed unused code * updated modeling_utils from main Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <Sylvain.gugger@gmail.com>

…ace#16171)

* Add TapexTokenizer * Improve docstrings and provide option to provide answer * Remove option for pretokenized inputs * Add TAPEX to README * Fix copies * Remove option for pretokenized inputs * Initial commit: add tapex fine-tuning examples on both table-based question answering and table-based fact verification. * - Draft a README file for running the script and introducing some background. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming. * * Support `as_target_tokenizer` function for TapexTokenizer. * Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target. * * Replace the label BartTokenizer with TapexTokenizer's as_target_tokenizer function. * Fix typos in tapex example README. * * fix the evaluation script - remove the property `task_name` * * Make the label space more clear for tabfact tasks * * Using a new fine-tuning script for tapex-base on tabfact. * * Remove the lowercase code outside the tokenizer - we use the tokenizer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql * * Remove the default tokenizer_name option. * Provide evaluation command. * * Support for WikiTableQuestion dataset. * Fix a typo in README. * * Fix the datasets's key name in WikiTableQuestions * Run make fixup and move test to folder * Fix quality * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Suraj Patil <surajp815@gmail.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review * Improve docstrings * Overwrite failing test * Improve comment in example scripts * Fix rebase * Add TAPEX to Auto mapping * Add TAPEX to auto config mappings * Put TAPEX higher than BART in auto mapping * Add TAPEX to doc tests Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain> Co-authored-by: SivilTaram <qianlxc@outlook.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

* add vit tf doctest with @add_code_sample_docstrings * add labels string back in Co-authored-by: Johannes Kolbe <johannes.kolbe@tech.better.team>

The defalut value of `padding` in `DataCollatorWithPadding` is `True`, not `False`.

* fix QA sample * For TF_QUESTION_ANSWERING_SAMPLE Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* Fixed some bugs involving saving during epochs * Added tests mimicking the existing examples tests * Added in json exporting to all `no_trainer` examples for consistency

* [Trainer] tf32 arg doc * Update src/transformers/training_args.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* ✨ update audio examples with minds dataset * 🖍 make style * 🖍 minor fixes for doctests

…cavdard/transformers into smp_trainer_partial_chekckpoint

LysandreJik · 2022-04-12T12:57:00Z

Hey! It seems a bad rebase/merge happened on your PR. Usually, closing this PR and opening a new one from the same branch solves the problem.

cavdard · 2022-04-12T17:19:49Z

Closing this PR.
Created a new PR

github-actions · 2022-05-07T15:02:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

adding partial checkpoint support for SMP

699965b

cavdard and others added 18 commits March 28, 2022 14:42

Merge branch 'huggingface:main' into smp_trainer_partial_chekckpoint

f355672

Updates on tensor parallelism and partial checkpointing

4bcf923

add load_best_model_at_end suppport for SMP

b822115

Merge branch 'huggingface:main' into smp_trainer_partial_chekckpoint

29504d3

Merge branch 'smp_trainer_partial_chekckpoint' of https://github.com/…

8570003

…cavdard/transformers into smp_trainer_partial_chekckpoint

Merge branch 'huggingface:main' into smp_trainer_partial_chekckpoint

f7bd21a

add smp.barrier to make sure checkpoint is saved before load_best_model

114e7ce

disable checkpoint sharding when smp is used

29ae2ec

Fix CI: test_inference_for_pretraining in ViTMAEModelTest (huggingfac…

765bafb

…e#16591) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

made _load_pretrained_model_low_mem static + bug fix (huggingface#16548)

8bf6d28

handle torch_dtype in low cpu mem usage (huggingface#16580)

21decb7

[Doctests] Correct filenaming (huggingface#16599)

7ccacdf

* [Doctests] Correct filenaming * improve quicktour * make style

[deepspeed] fix typo, adjust config name (huggingface#16597)

9fd5e6b

[benchmark tool] trainer-benchmark.py (huggingface#14934)

23fc4cb

* [benchmark tool] trainer-benchmark.py * improve * massive rework/expansion * fix * mucho improved * improved * fix prefix * fix * fix diff calculation * address suggestions

rahul003 reviewed Apr 5, 2022

View reviewed changes

stevhliu and others added 9 commits April 5, 2022 12:48

Quality

208f4c1

added type hints to CTRL pytorch (huggingface#16593)

b18dfd9

* Completed documentation of CTRL * Missing optional None * Added return types * updated imports * Update modeling_ctrl.py

fix default num_attention_heads in segformer doc (huggingface#16612)

d55fcbc

[Minds14] Correct quicktour (huggingface#16626)

0bf1864

Fix seq2seq doc tests (huggingface#16606)

a2b7d19

* fix bart and mbart * add ckpt names as variables * fix mbart * fix plbart * use varibale for ckot name

don't load state_dict twice when using low_cpu_mem_usage in from_pret…

47c5c05

…rained (huggingface#16602)

Use CLIP model config to set some kwargs for components (huggingface#…

ae6a7a7

…16609) * Use CLIP model's config for some fields (if specified) instead of those of vision & text components. Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

typo (huggingface#16621)

fb3d0df

lmvasque and others added 25 commits April 7, 2022 10:02

[megatron-bert-uncased-345m] fix conversion (huggingface#16639)

080e42d

Remove parent/child tests in auto model tests (huggingface#16653)

389f661

Updated _load_pretrained_model_low_mem to check if keys are in the st…

4099817

…ate_dict (huggingface#16643) * Updated _load_pretrained_model_low_mem to check if keys are in the stored state_dict * update after conversions

bert: properly mention deprecation of TF2 conversion script (huggingf…

33cb211

…ace#16171)

add vit tf doctest with @add_code_sample_docstrings (huggingface#16636)

9db2eeb

* add vit tf doctest with @add_code_sample_docstrings * add labels string back in Co-authored-by: Johannes Kolbe <johannes.kolbe@tech.better.team>

Fix error in doc of DataCollatorWithPadding (huggingface#16662)

5db2fcc

The defalut value of `padding` in `DataCollatorWithPadding` is `True`, not `False`.

Fix style

9a24b97

Fix QA sample (huggingface#16648)

ab22966

* fix QA sample * For TF_QUESTION_ANSWERING_SAMPLE Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Add tests for no_trainer and fix existing examples (huggingface#16656)

d57da99

* Fixed some bugs involving saving during epochs * Added tests mimicking the existing examples tests * Added in json exporting to all `no_trainer` examples for consistency

only load state dict when the checkpoint is not None (huggingface#16673)

f4d4f0a

Update audio examples with MInDS-14 (huggingface#16633)

7c5d799

* ✨ update audio examples with minds dataset * 🖍 make style * 🖍 minor fixes for doctests

adding partial checkpoint support for SMP

01422ff

Updates on tensor parallelism and partial checkpointing

664465b

add load_best_model_at_end suppport for SMP

87474a2

add smp.barrier to make sure checkpoint is saved before load_best_model

0151a3e

disable checkpoint sharding when smp is used

dd58a0e

updates based on comments

77d162a

add warning when full checkpointing is enabled

94791ec

Merge branch 'smp_trainer_partial_chekckpoint' of https://github.com/…

5f60e99

…cavdard/transformers into smp_trainer_partial_chekckpoint

revert removed else

9de03df

cavdard marked this pull request as draft April 9, 2022 02:18

github-actions bot closed this May 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314

Updates in Trainer to support partial checkpointing for SM Model Parallel library #16314
cavdard wants to merge 65 commits intohuggingface:mainfrom
cavdard:smp_trainer_partial_chekckpoint

cavdard commented Mar 21, 2022 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 21, 2022

Uh oh!

rahul003 Apr 5, 2022

Uh oh!

rahul003 Apr 5, 2022

Uh oh!

rahul003 Apr 5, 2022

Uh oh!

rahul003 Apr 5, 2022

Uh oh!

rahul003 Apr 5, 2022

Uh oh!

LysandreJik commented Apr 12, 2022

Uh oh!

cavdard commented Apr 12, 2022

Uh oh!

github-actions bot commented May 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

cavdard commented Mar 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 21, 2022

Uh oh!

rahul003 Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

rahul003 Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

rahul003 Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

rahul003 Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

rahul003 Apr 5, 2022

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented Apr 12, 2022

Uh oh!

cavdard commented Apr 12, 2022

Uh oh!

github-actions bot commented May 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

cavdard commented Mar 21, 2022 •

edited

Loading