Skip to content

Conversation

@muellerzr
Copy link
Contributor

What does this PR do?

This PR introduces the find_executable_batch_size decorator into Trainer, so the training loop is repeated if a CUDA OOM is reached, lowering the batch size.

The API looks as so:

trainer = Trainer()
trainer.train(auto_find_batch_size=True)

By default it is False, and requires Accelerate be installed to use.

Fixes # (issue)

Partially solves #16987

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger

@muellerzr muellerzr requested a review from sgugger May 3, 2022 19:32
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Left a few comments to make the PR a bit better :-)

@muellerzr muellerzr requested a review from sgugger May 4, 2022 16:22
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this. Pinging @LysandreJik to have a second set of eye :-)

Co-authored-by: Sylvain Gugger <[email protected]>
@muellerzr
Copy link
Contributor Author

@stas00 I'm getting a test failure on the metrics:

 tests/trainer/test_trainer.py:1426: in check_mem_metrics
     metrics = trainer.train().metrics
 src/transformers/trainer.py:1215: in train
     ignore_keys_for_eval=ignore_keys_for_eval,
 src/transformers/trainer.py:1571: in _inner_training_loop
     self._memory_tracker.stop_and_update_metrics(metrics)
 src/transformers/trainer_utils.py:536: in stop_and_update_metrics
     stage = self.derive_stage()
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 self = <transformers.trainer_utils.TrainerMemoryTracker object at 0x7f929f787e10>
 
     def derive_stage(self):
         """derives the stage/caller name automatically"""
         caller = inspect.currentframe().f_back.f_back.f_code.co_name
         if caller in self.stages:
             return self.stages[caller]
         else:
             raise ValueError(
 >               f"was called from {caller}, but only expect to be called from one of {self.stages.keys()}"
             )
 E           ValueError: was called from _inner_training_loop, but only expect to be called from one of dict_keys(['__init__', 'train', 'evaluate', 'predict'])

Any advice on how to approach a solution?

@muellerzr muellerzr force-pushed the muellerzr-memory-decorator branch from 62c7bd3 to ff6caca Compare May 5, 2022 13:05
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@stas00
Copy link
Contributor

stas00 commented May 5, 2022

This will overcome the problem:

diff --git a/src/transformers/trainer_utils.py b/src/transformers/trainer_utils.py
index 22b44a2f0..d4c523249 100644
--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -356,6 +356,7 @@ class TrainerMemoryTracker:
     stages = {
         "__init__": "init",
         "train": "train",
+        "_inner_training_loop": "train",
         "evaluate": "eval",
         "predict": "test",
     }

@muellerzr muellerzr requested a review from LysandreJik May 5, 2022 18:39
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks good to me!

@LysandreJik
Copy link
Member

Please make sure all tests pass after resolving conflicts and before merging!

@muellerzr muellerzr merged commit 2fbb237 into main May 9, 2022
@muellerzr muellerzr deleted the muellerzr-memory-decorator branch May 9, 2022 16:29
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…uggingface#17068)

Co-authored-by: Sylvain Gugger <[email protected]>

- Adds auto_batch_size finder 
- Moves training loop to an inner training loop
@JohnGiorgi
Copy link
Contributor

Any chance similar functionality could be supported for inference? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants