Add the auto_find_batch_size capability from Accelerate into Trainer #17068

muellerzr · 2022-05-03T19:31:24Z

What does this PR do?

This PR introduces the find_executable_batch_size decorator into Trainer, so the training loop is repeated if a CUDA OOM is reached, lowering the batch size.

The API looks as so:

trainer = Trainer()
trainer.train(auto_find_batch_size=True)

By default it is False, and requires Accelerate be installed to use.

Fixes # (issue)

Partially solves #16987

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sgugger

sgugger

Thanks for working on this! Left a few comments to make the PR a bit better :-)

setup.py

src/transformers/trainer.py

src/transformers/trainer_utils.py

tests/trainer/test_trainer.py

tests/trainer/test_trainer_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

tests/trainer/test_trainer.py

sgugger

Thanks for all the work on this. Pinging @LysandreJik to have a second set of eye :-)

src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <[email protected]>

muellerzr · 2022-05-05T12:51:26Z

@stas00 I'm getting a test failure on the metrics:

 tests/trainer/test_trainer.py:1426: in check_mem_metrics
     metrics = trainer.train().metrics
 src/transformers/trainer.py:1215: in train
     ignore_keys_for_eval=ignore_keys_for_eval,
 src/transformers/trainer.py:1571: in _inner_training_loop
     self._memory_tracker.stop_and_update_metrics(metrics)
 src/transformers/trainer_utils.py:536: in stop_and_update_metrics
     stage = self.derive_stage()
 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
 
 self = <transformers.trainer_utils.TrainerMemoryTracker object at 0x7f929f787e10>
 
     def derive_stage(self):
         """derives the stage/caller name automatically"""
         caller = inspect.currentframe().f_back.f_back.f_code.co_name
         if caller in self.stages:
             return self.stages[caller]
         else:
             raise ValueError(
 >               f"was called from {caller}, but only expect to be called from one of {self.stages.keys()}"
             )
 E           ValueError: was called from _inner_training_loop, but only expect to be called from one of dict_keys(['__init__', 'train', 'evaluate', 'predict'])

Any advice on how to approach a solution?

HuggingFaceDocBuilderDev · 2022-05-05T13:26:02Z

The documentation is not available anymore as the PR was closed or merged.

stas00 · 2022-05-05T15:29:08Z

This will overcome the problem:

diff --git a/src/transformers/trainer_utils.py b/src/transformers/trainer_utils.py
index 22b44a2f0..d4c523249 100644
--- a/src/transformers/trainer_utils.py
+++ b/src/transformers/trainer_utils.py
@@ -356,6 +356,7 @@ class TrainerMemoryTracker:
     stages = {
         "__init__": "init",
         "train": "train",
+        "_inner_training_loop": "train",
         "evaluate": "eval",
         "predict": "test",
     }

LysandreJik

Great, looks good to me!

LysandreJik · 2022-05-09T15:21:24Z

Please make sure all tests pass after resolving conflicts and before merging!

…uggingface#17068) Co-authored-by: Sylvain Gugger <[email protected]> - Adds auto_batch_size finder - Moves training loop to an inner training loop

JohnGiorgi · 2022-07-14T00:15:30Z

Any chance similar functionality could be supported for inference? 🙏

muellerzr added 13 commits May 3, 2022 12:08

Start implementation

297e7d1

Fix partial

3d78372

change where logic happens

c78ed02

Change how decorator is called

a933e18

Should fix issue

0ad8a77

Style

c5d5d7b

Proper style

8da6c7e

Add tests

f759535

Add to deps and remove guard

f97111a

Try now?

253f0cb

Dep tables

4f16986

Rm import

5624504

Add test

6dcb9df

muellerzr requested a review from sgugger May 3, 2022 19:32

sgugger reviewed May 3, 2022

View reviewed changes

muellerzr and others added 8 commits May 3, 2022 17:02

Update setup.py to use latest

48753df

Co-authored-by: Sylvain Gugger <[email protected]>

Fixup dep style

ab12e99

Move auto_find_batch_size to a TrainingArgument

81fe4d6

Use requires_backends, add decorators to tests

3a777fb

move import

56b884d

Rework test to use glue

b12ab6a

Fix import errors

4ce0683

Working trainer test, still need to fix utils

7752813

muellerzr requested a review from sgugger May 4, 2022 16:22

muellerzr added 2 commits May 4, 2022 12:46

Add accelerate to depslist

f8e8927

Update config with install accelerate

88d20c1

sgugger reviewed May 4, 2022

View reviewed changes

tests/trainer/test_trainer.py Show resolved Hide resolved

muellerzr added 2 commits May 4, 2022 15:36

Restructure inner loop

bdc4692

Slow

97b575e

sgugger approved these changes May 4, 2022

View reviewed changes

src/transformers/trainer.py Outdated Show resolved Hide resolved

Restyle

ff6caca

Co-authored-by: Sylvain Gugger <[email protected]>

muellerzr force-pushed the muellerzr-memory-decorator branch from 62c7bd3 to ff6caca Compare May 5, 2022 13:05

Fix logging import

62d9a22

muellerzr added 3 commits May 5, 2022 11:32

Add stas fix to stages

5c6c4b2

Style

ef92961

Finally clean

475e164

muellerzr requested a review from LysandreJik May 5, 2022 18:39

LysandreJik approved these changes May 9, 2022

View reviewed changes

muellerzr added 2 commits May 9, 2022 11:34

Merge branch 'main' into muellerzr-memory-decorator

00d316b

Clean

ada7bd1

muellerzr merged commit 2fbb237 into main May 9, 2022

muellerzr deleted the muellerzr-memory-decorator branch May 9, 2022 16:29

Add the auto_find_batch_size capability from Accelerate into Trainer #17068

Add the auto_find_batch_size capability from Accelerate into Trainer #17068

Uh oh!

Conversation

muellerzr commented May 3, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muellerzr commented May 5, 2022

Uh oh!

HuggingFaceDocBuilderDev commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented May 5, 2022

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik commented May 9, 2022

Uh oh!

JohnGiorgi commented Jul 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HuggingFaceDocBuilderDev commented May 5, 2022 •

edited

Loading