Repeated .fit() calls ignore max_steps iteration bound #4809

brdav · 2020-11-22T16:29:44Z

🐛 Bug

Hello!
While trying to convert my code to PL (I'm starting to become a big fan!) I came across some unexpected behavior: In an iteration-based training setup repeated calls of trainer.fit() result in ignoring the iteration bound set by the max_steps argument. The trainer will finish the entire epoch, even though in my opinion it shouldn't (forgive me if I missed something obvious, which is easily possible since I'm new to PL).

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1gKLNoYXjW7s3ifSJJ00SVZi4b08GDy5F?usp=sharing

To Reproduce

In the BoringModel, I only changed the test cell to something like this:

def test_x(tmpdir):
    # init model
    model = BoringModel()

    iterations = 80
    print('fit() should only do {} iterations'.format(iterations))

    # Initialize a trainer
    trainer = pl.Trainer(
        max_steps=iterations, 
        progress_bar_refresh_rate=20
    )

    # Train the model ⚡
    trainer.fit(model, train, val)  # here the trainer correctly does 80 iterations

    print('global_step: ', trainer.global_step)
    print('first fit() done')

    trainer.fit(model, train, val)  # in this fit(), the trainer will complete the epoch

    # trainer.test(test_dataloaders=test)

Expected behavior

I expect repeated trainer.fit() calls to result in either

No action, because trainer.global_step == trainer.max_steps

or

Again fitting of the model for another max_steps iterations

I think most people would expect the former.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.18.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.0.7
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

The same problem arises if a fully trained model is loaded, and then trainer.fit() is called. This is especially troubling when a trainer.test() follows.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-11-22T16:30:27Z

Hi! thanks for your contribution!, great first issue!

brdav · 2020-11-22T16:33:51Z

The issue might go hand-in-hand with this pull request: #4086
Although as far as I can see the pull request as it stands now would not fix this issue, that's why I opened it here.

awaelchli · 2020-11-23T02:34:30Z

Hi! I agree, your observations are correct :)
Looks like what is happening here is max step is reached without a complete epoch.
In the next fit, we detect that max_epoch is not reached and continue training, but in fact max_steps should take precedence.

I think we want max_epochs and max_steps to limit training in such a way that whichever of the two leads to less training steps takes precedence.

brdav added bug Something isn't working help wanted Open to be worked on labels Nov 22, 2020

awaelchli added the with code label Nov 23, 2020

awaelchli self-assigned this Nov 23, 2020

edenlightning added the priority: 1 Medium priority task label Nov 23, 2020

Borda added the good first issue Good for newcomers label Dec 1, 2020

kaushikb11 mentioned this issue Jan 27, 2021

Feature/3521 Enable purely iteration based training #4086

Closed

7 tasks

edenlightning assigned kaushikb11 and unassigned awaelchli Jan 27, 2021

kaushikb11 mentioned this issue Feb 12, 2021

Fix: Repeated .fit() calls ignore max_steps iteration bound #5936

Merged

12 tasks

Borda closed this as completed in #5936 Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated .fit() calls ignore max_steps iteration bound #4809

Repeated .fit() calls ignore max_steps iteration bound #4809

brdav commented Nov 22, 2020

github-actions bot commented Nov 22, 2020

brdav commented Nov 22, 2020

awaelchli commented Nov 23, 2020

Repeated .fit() calls ignore max_steps iteration bound #4809

Repeated .fit() calls ignore max_steps iteration bound #4809

Comments

brdav commented Nov 22, 2020

🐛 Bug

Please reproduce using the BoringModel and post here

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Nov 22, 2020

brdav commented Nov 22, 2020

awaelchli commented Nov 23, 2020