Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated .fit() calls ignore max_steps iteration bound #4809

Closed
brdav opened this issue Nov 22, 2020 · 3 comments · Fixed by #5936
Closed

Repeated .fit() calls ignore max_steps iteration bound #4809

brdav opened this issue Nov 22, 2020 · 3 comments · Fixed by #5936
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 1 Medium priority task

Comments

@brdav
Copy link

brdav commented Nov 22, 2020

🐛 Bug

Hello!
While trying to convert my code to PL (I'm starting to become a big fan!) I came across some unexpected behavior: In an iteration-based training setup repeated calls of trainer.fit() result in ignoring the iteration bound set by the max_steps argument. The trainer will finish the entire epoch, even though in my opinion it shouldn't (forgive me if I missed something obvious, which is easily possible since I'm new to PL).

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1gKLNoYXjW7s3ifSJJ00SVZi4b08GDy5F?usp=sharing

To Reproduce

In the BoringModel, I only changed the test cell to something like this:

def test_x(tmpdir):
    # init model
    model = BoringModel()

    iterations = 80
    print('fit() should only do {} iterations'.format(iterations))

    # Initialize a trainer
    trainer = pl.Trainer(
        max_steps=iterations, 
        progress_bar_refresh_rate=20
    )

    # Train the model ⚡
    trainer.fit(model, train, val)  # here the trainer correctly does 80 iterations

    print('global_step: ', trainer.global_step)
    print('first fit() done')

    trainer.fit(model, train, val)  # in this fit(), the trainer will complete the epoch

    # trainer.test(test_dataloaders=test)

Expected behavior

I expect repeated trainer.fit() calls to result in either

  1. No action, because trainer.global_step == trainer.max_steps

or

  1. Again fitting of the model for another max_steps iterations

I think most people would expect the former.

Environment

  • CUDA:
    • GPU:
      • Tesla T4
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.0.7
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

The same problem arises if a fully trained model is loaded, and then trainer.fit() is called. This is especially troubling when a trainer.test() follows.

@brdav brdav added bug Something isn't working help wanted Open to be worked on labels Nov 22, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@brdav
Copy link
Author

brdav commented Nov 22, 2020

The issue might go hand-in-hand with this pull request: #4086
Although as far as I can see the pull request as it stands now would not fix this issue, that's why I opened it here.

@awaelchli
Copy link
Contributor

Hi! I agree, your observations are correct :)
Looks like what is happening here is max step is reached without a complete epoch.
In the next fit, we detect that max_epoch is not reached and continue training, but in fact max_steps should take precedence.

I think we want max_epochs and max_steps to limit training in such a way that whichever of the two leads to less training steps takes precedence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants