Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added max/min number of steps in Trainer #728

Merged
merged 25 commits into from
Feb 18, 2020
Merged

Added max/min number of steps in Trainer #728

merged 25 commits into from
Feb 18, 2020

Conversation

peteriz
Copy link

@peteriz peteriz commented Jan 22, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Partially Fixes #640
I added a max_steps argument in Trainer for stopping the training process when max_steps has been reached. This feature is highly desired in Transformer-based models (such as BERT, XLNet, etc.) where a warm-up phase and LR decay is required.
I will add future PRs covering stepwise processing (related to scheduler)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@peteriz peteriz requested a review from Borda January 22, 2020 14:47
@williamFalcon
Copy link
Contributor

@peteriz good addition. please update docs and add a test

@peteriz
Copy link
Author

peteriz commented Feb 2, 2020

@williamFalcon @Borda
All comments were addressed.

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 2, 2020

@peteriz awesome! mind adding a test for this?
Trainer is something we're super strict with testing.

@peteriz
Copy link
Author

peteriz commented Feb 2, 2020

No problem.
Can you direct me to similar test? I didn't find a similar test for epochs

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 2, 2020

A test for this would run trainer with all 4 of those cases. So, min_steps, max_steps and then make sure the step counts are correct.

edit:
just looked, looks like we don't directly test for that, a lot of tests do limit the epochs though. Maybe this is a good time to add it? It would be for this file:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/test_trainer.py

To make sure the test is fast, maybe set a limit on how much training data is used? 10%?

@peteriz
Copy link
Author

peteriz commented Feb 2, 2020

The intention of my question was on where to place the test.
Would tests/test_trainer.py be the right place?

@williamFalcon
Copy link
Contributor

yup! edited my original comment

@peteriz
Copy link
Author

peteriz commented Feb 3, 2020

@williamFalcon I pushed the test, I hope it is okay.

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 3, 2020

Failed on GPU:

______________ test_num_trainer_steps ________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-5/test_num_trainer_steps0')

    def test_num_trainer_steps(tmpdir):
        """Verify model trains according to speficied steps"""
        tutils.reset_seed()
        model, _ = tutils.get_model()

        trainer_options = dict(
            max_epochs=5,
            gpus=None,
            default_save_path=tmpdir,
            train_percent_check=0.05,
        )

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 100
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_steps
        assert trainer.global_step == 100, "Model did not stop at max_steps"

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 500
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_epochs
>       assert trainer.global_step == 93 * 2 and \
            trainer.current_epoch == 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (278 == 186
E         -278
E         +186)

@peteriz
Copy link
Author

peteriz commented Feb 3, 2020

Thought Travis-CI pass was okay. I'll have a look.
I dont get why it failed GPU, the trainer is configured with gpus=None

@Borda
Copy link
Member

Borda commented Feb 3, 2020

unfortunately, it is hard to get free CI with GPUs for full testing, but we are working on it.. #486

@williamFalcon
Copy link
Contributor

still failing:

tmpdir = local('/tmp/pytest-of-waf251/pytest-6/test_num_trainer_steps0')

    def test_num_trainer_steps(tmpdir):
        """Verify model trains according to speficied steps"""
        tutils.reset_seed()
        model, _ = tutils.get_model()

        trainer_options = dict(
            max_epochs=5,
            gpus=None,
            default_save_path=tmpdir,
            train_percent_check=0.05,
        )

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 100
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_steps
        assert trainer.global_step == 100, "Model did not stop at max_steps"

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 500
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_epochs
>       assert trainer.global_step == 93 * 2 and \
            trainer.current_epoch == 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (278 == 186
E         -278
E         +186)

@peteriz
Copy link
Author

peteriz commented Feb 3, 2020

@williamFalcon I could not reproduce that error on my GPU machine.
I changed the test to pull train set size from the data loader. Fingers crossed this should pass.

@williamFalcon
Copy link
Contributor

@peteriz can you rebase master so we can see if tests pass now?

@peteriz
Copy link
Author

peteriz commented Feb 13, 2020

Fixed. Could you squash my commits when merging?
Thanks

@Borda
Copy link
Member

Borda commented Feb 13, 2020

Fixed. Could you squash my commits when merging?

sure, that's what we always do... :]

tests/test_trainer.py Outdated Show resolved Hide resolved
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, just a minor points, pls :] Thx for you patience...

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
tests/test_trainer.py Outdated Show resolved Hide resolved
tests/test_trainer.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented Feb 15, 2020

Hello @peteriz! Thanks for updating this PR.

Line 357:101: E501 line too long (106 > 100 characters)

Comment last updated at 2020-02-18 13:11:53 UTC

@Borda
Copy link
Member

Borda commented Feb 15, 2020

Hello @peteriz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

the config is waiting here - #842 but he is right, the test is failing there...

@williamFalcon
Copy link
Contributor

>       assert trainer.global_step == num_train_samples * trainer_options['max_epochs']
E       assert 278 == 186
E         -278
E         +186

Not sure what's happening here. Likely, the global steps are ACTUALLY wrong or perhaps this isn't accounting for batch size in your num_train_samples * trainer_options['max_epochs'] calculation?

@peteriz
Copy link
Author

peteriz commented Feb 16, 2020

@williamFalcon Okay I changed the assert and it pulls max_epochs from the trainer.
I think this assert is essential for validating that the trainer stopped number_of_epochs * steps_per_epoch.

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 16, 2020

@peteriz this is still failing...

>       assert result == 1 and trainer.global_step == num_train_samples * trainer.max_nb_epochs \
            and trainer.current_epoch == trainer.max_nb_epochs - 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (1 == 1
E         -1
E         +1 and 278 == 186
E         -278
E         +186)```

I'd also say that nested asserts are hard to debug. My suggestion is to split this into two asserts, and use PDB to figure out what's going on.

I'd love to help finish this but I'm pretty deep into another feature so I don't have the bandwith to debug this.

I'd also say to run this on 2 GPUs... not sure why this is different.

@peteriz
Copy link
Author

peteriz commented Feb 16, 2020

@williamFalcon I can separate the asserts back, but I was not able to get this error on my mac (python 3.7) or on my machines, cpu backed or 1/2/4 GPUs.
I need more info on what you're running (beside the test and CI here) ..

@peteriz
Copy link
Author

peteriz commented Feb 18, 2020

@Borda @williamFalcon What can we do to push this PR forward? It would be great to make this happen since I'm planning on integrating pytorch-lightning into our repo NLP Architect.

@williamFalcon
Copy link
Contributor

agreed! this is a critical PR for us.
But we reaaally need this test to pass. I think something is wrong with the functionality... there should be no caveats or what ifs on where these tests pass. if they don’t pass as they are on the machines i test on then we need to fix the functionality or test :)

so, maybe think about what would cause the difference in machines? and make sure you’re running the tests correctly.

i do bash .run_local_tests.sh on a 2 gpu machine

@williamFalcon
Copy link
Contributor

maybe it’s somehow loading a different checkpoint?? check the file cache clearing

@williamFalcon
Copy link
Contributor

@neggert

@Borda Borda added the help wanted Open to be worked on label Feb 18, 2020
@williamFalcon williamFalcon merged commit 054a353 into Lightning-AI:master Feb 18, 2020
@williamFalcon
Copy link
Contributor

ok, merged. @Borda mentioned this passed on his machine. we can dig into it deeper if it causes problems.

great addition!

@peteriz
Copy link
Author

peteriz commented Feb 18, 2020

@williamFalcon @Borda Thanks for reviewing. Expect more NLP model related contributions 🚀

@williamFalcon williamFalcon changed the title Added max number of steps in Trainer Added max/min number of steps in Trainer Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Step-wise processing, better support for IterableDataset, and others
5 participants