Added max/min number of steps in Trainer #728

peteriz · 2020-01-22T14:27:27Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Partially Fixes #640
I added a max_steps argument in Trainer for stopping the training process when max_steps has been reached. This feature is highly desired in Transformer-based models (such as BERT, XLNet, etc.) where a warm-up phase and LR decay is required.
I will add future PRs covering stepwise processing (related to scheduler)

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

williamFalcon · 2020-01-24T14:26:29Z

@peteriz good addition. please update docs and add a test

pytorch_lightning/trainer/trainer.py

peteriz · 2020-02-02T13:25:06Z

@williamFalcon @Borda
All comments were addressed.

williamFalcon · 2020-02-02T13:26:27Z

@peteriz awesome! mind adding a test for this?
Trainer is something we're super strict with testing.

peteriz · 2020-02-02T13:28:09Z

No problem.
Can you direct me to similar test? I didn't find a similar test for epochs

williamFalcon · 2020-02-02T13:31:17Z

A test for this would run trainer with all 4 of those cases. So, min_steps, max_steps and then make sure the step counts are correct.

edit:
just looked, looks like we don't directly test for that, a lot of tests do limit the epochs though. Maybe this is a good time to add it? It would be for this file:

https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/test_trainer.py

To make sure the test is fast, maybe set a limit on how much training data is used? 10%?

peteriz · 2020-02-02T13:35:59Z

The intention of my question was on where to place the test.
Would tests/test_trainer.py be the right place?

williamFalcon · 2020-02-02T13:36:36Z

yup! edited my original comment

peteriz · 2020-02-03T12:05:31Z

@williamFalcon I pushed the test, I hope it is okay.

williamFalcon · 2020-02-03T13:59:50Z

Failed on GPU:

______________ test_num_trainer_steps ________________________________________

tmpdir = local('/tmp/pytest-of-waf251/pytest-5/test_num_trainer_steps0')

    def test_num_trainer_steps(tmpdir):
        """Verify model trains according to speficied steps"""
        tutils.reset_seed()
        model, _ = tutils.get_model()

        trainer_options = dict(
            max_epochs=5,
            gpus=None,
            default_save_path=tmpdir,
            train_percent_check=0.05,
        )

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 100
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_steps
        assert trainer.global_step == 100, "Model did not stop at max_steps"

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 500
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_epochs
>       assert trainer.global_step == 93 * 2 and \
            trainer.current_epoch == 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (278 == 186
E         -278
E         +186)

peteriz · 2020-02-03T14:04:26Z

Thought Travis-CI pass was okay. I'll have a look.
I dont get why it failed GPU, the trainer is configured with gpus=None

Borda · 2020-02-03T14:19:32Z

unfortunately, it is hard to get free CI with GPUs for full testing, but we are working on it.. #486

williamFalcon · 2020-02-03T14:19:43Z

still failing:

tmpdir = local('/tmp/pytest-of-waf251/pytest-6/test_num_trainer_steps0')

    def test_num_trainer_steps(tmpdir):
        """Verify model trains according to speficied steps"""
        tutils.reset_seed()
        model, _ = tutils.get_model()

        trainer_options = dict(
            max_epochs=5,
            gpus=None,
            default_save_path=tmpdir,
            train_percent_check=0.05,
        )

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 100
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_steps
        assert trainer.global_step == 100, "Model did not stop at max_steps"

        trainer_options['max_epochs'] = 2
        trainer_options['max_steps'] = 500
        trainer = Trainer(**trainer_options)
        result = trainer.fit(model)
        assert result == 1
        # should stop at max_epochs
>       assert trainer.global_step == 93 * 2 and \
            trainer.current_epoch == 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (278 == 186
E         -278
E         +186)

peteriz · 2020-02-03T15:04:42Z

@williamFalcon I could not reproduce that error on my GPU machine.
I changed the test to pull train set size from the data loader. Fingers crossed this should pass.

williamFalcon · 2020-02-12T15:26:48Z

@peteriz can you rebase master so we can see if tests pass now?

…lightning

peteriz · 2020-02-13T10:00:57Z

Fixed. Could you squash my commits when merging?
Thanks

Borda · 2020-02-13T10:55:23Z

Fixed. Could you squash my commits when merging?

sure, that's what we always do... :]

tests/test_trainer.py

Borda

Almost there, just a minor points, pls :] Thx for you patience...

pytorch_lightning/trainer/trainer.py

pytorch_lightning/trainer/training_loop.py

tests/test_trainer.py

pep8speaks · 2020-02-15T23:40:55Z

Hello @peteriz! Thanks for updating this PR.

In the file pytorch_lightning/trainer/trainer.py:

Line 357:101: E501 line too long (106 > 100 characters)

Comment last updated at 2020-02-18 13:11:53 UTC

Borda · 2020-02-15T23:56:15Z

Hello @peteriz! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

the config is waiting here - #842 but he is right, the test is failing there...

williamFalcon · 2020-02-16T00:07:35Z

>       assert trainer.global_step == num_train_samples * trainer_options['max_epochs']
E       assert 278 == 186
E         -278
E         +186

Not sure what's happening here. Likely, the global steps are ACTUALLY wrong or perhaps this isn't accounting for batch size in your num_train_samples * trainer_options['max_epochs'] calculation?

peteriz · 2020-02-16T08:52:39Z

@williamFalcon Okay I changed the assert and it pulls max_epochs from the trainer.
I think this assert is essential for validating that the trainer stopped number_of_epochs * steps_per_epoch.

williamFalcon · 2020-02-16T15:58:27Z

@peteriz this is still failing...

>       assert result == 1 and trainer.global_step == num_train_samples * trainer.max_nb_epochs \
            and trainer.current_epoch == trainer.max_nb_epochs - 1, "Model did not stop at max_epochs"
E       AssertionError: Model did not stop at max_epochs
E       assert (1 == 1
E         -1
E         +1 and 278 == 186
E         -278
E         +186)```

I'd also say that nested asserts are hard to debug. My suggestion is to split this into two asserts, and use PDB to figure out what's going on.

I'd love to help finish this but I'm pretty deep into another feature so I don't have the bandwith to debug this.

I'd also say to run this on 2 GPUs... not sure why this is different.

peteriz · 2020-02-16T16:14:39Z

@williamFalcon I can separate the asserts back, but I was not able to get this error on my mac (python 3.7) or on my machines, cpu backed or 1/2/4 GPUs.
I need more info on what you're running (beside the test and CI here) ..

…lightning

peteriz · 2020-02-18T13:13:57Z

@Borda @williamFalcon What can we do to push this PR forward? It would be great to make this happen since I'm planning on integrating pytorch-lightning into our repo NLP Architect.

williamFalcon · 2020-02-18T13:31:18Z

agreed! this is a critical PR for us.
But we reaaally need this test to pass. I think something is wrong with the functionality... there should be no caveats or what ifs on where these tests pass. if they don’t pass as they are on the machines i test on then we need to fix the functionality or test :)

so, maybe think about what would cause the difference in machines? and make sure you’re running the tests correctly.

i do bash .run_local_tests.sh on a 2 gpu machine

williamFalcon · 2020-02-18T13:32:31Z

maybe it’s somehow loading a different checkpoint?? check the file cache clearing

williamFalcon · 2020-02-18T13:32:40Z

@neggert

williamFalcon · 2020-02-18T16:24:20Z

ok, merged. @Borda mentioned this passed on his machine. we can dig into it deeper if it causes problems.

great addition!

peteriz · 2020-02-18T16:38:55Z

@williamFalcon @Borda Thanks for reviewing. Expect more NLP model related contributions 🚀

peteriz requested a review from Borda January 22, 2020 14:47

Borda requested changes Jan 24, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved

Peter Izsak added 3 commits January 28, 2020 11:09

Added max number of steps in Trainer

5d7c59e

Added docstring

47b4119

Fix flake8 errors

00c3738

williamFalcon reviewed Jan 29, 2020

View reviewed changes

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

Peter Izsak and others added 2 commits January 30, 2020 12:59

Clarified docstrings

e468f31

Fixed flake8 error

62b0b08

williamFalcon added the need fix label Feb 1, 2020

Peter Izsak added 2 commits February 2, 2020 15:20

Added min_steps to Trainer

56fc410

Merge branch 'master' of https://github.com/peteriz/pytorch-lightning

ea66c7e

Peter Izsak added 2 commits February 3, 2020 13:01

Added steps and epochs test

b9e91d4

flake8

203a441

minor fix

9007772

Peter Izsak added 2 commits February 3, 2020 16:59

fix steps test in test_trainer

9d6590a

Merge branch 'master' of https://github.com/peteriz/pytorch-lightning

5af168e

Merge branch 'master' of https://github.com/peteriz/pytorch-lightning

167322f

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

9ac7f69

…lightning

peteriz mentioned this pull request Feb 13, 2020

Support/Features for step-based models #827

Closed

Borda reviewed Feb 13, 2020

View reviewed changes

tests/test_trainer.py Outdated Show resolved Hide resolved

Minor in test_trainer.py

ec71b4f

Borda requested changes Feb 14, 2020

View reviewed changes

Update test_trainer.py

4e7adfa

Peter Izsak added 3 commits February 16, 2020 10:44

Merge branch 'master' into master

075fb46

Address PR comments

927d18f

Merge branch 'master' of https://github.com/peteriz/pytorch-lightning

6903788

Peter Izsak added 2 commits February 18, 2020 14:28

Merge branch 'master' of https://github.com/PyTorchLightning/pytorch-…

c9285bb

…lightning

Minor

b66703c

Borda approved these changes Feb 18, 2020

View reviewed changes

Borda added the help wanted Open to be worked on label Feb 18, 2020

williamFalcon merged commit 054a353 into Lightning-AI:master Feb 18, 2020

williamFalcon changed the title ~~Added max number of steps in Trainer~~ Added max/min number of steps in Trainer Feb 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added max/min number of steps in Trainer #728

Added max/min number of steps in Trainer #728

peteriz commented Jan 22, 2020 •

edited

Loading

williamFalcon commented Jan 24, 2020

peteriz commented Feb 2, 2020

williamFalcon commented Feb 2, 2020 •

edited

Loading

peteriz commented Feb 2, 2020 •

edited

Loading

williamFalcon commented Feb 2, 2020 •

edited

Loading

peteriz commented Feb 2, 2020

williamFalcon commented Feb 2, 2020

peteriz commented Feb 3, 2020 •

edited

Loading

williamFalcon commented Feb 3, 2020 •

edited

Loading

peteriz commented Feb 3, 2020

Borda commented Feb 3, 2020 •

edited

Loading

williamFalcon commented Feb 3, 2020

peteriz commented Feb 3, 2020

williamFalcon commented Feb 12, 2020

peteriz commented Feb 13, 2020

Borda commented Feb 13, 2020

Borda left a comment •

edited

Loading

pep8speaks commented Feb 15, 2020 •

edited

Loading

Borda commented Feb 15, 2020 •

edited

Loading

williamFalcon commented Feb 16, 2020

peteriz commented Feb 16, 2020

williamFalcon commented Feb 16, 2020 •

edited

Loading

peteriz commented Feb 16, 2020

peteriz commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

peteriz commented Feb 18, 2020

Added max/min number of steps in Trainer #728

Added max/min number of steps in Trainer #728

Conversation

peteriz commented Jan 22, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

williamFalcon commented Jan 24, 2020

peteriz commented Feb 2, 2020

williamFalcon commented Feb 2, 2020 • edited Loading

peteriz commented Feb 2, 2020 • edited Loading

williamFalcon commented Feb 2, 2020 • edited Loading

peteriz commented Feb 2, 2020

williamFalcon commented Feb 2, 2020

peteriz commented Feb 3, 2020 • edited Loading

williamFalcon commented Feb 3, 2020 • edited Loading

peteriz commented Feb 3, 2020

Borda commented Feb 3, 2020 • edited Loading

williamFalcon commented Feb 3, 2020

peteriz commented Feb 3, 2020

williamFalcon commented Feb 12, 2020

peteriz commented Feb 13, 2020

Borda commented Feb 13, 2020

Borda left a comment • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Feb 15, 2020 • edited Loading

Comment last updated at 2020-02-18 13:11:53 UTC

Borda commented Feb 15, 2020 • edited Loading

williamFalcon commented Feb 16, 2020

peteriz commented Feb 16, 2020

williamFalcon commented Feb 16, 2020 • edited Loading

peteriz commented Feb 16, 2020

peteriz commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

williamFalcon commented Feb 18, 2020

peteriz commented Feb 18, 2020

peteriz commented Jan 22, 2020 •

edited

Loading

williamFalcon commented Feb 2, 2020 •

edited

Loading

peteriz commented Feb 2, 2020 •

edited

Loading

williamFalcon commented Feb 2, 2020 •

edited

Loading

peteriz commented Feb 3, 2020 •

edited

Loading

williamFalcon commented Feb 3, 2020 •

edited

Loading

Borda commented Feb 3, 2020 •

edited

Loading

Borda left a comment •

edited

Loading

pep8speaks commented Feb 15, 2020 •

edited

Loading

Borda commented Feb 15, 2020 •

edited

Loading

williamFalcon commented Feb 16, 2020 •

edited

Loading