-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added max/min number of steps in Trainer #728
Conversation
@peteriz good addition. please update docs and add a test |
@williamFalcon @Borda |
@peteriz awesome! mind adding a test for this? |
No problem. |
A test for this would run trainer with all 4 of those cases. So, min_steps, max_steps and then make sure the step counts are correct. edit: To make sure the test is fast, maybe set a limit on how much training data is used? 10%? |
The intention of my question was on where to place the test. |
yup! edited my original comment |
@williamFalcon I pushed the test, I hope it is okay. |
Failed on GPU:
|
Thought Travis-CI pass was okay. I'll have a look. |
unfortunately, it is hard to get free CI with GPUs for full testing, but we are working on it.. #486 |
still failing:
|
@williamFalcon I could not reproduce that error on my GPU machine. |
@peteriz can you rebase master so we can see if tests pass now? |
Fixed. Could you squash my commits when merging? |
sure, that's what we always do... :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, just a minor points, pls :] Thx for you patience...
Hello @peteriz! Thanks for updating this PR.
Comment last updated at 2020-02-18 13:11:53 UTC |
Not sure what's happening here. Likely, the global steps are ACTUALLY wrong or perhaps this isn't accounting for batch size in your num_train_samples * trainer_options['max_epochs'] calculation? |
@williamFalcon Okay I changed the assert and it pulls max_epochs from the trainer. |
@peteriz this is still failing...
|
@williamFalcon I can separate the asserts back, but I was not able to get this error on my mac (python 3.7) or on my machines, cpu backed or 1/2/4 GPUs. |
@Borda @williamFalcon What can we do to push this PR forward? It would be great to make this happen since I'm planning on integrating pytorch-lightning into our repo NLP Architect. |
agreed! this is a critical PR for us. so, maybe think about what would cause the difference in machines? and make sure you’re running the tests correctly. i do bash .run_local_tests.sh on a 2 gpu machine |
maybe it’s somehow loading a different checkpoint?? check the file cache clearing |
ok, merged. @Borda mentioned this passed on his machine. we can dig into it deeper if it causes problems. great addition! |
@williamFalcon @Borda Thanks for reviewing. Expect more NLP model related contributions 🚀 |
Before submitting
What does this PR do?
Partially Fixes #640
I added a max_steps argument in Trainer for stopping the training process when
max_steps
has been reached. This feature is highly desired in Transformer-based models (such as BERT, XLNet, etc.) where a warm-up phase and LR decay is required.I will add future PRs covering stepwise processing (related to scheduler)
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.