-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Refactor validation logic API to naturally support step-based training duration #12000
Comments
Really want this functionality! Thanks for proposing that! There is also a similar thread here: #2534 |
Have you had a look at the proposal in #8135 (comment)? It has the advantage of not introducing new arguments |
Yes, PR #11993 implements the first option you mention in #8135 (comment). But I believe that the PR/solution doesn't adhere to the design principles "Simple internal code" and "Simple external API", therefore I also proposed this RFC. |
Thanks for your critical thinking! Lets cc @PyTorchLightning/core-lightning for thoughts on a. relying on the value types to differentiate between options as described in #8135 (comment) If (b) is chosen, I would suggest keeping |
If (b) is chosen, I think the naming should be consistent with the learning rate schedule configuration: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers Currently, there |
what will be the configuration for
for this, I think it can be configured easily just with validation_frequency= I think we can just extend val_check_interval (possibly rename this) to support int values which can operate over the number of training batches processed overall. Something like:
|
Yes
That's what I propose in #8135 (comment) by setting |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
Closed by #11993 |
Proposed refactor
Currently, users can control when validation happens with the following arguments in the
Trainer
object:I propose to refactor the API to be similar to the learning rate schedule config:
This has 4 modes of operations:
validation_interval=step
,validation_frequency=int(v)
, requiresmax_steps=int(N)
: We train for a maximum ofN
steps and validate everyv
steps, withv << N
andv > 0, N > 0
. For example,v=100, N=1000
would validate at step100, 200, ..., 1000
.validation_interval=step
,validation_frequency=float(v)
, requiresmax_steps=int(N)
: We train for a maximum ofN
steps and validate everyN*v
steps, with0 < v < 1
andN > 0
. For example,v=0.1, N=1000
would validate at step100, 200, ..., 1000
.validation_interval=epoch
,validation_frequency=int(v)
, requiresmax_epochs=int(N)
: We train for a maximum ofN
epochs and validate everyv
th epoch, withv << N
, andv > 0, N > 0
. For example,v=2, N=100
would validate at the end of epoch2, 4, ..., 100
.validation_interval=epoch
,validation_frequency=float(v)
, requiresmax_epochs=int(N)
: We train for a maximum ofN
epochs and validate multiple times during each epoch, according tov*len(dataloader)
, with0 < v < 1
, andN > 0
. For example,v=0.5, N=100
would validate twice each epoch, once in the middle, and once at the end.We would default to
validation_interval=epoch
andvalidation_frequency=1
to mimic the current default behavior.Motivation
The current API was designed with an "epoch" based training mindset common in the earlier days of deep learning. With the growth of datasets size, and the popularity of cyclic, or staged, learning rate schedules, it has become more common to talk about training length in terms of steps instead of epochs . However, choosing a correct validation schedule when using
max_steps=N
with PyTorch Lightning is not intuitive, and users must still be aware of the number of steps in each epoch in order to not make mistakes.Pitch
If we change to this proposed API, all current use-cases will still be supported, but it will become a lot easier to reason about when validation will happen.
Additional context
A stop-gap solution to support the behavior of
validation_interval=step
,validation_frequency=int(v)
with the current API provided in #11993cc @justusschock @awaelchli @rohitgr7 @tchaton @Borda @kaushikb11
The text was updated successfully, but these errors were encountered: