[RFC] Refactor validation logic API to naturally support step-based training duration #12000

nikvaessen · 2022-02-19T09:28:25Z

Proposed refactor

Currently, users can control when validation happens with the following arguments in the Trainer object:

Trainer(
  val_check_interval: Union[int, float] = float(1.0),
  check_val_every_n_epoch: int = int(1)
)

I propose to refactor the API to be similar to the learning rate schedule config:

Trainer(
  validation_interval: "step" | "epoch" = "epoch",
  validation_frequency: Union[int, float] = int(1)
)

This has 4 modes of operations:

validation_interval=step, validation_frequency=int(v), requires max_steps=int(N): We train for a maximum of N steps and validate every v steps, with v << N and v > 0, N > 0. For example, v=100, N=1000 would validate at step 100, 200, ..., 1000.
validation_interval=step, validation_frequency=float(v), requires max_steps=int(N): We train for a maximum of N steps and validate every N*v steps, with 0 < v < 1 and N > 0. For example, v=0.1, N=1000 would validate at step 100, 200, ..., 1000.
validation_interval=epoch, validation_frequency=int(v), requires max_epochs=int(N): We train for a maximum of N epochs and validate every vth epoch, with v << N, and v > 0, N > 0. For example, v=2, N=100 would validate at the end of epoch 2, 4, ..., 100.
validation_interval=epoch, validation_frequency=float(v), requires max_epochs=int(N): We train for a maximum of N epochs and validate multiple times during each epoch, according to v*len(dataloader), with 0 < v < 1, and N > 0. For example, v=0.5, N=100 would validate twice each epoch, once in the middle, and once at the end.

We would default to validation_interval=epoch and validation_frequency=1 to mimic the current default behavior.

Motivation

The current API was designed with an "epoch" based training mindset common in the earlier days of deep learning. With the growth of datasets size, and the popularity of cyclic, or staged, learning rate schedules, it has become more common to talk about training length in terms of steps instead of epochs . However, choosing a correct validation schedule when using max_steps=N with PyTorch Lightning is not intuitive, and users must still be aware of the number of steps in each epoch in order to not make mistakes.

Pitch

If we change to this proposed API, all current use-cases will still be supported, but it will become a lot easier to reason about when validation will happen.

Additional context

A stop-gap solution to support the behavior of validation_interval=step, validation_frequency=int(v) with the current API provided in #11993

cc @justusschock @awaelchli @rohitgr7 @tchaton @Borda @kaushikb11

The text was updated successfully, but these errors were encountered:

yangyi02 · 2022-04-04T22:23:24Z

Really want this functionality! Thanks for proposing that!

There is also a similar thread here: #2534

carmocca · 2022-04-06T04:17:14Z

Have you had a look at the proposal in #8135 (comment)? It has the advantage of not introducing new arguments

nikvaessen · 2022-04-06T10:34:59Z

Yes, PR #11993 implements the first option you mention in #8135 (comment). But I believe that the PR/solution doesn't adhere to the design principles "Simple internal code" and "Simple external API", therefore I also proposed this RFC.

carmocca · 2022-04-06T10:48:12Z

Thanks for your critical thinking! Lets cc @PyTorchLightning/core-lightning for thoughts on

a. relying on the value types to differentiate between options as described in #8135 (comment)
b. deprecate the existing API and add new explicit flags as proposed in the top post.

If (b) is chosen, I would suggest keeping val_check_interval to match what you wrote as validation_frequency and instead find a less ambiguous name than validation_interval: val_granularity? val_cadence?

nikvaessen · 2022-04-06T10:57:45Z

If (b) is chosen, I think the naming should be consistent with the learning rate schedule configuration: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#configure-optimizers

Currently, there interval is used to choose between epoch and step, and frequency is used to indicate the amount of steps/epochs. I do prefer granularity above interval, as I feel frequency and interval have similar meanings.

rohitgr7 · 2022-04-06T14:34:39Z

what will be the configuration for check_val_every_n_epochs=4 and val_check_interval = 0.25 i.e do validation 4 times at every 4th training epoch?

validation_interval=step, validation_frequency=float(v), requires max_steps=int(N): We train for a maximum of N steps and validate every N*v steps

for this, I think it can be configured easily just with validation_frequency=N*v since the user already knows the value for N.

I think we can just extend val_check_interval (possibly rename this) to support int values which can operate over the number of training batches processed overall.

Something like:

if val_check_interval is float, consider the number of training batches per epoch
if val_check_interval is int, consider overall training batches already processed

carmocca · 2022-04-07T16:18:47Z

what will be the configuration for check_val_every_n_epochs=4 and val_check_interval = 0.25 i.e do validation 4 times at every 4th training epoch?

Yes

I think we can just extend val_check_interval (possibly rename this) to support int values which can operate over the number of training batches processed overall.

That's what I propose in #8135 (comment) by setting check_val_every_n_epochs=None (default is 1)

stale · 2022-06-06T03:38:18Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca · 2022-06-06T15:56:52Z

Closed by #11993

nikvaessen added the refactor label Feb 19, 2022

akihironitta added the trainer: argument label Feb 20, 2022

carmocca mentioned this issue Apr 6, 2022

Extend val_check_interval support #8135

Closed

carmocca added the design Includes a design discussion label Apr 6, 2022

stale bot added the won't fix This will not be worked on label Jun 6, 2022

carmocca closed this as completed Jun 6, 2022

Borda added this to Lightning RFCs Aug 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Refactor validation logic API to naturally support step-based training duration #12000

[RFC] Refactor validation logic API to naturally support step-based training duration #12000

nikvaessen commented Feb 19, 2022 •

edited by github-actions bot

Loading

yangyi02 commented Apr 4, 2022

carmocca commented Apr 6, 2022

nikvaessen commented Apr 6, 2022 •

edited

Loading

carmocca commented Apr 6, 2022

nikvaessen commented Apr 6, 2022 •

edited

Loading

rohitgr7 commented Apr 6, 2022 •

edited

Loading

carmocca commented Apr 7, 2022

stale bot commented Jun 6, 2022

carmocca commented Jun 6, 2022

[RFC] Refactor validation logic API to naturally support step-based training duration #12000

[RFC] Refactor validation logic API to naturally support step-based training duration #12000

Comments

nikvaessen commented Feb 19, 2022 • edited by github-actions bot Loading

Proposed refactor

Motivation

Pitch

Additional context

yangyi02 commented Apr 4, 2022

carmocca commented Apr 6, 2022

nikvaessen commented Apr 6, 2022 • edited Loading

carmocca commented Apr 6, 2022

nikvaessen commented Apr 6, 2022 • edited Loading

rohitgr7 commented Apr 6, 2022 • edited Loading

carmocca commented Apr 7, 2022

stale bot commented Jun 6, 2022

carmocca commented Jun 6, 2022

nikvaessen commented Feb 19, 2022 •

edited by github-actions bot

Loading

nikvaessen commented Apr 6, 2022 •

edited

Loading

nikvaessen commented Apr 6, 2022 •

edited

Loading

rohitgr7 commented Apr 6, 2022 •

edited

Loading