Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow validation after N training steps #461

Merged
merged 2 commits into from
Jun 13, 2024

Conversation

SamuelLarkin
Copy link
Collaborator

@SamuelLarkin SamuelLarkin commented Jun 12, 2024

TODO

  • Is val_check_interval available since 2.0.0 which the version that EveryVoice currently requires?
  • Does it save a checkpoint each time it's validating?

PR Goal?

Change to save checkpoint and run validation on N steps instead of epochs.

Fixes?

Fixes: #204

Feedback sought?

Merge request.

Priority?

Part of alpha so medium/high.

Tests added?

No

How to test?

Command

everyvoice train text-to-spec \
  config/everyvoice-text-to-spec.yaml \
  --config-args training.max_epochs=3 \
  --config-args training.val_check_interval=9

We can see a checkpoint every 9 steps

lsd logs_and_checkpoints/FeaturePredictionExperiment/base/checkpoints/
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:38 2024  epoch=0-step=54.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:41 2024  epoch=0-step=63.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:44 2024  epoch=0-step=72.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:47 2024  epoch=0-step=81.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:50 2024  epoch=0-step=90.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:56 2024  epoch=1-step=108.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:59 2024  epoch=1-step=117.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:02 2024  epoch=1-step=126.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:05 2024  epoch=1-step=135.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:08 2024  epoch=1-step=144.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:10 2024  epoch=1-step=153.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:13 2024  epoch=1-step=162.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:16 2024  epoch=1-step=171.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:19 2024  epoch=1-step=180.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:25:53 2024  epoch=1-step=99.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:22 2024  epoch=2-step=189.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:25 2024  epoch=2-step=198.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:28 2024  epoch=2-step=207.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:31 2024  epoch=2-step=216.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:34 2024  epoch=2-step=225.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:37 2024  epoch=2-step=234.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:40 2024  epoch=2-step=243.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:43 2024  epoch=2-step=252.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:46 2024  epoch=2-step=261.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:49 2024  epoch=2-step=270-v1.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:48 2024  epoch=2-step=270.ckpt
.rw-rw---- sam037 nrc_ict 159 MB Wed Jun 12 10:26:49 2024  last.ckpt

Taking pitch_loss as an example, we can see that a validation occurred every 9 training steps.

image

Confidence?

Good

Version change?

Don't think so because if config/everyvoice-text-to-spec.yaml is missing, then training.val_check_interval will default to 1.0 just like before.

Related PRs?

@SamuelLarkin SamuelLarkin changed the title [WIP]feat: allow validation after N training steps feat: allow validation after N training steps Jun 12, 2024
Copy link
Member

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this Sam - that was quick! I find the difference between check_val_every_n_epoch and val_check_interval a bit confusing. Looking at this documentation I think we need to account for the scenario when the value of val_check_interval is greater than the number of steps in an epoch. In that scenario, check_val_every_n_epoch should be None. In fact, I think that's what our default should be: change check_val_every_n_epoch to None and then val_check_interval should be set to say 1000. What do you think?

@SamuelLarkin
Copy link
Collaborator Author

@roedoejet , I also had to read val_check_interval's documentation multiple times. It is confusing at first.

We could also rename val_check_interval to something more descriptive or simply change it's documentation in EV.

I think it is valuable to know what performance we are getting when we have trained on exactly all examples the same number of times aka at the end of an epoch.

But, there is the danger that the user sets val_check_interval too high and check_val_every_n_epoch is not None and get an error message from pytorch_lighting. I tried it and got

ValueError:  `val_check_interval` (100) must be less than or equal to the number of the training batches (90). If you want to disable validation set `limit_val_batches` to 0.0 instead. If you want to validate based on the total training batches, set `check_val_every_n_epoch=None`.

I'll set check_val_every_n_epoch to None as you suggested.

But I think val_check_interval = 1000 is too high as implies (batch_size=16) * (val_check_interval = 1000) = 16000 examples which is more than all the audio recordings we have in CV and it is also bigger than LJSpeech=13100.

Copy link
Member

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I just said 1000 because that's what HiFiGAN and FastSpeech2 use by default, but yes, I think 500 is fine. I updated the schemas which is needed to pass CI

Copy link
Contributor

CLI load time: 0:00.25
Pull Request HEAD: bb3643e535a12f2ea176915c15deaebada4636df
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
import time:       261 |     103005 | typer

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.15%. Comparing base (e044cd5) to head (bb3643e).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #461   +/-   ##
=======================================
  Coverage   74.15%   74.15%           
=======================================
  Files          44       44           
  Lines        2766     2767    +1     
  Branches      428      428           
=======================================
+ Hits         2051     2052    +1     
  Misses        630      630           
  Partials       85       85           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SamuelLarkin SamuelLarkin merged commit 1beb279 into main Jun 13, 2024
4 checks passed
@SamuelLarkin SamuelLarkin deleted the dev.sl/204_validation_step branch June 13, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change to save checkpoint and run validation on N steps instead of epochs
2 participants