Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WORLD pitch estimators and F0 range as hyperparameters #149

Merged
merged 23 commits into from
Nov 20, 2023

Conversation

UtaUtaUtau
Copy link

  1. Add WORLD pitch estimators
    • Adds harvest and dio as other options for the pe hyperparameter. WORLD is used for the breathiness embed, so I thought it would make sense to also have their pitch estimation algorithms supported.
    • harvest is slow but very accurate. There are some weird F0 values in voiced-unvoiced boundaries, but I found that it does not affect the model quality much.
    • dio is faster but is not too accurate. Only added as an option for completeness.
  2. Add F0 range as hyperparameters
    • f0_min and f0_max is added to the base config so that F0 detection range is more customizable. get_pitch_parselmouth was also updated to take these new parameters into account.

i also removed hardcoded F0 ranges because what the heck is that 800 Hz max pitch in parselmouth that is way too low
okay maybe don't add ur own flare in the readme if u actually want to create a pull req
i saw it in the parselmouth thing might as well put it in to make sure
for some reason pyworld only likes float64?
why isn't it a hyperparameter in the first place
i think world is p accurate with the frames stuff but it's just to ensure
it's just to be similar to the parselmouth one.. it makes sense to not center the F0 after all
@yqzhishen yqzhishen requested a review from yxlllc October 18, 2023 16:37
@yxlllc
Copy link
Collaborator

yxlllc commented Oct 18, 2023

I hope that the default f0_max=800 is because the current upper limit of the vocoder's range G5 corresponds to 800Hz, which means that data with a higher range is inherently difficult to synthesize, and a smaller f0 range can improve some accuracy. . .
We will update the vocoder in the future, and we will consider changing this default value at that time, but it is best to change it back now.

@yqzhishen
Copy link
Member

Some of my own opinions and obersavation that may be necessary to share with you:

  1. Harvest may be too slow compared to parselmouth, and may not be as accurate as the NN-based RMVPE; on GPU or some high-end CPUs, the speed of harvest can be even slower than RMVPE, which benefits a lot from parallel computation.
  2. As for dio, I think accuracy should be preferred instead of speed when running preprocessing.
  3. These two algorithms are from WORLD, but this does not mean they are the most suitable pitch extractor for further WORLD analysis. In one of my former experiment, parselmouth+WORLD can outperform dio when extracting aperiodic part from waveform.
  4. f0_min and f0_max are only used by parselmouth before. Actually, parselmouth does not take them as hard limits; instead, they change the behavior of it somehow, even in unpredictable ways, especially f0_min... And setting an f0_max does not directly prevents parselmouth from producing f0 higher than that limit.

Anyway, I am glad to merge this PR as long as the code itself is correct. But I think it is necessary to point out the advantages and disadvantages of each pitch extractor as well as their differences in the documentation.

@hrukalive
Copy link

There are many f0_min/max in the codebase, should we also move this setting into binarization args specifically?

@UtaUtaUtau
Copy link
Author

I hope that the default f0_max=800 is because the current upper limit of the vocoder's range G5 corresponds to 800Hz, which means that data with a higher range is inherently difficult to synthesize, and a smaller f0 range can improve some accuracy. . .
We will update the vocoder in the future, and we will consider changing this default value at that time, but it is best to change it back now.

I think it would be best to change it to 800 by default if it's by accordance with the current vocoder, yes.

There are many f0_min/max in the codebase, should we also move this setting into binarization args specifically?

I think so? I put it outside of binarization args thinking that it would be better if it was closer to where pe is, though it wouldn't be that bad to have it somewhere else...

As with the advantages and disadvantages I can look more into it later as I have some other things to do in the meantime. Thank you for the quick responses!

@UtaUtaUtau
Copy link
Author

Okay I ran a test over the MIR-1K dataset to compare Harvest, DIO and Parselmouth and these are my findings using metrics from mir_eval:

  1. Harvest rarely misses adding pitch in voiced areas 99.78 ± 0.696 Voicing Recall, but that also results in less accurate unvoiced areas 38.47 ± 15.341 Voicing False Alarm
  2. All of them perform roughly the same in Raw Pitch Accuracy and Raw Chroma Accuracy, but Harvest has the highest performance out of the three 93.97 ± 4.013 RPA, 94.25 ± 2.883 RCA. In the RMVPE paper, RMVPE still outperforms in both categories, but I think it's worth it to have a second option that can achieve similar results.
  3. DIO is definitely falling behind even compared to Parselmouth... It can probably be removed.

That's for the qualitative side... Now for my personal views.

I think the overseas vocal synth community doesn't really mind the extra CPU overhead that Harvest gives, as a lot of the people helping in training their models usually do pre-processing locally. I personally don't mind this either, as what I care about truly is the quality of the results, and I am certain that misdetecting unvoiced areas isn't as bad as misdetecing voiced areas in pitch.

I also know that the biggest RVC fork (Mangio-RVC) preferred Harvest before they implemented RMVPE for more accurate pitch estimation. Research papers have also used Harvest for pitch estimation — most notable one being RefineGAN — which is what ACE Studio uses for its vocoder if I remember correctly. From this, I do think there's a lot of merit to adding Harvest in DiffSinger as an option for another pitch estimator. That's all thank you!

@yqzhishen
Copy link
Member

Then I agree that we should remove dio and preserv harvest. By the way, are you sure that harvest is faster than RMVPE on CPU?

@UtaUtaUtau
Copy link
Author

From what I remember in Mangio-RVC, RMVPE on CPU is still faster than Harvest...

@yqzhishen
Copy link
Member

Hi there,

Is this PR still active? It can be merged once the suggestions above are applied.

@UtaUtaUtau
Copy link
Author

Yes! I'm just not quite sure which ones to change other than removing Dio...

@yqzhishen
Copy link
Member

My suggestions:

  • Remove dio because it cannot produce satisfactory results
  • Restore f0_min and f0_max to 65 and 800, keep parselmouth as the default PE
  • If you would like, please add proper documentation to tell the users differences among the PEs, and how to switch and configure them (or leaving it for me to complete the docs after merging is okay)

@yqzhishen yqzhishen merged commit 931df27 into openvpi:main Nov 20, 2023
UtaUtaUtau added a commit to UtaUtaUtau/DiffSinger that referenced this pull request Nov 20, 2023
yqzhishen added a commit that referenced this pull request Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants