-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add WORLD pitch estimators and F0 range as hyperparameters #149
Conversation
i also removed hardcoded F0 ranges because what the heck is that 800 Hz max pitch in parselmouth that is way too low
okay maybe don't add ur own flare in the readme if u actually want to create a pull req
i saw it in the parselmouth thing might as well put it in to make sure
oops
for some reason pyworld only likes float64?
why isn't it a hyperparameter in the first place
i think world is p accurate with the frames stuff but it's just to ensure
it's just to be similar to the parselmouth one.. it makes sense to not center the F0 after all
I hope that the default f0_max=800 is because the current upper limit of the vocoder's range G5 corresponds to 800Hz, which means that data with a higher range is inherently difficult to synthesize, and a smaller f0 range can improve some accuracy. . . |
Some of my own opinions and obersavation that may be necessary to share with you:
Anyway, I am glad to merge this PR as long as the code itself is correct. But I think it is necessary to point out the advantages and disadvantages of each pitch extractor as well as their differences in the documentation. |
There are many |
I think it would be best to change it to 800 by default if it's by accordance with the current vocoder, yes.
I think so? I put it outside of binarization args thinking that it would be better if it was closer to where As with the advantages and disadvantages I can look more into it later as I have some other things to do in the meantime. Thank you for the quick responses! |
Okay I ran a test over the MIR-1K dataset to compare Harvest, DIO and Parselmouth and these are my findings using metrics from
That's for the qualitative side... Now for my personal views. I think the overseas vocal synth community doesn't really mind the extra CPU overhead that Harvest gives, as a lot of the people helping in training their models usually do pre-processing locally. I personally don't mind this either, as what I care about truly is the quality of the results, and I am certain that misdetecting unvoiced areas isn't as bad as misdetecing voiced areas in pitch. I also know that the biggest RVC fork (Mangio-RVC) preferred Harvest before they implemented RMVPE for more accurate pitch estimation. Research papers have also used Harvest for pitch estimation — most notable one being RefineGAN — which is what ACE Studio uses for its vocoder if I remember correctly. From this, I do think there's a lot of merit to adding Harvest in DiffSinger as an option for another pitch estimator. That's all thank you! |
Then I agree that we should remove dio and preserv harvest. By the way, are you sure that harvest is faster than RMVPE on CPU? |
From what I remember in Mangio-RVC, RMVPE on CPU is still faster than Harvest... |
Hi there, Is this PR still active? It can be merged once the suggestions above are applied. |
Yes! I'm just not quite sure which ones to change other than removing Dio... |
My suggestions:
|
…penvpi#149)" This reverts commit 931df27.
harvest
anddio
as other options for thepe
hyperparameter. WORLD is used for the breathiness embed, so I thought it would make sense to also have their pitch estimation algorithms supported.harvest
is slow but very accurate. There are some weird F0 values in voiced-unvoiced boundaries, but I found that it does not affect the model quality much.dio
is faster but is not too accurate. Only added as an option for completeness.f0_min
andf0_max
is added to the base config so that F0 detection range is more customizable.get_pitch_parselmouth
was also updated to take these new parameters into account.