Add WORLD pitch estimators and F0 range as hyperparameters #149

UtaUtaUtau · 2023-10-18T16:20:55Z

Add WORLD pitch estimators
- Adds harvest and dio as other options for the pe hyperparameter. WORLD is used for the breathiness embed, so I thought it would make sense to also have their pitch estimation algorithms supported.
- harvest is slow but very accurate. There are some weird F0 values in voiced-unvoiced boundaries, but I found that it does not affect the model quality much.
- dio is faster but is not too accurate. Only added as an option for completeness.
Add F0 range as hyperparameters
- f0_min and f0_max is added to the base config so that F0 detection range is more customizable. get_pitch_parselmouth was also updated to take these new parameters into account.

i also removed hardcoded F0 ranges because what the heck is that 800 Hz max pitch in parselmouth that is way too low

okay maybe don't add ur own flare in the readme if u actually want to create a pull req

i saw it in the parselmouth thing might as well put it in to make sure

oops

for some reason pyworld only likes float64?

why isn't it a hyperparameter in the first place

i think world is p accurate with the frames stuff but it's just to ensure

it's just to be similar to the parselmouth one.. it makes sense to not center the F0 after all

yeah

yxlllc · 2023-10-18T16:51:59Z

I hope that the default f0_max=800 is because the current upper limit of the vocoder's range G5 corresponds to 800Hz, which means that data with a higher range is inherently difficult to synthesize, and a smaller f0 range can improve some accuracy. . .
We will update the vocoder in the future, and we will consider changing this default value at that time, but it is best to change it back now.

yqzhishen · 2023-10-18T16:59:58Z

Some of my own opinions and obersavation that may be necessary to share with you:

Harvest may be too slow compared to parselmouth, and may not be as accurate as the NN-based RMVPE; on GPU or some high-end CPUs, the speed of harvest can be even slower than RMVPE, which benefits a lot from parallel computation.
As for dio, I think accuracy should be preferred instead of speed when running preprocessing.
These two algorithms are from WORLD, but this does not mean they are the most suitable pitch extractor for further WORLD analysis. In one of my former experiment, parselmouth+WORLD can outperform dio when extracting aperiodic part from waveform.
f0_min and f0_max are only used by parselmouth before. Actually, parselmouth does not take them as hard limits; instead, they change the behavior of it somehow, even in unpredictable ways, especially f0_min... And setting an f0_max does not directly prevents parselmouth from producing f0 higher than that limit.

Anyway, I am glad to merge this PR as long as the code itself is correct. But I think it is necessary to point out the advantages and disadvantages of each pitch extractor as well as their differences in the documentation.

hrukalive · 2023-10-18T18:19:45Z

There are many f0_min/max in the codebase, should we also move this setting into binarization args specifically?

UtaUtaUtau · 2023-10-18T20:46:12Z

I hope that the default f0_max=800 is because the current upper limit of the vocoder's range G5 corresponds to 800Hz, which means that data with a higher range is inherently difficult to synthesize, and a smaller f0 range can improve some accuracy. . .
We will update the vocoder in the future, and we will consider changing this default value at that time, but it is best to change it back now.

I think it would be best to change it to 800 by default if it's by accordance with the current vocoder, yes.

There are many f0_min/max in the codebase, should we also move this setting into binarization args specifically?

I think so? I put it outside of binarization args thinking that it would be better if it was closer to where pe is, though it wouldn't be that bad to have it somewhere else...

As with the advantages and disadvantages I can look more into it later as I have some other things to do in the meantime. Thank you for the quick responses!

UtaUtaUtau · 2023-10-21T11:39:51Z

Okay I ran a test over the MIR-1K dataset to compare Harvest, DIO and Parselmouth and these are my findings using metrics from mir_eval:

Harvest rarely misses adding pitch in voiced areas 99.78 ± 0.696 Voicing Recall, but that also results in less accurate unvoiced areas 38.47 ± 15.341 Voicing False Alarm
All of them perform roughly the same in Raw Pitch Accuracy and Raw Chroma Accuracy, but Harvest has the highest performance out of the three 93.97 ± 4.013 RPA, 94.25 ± 2.883 RCA. In the RMVPE paper, RMVPE still outperforms in both categories, but I think it's worth it to have a second option that can achieve similar results.
DIO is definitely falling behind even compared to Parselmouth... It can probably be removed.

That's for the qualitative side... Now for my personal views.

I think the overseas vocal synth community doesn't really mind the extra CPU overhead that Harvest gives, as a lot of the people helping in training their models usually do pre-processing locally. I personally don't mind this either, as what I care about truly is the quality of the results, and I am certain that misdetecting unvoiced areas isn't as bad as misdetecing voiced areas in pitch.

I also know that the biggest RVC fork (Mangio-RVC) preferred Harvest before they implemented RMVPE for more accurate pitch estimation. Research papers have also used Harvest for pitch estimation — most notable one being RefineGAN — which is what ACE Studio uses for its vocoder if I remember correctly. From this, I do think there's a lot of merit to adding Harvest in DiffSinger as an option for another pitch estimator. That's all thank you!

yqzhishen · 2023-10-21T18:02:09Z

Then I agree that we should remove dio and preserv harvest. By the way, are you sure that harvest is faster than RMVPE on CPU?

UtaUtaUtau · 2023-10-21T18:54:14Z

From what I remember in Mangio-RVC, RMVPE on CPU is still faster than Harvest...

yqzhishen · 2023-11-19T14:28:39Z

Hi there,

Is this PR still active? It can be merged once the suggestions above are applied.

UtaUtaUtau · 2023-11-19T20:36:05Z

Yes! I'm just not quite sure which ones to change other than removing Dio...

yqzhishen · 2023-11-20T04:46:32Z

My suggestions:

Remove dio because it cannot produce satisfactory results
Restore f0_min and f0_max to 65 and 800, keep parselmouth as the default PE
If you would like, please add proper documentation to tell the users differences among the PEs, and how to switch and configure them (or leaving it for me to complete the docs after merging is okay)

docs/BestPractices.md

…penvpi#149)" This reverts commit 931df27.

UtaUtaUtau added 19 commits August 13, 2023 23:40

Added WORLD pitch estimators

5fef0aa

i also removed hardcoded F0 ranges because what the heck is that 800 Hz max pitch in parselmouth that is way too low

Update README.md

0099737

okay maybe don't add ur own flare in the readme if u actually want to create a pull req

Merge branch 'openvpi:main' into main

3ed0506

Apply dtype change

e2c0da4

i saw it in the parselmouth thing might as well put it in to make sure

Update pw.py

686ad8d

oops

fix dtype mismatch

344c471

for some reason pyworld only likes float64?

add f0 range as a hyperparameter

233c8b4

why isn't it a hyperparameter in the first place

Merge branch 'openvpi:main' into main

64dea71

Merge branch 'openvpi:main' into main

5e07fce

Merge branch 'openvpi:main' into main

e71ad28

Merge branch 'openvpi:main' into main

2c82a61

Merge branch 'openvpi:main' into main

04efef8

move pad_frames to pw

947efb5

i think world is p accurate with the frames stuff but it's just to ensure

Merge branch 'main' of https://github.com/UtaUtaUtau/DiffSinger

9050e70

Merge branch 'openvpi:main' into main

411ea2c

Merge branch 'openvpi:main' into main

bd66223

change padding algorithm

7cda88a

it's just to be similar to the parselmouth one.. it makes sense to not center the F0 after all

Merge branch 'main' of https://github.com/UtaUtaUtau/DiffSinger

4cca34d

remove duplicate line

795d3b0

yeah

yqzhishen requested a review from yxlllc October 18, 2023 16:37

UtaUtaUtau added 2 commits November 1, 2023 15:01

Merge branch 'openvpi:main' into main

fec788a

Merge branch 'openvpi:main' into main

cd3834b

Remove DIO, change default range, add docs

f7f00c7

yqzhishen reviewed Nov 20, 2023

View reviewed changes

docs/BestPractices.md Show resolved Hide resolved

Add notice for F0 range

7e62ed0

yqzhishen approved these changes Nov 20, 2023

View reviewed changes

yqzhishen merged commit 931df27 into openvpi:main Nov 20, 2023

UtaUtaUtau added a commit to UtaUtaUtau/DiffSinger that referenced this pull request Nov 20, 2023

Revert "Add WORLD pitch estimators and F0 range as hyperparameters (o…

6b75f59

…penvpi#149)" This reverts commit 931df27.

yqzhishen added a commit that referenced this pull request Nov 20, 2023

Remove unexpected import in #149

961be9a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add WORLD pitch estimators and F0 range as hyperparameters #149

Add WORLD pitch estimators and F0 range as hyperparameters #149

UtaUtaUtau commented Oct 18, 2023

yxlllc commented Oct 18, 2023

yqzhishen commented Oct 18, 2023

hrukalive commented Oct 18, 2023

UtaUtaUtau commented Oct 18, 2023

UtaUtaUtau commented Oct 21, 2023

yqzhishen commented Oct 21, 2023

UtaUtaUtau commented Oct 21, 2023

yqzhishen commented Nov 19, 2023

UtaUtaUtau commented Nov 19, 2023

yqzhishen commented Nov 20, 2023

Add WORLD pitch estimators and F0 range as hyperparameters #149

Add WORLD pitch estimators and F0 range as hyperparameters #149

Conversation

UtaUtaUtau commented Oct 18, 2023

yxlllc commented Oct 18, 2023

yqzhishen commented Oct 18, 2023

hrukalive commented Oct 18, 2023

UtaUtaUtau commented Oct 18, 2023

UtaUtaUtau commented Oct 21, 2023

yqzhishen commented Oct 21, 2023

UtaUtaUtau commented Oct 21, 2023

yqzhishen commented Nov 19, 2023

UtaUtaUtau commented Nov 19, 2023

yqzhishen commented Nov 20, 2023