Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Montreal Forced Alignment (MFA) Version Inquiry #39

Open
zeynabyousefi opened this issue Nov 2, 2024 · 3 comments
Open

Montreal Forced Alignment (MFA) Version Inquiry #39

zeynabyousefi opened this issue Nov 2, 2024 · 3 comments

Comments

@zeynabyousefi
Copy link

Hello, I would like to know the exact version of Montreal Forced Alignment (MFA) used in this project. I need to confirm the version to ensure compatibility with other project components.

@ytyeung
@wenyong-h
@ivanvovk
@huawei-noah-admin

@li1jkdaw
Copy link

li1jkdaw commented Nov 4, 2024

Hi, @zeynabyousefi ! We used MFA v1.0. As for the English model, meta.yaml file states that it was version v0.9.0, architecture gmm+hmm, feats mfcc+deltas.

@zeynabyousefi
Copy link
Author

Thanks .
I am training the Encoder Diff VC model using the LJSpeech dataset. Currently, I am facing some issues with data preprocessing and setting up input parameters. I would appreciate any guidance on the appropriate configuration for data preprocessing and input parameters.

Additionally, I've encountered errors while running the get_avg_mels.ipynb file, which seem to be due to mismatches in sample rates, audio features (such as MFCC or Mel spectrogram), or other processing parameters.

If specific settings are required for data preprocessing and input parameters, please provide detailed instructions.

Thank you in advance for your assistance!

@ytyeung
@wenyong-h
@ivanvovk
@huawei-noah-admin

@li1jkdaw
Copy link

@zeynabyousefi
I can think of the two reasons why you can have mistakes in data preprocessing. The first one can be related to mel features parameters and the second one - to textgrid files.

  1. Diff-VC model operates on mel-spectrograms with the parameters consistent with the universal HiFiGAN vocoder https://github.com/jik876/hifi-gan. You can find exact parameters in inference.ipynb in the function get_mel. If you want to use the same universal HiFiGAN vocoder as we do, please extract mel features from your audio with the mentioned function get_mel. In particular, sample rate is 22050Hz and hop size is 256, and these parameters are hard-coded in the jupyter notebook get_avg_mels.ipynb, so perhaps this is the reason why may get some mistakes while running this notebook.
  2. MFA must be run separately, and the features used to extract alignment are different from the ones described above. But if I remember correctly, you do not have to extract those features manually to run alignment with MFA - you just have to prepare your audio files in the correct format and put them to the folders with the specific structure (e.g. spk1/book1/wav1,wav2,...). Please refer to MFA documentation for more details. Perhaps you will need to resample audio to 16kHz, but I'm not sure, you'd better check it with MFA documentation as well. After you perform alignment, you'd better check obtained .TextGrid files manually - check that the phonemes and timestamps in those files are consistent with the corresponding audio files.

I'd also want to mention that training DiffVC Average Voice Encoder on LJSpeech only is not a good idea unless you want to perform one-to-any voice conversion where source voice is always LJ. The main idea behind this Encoder is that it should convert any voice into some speaker-independent "average" voice preserving linguistic content of the source speech. It is supposed to be used in any-to-any voice conversion to transform any source voice to "average" voice thus helping to perform disentanglement between content and timbre. But if you train this Encoder only on some specific voice, it won't be able to perform properly on any voice, it will only work as expected on that particular voice. So, if you want to achieve any-to-any voice conversion, you'd better train the Encoder on as many different voices as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants