-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preparation before training. #30
Comments
|
I'm glad to hear that the processed This is a good observation suggestion, l will pay attention to stop taken loss. Yes, l saw someone successfully trained with AliTTS and got results. Because I don't have your four data sets, so l want to get the data from other places. In addition:
|
Word segmentation can be added into transcript as well. For instance, As for the time length for each speaker, I think it depends on whether it can learn alignment or not. Of course the more corpus the better evaluation. In my humble opinion, if the alignment can be learned, the synthesis is feasible. The time we spend on training depends on the amounts of your dataset. Typically speaking once the stop token loss reduces to zero then the issue you mentioned above would not happen. |
This answer solved my doubts for several days, thank you very much! I think I can use your code to train a single speaker with the Biaobei dataset, and then add more speakers. |
@begeekmyfriend Sorry to disturb you again.
Doubt 1: I don't know the role of Doubt 2: I trained 240K on 1080Ti for 2 days, I want to test the intermediate result, but got the following error:
Can you tell me about the training steps? I really don't know what went wrong. |
Your logcat shows strange information that I cannot understand what was wrong. I am afraid I am not able to help you with those logging. The format of |
Thanks, I have solved this bug, the problem lies in the error of the I have to say that your work is really great. Now I plan to do the following work to improve the training:
Training is really slow. For training with a single GPU and adding more datasets, it will take a week to reach 50K. @begeekmyfriend By the way, the filelists folder and the validation-anchor-dir parameters are not used. Can it be deleted? |
我现在已经改成了这种格式
增加了标点,分词和音节。 |
停顿问题一个是用标点,还有就是分词,前面提到过,你可以试一试,不需要你这种复杂的形式,把拼音挤在一起就行了 |
是的,我现在的效果比较好了。所以我想继续听从您的意见,训练WaveRNN。
但是当我运行GTA的时候,却遇到了以下错误:
issue里没其他人遇见过这个问题,请问您知道如何解决这个error吗? |
我参考了inference.py的代码,对
这样就能正常运行 |
有效果就行,运行环境因人而异,至于 |
这个warning生成的数据会影响训练waveRNN吗? |
另外生成速度是怎样的?昨晚到现在11小时了,才生成了4200条数据,一共2w条。1080Ti。这个速度有问题吗?我觉得实在有点慢,不知道是不是我外接移动硬盘,硬盘IO读写有问题还是怎么的。所以想了解一下您的生成速度。 |
T2生成mel这么慢?你是不是没用批合成?1080Ti 11G,够一次性合成一千句话了 |
我上面对更改不太好,我参考了别人的fork,只需更改这一行: Line 155 in a75495e
更改: text --> seq
这样就不会有报错了而且速度很快。 参考您在 issue4 的建议。
|
我以为你是在合成集外mel,那你对 |
我似乎又遇到了一点问题,训练了一天,目前的log是:
在路径 |
643/2565 (179/312) |
要训练完2565个epochs?目前一天600个epochs,那一共要训练4天左右? |
你可以先看一下,你训练的wav和GTA mel是不是sample与frame级别对齐的? |
这是两个文件的对比,抱歉这个我不太了解,看不太懂。
|
就是mel和wav两个numpy数组的长度是否匹配 |
您仓库的waveglow我发现有预训练模型,我打算用预训练模型试试效果,然后在自己数据集上微调一下。但是README文件似乎和您修改的程序不太符合,我也是不能正常运行。 |
版本问题,torch1.6切换至1.3可以了,但是预训练到模型效果也一般,电音沙音严重。 |
音量你可以做个后处理的,比如还可以用一些均值计算统一一下各个句子的音量 |
训练模式一: 训练模式二: 等于我现在参考issue用的模式二,完全参考上面提的issue,应该没啥地方错误吧,我排查不出来。 我想并行用模式一也训练试试,就不加--gta参数了。应该直接把training_data下的audio,mel分别移动到mel和quant目录下就行了吧。 谢了老哥,我不从事这方面的研究不太懂这方面的知识,跟我交流这么久应该会让你比较心累,实在抱歉。 |
意思是你waveRNN仓库下的代码还有部分是有问题的吗?就是我不能处理好数据集直接训练,需要排查一下哪里的代码有问题? |
example.zip gta训练600K
非gta训练250K的结果
似乎是mel的取值范围问题,我正在排查 |
有两个疑问:
他的fork用 mel_bias 来解决该问题,并且固定值为2。我尝试+2,是有所改善,但还不够好。至少问题定位到了mel的范围 |
我对taco2的结果做处理,分别使用了三种操作。
得到的结果+2是最好的。。。但都不太行。 |
这种情况要看你自己怎么调节了,具体情况我不了解 |
还有,wavernn的mel padding是-4么 |
是的,-4。我没对默认参数进行修改。另外wavernn训练到时候有做归一化[-1,1]的处理的吧?对照这一部分的代码。
这个不是调节问题,是我没理清楚输入输出该有的范围,以及正确是处理方式。由于没有两个仓库衔接的README,我只能参考您在别的issue下的给的信息。您可以理解成我完全按照您的方式来处理的。
所以我想请教您,使用taco2预测的mel,需不需要再经过什么处理,再通过WaveRNN来合成音频?
不论是[-4,4]还是[-1,1]结果都不好,杂音很多,但至少不像1一样没声音了 核心问题还是这个:
|
我刚刚试了直接mel+4,效果竟然最好,就是不太稳定,偶尔有噼里啪啦的杂音。我等会下课排查一下mel加多少的数值合适。我没理清楚这个逻辑,为什么直接mel+数值就极大的改善了音质。 不知道原因所在,排查不出来加多少数值合适,给您听听结果: 对了,我没用--gta,--gta到了800K依旧怎么调合成都糟糕,这个wavernn我用的原始音频和原始音频提取的mel进行训练的。 |
最直观的对比:
eval.npy为taco2输出mel,aida_000004.npy为原始音频mel(经过preprocess.py 处理生成)。 aida_000004.npy合成的音频十分好 @begeekmyfriend 请问您知道问题所在吗,我认为是taco2输出的mel,我并未改动过max_abs_value的值,我不知道为什么taco2输出的mel范围不在[-4,4]之间。还需要做一次处理才能输入wavernn,但是什么处理我就不知道了,我已经尝试了归一化以及mel+数值,希望您可以指点我正确的处理方式或者如何控制taco2的输出范围正确位于[-4,4]之间。 |
GTA mel我应该是做了clip的,你确认一下,wavernn怎么可能出现小于-4的值,这些“边角料”很有可能导致合成的杂音 |
GTA训练有问题,我训练到了1000K,但是效果超级差,抖的不行,似乎还有两个人的声音。 所以我又用了非GTA模式,我用原音频和原音频提取的mel训练wavernn,训到到了150K就有很好的结果,我训练到了600K结束。 我现在纠结的问题是taco2的预测输出的mel为什么区间不在[-4,4]? |
这个是必然的,因为在预处理的时候就减去了ref_level_db导致输出负向偏置,无论GTA还是非GTA取值范围都不是对称的,我只是强制clip到-4,祛除“边角料”,不会影响到效果。 另外tacotron2和wavernn没有关系,你用GTA mel去训练wavernn,和T2推理输出的范围基本都是一致的。如果GTA mel通过Griffin Lim效果没有问题的话,那么训练应该没有问题,希望你能确认。 |
Thanks for your help! |
I have new progress, both biaobei woman voice and AliTTS male voice are great.
Thanks again! There are two questions I want to ask you for advice:
I test change max decoder steps: |
You might try pkuseg for Chinese word segmentation. As for long sentence splitting and splicing tools I am afraid I have no idea on it. Maybe punctuator2? You'd better read the README closely. Good luck! |
交个作业,给老哥听听效果:wav_sample.zip 分享其中的一些心得:
另外老哥,我还想请教您一个科普类的问题: 最后还是要感谢您的开源,谢谢! |
Congrats! 你的音量不够均衡,建议后处理一下,建议用均方根RMS再处理一下 停顿处理一般也是在数据标注上下功夫。Mel的取值范围一律clip到-4,GTA mel也一样。 关于声调,这个可以在预处理用Python脚本解决,这属于代码逻辑,不细说了。 GTA mel对我来说没有问题,论文的MOS显示要比GT好,你再排查一下原因。 |
@leijue222
|
@yannier912
遵循标贝33连音的规律在预测的时候去改动输入的Input text label,所有33连音发音正常。 |
请问33连音是指什么呢?我搜了下没找到相关,研究语音不久的小白... |
@leijue222 感谢分享! |
可以发现标贝文本的“卡尔普”这三个字都是第三声(333),但是标贝文本处理成了(223)。还有假语(33)以及躲躲闪闪(3333)同规则一样处理。简而言之就算直接第三声连在一起会变调,做一些标注处理。 |
Hi, @begeekmyfriend
The effect of this work is really great! Everyone's voice quality and speaking speed sound quite comfortable, and they can synthesize such long sentences.
According to other issues, I got the following information:
hparms.anchor_dirs()
should change to my own dataset.trn
file records Chinese pinyin. As targets will be added intotrain.txt
.I still have some questions and hope to get your reply.
trn
? Like this:huángyù línghǎn yǎ le hóulóng
orka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
or other which format? Can you give me a reference format from your data?According to your code, I processed the Biaobei dataset into
train.txt
like the following, I'm not sure if the last column is correct.The tacotron2 I came across can only synthesize one or two sentences, which is about ten seconds, and it will miss and reread if it is longer. But in your example, a sentence of 2 minutes is synthesized. How did you do it?
Is it splitting and concatenating to synthesize long sentences, or the dataset you train is relatively long. Instead of short sentences like Biaobei data? Or other methods?
I consider using Ali TTS API to synthesize data sets. I only have Biaobei data. I want to use the API to synthesize another 3 speakers datasets to keep it consistent with you. Then, each speaker has 12 hours; I don't know if this is feasible.
The text was updated successfully, but these errors were encountered: