Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preparation before training. #30

Closed
leijue222 opened this issue Nov 12, 2020 · 52 comments
Closed

Preparation before training. #30

leijue222 opened this issue Nov 12, 2020 · 52 comments

Comments

@leijue222
Copy link
Contributor

leijue222 commented Nov 12, 2020

Hi, @begeekmyfriend
The effect of this work is really great! Everyone's voice quality and speaking speed sound quite comfortable, and they can synthesize such long sentences.

According to other issues, I got the following information:

  1. The dataset of hparms.anchor_dirs() should change to my own dataset.
  2. The training_data directory structure is here.
  3. The original corpus contains wav, txt and trn. The trn file records Chinese pinyin. As targets will be added into train.txt.

I still have some questions and hope to get your reply.

  1. What is the storage format of trn? Like this: huángyù línghǎn yǎ le hóulóng or ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1 or other which format? Can you give me a reference format from your data?
    According to your code, I processed the Biaobei dataset into train.txt like the following, I'm not sure if the last column is correct.
000001.npy|53504|209|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002.npy|59648|233|jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
  1. The tacotron2 I came across can only synthesize one or two sentences, which is about ten seconds, and it will miss and reread if it is longer. But in your example, a sentence of 2 minutes is synthesized. How did you do it?
    Is it splitting and concatenating to synthesize long sentences, or the dataset you train is relatively long. Instead of short sentences like Biaobei data? Or other methods?

  2. I consider using Ali TTS API to synthesize data sets. I only have Biaobei data. I want to use the API to synthesize another 3 speakers datasets to keep it consistent with you. Then, each speaker has 12 hours; I don't know if this is feasible.

@leijue222 leijue222 changed the title How to get the dataset? Preparation before training. Nov 12, 2020
@begeekmyfriend
Copy link
Owner

  1. The content in trn is Chinese pinyin with alphabets plus number like ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1. So the biaobei format is correct.

  2. You'd better train until the stop token loss reduces to zero and thus the model knows when to stop inference and there will be no reread in evaluation. I guess you stopped the training too early. And I do not know whether you are using vocoder or not.

  3. I have no idea about Ali TTS API and I do not know whether you can synthesize speakers with trained model from T2.

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 13, 2020

I'm glad to hear that the processed train.txt is correct. I see many people will further process ka2 er2 pu3 into k a2 er2 pu3(声母韵母+声调). l have been entangled here, worried that the input will affect the model results. Thank you for telling me to keep ka2 er2 pu3.

This is a good observation suggestion, l will pay attention to stop taken loss.

Yes, l saw someone successfully trained with AliTTS and got results. Because I don't have your four data sets, so l want to get the data from other places.

In addition:

  1. I noticed that you said that each speaker needs at least 1 hour. so how many hours do you have for each speaker?
    If I use AliTTS to get another three datasets, then each speaker will have 12 hours.

  2. How long have you trained under which GPU?

@begeekmyfriend
Copy link
Owner

Word segmentation can be added into transcript as well. For instance, ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1 can be modified into ka2er2pu3 pei2 wai4sun1 wan2 hua2ti1 for better tone. That is to say you can use white space as delimiter. And ka2er2pu3 can also be modified into kar2pu3 for rhotic accent.

As for the time length for each speaker, I think it depends on whether it can learn alignment or not. Of course the more corpus the better evaluation. In my humble opinion, if the alignment can be learned, the synthesis is feasible.

The time we spend on training depends on the amounts of your dataset. Typically speaking once the stop token loss reduces to zero then the issue you mentioned above would not happen.

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 13, 2020

Word segmentation can be added into transcript as well. For instance, ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1 can be modified into ka2er2pu3 pei2 wai4sun1 wan2 hua2ti1 for better tone. That is to say you can use white space as delimiter. And ka2er2pu3 can also be modified into kar2pu3 for rhotic accent.

This answer solved my doubts for several days, thank you very much!
I used trcotron2 of other projects to train hǎn yǎ le hóulóng this type of Pinyin as input, but I didn't get the correct tone.

I think I can use your code to train a single speaker with the Biaobei dataset, and then add more speakers.

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 13, 2020

You'd better train until the stop token loss reduces to zero and thus the model knows when to stop inference and there will be no reread in evaluation.

Why can I only see this loss in tensorboard, but not other losses such as stop token loss? So how do I know the status of the stop token loss.
屏幕截图 2020-11-13 21:46:27

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 16, 2020

@begeekmyfriend Sorry to disturb you again.
The description of the README is not clear enough that I don't know if I'm doing the right training.
The following is my approach:

1. # Preprocessing
python preprocess.py
Processing Biaobei dataset. Get training_data:
tacotron2
└── training_data
     ├── tts_biaobei_22050
     │     ├── audio/*.npy
     │     ├── mels/*.npy
     │     └── train.txt

2. # Training
CUDA_VISIBLE_DEVICES=1 python train.py --amp-run -o logs --init-lr 1e-3 --final-lr 1e-5 --epochs 1000 -bs 32 --weight-decay 1e-6 --log-file nvlog.json --dataset-path training_data --training-anchor-dirs tts_biaobei_22050

Doubt 1: I don't know the role of filelists folders. I don't know how to generate the txt file inside, or whether the txt file inside is used.

Doubt 2: I trained 240K on 1080Ti for 2 days, I want to test the intermediate result, but got the following error:

CUDA_VISIBLE_DEVICES=0 python inference.py -i text.txt -o outputs --amp-run --speaker-num 0 --speaker-id 0 --log-file nvlog.json

Loading Weights: "logs/checkpoint_latest.pt"
Traceback (most recent call last):
  File "inference.py", line 191, in <module>
    main()
  File "inference.py", line 146, in main
    model, args = load_and_setup_model(parser, args)
  File "inference.py", line 73, in load_and_setup_model
    model.restore_checkpoint(checkpoint_path)
  File "/media/yiwei/600G/tts/tacotron2/tacotron2/model.py", line 584, in restore_checkpoint
    self.load_state_dict(checkpoint['model'])
  File "/home/yiwei/anaconda3/envs/torch1.6/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1045, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Tacotron2:
        size mismatch for embedding.weight: copying a param with shape torch.Size([75, 512]) from checkpoint, the shape in current model is torch.Size([0, 512]).

Can you tell me about the training steps? I really don't know what went wrong.

@begeekmyfriend
Copy link
Owner

Your logcat shows strange information that I cannot understand what was wrong. I am afraid I am not able to help you with those logging. The format of train.txt is same with Rayhane-mamah/Tacotron-2. You can make it clear how it generates by reading this lines of code.

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 16, 2020

Thanks, I have solved this bug, the problem lies in the error of the n_symbols parameter.
align_0838_261456
This is the result of 260K, is this image a bit strange?
These are two audio cases, one misses a word, and the other has a word pitch error. Can WaveRNN solve these problems? I am still training tacotron2 and plan to train to 50k, but now it is only 260K...
samples.zip

I have to say that your work is really great.

Now I plan to do the following work to improve the training:

  1. Change text labels, add participles, syllable information, like this:
    k a 3 - er 3 - p u 3 / p ei 2 - w ai 4 - s un 1 / w an 2 - h ua 2 - t i 1 / 。
    Among them: - represents the end of the syllable, / represents the participle
  2. Add more speakers.
  3. Maybe can try to add Bert.

Training is really slow. For training with a single GPU and adding more datasets, it will take a week to reach 50K.
It is better to use multi-GPU training.

@begeekmyfriend By the way, the filelists folder and the validation-anchor-dir parameters are not used. Can it be deleted?

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 18, 2020

align_1071_334152
这个图看的不是很懂,感觉怪怪的,是存在问题吗?
屏幕截图 2020-11-18 10:33:24
Loss似乎已经降不下去了。
我合成了两句十四五规划中的一句话:
eval.zip
发音和音质还是很不错的,缺点是停顿有问题,我觉得很大一部分可能是因为我训练到数据没有标点符号,直接用的这种格式

ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1

我现在已经改成了这种格式

k a 3 - er 3 - p u 3 / p ei 2 - w ai 4 - s un 1 / w an 2 - h ua 2 - t i 1 / 。

增加了标点,分词和音节。-代表音节结束点,/代表分词。
同时我用API和标贝文本生成了另一个音色的数据集。问现在打算用两个音色的标贝数据集和上述新的label标签去监督,应该能有更好的结果吧?比如改善停顿的问题。老哥你觉得呢?
另外这个参数需要修改吗?不知道什么值合适,我现在一男一女的声音。

@begeekmyfriend
Copy link
Owner

begeekmyfriend commented Nov 18, 2020

停顿问题一个是用标点,还有就是分词,前面提到过,你可以试一试,不需要你这种复杂的形式,把拼音挤在一起就行了

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 20, 2020

是的,我现在的效果比较好了。所以我想继续听从您的意见,训练WaveRNN。

CUDA_VISIBLE_DEVICES=0 python gta.py --amp-run -o gta --dataset-path training_data --training-anchor-dirs tts_biaobei_22050 tts_aida_22050

但是当我运行GTA的时候,却遇到了以下错误:

  0%|                                                                                                                                          | 0/10000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "gta.py", line 185, in <module>
    main()
  File "gta.py", line 165, in main
    outputs = model.infer(to_gpu(seqs).long(), to_gpu(seq_lens).int(), to_gpu(targets).half(), to_gpu(target_lengths).int())
  File "/home/ming-y/yiwei/home/ming-jp/Desktop/yiwei/tacotron2-2/tacotron2/model.py", line 532, in infer
    encoder_outputs = self.encoder(embedded_inputs, text_lengths)
  File "/home/ming-y/anaconda3/envs/pytorch1.6/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ming-y/yiwei/home/ming-jp/Desktop/yiwei/tacotron2-2/tacotron2/model.py", line 222, in forward
    outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs, batch_first=True)
  File "/home/ming-y/anaconda3/envs/pytorch1.6/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 310, in pad_packed_sequence
    sequence.data, sequence.batch_sizes, batch_first, padding_value, max_seq_length)
RuntimeError: shape '[77, 1, 512]' is invalid for input of size 37888

issue里没其他人遇见过这个问题,请问您知道如何解决这个error吗?

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 20, 2020

我参考了inference.py的代码,对gta.py的这部分代码进行了修改:

  1. tacotron2/gta.py

    Lines 153 to 155 in a75495e

    seq = text_to_sequence(text, speaker_id, ['basic_cleaners'])
    seqs = torch.from_numpy(np.stack(seq)).unsqueeze(0)
    seq_lens = torch.IntTensor([len(text)])

    修改成了如下:
    sequences, text_lengths, ids_sorted_decreasing = prepare_input_sequence(sentences, args.speaker_id)

  2. tacotron2/gta.py

    Lines 165 to 172 in a75495e

    outputs = model.infer(to_gpu(seqs).long(), to_gpu(seq_lens).int(), to_gpu(targets).half(), to_gpu(target_lengths).int())
    _, mel_out, _, _ = [output.cpu() for output in outputs if output is not None]
    mel_out = mel_out.squeeze()[:, :mel.size(-1) - 1]
    # clamp the range according to reference level decibel bias to eliminate background noises (20db)
    mel_out = np.clip(mel_out, args.mel_pad_val, -args.mel_pad_val)
    assert(mel_out.shape[-1] == wav.shape[-1] // args.hop_length)
    fname = os.path.basename(npy_path)
    np.save(os.path.join(args.output_dir, fname), mel_out, allow_pickle=False)

    修改成了如下:

outputs = model.infer(sequences, text_lengths)
_, mels, _, _, mel_lengths = [output.cpu() for output in outputs]
ids_sorted_decreasing = ids_sorted_decreasing.numpy().tolist()
mels = [mel[:, :length] for mel, length in zip(mels, mel_lengths)]
mels = [mels[ids_sorted_decreasing.index(i)] for i in range(len(ids_sorted_decreasing))]
fname = os.path.basename(npy_path)
np.save(os.path.join(args.output_dir, str(speaker_id), fname), np.concatenate(mels, axis=-1), allow_pickle=False)

这样就能正常运行gta.py了,同时伴随着Warning! Reached max decodfer steps
@begeekmyfriend 我不知道我的修改是否正确,希望您能帮我看一下,谢谢!

@begeekmyfriend
Copy link
Owner

begeekmyfriend commented Nov 21, 2020

有效果就行,运行环境因人而异,至于Reached max decodfer steps就是你的stop token没有及时停下来,我设置了最大解码时间步

@leijue222
Copy link
Contributor Author

有效果就行,运行环境因人而异,至于Reached max decodfer steps就是你的stop token没有及时停下来,我设置了最大解码时间步

这个warning生成的数据会影响训练waveRNN吗?

@leijue222
Copy link
Contributor Author

另外生成速度是怎样的?昨晚到现在11小时了,才生成了4200条数据,一共2w条。1080Ti。这个速度有问题吗?我觉得实在有点慢,不知道是不是我外接移动硬盘,硬盘IO读写有问题还是怎么的。所以想了解一下您的生成速度。

@begeekmyfriend
Copy link
Owner

T2生成mel这么慢?你是不是没用批合成?1080Ti 11G,够一次性合成一千句话了

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 21, 2020

我上面对更改不太好,我参考了别人的fork,只需更改这一行:

seq_lens = torch.IntTensor([len(text)])

更改:text --> seq

seq_lens = torch.IntTensor([len(seq)])

这样就不会有报错了而且速度很快。


参考您在 issue4 的建议。
我成功的训练了起来。希望能得到好的结果。

| Epoch: 2/2565 (271/312) | Loss: 4.8291 | 2.1 steps/s | Step: 0k | Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
| Epoch: 4/2565 (41/312) | Loss: 4.3082 | 2.0 steps/s | Step: 0k | Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0

@begeekmyfriend
Copy link
Owner

我以为你是在合成集外mel,那你对gta.py提交一个PR吧

@leijue222
Copy link
Contributor Author

我似乎又遇到了一点问题,训练了一天,目前的log是:

| Epoch: 643/2565 (179/312) | Loss: 2.4632 | 2.2 steps/s | Step: 200k | Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
| Epoch: 650/2565 (114/312) | Loss: 2.4635 | 2.2 steps/s | Step: 202k | Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0

在路径 WaveRNN/model_outputs/voc_mol.wavernn下和test_waveRNN.py下生产的音频是这样的:
waveRNN_middle_result.zip
其中eval.waveval.npy是GL算法生成的,其余为WaveRNN的结果,这结果太怪异了,是我哪里弄错了吗?

@begeekmyfriend
Copy link
Owner

643/2565 (179/312)
你这才多久啊?

@leijue222
Copy link
Contributor Author

643/2565 (179/312)
你这才多久啊?

要训练完2565个epochs?目前一天600个epochs,那一共要训练4天左右?

@begeekmyfriend
Copy link
Owner

你可以先看一下,你训练的wav和GTA mel是不是sample与frame级别对齐的?

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 22, 2020

这是两个文件的对比,抱歉这个我不太了解,看不太懂。
mel1
mel2
训练WaveRNN我只做了两件事:

  1. 运行gta.py生成mel
  2. 参考把原音频文件转移至合适的目录

@begeekmyfriend
Copy link
Owner

就是mel和wav两个numpy数组的长度是否匹配

@leijue222
Copy link
Contributor Author

sendpix13

@leijue222
Copy link
Contributor Author

您仓库的waveglow我发现有预训练模型,我打算用预训练模型试试效果,然后在自己数据集上微调一下。但是README文件似乎和您修改的程序不太符合,我也是不能正常运行。

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 23, 2020

版本问题,torch1.6切换至1.3可以了,但是预训练到模型效果也一般,电音沙音严重。
另外老哥,我目前正在训练的waveRNN操作是对的嘛?现在300K了,但生成的wav还是很糟糕,很害怕训练错误
| Epoch: 1028/2565 (266/312) | Loss: 2.4107 | 2.2 steps/s | Step: 320k | Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
200k-300k.zip
300K _generate的效果要比200K更好些了,但是为什么target_wav的音量都那么小呀?

@begeekmyfriend
Copy link
Owner

音量你可以做个后处理的,比如还可以用一些均值计算统一一下各个句子的音量

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 23, 2020

训练模式一:
☞输入原始音频
☞输入原始音频提取的mel

训练模式二:
☞输入原始音频
☞输入tacotron2生成的mel

等于我现在参考issue用的模式二,完全参考上面提的issue,应该没啥地方错误吧,我排查不出来。

我想并行用模式一也训练试试,就不加--gta参数了。应该直接把training_data下的audio,mel分别移动到mel和quant目录下就行了吧。

谢了老哥,我不从事这方面的研究不太懂这方面的知识,跟我交流这么久应该会让你比较心累,实在抱歉。

@leijue222
Copy link
Contributor Author

音量你可以做个后处理的,比如还可以用一些均值计算统一一下各个句子的音量

意思是你waveRNN仓库下的代码还有部分是有问题的吗?就是我不能处理好数据集直接训练,需要排查一下哪里的代码有问题?

@begeekmyfriend
Copy link
Owner

后处理是你自己需要定制的东西啊,你可以参考pydub,了解RMS是什么。另外300K步还不够多,600K再看吧

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 25, 2020

example.zip
文件包含了gta训练600K的结果非gta训练250K的结果

gta训练600K

  1. wavernn合成训练集的mel,能听的出来,但是效果不太行;
  2. wavernn合成tacotron2生成的mek,几乎没有声音;

非gta训练250K的结果

  1. wavernn合成训练集的mel,合成效果很好;
  2. wavernn合成tacotron2生成的mek,有结果,但是有诸如音量超小以及各种杂音的问题;

似乎是mel的取值范围问题,我正在排查

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 26, 2020

有两个疑问:

  1. 训练trco2的时候,所有训练集的mel范围都是[-4,4],但是为什么生成的mel不在这个范围内;
    即taco2训练用的mel数据是【-4,4】,但是生成的mel是没有范围约束的

  2. WaveRNN训练前有归一化处理吗?还是继续保持[-4,4]?

他的fork用 mel_bias 来解决该问题,并且固定值为2。我尝试+2,是有所改善,但还不够好。至少问题定位到了mel的范围

@leijue222
Copy link
Contributor Author

我对taco2的结果做处理,分别使用了三种操作。

  1. mels+=2
  2. mels范围约束到[-1,1]
  3. mels范围约束到[-4,4]

得到的结果+2是最好的。。。但都不太行。

@begeekmyfriend
Copy link
Owner

这种情况要看你自己怎么调节了,具体情况我不了解

@begeekmyfriend
Copy link
Owner

还有,wavernn的mel padding是-4么

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 26, 2020

还有,wavernn的mel padding是-4么

是的,-4。我没对默认参数进行修改。另外wavernn训练到时候有做归一化[-1,1]的处理的吧?对照这一部分的代码

这种情况要看你自己怎么调节了,具体情况我不了解

这个不是调节问题,是我没理清楚输入输出该有的范围,以及正确是处理方式。由于没有两个仓库衔接的README,我只能参考您在别的issue下的给的信息。您可以理解成我完全按照您的方式来处理的。

  1. 训练tacotron2,使用GL合成音频取得了令人满意的结果;
  2. 训练WaveRNN,似乎是输入有问题,导致taco2生成的mel过WaveRNN生成的效果很差。

所以我想请教您,使用taco2预测的mel需不需要再经过什么处理,再通过WaveRNN来合成音频?

  1. 不处理:合成几乎没声音
  2. 参考别人的fork mels+=2,效果稍好一些,但也是明显的问题。
  3. 归一化处理:
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler(feature_range = (-4,4),copy = False)  #范围改为1~3,对原数组操作
mels = min_max_scaler.fit_transform(mels)

不论是[-4,4]还是[-1,1]结果都不好,杂音很多,但至少不像1一样没声音了

核心问题还是这个:

  1. taco2的训练集mel的取值范围都是[-4,4],但是inference.py预测的mel范围并不在这个范围内;
  2. 所以导致taco2的预测结果不能直接过wavernn?我尝试使用上面的归一化处理又是行不通的

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 26, 2020

我刚刚试了直接mel+4,效果竟然最好,就是不太稳定,偶尔有噼里啪啦的杂音。我等会下课排查一下mel加多少的数值合适。我没理清楚这个逻辑,为什么直接mel+数值就极大的改善了音质。


不知道原因所在,排查不出来加多少数值合适,给您听听结果:
eval.zip

对了,我没用--gta,--gta到了800K依旧怎么调合成都糟糕,这个wavernn我用的原始音频和原始音频提取的mel进行训练的。

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 26, 2020

最直观的对比:

eval.npy
tensor([[[ -5.9253,  -5.0147,  -5.4051,  ...,  -8.9068,  -9.6344, -10.6227],
         [ -5.8990,  -4.6007,  -4.3217,  ...,  -8.8981,  -9.5619, -10.6757],
         [ -5.7439,  -3.6977,  -3.1137,  ...,  -8.8748,  -9.4325, -10.6157],
         ...,
         [ -6.7197,  -5.8039,  -5.5047,  ...,  -9.5029,  -9.7275, -10.1876],
         [ -6.6282,  -5.5546,  -5.0863,  ...,  -9.3651,  -9.6204, -10.1205],
         [ -6.6143,  -5.5120,  -5.0923,  ...,  -9.1186,  -9.4814, -10.0734]]])
| ████████████████ 44800/45056 | Batch Size: 8 | Gen Rate: 14.3kHz | Elapsed 3.193572521209717 seconds
aida_000004.npy
tensor([[[-1.4159, -1.1807, -1.5785,  ..., -3.5899, -3.9558, -4.0000],
         [-1.7761, -1.1344, -1.2679,  ..., -3.6451, -4.0000, -4.0000],
         [-2.9079, -0.4590,  0.0271,  ..., -3.6788, -3.6406, -4.0000],
         ...,
         [-1.4811, -1.2190, -1.3176,  ..., -4.0000, -4.0000, -4.0000],
         [-1.4003, -1.2009, -1.0263,  ..., -3.9386, -4.0000, -4.0000],
         [-1.6547, -1.4295, -1.3323,  ..., -3.8347, -3.8417, -4.0000]]])
| ████████████████ 44800/45056 | Batch Size: 8 | Gen Rate: 14.4kHz | Elapsed 3.173750162124634 seconds

eval.npy为taco2输出mel,aida_000004.npy为原始音频mel(经过preprocess.py 处理生成)。

aida_000004.npy合成的音频十分好
eval.npy合成的很糟糕,几乎没有声音,mel+=4后有很好的改善,但是没法解决偶尔会出现的各种杂音问题。
example4.zip

@begeekmyfriend 请问您知道问题所在吗,我认为是taco2输出的mel,我并未改动过max_abs_value的值,我不知道为什么taco2输出的mel范围不在[-4,4]之间。还需要做一次处理才能输入wavernn,但是什么处理我就不知道了,我已经尝试了归一化以及mel+数值,希望您可以指点我正确的处理方式或者如何控制taco2的输出范围正确位于[-4,4]之间。

@begeekmyfriend
Copy link
Owner

begeekmyfriend commented Nov 27, 2020

GTA mel我应该是做了clip的,你确认一下,wavernn怎么可能出现小于-4的值,这些“边角料”很有可能导致合成的杂音

@leijue222
Copy link
Contributor Author

leijue222 commented Nov 27, 2020

GTA mel我应该是做了clip的,你确认一下,wavernn怎么可能出现小于-4的值,这些“边角料”很有可能导致合成的杂音

GTA训练有问题,我训练到了1000K,但是效果超级差,抖的不行,似乎还有两个人的声音。

所以我又用了非GTA模式,我用原音频和原音频提取的mel训练wavernn,训到到了150K就有很好的结果,我训练到了600K结束。

我现在纠结的问题是taco2的预测输出的mel为什么区间不在[-4,4]?

@begeekmyfriend
Copy link
Owner

这个是必然的,因为在预处理的时候就减去了ref_level_db导致输出负向偏置,无论GTA还是非GTA取值范围都不是对称的,我只是强制clip到-4,祛除“边角料”,不会影响到效果。

另外tacotron2和wavernn没有关系,你用GTA mel去训练wavernn,和T2推理输出的范围基本都是一致的。如果GTA mel通过Griffin Lim效果没有问题的话,那么训练应该没有问题,希望你能确认。

@leijue222
Copy link
Contributor Author

Thanks for your help!

@leijue222
Copy link
Contributor Author

leijue222 commented Dec 5, 2020

I have new progress, both biaobei woman voice and AliTTS male voice are great.
There are two points I want to share:

  1. Can't judge whether it is good or bad only based on the synthesis effect of the GL algorithm. Taco2 needs more training. My effect on epoch300 is better than 200(with wavernn).
  2. If wavernn is good at one speaker and another speaker is bad. We can reduce the good speaker's data to continue to train.

Thanks again!

There are two questions I want to ask you for advice:

  1. The accuracy of jieba word segmentation is not enough, do you have a better word segmentation proposal?
  2. Your set max decoder steps = 1000. I tested that it can synthesize audio up to 34 seconds, and it will be ignored if it is longer. I tried splitting and splicing at commas, periods, etc., but the frustration was a little hard to control. Can you give me some suggestions?

I test change max decoder steps:
26s audio, set max decoder steps=1000, male voice is good.
26s audio, set max decoder steps=2000, male voice sometimes have noise.
It's a bit strange, the 26s audio does not exceed 1000...

@begeekmyfriend
Copy link
Owner

You might try pkuseg for Chinese word segmentation. As for long sentence splitting and splicing tools I am afraid I have no idea on it. Maybe punctuator2? You'd better read the README closely. Good luck!

@leijue222
Copy link
Contributor Author

交个作业,给老哥听听效果:wav_sample.zip

分享其中的一些心得:

  1. 停顿处理:
    因为标贝数据和AliTTS数据的标点停顿时长都比较短,模型很难从标点label中学到停顿。
    解决办法是分开合成,给不同的标点加入不同时长的停顿。所以taco2的预测范围必须控制在[-4,4]这个范围内(据说有人试过[0,1],[-1,1],但效果最好),生成值全为-4的mel就是静音段。

  2. 声调处理(主要是33变音问题):
    现有的文字转语音工具都没法100%正确,有脏数据就势必会影响音调。所以只能采用标贝的拼音,建议把声母韵母以及音调分开来,避免稀疏矩阵。
    而标贝拼音有一个33变调规则:33->23, 333->223,3333->2323。所以预测前要把text label按照这个规则进行转换。

  3. wavernn
    我没有采用gta的模式,而是用原音频和原音频提取的mel。主要是无法确定哪个taco2 model最好,怕万一不够好,导致mismatch太大影响音质。


另外老哥,我还想请教您一个科普类的问题:
您整合了 NVIDIA/tacotron2 and Rayhane-mamah/Tacotron-2 ,而两个项目都是本身只用来合成英文的。
我想请问除了数据集以外,还需要哪些改动才可以使得原本的taco2模型能够合成中文?

最后还是要感谢您的开源,谢谢!

@begeekmyfriend
Copy link
Owner

Congrats! 你的音量不够均衡,建议后处理一下,建议用均方根RMS再处理一下

停顿处理一般也是在数据标注上下功夫。Mel的取值范围一律clip到-4,GTA mel也一样。

关于声调,这个可以在预处理用Python脚本解决,这属于代码逻辑,不细说了。

GTA mel对我来说没有问题,论文的MOS显示要比GT好,你再排查一下原因。

@yannier912
Copy link

@leijue222
您好!看到您提到通过ali api生成语音作为训练数据,效果我听了感觉特别棒!想请教下,

  1. ali TTS API是试用吗?够生成12小时的训练数据吗?
  2. 生成12小时语音数据,所用文本是标贝的txt,还是您随机用的其他文本?
  3. 试听了您标贝和ali的效果,节奏停顿和情感都很好,我用标贝合成够清晰也不多字少字,但节奏语调的机器感很重。所以您对train.txt文件又做了什么处理呢?我生成文本这样的
    audio-002147.npy|mel-002147.npy|40700|148|两人再次想歪点子。|l iang3 r en2 z ai4 c i4 x iang3 w ai1 d ian3 z i 。
    是否按照作者说的对train.txt手动加分词处理,就可以达到合成的效果呢?比如 liang3ren2 zai4ci4 xiang3 wai1dian3zi。
    期待您的回复!谢谢!!

@leijue222
Copy link
Contributor Author

leijue222 commented Dec 29, 2020

@yannier912
Ali TTS 对开发者有一定的免费额度,你可以去搜一下
因为标贝文本是经过人工校对的,准确率高于其它一切工具查找的方法,所以我用的标贝文本,做了些改动:new_baker.txt
再就是预测的时候注意准确33连音原则:

标贝33连音的规律:
33 -> 23
333 -> 223
3333 -> 2323

遵循标贝33连音的规律在预测的时候去改动输入的Input text label,所有33连音发音正常。

@yannier912
Copy link

请问33连音是指什么呢?我搜了下没找到相关,研究语音不久的小白...

@yannier912
Copy link

@leijue222 感谢分享!

@leijue222
Copy link
Contributor Author

请问33连音是指什么呢?我搜了下没找到相关,研究语音不久的小白...

可以发现标贝文本的“卡尔普”这三个字都是第三声(333),但是标贝文本处理成了(223)。还有假语(33)以及躲躲闪闪(3333)同规则一样处理。简而言之就算直接第三声连在一起会变调,做一些标注处理。
按标贝文本去去训练不会有什么问题。就是注意预测的时候输入有33连音的词的话,也要遵守标贝文本的33连音规则转换一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants