Replies: 5 comments
-
>>> sanjaesc |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> LSutton |
Beta Was this translation helpful? Give feedback.
-
>>> LSutton |
Beta Was this translation helpful? Give feedback.
-
>>> LSutton |
Beta Was this translation helpful? Give feedback.
-
>>> LSutton
[February 18, 2021, 11:33pm]
Hey all,
I am attempting to continue fine-tuning an existing model. This is a
first attempt into training voice models. My hope was to take a working
model, fine-tune it and see the ongoing, improving results after
continued updated epochs.
I understand this may be a premature evaluation but from a working
Tacotron2 model to just after the first epoch it sounds unrecognizable
as human speech. I expected a very fine, gradual improvement baselining
somewhere near where the original model ended up. Instead, it's quite a
huge drop to the point you can't understand it. So I am trying to
understand where I misconfigured / setup my training. Below are my
configs and starting points:
Model : Multi-Speaker-Tacotron2
DDC
using source code based on commit
6cc464e. (I tried the
advertised 2136433 commit
but there were issues getting it to the model to align to the code so I
reviewed the GIT log and used a downstream commit where fixes appeared
to be merged. This seemed like a safe commit to attempt and it did not
provide any model-based warnings or errors when run.
I pulled the model and configurations from the associated Colab
Notebook.
The configuration I ran with is:
{
'github_branch':'* origin2_dev',
'model': 'Tacotron2', // one of the model in models/
'run_name': 'vctk-r=2-ddc',
'run_description': 'tacotron2 on vctl r=2 with ddc only without guided attention',
'mixed_precision': false,
// AUDIO PARAMETERS
'audio':{
'fft_size': 1024, // number of stft frequency levels. Size of the linear spectogram frame.
'win_length': 1024, // stft window length in ms.
'hop_length': 256, // stft window hop-lengh in ms.
'frame_length_ms': null, // stft window length in ms.If null, 'win_length' is used.
'frame_shift_ms': null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
'sample_rate': 48000, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
'preemphasis': 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
'ref_level_db': 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
'do_trim_silence': true,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
'trim_db': 25, // threshold for timming silence. Set this according to your dataset.
// Griffin-Lim
'power': 1.5, // value to sharpen wav signals after GL algorithm.
'griffin_lim_iters': 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// MelSpectrogram parameters
'num_mels': 80, // size of the mel spec frame.
'mel_fmin': 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
'mel_fmax': 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
'spec_gain': 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
'signal_norm': true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
'min_level_db': -100, // lower bound for normalization
'symmetric_norm': true, // move normalization to range [-1, 1]
'max_norm': 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
'clip_norm': true, // clip normalized values into the range.
//'stats_path': '/workspace/scale_stats.npy' // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
'stats_path': null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// VOCABULARY PARAMETERS
// if custom character set is not defined,
// default set in symbols.py is used
'characters':{
'pad': '_',
'eos': '&',
'bos': '*',
//'characters': 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!'(),-.:;? ',
//'characters': 'ABCDEFGHIJKLMçãàáâêéíóôõúûabcdefghijklmnopqrstuvwxyz!'(),-.:;? ',
'characters': 'ABCDEFGHIJKLMNOPQRSTUVWXYZÇÃÀÁÂÊÉÍÓÔÕÚÛabcdefghijklmnopqrstuvwxyzçãàáâêéíóôõúû!(),-.:;? ',
'punctuations':'!'(),-.:;? ',
'phonemes':'iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ'̃' '
},
// DISTRIBUTED TRAINING
'distributed':{
'backend': 'nccl',
'url': 'tcp: slash / slash /localhost:54322'
},
'reinit_layers': [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
'batch_size': 32, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
'eval_batch_size':16,
'r': 2, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
'gradual_training': [[0, 7, 32], [1, 5, 32], [100000, 3, 16], [250000, 2, 16], [500000, 1, 16]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
// LOSS SETTINGS
'loss_masking': false, // enable / disable loss masking against the sequence padding.
'decoder_loss_alpha': 0.5, // decoder loss weight. If > 0, it is enabled
'postnet_loss_alpha': 0.25, // postnet loss weight. If > 0, it is enabled
'ga_alpha': 0.0, // weight for guided attention loss. If > 0, guided attention is enabled.
'decoder_diff_spec_alpha': 0.25, // differential spectral loss weight. If > 0, it is enabled
'postnet_diff_spec_alpha': 0.25, // differential spectral loss weight. If > 0, it is enabled
'decoder_ssim_alpha': 0.5, // differential spectral loss weight. If > 0, it is enabled
'postnet_ssim_alpha': 0.25, // differential spectral loss weight. If > 0, it is enabled
// VALIDATION
'run_eval': true,
'test_delay_epochs': 1, //Until attention is aligned, testing only wastes computation time.
'test_sentences_file': null,//'../../../datasets/BRSpeech-3-Speakers-Paper/BRSpeech-3-Speakers-Paper/TTS-Portuguese_Corpus/test_setences.txt', // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
'noam_schedule': false, // use noam warmup and lr schedule.
'grad_clip': 1.0, // upper limit for gradients for clipping.
'epochs': 1000, // total number of epochs to train.
'lr': 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
'wd': 0.000001, // Weight decay weight.
'warmup_steps': 4000, // Noam decay steps to increase the learning rate from 0 to 'lr'
'seq_len_norm': true, // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.
// TACOTRON PRENET
'memory_size': -1, // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
'prenet_type': 'original', // 'original' or 'bn'.
'prenet_dropout': true, // enable/disable dropout at prenet.
// ATTENTION
'attention_type': 'original', // 'original' or 'graves'
'attention_heads': 4, // number of attention heads (only for 'graves')
'attention_norm': 'softmax', // softmax or sigmoid. Suggested to use softmax for Tacotron2 and sigmoid for Tacotron.
'windowing': false, // Enables attention windowing. Used only in eval mode.
'use_forward_attn': false, // if it uses forward attention. In general, it aligns faster.
'forward_attn_mask': false, // Additional masking forcing monotonicity only in eval mode.
'transition_agent': false, // enable/disable transition agent of forward attention.
'location_attn': true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
'bidirectional_decoder': false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
'double_decoder_consistency': true, // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
'ddc_r': 4, // reduction rate for coarse decoder.
'apex_amp_level': null, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use 'O1' to activate.
// STOPNET
'stopnet': true, // Train stopnet predicting the end of synthesis.
'separate_stopnet': true, // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.
// TENSORBOARD and LOGGING
'print_step': 25, // Number of steps to log traning on console.
'tb_plot_step': 100, // Number of steps to plot TB training figures.
'print_eval': false, // If True, it prints intermediate loss values in evalulation.
'save_step': 250, // Number of training steps expected to save traninpg stats and checkpoints.
'checkpoint': true, // If true, it saves checkpoints per 'save_step'
'tb_model_param_stats': true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
'text_cleaner': 'english_cleaners',
'enable_eos_bos_chars': false, // enable/disable beginning of sentence and end of sentence chars.
'num_loader_workers': 8, // number of training data loader processes. Don't set it too big. 4-8 are good values.
'num_val_loader_workers': 8, // number of evaluation data loader processes.
'batch_group_size': 0, //Number of batches to shuffle after bucketing.
'min_seq_len': 2, // DATASET-RELATED: minimum text length to use in training
'max_seq_len': 153, // DATASET-RELATED: maximum text length
// PATHS
// 'output_path': '/data5/rw/pit/keep/', // DATASET-RELATED: output path for all training outputs.
'output_path': '/data/output/LJSpeech/',
// PHONEMES
'phoneme_cache_path': '/workspace/phonemes/', // phoneme computation is slow, therefore, it caches results in the given folder.
'use_phonemes': true, // use phonemes instead of raw characters. It is suggested for better pronounciation.
'phoneme_language': 'en-us', // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
// 'speaker_encoder_config_path':'/root/speaker_encoder/LibriTTS-common-voice-voxceleb_angle_proto/config.json', // config.json for the speaker encoder
// 'speaker_encoder_checkpoint_path': '/root/speaker_encoder/LibriTTS-common-voice-voxceleb_angle_proto/320k.pth.tar', // Speaker Encoder Checkpoint full path
'use_speaker_embedding': true, // use speaker embedding to enable multi-speaker learning.
'use_external_speaker_embedding_file': true, // if true, forces the model to use external embedding per sample instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
'external_speaker_embedding_file': '/workspace/train/speakers.json', // if not null and use_external_speaker_embedding_file is true, it is used to load a specific embedding file and thus uses these embeddings instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
'use_gst': false, // TACOTRON ONLY: use global style tokens
'gst': { // gst parameter if gst is enabled
'gst_style_input': null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {'0': 0.15, '1': 0.15, '5': -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
'disable_gst': false, // if true its force GST predict zero matrix.
'gst_use_speaker_embedding': true, // if true pass speaker embedding in attention input GST.
'gst_embedding_dim': 128,
'gst_num_heads': 4,
'gst_style_tokens': 5
},
// DATASETS
'datasets': // List of datasets. They all merged and they get different s$
[
{
'name': 'vctk',
'path': '/data/VCTK/VCTK-Corpus',
// para o teste no vctk eu escolhi 1 locutor de cada ACCEPTS,
'meta_file_train': ['p225', 'p234', 'p238', 'p245', 'p248', 'p261', 'p294', 'p302', 'p326', 'p335', 'p347'], // for vtck if list, ignore speakers id in list for train, its useful for test cloning with new speakers
'meta_file_val': null
}
]
}
As stated above, this configuration runs as expected without warnings or
errors. But when I review the test audio via the Tensorboard, it's
indistinguishable from straight noise. (Attached) slash
test_audio.zip
(528.5 KB) (rename to test_audio.wav without unzipping)
Further, the supporting images seem to indicate a 'start over' scenario.
Finally, like I said, I didn't see any warnings or errors on the
application startup. I include the first portion of the application log
here:
python /workspace/TTS/TTS/bin/distribute.py --script /workspace/TTS/TTS/bin/train_tacotron.py --config_path=/workspace/train/config.json --continue_path=/workspace/output/ --restore_path=/workspace/train/tts_model.pth.tar'
2021-02-18 16:14:28.576959: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-02-18 16:14:28.592577: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
> Using CUDA: True
> Number of GPUs: 2
> Training continues for /workspace/output/
> Setting up Audio Processor...
| > sample_rate:48000
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:0
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:50.0
| > mel_fmax:7600.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:25
| > do_sound_norm:False
| > stats_path:None
| > hop_length:256
| > win_length:1024
| > Found 39490 files in /data/VCTK/VCTK-Corpus
Training with 96 speakers: VCTK_p226, VCTK_p227, VCTK_p228, VCTK_p229, VCTK_p230, VCTK_p231, VCTK_p232, VCTK_p233, VCTK_p236, VCTK_p237, VCTK_p239, VCTK_p240, VCTK_p241, VCTK_p243, VCTK_p244, VCTK_p246, VCTK_p247, VCTK_p249, VCTK_p250, VCTK_p251, VCTK_p252, VCTK_p253, VCTK_p254, VCTK_p255, VCTK_p256, VCTK_p257, VCTK_p258, VCTK_p259, VCTK_p260, VCTK_p262, VCTK_p263, VCTK_p264, VCTK_p265, VCTK_p266, VCTK_p267, VCTK_p268, VCTK_p269, VCTK_p270, VCTK_p271, VCTK_p272, VCTK_p273, VCTK_p274, VCTK_p275, VCTK_p276, VCTK_p277, VCTK_p278, VCTK_p279, VCTK_p280, VCTK_p281, VCTK_p282, VCTK_p283, VCTK_p284, VCTK_p285, VCTK_p286, VCTK_p287, VCTK_p288, VCTK_p292, VCTK_p293, VCTK_p295, VCTK_p297, VCTK_p298, VCTK_p299, VCTK_p300, VCTK_p301, VCTK_p303, VCTK_p304, VCTK_p305, VCTK_p307, VCTK_p308, VCTK_p310, VCTK_p311, VCTK_p312, VCTK_p313, VCTK_p314, VCTK_p316, VCTK_p317, VCTK_p318, VCTK_p323, VCTK_p329, VCTK_p330, VCTK_p333, VCTK_p334, VCTK_p336, VCTK_p339, VCTK_p340, VCTK_p341, VCTK_p343, VCTK_p345, VCTK_p351, VCTK_p360, VCTK_p361, VCTK_p362, VCTK_p363, VCTK_p364, VCTK_p374, VCTK_p376
> Using model: Tacotron2
> Model restored from step 220000
> Model has 51314996 parameters
> EPOCH: 0/1000
> Number of output frames: 2
> DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 39096
| > Max length sequence: 181
| > Min length sequence: 9
| > Avg length sequence: 39.60463986085533
| > Num. instances discarded by max-min (max=153, min=2) seq limits: 184
| > Batch group size: 0.
> TRAINING (2021-02-18 16:16:17)
--> STEP: 24/1216 -- GLOBAL_STEP: 220025
| > decoder_loss: 4.02636 (4.76018)
| > postnet_loss: 3.88876 (4.63819)
| > stopnet_loss: 0.36340 (0.33130)
| > decoder_coarse_loss: 5.08758 (5.75460)
| > decoder_ddc_loss: 0.01690 (0.01789)
| > decoder_diff_spec_loss: 0.05701 (0.08949)
| > postnet_diff_spec_loss: 0.04774 (0.07405)
| > decoder_ssim_loss: 0.49228 (0.51219)
| > postnet_ssim_loss: 0.49113 (0.51146)
| > loss: 5.76943 (6.75253)
| > align_error: 0.62557 (0.60317)
| > avg_spec_length: 281.0
| > avg_text_length: 15.2
| > step_time: 3.8240
| > loader_time: 0.01
| > current_lr: 0.0001
Any help or insight into what I'm misunderstanding would be greatly
appreciated. Thanks much for the library and the hard work!
[This is an archived TTS discussion thread from discourse.mozilla.org/t/fine-tuning-vctk-model-destroys-quality]
Beta Was this translation helpful? Give feedback.
All reactions