Add detailed instructions for shallow diffusion

openvpi · yqzhishen · Nov 22, 2023 · Nov 20, 2023 · Nov 21, 2023 · Nov 21, 2023
commit 76bf42b6ba1352e80d48868933bc8921e2db9fde
diff --git a/docs/BestPractices.md b/docs/BestPractices.md
@@ -211,16 +211,103 @@ pe_ckpt: checkpoints/rmvpe/model.pt
 [Harvest](https://github.com/mmorise/World) (Harvest: A high-performance fundamental frequency estimator from speech signals) is the recommended pitch extractor from Masanori Morise's WORLD, a free software for high-quality speech analysis, manipulation and synthesis. It is a state-of-the-art algorithmic pitch estimator designed for speech, but has seen use in singing voice synthesis. It runs the slowest compared to the others, but provides very accurate F0 on clean and normal recordings compared to parselmouth.
 
 To use Harvest, simply include the following line in your configuration file:
+
 ```yaml
 pe: harvest
 ```
 
 **Note:** It is also recommended to change the F0 detection range for Harvest with accordance to your dataset, as they are hard boundaries for this algorithm and the defaults might not suffice for most use cases. To change the F0 detection range, you may include or edit this part in the configuration file:
+
+```yaml
+f0_min: 65  # Minimum F0 to detect
+f0_max: 800  # Maximum F0 to detect
+```
+
+## Shallow diffusion
+
+Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger [paper](https://arxiv.org/abs/2105.02446). Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.
+
+Currently, acoustic models in this repository support shallow diffusion. The main switch of shallow diffusion is `use_shallow_diffusion` in the configuration file, and most arguments of shallow diffusion can be adjusted under `shallow_diffusion_args`. See [Configuration Schemas](ConfigurationSchemas.md) for more details.
+
+### Train full shallow diffusion models from scratch
+
+To train a full shallow diffusion model from scratch, simply introduce the following settings in your configuration file:
+
+```yaml
+use_shallow_diffusion: true
+K_step: 400  # adjust according to your needs
+K_step_infer: 400  # should be <= K_step
+```
+
+Please note that when shallow diffusion is enabled, only the last $K$ diffusion steps will be trained. Unlike classic diffusion models which are trained on full steps, the limit of `K_step` can make the training more efficient. However, `K_step` should not be set too small because without enough diffusion depth (steps), the low-quality auxiliary decoder results cannot be well refined. 200 ~ 400 should be the proper range of `K_step`.
+
+The auxiliary decoder and the diffusion decoder shares the same linguistic encoder, which receives gradients from both the decoders. In some experiments, it was found that gradients from the auxiliary decoder will cause mismatching between the encoder and the diffusion decoder, resulting in the latter being unable to produce reasonable results. To prevent this case, a configuration item called `aux_decoder_grad` is introduced to apply a scale factor on the gradients from the auxiliary decoder during training. To adjust this factor, introduce the following in the configuration file:
+
+```yaml
+shallow_diffusion_args:
+  aux_decoder_grad: 0.1  # should not be too high
+```
+
+### Train auxiliary decoder and diffusion decoder separately
+
+Training a full shallow diffusion model can consume more memory because the auxiliary decoder is also in the training graph. In limited situations, the two decoders can be trained separately, i.e. train one decoder after another.
+
+**STEP 1: train the diffusion decoder**
+
+In the first stage, the linguistic encoder and the diffusion decoder is trained together, while the auxiliary decoder is left unchanged. Edit your configuration file like this:
+
 ```yaml
-f0_min: 65 # Minimum F0 to detect
-f0_max: 800 # Maximum F0 to detect
+use_shallow_diffusion: true  # make sure the main option is turned on
+shallow_diffusion_args:
+  train_aux_decoder: false  # exclude the auxiliary decoder from the training graph
+  train_diffusion: true  # train diffusion decoder as normal
+  val_gt_start: true  # should be true because the auxiliary decoder is not trained yet
 ```
 
+Start training until `max_updates` is reached, or until you get satisfactory results on the TensorBoard.
+
+**STEP 2: train the auxiliary decoder**
+
+In the second stage, the auxiliary decoder is trained besides the linguistic encoder and the diffusion decoder. Edit your configuration file like this:
+
+```yaml
+shallow_diffusion_args:
+  train_aux_decoder: true
+  train_diffusion: false  # exclude the diffusion decoder from the training graph
+lambda_aux_mel_loss: 1.0  # no more need to limit the auxiliary loss
+```
+
+Then you should freeze the encoder to prevent it from getting updates. This is because if the encoder changes, it no longer matches with the diffusion decoder, thus making the latter unable to produce correct results again. Edit your configuration file:
+
+```yaml
+freezing_enabled: true
+frozen_params:
+  - model.fs2  # the linguistic encoder
+```
+
+You should also manually reset your learning rate scheduler because this is a new training process for the auxiliary decoder. Possible ways are:
+
+1. Rename the latest checkpoint to `model_ckpt_steps_0.ckpt` and remove the other checkpoints from the directory.
+2. Increase the initial learning rate (if you use a scheduler that decreases the LR over training steps) so that the auxiliary decoder gets proper learning rate.
+
+Additionally, `max_updates` should be adjusted to ensure enough training steps for the auxiliary decoder.
+
+Once you finished the configurations above, you can resume the training. The auxiliary decoder normally does not need many steps to train, and you can stop training when you get stable results on the TensorBoard. Because this step is much more complicated than the previous step, it is recommended to run some inference to verify if the model is trained properly after everything is finished.
+
+### Add shallow diffusion to classic diffusion models
+
+Actually, all classic DDPMs have the ability to be "shallow". If you want to add shallow diffusion functionality to a former classic diffusion model, the only thing you need to do is to train an auxiliary decoder for it.
+
+Before you start, you should edit the configuration file to ensure that you use the same datasets, and that you do not remove or add any of the functionalities of the old model. Then you can configure the old checkpoint in your configuration file:
+
+```yaml
+finetune_enabled: true
+finetune_ckpt_path: xxx.ckpt  # path to your old checkpoint
+finetune_ignored_params: []  # do not ignore any parameters
+```
+
+Then you can follow the instructions in STEP 2 of the [previous section](#add-shallow-diffusion-to-classic-diffusion-models) to finish your training.
+
 ## Performance tuning
 
 This section is about accelerating training and utilizing hardware.