Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish documentation for v2.2.0 release #156

Merged
merged 3 commits into from
Nov 22, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add detailed instructions for shallow diffusion
yqzhishen committed Nov 21, 2023
commit 76bf42b6ba1352e80d48868933bc8921e2db9fde
91 changes: 89 additions & 2 deletions docs/BestPractices.md
Original file line number Diff line number Diff line change
@@ -211,16 +211,103 @@ pe_ckpt: checkpoints/rmvpe/model.pt
[Harvest](https://github.com/mmorise/World) (Harvest: A high-performance fundamental frequency estimator from speech signals) is the recommended pitch extractor from Masanori Morise's WORLD, a free software for high-quality speech analysis, manipulation and synthesis. It is a state-of-the-art algorithmic pitch estimator designed for speech, but has seen use in singing voice synthesis. It runs the slowest compared to the others, but provides very accurate F0 on clean and normal recordings compared to parselmouth.

To use Harvest, simply include the following line in your configuration file:

```yaml
pe: harvest
```

**Note:** It is also recommended to change the F0 detection range for Harvest with accordance to your dataset, as they are hard boundaries for this algorithm and the defaults might not suffice for most use cases. To change the F0 detection range, you may include or edit this part in the configuration file:

```yaml
f0_min: 65 # Minimum F0 to detect
f0_max: 800 # Maximum F0 to detect
```

## Shallow diffusion

Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger [paper](https://arxiv.org/abs/2105.02446). Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.

Currently, acoustic models in this repository support shallow diffusion. The main switch of shallow diffusion is `use_shallow_diffusion` in the configuration file, and most arguments of shallow diffusion can be adjusted under `shallow_diffusion_args`. See [Configuration Schemas](ConfigurationSchemas.md) for more details.

### Train full shallow diffusion models from scratch

To train a full shallow diffusion model from scratch, simply introduce the following settings in your configuration file:

```yaml
use_shallow_diffusion: true
K_step: 400 # adjust according to your needs
K_step_infer: 400 # should be <= K_step
```

Please note that when shallow diffusion is enabled, only the last $K$ diffusion steps will be trained. Unlike classic diffusion models which are trained on full steps, the limit of `K_step` can make the training more efficient. However, `K_step` should not be set too small because without enough diffusion depth (steps), the low-quality auxiliary decoder results cannot be well refined. 200 ~ 400 should be the proper range of `K_step`.

The auxiliary decoder and the diffusion decoder shares the same linguistic encoder, which receives gradients from both the decoders. In some experiments, it was found that gradients from the auxiliary decoder will cause mismatching between the encoder and the diffusion decoder, resulting in the latter being unable to produce reasonable results. To prevent this case, a configuration item called `aux_decoder_grad` is introduced to apply a scale factor on the gradients from the auxiliary decoder during training. To adjust this factor, introduce the following in the configuration file:

```yaml
shallow_diffusion_args:
aux_decoder_grad: 0.1 # should not be too high
```

### Train auxiliary decoder and diffusion decoder separately

Training a full shallow diffusion model can consume more memory because the auxiliary decoder is also in the training graph. In limited situations, the two decoders can be trained separately, i.e. train one decoder after another.

**STEP 1: train the diffusion decoder**

In the first stage, the linguistic encoder and the diffusion decoder is trained together, while the auxiliary decoder is left unchanged. Edit your configuration file like this:

```yaml
f0_min: 65 # Minimum F0 to detect
f0_max: 800 # Maximum F0 to detect
use_shallow_diffusion: true # make sure the main option is turned on
shallow_diffusion_args:
train_aux_decoder: false # exclude the auxiliary decoder from the training graph
train_diffusion: true # train diffusion decoder as normal
val_gt_start: true # should be true because the auxiliary decoder is not trained yet
```

Start training until `max_updates` is reached, or until you get satisfactory results on the TensorBoard.

**STEP 2: train the auxiliary decoder**

In the second stage, the auxiliary decoder is trained besides the linguistic encoder and the diffusion decoder. Edit your configuration file like this:

```yaml
shallow_diffusion_args:
train_aux_decoder: true
train_diffusion: false # exclude the diffusion decoder from the training graph
lambda_aux_mel_loss: 1.0 # no more need to limit the auxiliary loss
```

Then you should freeze the encoder to prevent it from getting updates. This is because if the encoder changes, it no longer matches with the diffusion decoder, thus making the latter unable to produce correct results again. Edit your configuration file:

```yaml
freezing_enabled: true
frozen_params:
- model.fs2 # the linguistic encoder
```

You should also manually reset your learning rate scheduler because this is a new training process for the auxiliary decoder. Possible ways are:

1. Rename the latest checkpoint to `model_ckpt_steps_0.ckpt` and remove the other checkpoints from the directory.
2. Increase the initial learning rate (if you use a scheduler that decreases the LR over training steps) so that the auxiliary decoder gets proper learning rate.

Additionally, `max_updates` should be adjusted to ensure enough training steps for the auxiliary decoder.

Once you finished the configurations above, you can resume the training. The auxiliary decoder normally does not need many steps to train, and you can stop training when you get stable results on the TensorBoard. Because this step is much more complicated than the previous step, it is recommended to run some inference to verify if the model is trained properly after everything is finished.

### Add shallow diffusion to classic diffusion models

Actually, all classic DDPMs have the ability to be "shallow". If you want to add shallow diffusion functionality to a former classic diffusion model, the only thing you need to do is to train an auxiliary decoder for it.

Before you start, you should edit the configuration file to ensure that you use the same datasets, and that you do not remove or add any of the functionalities of the old model. Then you can configure the old checkpoint in your configuration file:

```yaml
finetune_enabled: true
finetune_ckpt_path: xxx.ckpt # path to your old checkpoint
finetune_ignored_params: [] # do not ignore any parameters
```

Then you can follow the instructions in STEP 2 of the [previous section](#add-shallow-diffusion-to-classic-diffusion-models) to finish your training.

## Performance tuning

This section is about accelerating training and utilizing hardware.