Examples of a good fine-tune? #65

addytheyoung · 2023-11-22T20:10:59Z

addytheyoung
Nov 22, 2023

Does anyone have an example of a good fine-tuned styletts2 model?

The only one I can find is the LJSpeech model, which sounds really good! But wondering what some other narrators / speakers would sound like, especially voices more outside the training dataset. Thanks, and awesome work on this.

Shiro836 · 2023-11-24T15:35:16Z

Shiro836
Nov 24, 2023

https://www.youtube.com/watch?v=Tuz7_7q0Pr0 Trained on this interview: https://www.youtube.com/watch?v=ozOoONmJ9EQ

16 replies

GUUser91 Apr 6, 2024

@Shiro836
I get this error when I used your script.

ImportError: 'speechbrain' must be installed to use 'speechbrain/spkrec-ecapa-voxceleb' embeddings. Visit https://speechbrain.github.io/ for installation instructions.

GUUser91 Apr 26, 2024

If anyone has a graphics card with 24gb vram like the 7900 xtx or 4090, then it's possible to set the max_len settings to 280 if anyone has Linux. Just use virtual console mode and close any program eating up vram with nvtop.

GUUser91 Apr 27, 2024

@borrero-c
How were you able to do adversarial training? I followed the instructions by clicking on your link. I set the joint_epoch epoch to 48 so it can start on epoch 49 and adversarial training never started. I also followed the instructions on this link and adversarial training never started.
#227 (comment)

borrero-c Apr 27, 2024

@GUUser91 how many epochs are you training for? Are you doing two separate sessions or one?

You can train adversarial by using the diffusion trained model as the base model in the config.yml file, how are you doing it?

GUUser91 Apr 27, 2024

@borrero-c
I set the epoch to 80 for training. This is a seperate session. I'm using the diffusion checkpoint as the base model. I set joint_epoch each to 48 and then 49. Here is a screenshot of the training.

Edit:
I tinkered around with the config_ft.yml file. I set Max_Len to 120. I set batch_percentage to 1. I set slmadv_params min_len to 100 and slmadv_params max_len to 120. Batch size is set to 2. Now the DiscLM and GenLM Loss stats are no longer at 0. I'm using a rx 7900 xtx.

Here a pic from my tensorboard folder.

yl4579 · 2023-11-25T20:54:40Z

yl4579
Nov 25, 2023
Maintainer

This is the results I got using default config for 50 epochs (past joint_epoch) and one hour of data: https://vocaroo.com/1aC4vr4jErDL
If we do not run SLM adversarial training part (stop before joint_epoch) it is slightly worse: https://vocaroo.com/1hxkfwlrhowS

SLM adversarial training is the most VRAM consuming part. I don't know how to mitigate this problem and fit it in smaller machines. Maybe techniques used for LLM finetuning could help as we are working with large speech language models here.

0 replies

78Alpha · 2024-01-12T03:15:47Z

78Alpha
Jan 12, 2024

Here are the results I have from 2 different models.

Aurora: 50 Epochs with joint training after 10, 8 Hours of audio, single voice. Batch Size 2, max length 220.
Chaos: 50 Epochs with joint training after 10, 10 hours of audio, 5 Hours of British audio for accent, 5 hours of depressed voice for the emotion. Batch Size 2, max length 220.

AuroraTest1.webm
AuroraTest2.webm
ChaosTest1.webm
ChaosTest2.webm

3 replies

godspirit00 Apr 28, 2024

Hello, could you please share your config? Did SLM train during your finetuning? I was having a problem that even with joint_epoch small than epochs, SLM training did not start during the entire finetuning process. I wonder what was wrong. So it'd be great if you can share something about your finetuning experiment. Thanks.

78Alpha May 4, 2024

No joint training was done because I left batch percentage at 0.5. It needs to be at least 1 if you're using batch size 2.

godspirit00 May 4, 2024

I tried with batch percentage 1 and batch size 2, but it didn't start either.

jonathandasilvasantos · 2024-04-10T01:13:06Z

jonathandasilvasantos
Apr 10, 2024

Fine-tuning on LibriTTS using a single Brazilian Portuguese speaker involved processing approximately 24 hours of audio over 60 epochs.

Link: https://drive.google.com/file/d/1pBqHbIuuaO7jvMsnnpbjrsFAPcHZKr41/view?usp=sharing

I'm using PL-BERT multilingual.

Please, any idea why there is this annoying noise on the end of the audio clip?

Thanks!

Jonathan S. Santos

0 replies

traderpedroso · 2024-05-23T14:10:39Z

traderpedroso
May 23, 2024

Fine-tuning on LibriTTS using a single Brazilian Portuguese speaker involved processing approximately 24 hours of audio over 60 epochs.

Link: https://drive.google.com/file/d/1pBqHbIuuaO7jvMsnnpbjrsFAPcHZKr41/view?usp=sharing

I'm using PL-BERT multilingual.

Please, any idea why there is this annoying noise on the end of the audio clip?

Thanks!

Jonathan S. Santos

Acredito que por falta de um pad de silêncio de pelo menos 400ms outra coisa se os áudios estiverem maior que o length faça o cálculo dos segundos e a frequência não tentei treinar ainda em português assim que concluir os LLM vou liberar um checkpoint em português se puder compartilhar seu check point

0 replies

GUUser91 · 2024-06-03T03:28:08Z

GUUser91
Jun 3, 2024

I tinkered around with the config_ft.yml file and I discovered I can do style diffusion and SLM Adversarial Training in one session on my 4090. Batch_size is set to 2. batch_percentage is set to 1. Note this can also work on a 7900 xtx. I'm using Virtual console mode and I used nvtop to close any program eating up vram. Epoch is set to 100 because DiscLM is usually at 0. I'm using vokan as the base model. Most of the audio files I gathered had background music and noise so I used resemble-enhance (Denoising via Gradio App version, not commandline version) and the audacity plug in acon digital deverberate 3 on the audio files. Then I used the audacity plug in trim extend to add 200 milliseconds in the beginning and end of the audio files.
Violet Parr: 8 minutes of audio. I set Max_Len to 270. slmadv_params min_len is set to 270 and slmadv_params max_len is set to 270.
https://vocaroo.com/15gtNvhjeFi0
Hiccup: 7 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1b5ewbnxQ2aa
Branch: 11 minutes of audio. I set Max_Len to 256. slmadv_params min_len is set to 256 and slmadv_params max_len is set to 256. Model coudn't pronounce bouhuhuh-ned and buhuhuhuh.
https://vocaroo.com/1kwWiqOQuEvg
https://vocaroo.com/1f2epsXNHBtH
Poppy: 15 minutes of audio. I set Max_Len to 270. slmadv_params min_len is set to 270 and slmadv_params max_len is set to 270.
https://vocaroo.com/157F3j65bYLE
Arnold Shortman: 4 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1ovdUEp7rVl2
https://vocaroo.com/1j7VQQSLo4Zo
Mr. Delicious: 4 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1gJ3lSN6ut9t
Helga G. Pataki: 14 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1bGX3i15t9Wm
https://vocaroo.com/1erNAm3ZH9n9
https://vocaroo.com/1cjvt2EHB6xm
Merida: 11 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/118XYZ0Oq4t7
Judy Hopps: 21 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1jf0jlUL3kOw
Wilbur Robinson: 7 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1agReLZAIx0c
Zuko: 6 minutes of audio. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280. Model coudn't pronounce Katara.
https://vocaroo.com/1i9Sn1TvaPTr
https://vocaroo.com/1bkYnTkqhsyj
Connor: 56 seconds of audio taken from the film, Ruby Gillman, Teenage Kraken and 5 minutes of audio taken from Jaboukie Young-White (Connor's voice actor) interviews. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/17W0YDHnkgky
Mara Jade (Heidi Shannon): 2 minutes and 30 seconds of audio taken from the video game, Star Wars: Jedi Knight - Mysteries of the Sith and 5 minutes of audio taken from elevenlabs. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280.
https://vocaroo.com/1n04GhNXlon9
Luke Skywalker (Mark Hamill): 2 minutes of audio taken from a digital copy of Return of the Jedi and 7 minutes and 50 seconds taken from a 1983 interview. I set Max_Len to 280. slmadv_params min_len is set to 280 and slmadv_params max_len is set to 280. Model coudn't pronounce Coruscant.
https://vocaroo.com/13KuwZNIMhuv

Edit:
I rented out a h100 from runpod again. I edit the config_ft.yml file with micro. I set batch_size to 4 and max_len to 500. I set slmadv_params min_len to 100 and slmadv_params max_len to 500 and batch_percentage to 1. And now SLM adversarial training has started to work.

Here's a screenshot of the vram usage.

This is what I did in runpod.
I update the repo.

apt update

I install these.

apt install aria2 p7zip-full curl jq micro

I use the pwd command to find directory / filepath infomation.

pwd

I put the training dataset in a zip file and then I upload it to either https://catbox.moe/ or https://litterbox.catbox.moe/ (Which lets you upload a 1GB file)

I download the vokan base model and zip file with aria2.

aria2c -x 16 -s 16 -k 1M https://archive.org/download/epoch_2nd_00012/epoch_2nd_00012.pth

or

wget https://archive.org/download/epoch_2nd_00012/epoch_2nd_00012.pth

or the gofile downloader.

https://github.com/ltsdw/gofile-downloader.

Link for vokan model.
https://huggingface.co/ShoukanLabs/Vokan

aria2c -x 16 -s 16 -k 1M https://files.catbox.moe/XXXXXXXX.zip

I unzip the file

7z x XXXXXXXX.zip

I download the gofile upload script file.

aria2c https://raw.githubusercontent.com/Sushrut1101/GoFile-Upload/master/upload.sh

I give the script permissions

chmod +rx upload.sh

I upload the pth file to https://gofile.io/

./upload.sh model.pth

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples of a good fine-tune? #65

{{title}}

Replies: 6 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Examples of a good fine-tune? #65

Replies: 6 comments · 19 replies

yl4579 Nov 25, 2023 Maintainer

Replies: 6 comments 19 replies

yl4579
Nov 25, 2023
Maintainer