Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watermark model slow training (cross-posted from facebookresearch/audioseal) #484

Open
christianc102 opened this issue Aug 12, 2024 · 10 comments
Assignees

Comments

@christianc102
Copy link

christianc102 commented Aug 12, 2024

Hi!

(This was cross-posted at facebookresearch/audioseal, but wanted to also put here for visibility--thanks!)

Thanks so much for the helpful training code and documentation. Apologies in advance for the naive question--I'm pretty new to machine learning.

I'm trying to train my own watermarking model at 48kHz with my own dataset on an H100 node with 8 GPUs (H100 80GB HBM3) on a remote SLURM cluster, but as I scale the batch size the training speed appears to drop proportionally. There also appears to be an unexpected behavior where I specify dataset.batch_size=k but the submitted config (logged by wandb) shows dataset.batch_size=k/8.

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

I've tried preprocessing my dataset to one-second clips and removing some of the augmentations (even running an experiment with only noise augmentations) to try to increase GPU utilization, but nothing I've tried has improved the training speed.

Is this to be expected? Roughly how long did the original AudioSeal model take to train, using what amount of compute?

Thank you so much!

@hadyelsahar
Copy link

Hi! , can you paste here your run command so i am sure you are doing it right?

As an example, I ran experiments setting dataset.batch_size=8, which became dataset.batch_size=1, yielding a max training speed of about 1.67 steps / second and GPU utilization reaching averaging around 25%. When I set dataset.batch_size=128 (to yield dataset.batch_size=16), training speed dropped to around 0.3 steps / second. It seems to me that parallelization isn't working the way it should based on these results?

this seems normal to me the batch_size you add as an argument is the effective batch size, is internally divided between all gpus. If i understand correctly it is normal for step/sec to drop if you increase batch size because the step now has more samples to compute.

have you tried to plot convergence curves between the bsz?

Roughly how long did the original AudioSeal model take to train

Original training took from 3-10 days to to obtain good results on 4 gpus machine. But after 20-40 hours you could see it converging already.

@Comedian1926
Copy link

@hadyelsahar
Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

@hadyelsahar
Copy link

but there is no effect.

it will help a lot if you can share your evaluation metrics, you can find them in the dora log directory ./history.json

The 400k Voxpopuli dataset is too large for me.

Note here that in AudioCraft epoch is just a predefined # of steps not the whole training data, we set the default = 2000 steps . so the size of your training data basically doesn't affect the time taken per epoch it just affects the pool of samples that your training comes from.

updates_per_epoch: 2000

We don't use the full 400k hours on vox populi we select 5k hours, with which you can find good performance in about 80-100 epochs, we let our run till 200-300 epochs.

@pierrefdz
Copy link
Contributor

I think the training could be made a bit more efficient indeed, but we have not focused on it that much...

Hello, thank you very much for your work. Are there more details about the training? The 400k Voxpopuli dataset is too large for me. I hope to verify the watermarking effects on a smaller dataset. In fact, I have trained for about 10 epochs on a 200-hour dataset, but there is no effect. So I would like to know the minimum effective dataset size in terms of hours. Thank you again.

@Comedian1926 , if you want to study the watermark training at a smaller scale, what you can do is focus on some augmentations, and remove the compression ones -- for them, we need to transfer to CPU, save with the new format, load, and transfer back to GPU, so they take a lot of time.

What we observed during training is that the detection (and localization) accuracy increases very fast, in 10 epochs or even less. For the rest of the epochs, all metrics increase at a steady rate (notably the audio quality metrics).
Here is an example of some of the validation metrics (each point here is for 10 epochs since we computed validation metrics every 10epochs -- so 20 means 200 epochs).
308475502-618403f4-5654-4892-9fe1-ebf12b55f3a5.

@Comedian1926
Copy link

@pierrefdz @hadyelsahar Thank you very much for your reply, it is very useful to me. My previous training mainly had the d_loss between 1.98 and 2, and I feel it did not converge. I am currently restarting the training and will synchronize the log to you, hoping to succeed. Thank you again for your work~

@Comedian1926
Copy link

@hadyelsahar @pierrefdz
Hello, I've trained another model, but it still doesn't seem to be converging
My hardware is 3090 x 2
The training data is Voxpopuli subset 10k en
I've also experimented with adjusting the learning rate and batch size using a single card, but it didn't yield satisfactory results
This is the hyperparameter and log for training
history.json
hyperparams.json
spec_7 (2).pdf
I appreciate any advice you can offer. Thank you in advance.

@zjcqn
Copy link

zjcqn commented Nov 4, 2024

I found that the pesq operation in [audiocraft/solvers/watermark.py] is very time-consuming, so I skipped it.

@zjcqn
Copy link

zjcqn commented Nov 8, 2024

@hadyelsahar @pierrefdz Hello, I've trained another model, but it still doesn't seem to be converging My hardware is 3090 x 2 The training data is Voxpopuli subset 10k en I've also experimented with adjusting the learning rate and batch size using a single card, but it didn't yield satisfactory results This is the hyperparameter and log for training history.json hyperparams.json spec_7 (2).pdf I appreciate any advice you can offer. Thank you in advance.

I have encountered a similar issue where my training results are not converging. Specifically, d_loss remains close to 2.0, and wm_mc_identity is around 0.693, indicating an accuracy of only 0.5 and rendering the detector completely ineffective. Even removing all augmentations does not resolve the problem.

Has anyone found a suitable solution? I would be extremely grateful for any useful suggestions.

@pierrefdz
Copy link
Contributor

pierrefdz commented Nov 8, 2024

I'd suggest to first try to make things work without any perceptual losses, and see if you manage to make the bit accuracy and the detection go up. Something like:

# all the defaults form compression
losses:
  adv:0.0
  feat: 0.0
  l1: 0.0
  mel: 0.0
  msspec: 0.0
  sisnr: 0.0
  wm_detection: 1.0 # loss for first 2 bits cannot be 0 
  wm_mb: 1.0  # loss for the rest of the bits (wm message)
  tf_loudnessratio: 0.0

Then add the rest little by little and adapt the optimization parameters to ensure that the training is able to start.
(Sometimes the training stays frozen depending on the hyperparam. If you see that it does not take off you can cut the run very fast.)

@zjcqn
Copy link

zjcqn commented Nov 12, 2024

I'd suggest to first try to make things work without any perceptual losses, and see if you manage to make the bit accuracy and the detection go up. Something like:

# all the defaults form compression
losses:
  adv:0.0
  feat: 0.0
  l1: 0.0
  mel: 0.0
  msspec: 0.0
  sisnr: 0.0
  wm_detection: 1.0 # loss for first 2 bits cannot be 0 
  wm_mb: 1.0  # loss for the rest of the bits (wm message)
  tf_loudnessratio: 0.0

Then add the rest little by little and adapt the optimization parameters to ensure that the training is able to start. (Sometimes the training stays frozen depending on the hyperparam. If you see that it does not take off you can cut the run very fast.)

Thank you for your detailed reply. Your advice is very useful in diagnosing the issue, which now seems to be resolved.
I suspect that the primary problem I faced may be due to a lack of adequate training duration, as wm_mb_identity started to exhibit a noticeable decline after 25 epochs. Moreover, I made slight modifications to the temperature setting of wm_mb_loss to 1, which led to a substantial enhancement in the convergence rate. The watermark detection function is now yielding favorable results, and the performance of multi-band message detection continues to improve.
I am very grateful for your assistance.

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants