Skip to content

Conversation

@ebezzam
Copy link
Contributor

@ebezzam ebezzam commented Jul 30, 2025

What does this PR do

  1. Fix DAC conversion:
  • Most notably, performing weight norm removal on GPU instead of on CPU (otherwise get differences for layers with weight norm when applying models on GPU)
  • Missing feature extractor parameters
  • Correctly casting sampling rate
  1. More consistent add/remove weight norm functions
  2. Update explanation of high tolerances during testing. We now know it comes from weight norm removal on CPU (instead of GPU) and different implementations of Snake1d (their version uses JIT). Nevertheless, we stick with current models on the Hub, as differences are minimal.

Reproducer to show weight norm difference when doing weight removal on a different device: https://gist.github.com/ebezzam/c83f186dcfeaab8cac040c960eb474cd

Comment on lines 404 to 409
1. Transformer model does not use weight norm for speed-up. And during model conversion, weight norm was removed on
CPU (old script: https://github.com/huggingface/transformers/blob/8e077a3e452e8cab94ef62b37d68258bd3dcffed/src/transformers/models/dac/convert_dac_checkpoint.py#L230)
This leads to slightly different weight (1e-8) and the error accumulates. Removing weight norm on GPU would produce
equivalent weights (current conversion script).
2. Original version uses Snake1D activation with JIT: https://github.com/descriptinc/descript-audio-codec/blob/c7cfc5d2647e26471dc394f95846a0830e7bec34/dac/nn/layers.py#L18
Transformer version does not use JIT, so outputs are slightly different.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated (definite) reason for high tolerances

@ebezzam ebezzam requested a review from eustlb July 30, 2025 14:53
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ebezzam ebezzam mentioned this pull request Jul 30, 2025
4 tasks
@ebezzam ebezzam added the Audio label Jul 30, 2025
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not update the conversion if we don't change the hub. Having a legacy path is unideal and makes it confusing for the average user as hub differs from the script here.

There are two option imo:

  • Change to new conversion (no extra flags) and update hub weights
  • Only leave the description where the differences stem from

I'd prefer option 1 even if it was breaking tbh. Would wait on Eustache here tbh

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: dac

@ebezzam
Copy link
Contributor Author

ebezzam commented Aug 20, 2025

thanks @vasqu!

🚨 @eustlb (when you're back), @vasqu and I spoke offline that it would be better to:

  • ask Descript to update model weights (with conversion done on GPU)
  • switch to Snake 1D with JIT

Main reason being that several models are depending on DAC (XCodec, Dia, Higgs Boson, maybe more), and it would be better that they don't depend on a model with minor output differences. As model addition/integration will be trickier since we may not be able to isolate if differences are coming from DAC or from implementing the new model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants