-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add Flash Attention 2 support to Musicgen and Musicgen Melody #29939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
| return self.audio_encoder.sampling_rate | ||
|
|
||
| @property | ||
| def _attn_implementation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is one-to-one the same as in the PreTrainedConfig class:
| def _attn_implementation(self): |
Can we remove it from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if we want to keep the setter part!
|
|
||
| MUSICGEN_ATTENTION_CLASSES = { | ||
| "eager": MusicgenAttention, | ||
| "flash_attention_2": MusicgenFlashAttention2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding sdpa in one go as well? Would enable you to showcase attention implementation through sdpa on free tier Colab T4 GPU (where FA2 is not available)
| return self.audio_encoder.sampling_rate | ||
|
|
||
| @property | ||
| def _attn_implementation(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
| else outputs_fa.decoder_hidden_states[-1] | ||
| ) | ||
|
|
||
| assert torch.allclose(logits_fa[1:], logits[1:], atol=4e-2, rtol=4e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enough for a generative audio model with FA2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've copied out the same tolerance threshold than any other models (regardless of modality) btw
|
I've also added SDPA! cc @amyeroberts or @ArthurZucker could you review when you have time? |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Tests are ... huge, would be nice if you can use copied from, would help the review 😅
| self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" | ||
| self._use_sdpa = config._attn_implementation == "sdpa" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's only save self._attn_implementation please
|
|
||
| return attn_output, None, past_key_value | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied from can be used here as well!
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouf! Thanks for the big PR and adding those tests!
What does this PR do?
Supersedes #27924
The attention tests all pass but there are no integration equivalence between the original attention models and the FA ones. I don't hear any difference in quality despite not being the same song, though.
cc @sanchit-gandhi and @amyeroberts, could you review please?