-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Fix -1e4 as attn mask #17306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix -1e4 as attn mask #17306
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
ff7b92a to
a6fd049
Compare
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyTorch implementation relies on self.device which breaks the model parallelism for big model inference, so we should avoid using it (I actually removed lots of instance where we used it recently, and will hunt the other ones in another PR ;-) )
|
Generally, this looks good to me. I'd prefer though to not factor out a one-liner into a function (even if we have to add the one-liner 100+ times). It's not good for readability to have to jump to Also, I'd advocate to make three separate PRs (one for PT, one for TF, one for Flax). Think it should be both easier to maintain the PRs as well as review them. A first test should then be that all slow tests pass. After that it would indeed be nice if we could run some fine-tuning for the most important models (BERT on GLUE, GPT2 on causal LM, T5 on translation maybe). Maybe also not even necessary to verify that everything is correct with a training run if the slow tests all pass |
|
Hi,
Would this be a problem for model parallelism for big model inference? It is |
patil-suraj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good in general. I have pretty much the same comments as Patrick. I would advocate to do some fine-tuning even if the slow tests pass to make sure it doesn't break anything. Especially with models like T5 which have had issues with attention_mask.
|
@ydshieh Using the weight device is perfectly fine, thanks for checking! |
|
Cool, exciting! |
40c0ce7 to
70eb792
Compare
|
Hi, @patrickvonplaten @patil-suraj @sgugger @LysandreJik This PR is ready for review.
|
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing all of those!
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @ydshieh ! Looks good to me
797fb16 to
e52f8f9
Compare
267912c to
5dc2a3f
Compare
What does this PR do?
Fix the issues regarding
-1e4as attention mask.Fix #17215 #17121 #14859