Skip to content

Conversation

@ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented May 17, 2022

What does this PR do?

Fix the issues regarding -1e4 as attention mask.

Fix #17215 #17121 #14859

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 17, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh force-pushed the no_-1e4_for_attn_mask branch from ff7b92a to a6fd049 Compare May 17, 2022 18:53
@ydshieh ydshieh changed the title [WIP] Fix -1e4 as attn mask Fix -1e4 as attn mask May 17, 2022
@ydshieh ydshieh marked this pull request as ready for review May 17, 2022 20:31
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyTorch implementation relies on self.device which breaks the model parallelism for big model inference, so we should avoid using it (I actually removed lots of instance where we used it recently, and will hunt the other ones in another PR ;-) )

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented May 17, 2022

Generally, this looks good to me. I'd prefer though to not factor out a one-liner into a function (even if we have to add the one-liner 100+ times). It's not good for readability to have to jump to modeling_utils.py and the code saved is not worth it for a one-liner.

Also, I'd advocate to make three separate PRs (one for PT, one for TF, one for Flax). Think it should be both easier to maintain the PRs as well as review them.

A first test should then be that all slow tests pass. After that it would indeed be nice if we could run some fine-tuning for the most important models (BERT on GLUE, GPT2 on causal LM, T5 on translation maybe). Maybe also not even necessary to verify that everything is correct with a training run if the slow tests all pass

@ydshieh
Copy link
Collaborator Author

ydshieh commented May 18, 2022

Hi,

@patrickvonplaten:

  • I removed the new function.
  • I have to modify FlaxT5Attention otherwise the PT/Flax T5 equivalence tests will fail.

@sgugger:

  • since there is no more new function mask_value(), so no more device issue. There is one place I need to use tensor and device though:

https://github.com/huggingface/transformers/blob/195ef42e0be974e8c019e9d5f03070f65365c721/src/transformers/models/gpt2/modeling_gpt2.py#L202-L205

Would this be a problem for model parallelism for big model inference? It is attn_weights.device instead of self.dtype though.

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good in general. I have pretty much the same comments as Patrick. I would advocate to do some fine-tuning even if the slow tests pass to make sure it doesn't break anything. Especially with models like T5 which have had issues with attention_mask.

@sgugger
Copy link
Collaborator

sgugger commented May 18, 2022

@ydshieh Using the weight device is perfectly fine, thanks for checking!

@LysandreJik
Copy link
Member

Cool, exciting!

@ydshieh ydshieh force-pushed the no_-1e4_for_attn_mask branch 5 times, most recently from 40c0ce7 to 70eb792 Compare May 25, 2022 08:35
@ydshieh
Copy link
Collaborator Author

ydshieh commented May 25, 2022

Hi, @patrickvonplaten @patil-suraj @sgugger @LysandreJik

This PR is ready for review.

  • Only dealing with PyTorch models: but need to change FlaxT5 too to make the test pass.
  • In general, change to torch.finfo(correct-dtype).min instead of -10000, -1e9 etc.
  • In particular, changes in modeling_utils.py
  • Verified the change by training a T5 from scratch as well as finetuning the t5-small checkpoint

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing all of those!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @ydshieh ! Looks good to me

@ydshieh ydshieh force-pushed the no_-1e4_for_attn_mask branch from 797fb16 to e52f8f9 Compare May 26, 2022 08:33
@ydshieh ydshieh force-pushed the no_-1e4_for_attn_mask branch from 267912c to 5dc2a3f Compare June 20, 2022 09:18
@ydshieh ydshieh merged commit d3cb288 into huggingface:main Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

-1e9 constants in T5 implementation

8 participants