-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Fix T5 incorrect weight decay in Trainer and official summarization example #18002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR, I've added a few comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep both here, since the user might use other models.
src/transformers/trainer.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This import is not something we really want to add as the Trainer shouldn't depend on individual model files. Instead T5LayerNorm should subclass torch.nn.LayerNorm (it rewrites the init and the forward anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments! It seems that no one has written any types of layer norm as subclass of nn.LayerNorm. I'm not sure if this is true:
class T5LayerNorm(nn.LayerNorm):
def __init__(self, hidden_size, eps):
"""
Construct a layernorm module in the T5 style. No bias and no subtraction of mean. """ super().__init__(hidden_size, eps, elementwise_affine=False)
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
self.reset_parameters()
def forward(self, hidden_states):
# T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
# Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for # half-precision inputs is done in fp32
variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
# convert into half-precision if necessary
if self.weight.dtype in [torch.float16, torch.bfloat16]:
hidden_states = hidden_states.to(self.weight.dtype)
return self.weight * hidden_states
def reset_parameters(self):
if self.weight is not None:
nn.init.ones_(self.weight)LongT5LayerNorm is copied from T5LayerNorm which need to be changed same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would create unused bias weights for every layernorm in T5 at the moment, I don't think we want this. @sgugger I'm actually not really in favor of adding this abstraction here to the T5LayerNorm class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the init calls the superclass. If it calls the super of the super class (nn.Module), there is nothing breaking. We just can't have the Trainer start depending on multiple modeling files because there are 100 flavors of LayerNorms.
An alternative would be to have a constant in pt_utils where each submodule that defines a specific LayerNorm adds their own, would that work better?
So here we would have after defining LongT5LayerNorm:
ALL_LAYERNORM_LAYERS.append(LongT5LayerNorm)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ADAning Could you amend your PR to use this solution maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pt_utils mean transformers.utils ?
@ADAning Could you amend your PR to use this solution maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No transformers.pytorch_utils. Sorry I misspelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actuall this is my first time to use git... I don't know why other people commits can be seen after I fetch and rebase from upstream, it really bothers me. So I reset my commit to I created this PR before, then this PR closed. Then my commit can't be seen there. Could you re-open this PR? @sgugger
Im sorry for my mistake.
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just re-opened your PR! In general you only need to rebase if the PR has gone a long time (a month) or if there is some critical changes in the main branch you need, so don't bother for small changes :-)
| return self.weight * hidden_states | ||
|
|
||
|
|
||
| ALL_LAYERNORM_LAYERS.append(T5LayerNorm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be after the try except block below, which may change T5LayerNorm.
| return self.weight * hidden_states | ||
|
|
||
|
|
||
| ALL_LAYERNORM_LAYERS.append(LongT5LayerNorm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be after the try/except block below, which can change LongT5LayerNorm.
|
Thanks a lot for iterating with us! |
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for iterating on this and sorry to reply so late here - the final solution looks very nice!
…xample (huggingface#18002) * Add ALL_LAYERNORM_LAYERS for LayerNorm * fix bug of appending layer norm
What does this PR do?
Official summrization examples use T5 as pretrained model, but the name of T5 layer norm is
layer_norm, notlayerNorm:Output:
In Official example of summarization,
layer_normnot included:A similar problem occurred in trainer.py, which may cause
Seq2SeqTrainertrain T5 layer norm with weight decay:Because T5 uses T5LayerNorm Layer norm not
nn.LayerNorm:Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.