-
Notifications
You must be signed in to change notification settings - Fork 309
Fix value_dim in TransformerDecoder's cross-attn layer
#667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix value_dim in TransformerDecoder's cross-attn layer
#667
Conversation
|
Oops, accidentally removed review request for @mattdangerw. Adding it back. |
jbischof
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
mattdangerw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks good! I am unclear why we need the testing change though, is it actually changing the test in any way?
| intermediate_dim=4, num_heads=2 | ||
| intermediate_dim=4, | ||
| num_heads=2, | ||
| has_cross_attention=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do we need this actually? Won't this line at the start of call has_encoder_sequence = encoder_sequence is not None, mean the layer will be built with cross attention as soon as the decoder is called on two inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry 🤦🏼 , not needed. Changing it back
mattdangerw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops actually marked this as "changes requested" until we figure out the testing bit.
This bug cropped up when I was implementing
BartBackbone: #661. Instead of passingvalue_dim = hidden_dimin the cross-attention layer, we should passhead_dim.Let's look at the
TransformerDecoderBlocklayer given in thetensorflow/modelsrepo.value_dimis not passed tokeras.layers.MultiHeadAttentionlayer, which means thatvalue_dim = key_dim = head_dim.Intuitively, if we pass
value_dimashidden_dim = 768, withnum_heads = 12, the weight matrix for value will be of shape(768, 12, 768). This is incorrect. The shape should be(768, 12, 64).