-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReLU in residual connections? #7
Comments
Thanks for bringing up this issue. GTrXL is still work in progress. We will investigate this detail on ReLU. |
I'm closing this issue for now. We decided to not add the ReLU activations. We don't have the time to further investigate this right now. |
Hi, Many thanks for your feedback and very interesting results. Very much appreciated, I might give it a look on my own research, but as I can see, it seems not to have much effect (apart of adding a little bit more of computation in the model) |
Hi,
I am using part of your code for a particular implementation of a transformer architecture I need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the
config["layer_norm"]
setting. But on the paper they also mention, I quote: "Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations).". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!The text was updated successfully, but these errors were encountered: