Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReLU in residual connections? #7

Closed
ibagur opened this issue Feb 8, 2023 · 4 comments
Closed

ReLU in residual connections? #7

ibagur opened this issue Feb 8, 2023 · 4 comments

Comments

@ibagur
Copy link

ibagur commented Feb 8, 2023

Hi,

I am using part of your code for a particular implementation of a transformer architecture I need as part of my master thesis research in RL. I noticed on the original paper from (Parisotto et al., 2019) that they re-order the LayerNorms so they place them at the input of both the multihead-attention and the feed-forward sub-modules. I saw that you also implement this on your code, via a the config["layer_norm"] setting. But on the paper they also mention, I quote: "Because the layer norm reordering causes a path where two linear layers are applied in sequence, we apply a ReLU activation to each sub-module output before the residual connection (see Appendix C for equations).". In fact, on those equations they apply a ReLU both to the output of the multihead-attention and feed-forward sub-modules, before performing the residual connection. I did not see that specific step on your code (just the standard residual connection), so I wonder whether there is a particular reason for that, or maybe I am missing something (I'm still quite novice in these implementations). In any case, congratulations for your great works, it is helping me a lot to understand the inner workings of such architectures. Thanks!

@MarcoMeter
Copy link
Owner

Thanks for bringing up this issue. GTrXL is still work in progress. We will investigate this detail on ReLU.

@MarcoMeter
Copy link
Owner

MarcoMeter commented Feb 13, 2023

I took some time to train TrXL with pre layer norm plus the missing ReLU activations on Mortar Mayhem Grid and Mystery Path Grid. I did not observe any gains in performance.

# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    h = self.gate1(query, attention)
else:
    if self.layer_norm == "pre":
        attention = F.relu(attention)
    # Skip connection
    h = attention + query
# GRU Gate or skip connection
if self.use_gtrxl:
    # Forward GRU gating
    out = self.gate2(h, forward)
else:
    if self.layer_norm == "pre":
        forward = F.relu(forward)
    # Skip connection
    out = forward + h

Here is a rough plot on the training performance in Mystery Path Grid. Each experiment was run for 4 different seeds.
mpg_trxl_gtrxl

@MarcoMeter
Copy link
Owner

I'm closing this issue for now. We decided to not add the ReLU activations. We don't have the time to further investigate this right now.

@ibagur
Copy link
Author

ibagur commented Feb 15, 2023

Hi,

Many thanks for your feedback and very interesting results. Very much appreciated, I might give it a look on my own research, but as I can see, it seems not to have much effect (apart of adding a little bit more of computation in the model)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants