-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending Self Attention #391
Comments
It would definitely be better to implement this using the RETURNN layers instead of extending the From my perspective, the problems you are mentioning might not be solvable without extending some of the other layers. If you want to access computed layer outputs from previous recurrent steps, you can use the I am not sure if there is another approach currently existing. Extending the |
Thanks for mentioning this. This is what I keep saying, and also explain in the tutorial, and also in the introduction video, and in many other places. This is why we definitely should not make it even more complex. I would even go as far as saying we should deprecate the This is what you should follow in principle. (If you have not seen the tutorial, or introduction video, please check it, to know the principles of RETURNN.) Layers should be as simple as possible, reduced mostly to a single operation (e.g. wrapping a single TF low-level function), such that you can basically define anything in your config by using the layers as building blocks. Also, such basic building blocks then should be useful for most other people as well, as they should be very generic. This should mostly be straight forward. Except for automatic optimization, i.e. that you should think about the two cases, where the layer is inside a recurrent loop (at decoding time, or maybe if you have other recurrence in the model), or when it can be calculated in parallel (independent of the recurrent loop). I haven't really thought about how to split up You could also implement another independent layer, One principle rule of code in RETURNN is that this should be useful for many people, and would probably immediately use by many people, now and in the future. For generic basic building blocks, this is usually always the case. For sth like So, to summarize:
|
It seems that it is a better idea to write a new layer instead of hacking something into the |
Thanks for all of your comments!
That's a good idea - that's the main part that's missing. The idea of that layer would be then:
|
Right, There is a very similar/related
Yes. So let's be more specific. In the loop, it assumes an input of shape [B,D], and the output in first frame would be [B,1,D], in second frame would be [B,2,D], and so on, i.e. in general [B,i,D]. This assumes an initial state of [B,0,D], but that could be configured. It also means when you know accumulate this frame-by-frame (for i=1...T), i.e. get the accumulated values from outside of the loop, you would get an output of shape [T,B,T,D], which has the usual mask on the second T axis.
This is not so clear. To be analog to the within-loop case, it should expect an input [T,B,D] (or whatever axis order), and then just produce the same output [T,B,T,D], with the usual mask on the second T axis. |
To add on the outside-loop case: However, this is unnecessarily inefficient. You are right, if outside the loop, it could also just return [T,B,D]. But this has two problems:
And there need to be a custom layer which applies the mask only if outside the loop. But I wonder if this is really intuitive. How would this be called? We should draft maybe a net dict, how that would look like. Analog to the
This is only a draft. There are a couple of details which probably need extra care. It misses:
Incomplete also: In the last dot-layer for We cannot use Or maybe instead of using |
I think there is sth wrong in the Another idea: The shapes would become:
However, the
This would squeeze away the T' axis in case of outside-the-loop, as it would be part of the remaining axes and it assumes them to match. Not sure if it does it all correctly, or complains. Also not sure if it should do this implicitly like that, or we should make it more explicit. Then for the
T'' would be like T, but a separate dim-tag. Also the final
That does not work. The axes to be reduced must match, but they don't. Maybe there is some way to overcome this. Or maybe this idea was just not good, and the first approach is better. |
Right now, I think the first proposal (just return As you said, it makes everything simpler if the I am unsure whether a
and let the Or instead, |
Yes, I think I agree. The new tag would be sth with
Let's not make it too complicated. Let's first think/draft one possibility, and only implement that.
Yes, this is really necessary. Also that a layer like
Yes. But not sure what would be better. Remember, I don't like that layers become too complex, and more and more options are added. That is our problem with
This is the wrong way around, right? The axis for the softmax is
Why two? It only needs to check for such a "history" dim tag, which you could maybe explicitly specify by I'm a bit afraid that this could potentially hide bugs, e.g. if you have a typo there for the history axis, you would not get any error, but it would silently just ignore this. This is bad. Any idea about this? Maybe if outside a loop, the history axis must exist. And if inside the loop, it will ignore it. (This is not perfectly generic, though. E.g. what if you have a loop in a loop? Etc...) |
Yes sounds good.
Okay sure. Would we still want the layer to have an attribute
Yes okay, you do have a point. Then I would go for the additional masking layer I think.
Actually I think it is correct:
I wanted it to be less specific to the
Good point. |
Hm, maybe. I'm not sure. The name is bad. Also, what would be the default? Or would this be a required argument? (If it has a default, we can also just leave it away.) Also, we speak about these two cases inside loop / outside loop, but these are not really the only cases. It could be inside a loop, inside another loop. It would be good to abstract away a bit from that. Also, in an optimal world, the user should not really need to think about this, and just implement it in a way as it would be without automatic optimization, i.e. always inside the loop. Actually, what are the other arguments of I cannot really come up with a good name or other conceptual way to configure something like
Ok. Btw, the naming for the current rec layer frame axis (":i" suggested above), the naming of
Btw, in the
I think I/we get confused now about the naming of "history" here. But you are right, depending on what the "history" axis refers to. If this refers to the dim tag by
The But we can solve that, by adding a special flag/marker on the However, then the If you instead have two axes, and do the no-op mode when one of the axis does not exists, this can potentially be problematic. E.g. just a typo could lead to the case that the axis never exists, and it is always a no-op. |
I think that either
I agree that the axis should be configurable.. But puh, I really can't tell what would be more intuitive :D
Don't we run into issues if the So if understood you correctly, you want to call the axis generated by the Maybe we could also postpone these |
So you think we should have this option on
The question about the
Yes. So concluding from that, it should have the same name in both cases. So "rec-history" (or so) then in both cases. This is about the name of the new dim tag. It still needs to be distinguished for the other logic (e.g. in
Yes exactly.
I don't like that. We always should try to avoid error-prone code. And why not do it directly in a good way? We already have a possible solution, as outlined here.
They are widely used, but adding new attributes will not have any effect on existing code, so that shouldn't be a problem. And they would be optional, i.e. The only problem I see with this is whether these two attribs/flags are maybe very specific for this specific case here, and will not be used in any other case. This is maybe a bit ugly. But maybe this is not too much a problem. |
Okay, then we simply only support the "
Yeah okay, then let's do it this way.
Okay okay, I guess so, fine. Then everything is more or less clear now, right? I would start by using the names we chose here (i.e. |
Yes.
I think so. At least the draft looks fine and like this could work.
Yea, I assume there will be subtle problems. But maybe not. We will see.
Use |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I pushed an initial draft of the
yielding the templates (without optimizations)
and with optimizations
Obviously, the masking is still missing. So far I have come across some issues:
It would be good if somebody could look over what I did :D |
Some comments:
|
Thanks for your comments!
Seems to have worked, thanks
Ah, okay of course we can do that then.
Ahh oh, I didn't think about that.
I agree.
I am not sure whether I understand you correctly. What exactly should we create in
Thanks okay, that does clear up some things! :)
Maybe we can also add an attribute with the rec time axis to Somewhat related to this, in general I think we need some additional way of accessing time axes: |
From a first glance, it seems correct. Handling this can be tricky sometimes.
Yes, this would not be nice. This should be decoupled. I.e. the
Yes exactly.
I would rather make this handled directly via
You can create your own But to the specific question/case: Couldn't you just use Arguably, it's maybe a bit non-intuitive to use Btw, we might need another attribute for |
Okay, so we essentially add the Maybe we run into issues when other code copies and modifies Alternatively, we could also stick to a simpler solution instead, like e.g. storing both the
Yes, I think this is kind of out of the scope of this issue related to self attention :D
Ah true, one could do that. I wouldn't use that in a reference self attention implementation though.
True! It would also make it (apart from the masking layer) exactly analogue to the decoder self attention, which I think is good. However, the But there are also other issues in the implementation if we do that, e.g. our plan was to copy the
Maybe yes. But if I understood correctly, that is somewhat impossible to 100% correctly implement right now, because we should not (but must) create a new tensor for this |
This is for generalized self attention (#391). Co-authored-by: Frithjof <[email protected]>
This is for generalized self attention (#391). Co-authored-by: Frithjof <[email protected]>
This is for generalized self attention (#391). Co-authored-by: Frithjof <[email protected]>
This is for generalized self attention (#391). Co-authored-by: Frithjof <[email protected]>
This is for generalized self attention (#391). Co-authored-by: Frithjof <[email protected]>
This is for generalized self attention (#391). Fixes #391. Co-authored-by: Frithjof <[email protected]>
With The test case is this code:
|
I would suggest to leave this closed now. Any further issues, or missing functionality, should be discussed in new separate issues. |
@Zettelkasten Can you check whether you have everything you need to implement the basic self attention (well, basically that is already the test case, but maybe extend), also positional encoding, also relative positional encoding, also LSH, and other things you need? |
Cool awesome, a lot of thanks for all the work and thoughts you put into this! |
E.g. for generalized self attention (#391, #545). Co-authored-by: Frithjof <[email protected]>
For generalized non-rec self attention (#391).
For reference, an example for generalized self attention (non recursive) (via #656):
|
E.g. for generalized self attention (#391, #545). Co-authored-by: Frithjof <[email protected]>
For generalized non-rec self attention (#391).
I want to implement some changes to the self-attention used in the Transformer for MT, namely implement locality-sensitive hashing (https://arxiv.org/pdf/2001.04451.pdf).
Right now, self-attention is a single layer within RETURNN. While this is very convenient when using the default configuration, it is not very extensible: All options for it have been implemented as additional arguments for the layer and the code for it has become pretty messy over the time.
I could implement my changes within the layer by adding an additional parameter, but I think it might be better to not clutter the self-attention layer with even more (relatively specific) parameters.
Instead it might be nicer to implement them using existing other RETURNN Layers, similar to how encoder-decoder attention is implemented in our Trafo config.
For unmasked self-attention (where one can attend to the entire sequence, e.g. used in the encoder), I don't see an issue in implementing it completely analogous to the encoder-decoder attention:
Use three
linear
layers to obtain queries, keys and values. Compare all queries and keys against each other using adot
layer, and then use asoftmax_over_spatial
layer to turn these attention energies into attention weights. Finally use ageneric_attention
layer to compute a weighted sum of the attention values.For masked self-attention (where one cannot attend to future positions, e.g. used in the decoder), there are two things to consider:
I have no idea how that should look. What would the linear layers generating attention keys and values return in a recurrent loop (wouldn't that introduce a time axis even then)? How to handle that we do not need to recompute old keys/values?
What would be the best approach to extend self-attention? Stick to changing the Returnn code of the layer? Or implement it in multiple layers, but then how do I solve the problems I mentioned?
Thanks :)
The text was updated successfully, but these errors were encountered: