Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Self Attention #391

Closed
Zettelkasten opened this issue Nov 16, 2020 · 81 comments
Closed

Extending Self Attention #391

Zettelkasten opened this issue Nov 16, 2020 · 81 comments
Assignees

Comments

@Zettelkasten
Copy link
Member

I want to implement some changes to the self-attention used in the Transformer for MT, namely implement locality-sensitive hashing (https://arxiv.org/pdf/2001.04451.pdf).

Right now, self-attention is a single layer within RETURNN. While this is very convenient when using the default configuration, it is not very extensible: All options for it have been implemented as additional arguments for the layer and the code for it has become pretty messy over the time.
I could implement my changes within the layer by adding an additional parameter, but I think it might be better to not clutter the self-attention layer with even more (relatively specific) parameters.

Instead it might be nicer to implement them using existing other RETURNN Layers, similar to how encoder-decoder attention is implemented in our Trafo config.
For unmasked self-attention (where one can attend to the entire sequence, e.g. used in the encoder), I don't see an issue in implementing it completely analogous to the encoder-decoder attention:
Use three linear layers to obtain queries, keys and values. Compare all queries and keys against each other using a dot layer, and then use a softmax_over_spatial layer to turn these attention energies into attention weights. Finally use a generic_attention layer to compute a weighted sum of the attention values.

For masked self-attention (where one cannot attend to future positions, e.g. used in the decoder), there are two things to consider:

  • We have to mask all future target positions by setting the attention energies to -\infty. This could for example be done in the softmax layer (which already considers the total sequence length anyway)
  • When we are in a recurrent layer (e.g. during Trafo inference), then we would like to cache all previously computed attention keys, values and queries. The current self-attention layer does this, but it is also one of the reasons why it is messy to extend it currently: Both the recurrent and parallel case are handled somewhat differently.
    I have no idea how that should look. What would the linear layers generating attention keys and values return in a recurrent loop (wouldn't that introduce a time axis even then)? How to handle that we do not need to recompute old keys/values?

What would be the best approach to extend self-attention? Stick to changing the Returnn code of the layer? Or implement it in multiple layers, but then how do I solve the problems I mentioned?
Thanks :)

@JackTemaki
Copy link
Collaborator

It would definitely be better to implement this using the RETURNN layers instead of extending the SelfAttentionLayer.

From my perspective, the problems you are mentioning might not be solvable without extending some of the other layers. If you want to access computed layer outputs from previous recurrent steps, you can use the WindowLayer (unfortunately the docstring is not yet covering this special case). With the window layer, you could get e.g. the last N keys and values in the format [B,N,D] inside the recurrent subnetwork, and they would not be recomputed. The problem is, that you need all the states from the beginning, and the WindowLayer is currently not capable of doing so.

I am not sure if there is another approach currently existing. Extending the WindowLayer to have a "dynamic" window might not be the best approach, but it is the only one I can think of so far.

@albertz
Copy link
Member

albertz commented Nov 17, 2020

Right now, self-attention is a single layer within RETURNN. While this is very convenient when using the default configuration, it is not very extensible: All options for it have been implemented as additional arguments for the layer and the code for it has become pretty messy over the time.

Thanks for mentioning this. This is what I keep saying, and also explain in the tutorial, and also in the introduction video, and in many other places.

This is why we definitely should not make it even more complex. I would even go as far as saying we should deprecate the SelfAttentionLayer, as it obviously tends to become messy, or already is, and this is one of the core principles of RETURNN, that layers should be simple.

This is what you should follow in principle. (If you have not seen the tutorial, or introduction video, please check it, to know the principles of RETURNN.) Layers should be as simple as possible, reduced mostly to a single operation (e.g. wrapping a single TF low-level function), such that you can basically define anything in your config by using the layers as building blocks. Also, such basic building blocks then should be useful for most other people as well, as they should be very generic.

This should mostly be straight forward. Except for automatic optimization, i.e. that you should think about the two cases, where the layer is inside a recurrent loop (at decoding time, or maybe if you have other recurrence in the model), or when it can be calculated in parallel (independent of the recurrent loop). I haven't really thought about how to split up SelfAttentionLayer into smaller building blocks such that it is still efficient for these two cases. This is the non-trivial part.

You could also implement another independent layer, SelfLSHAttentionLayer or so. But this might be very specific, so maybe not useful for other people (except if they want to use exactly the same LSH variant). But you could simply leave such layer inside your config and not push it back to RETURNN.

One principle rule of code in RETURNN is that this should be useful for many people, and would probably immediately use by many people, now and in the future. For generic basic building blocks, this is usually always the case. For sth like SelfLSHAttentionLayer, this is not clear currently.

So, to summarize:

  • Try to think about basic building blocks you need, and whether that works correct with automatic optimization. Then add whatever basic building block is missing to RETURNN.
  • Or: Put your custom SelfLSHAttentionLayer into your config. But this is maybe not so nice.

@JackTemaki
Copy link
Collaborator

It seems that it is a better idea to write a new layer instead of hacking something into the Window-layer. @albertz suggested adding a ConcatenateInTimeLayer which returns the succeeding outputs in a recurrent layer.

@Zettelkasten
Copy link
Member Author

Thanks for all of your comments!
I will start by implementing the 'normal' self attention as currently provided by the SelfAttentionLayer using multiple RETURNN layers, and then start from there - I do not seem to be the only person here interested in making self attention more configurable.

It seems that it is a better idea to write a new layer instead of hacking something into the Window-layer. @albertz suggested adding a ConcatenateInTimeLayer which returns the succeeding outputs in a recurrent layer.

That's a good idea - that's the main part that's missing.
However, I would call the layer differently: ConcatenateInTimeLayer sounds very similar to Prefix/PostfixInTimeLayer to me (suggesting it concatenates multiple inputs along the time axis).
Perhaps something like AccumulatePreviousOutputLayer? That would be similar to the (obsolete) GetRecAccumulatedOutputLayer. But I'm not sure ..
Any thoughts? But that's just a name, I can start working on it nonetheless.

The idea of that layer would be then:

  • If the layer is not in a recurrent loop, it just copies its input (and maybe also asserts that an recurrent time axis already exists).
  • If it is in a recurrent loop, it adds a recurrent time axis which concatenates it's input at this time step and all previous time steps.

@albertz
Copy link
Member

albertz commented Nov 18, 2020

However, I would call the layer differently: ConcatenateInTimeLayer sounds very similar to Prefix/PostfixInTimeLayer to me (suggesting it concatenates multiple inputs along the time axis).
Perhaps something like AccumulatePreviousOutputLayer? That would be similar to the (obsolete) GetRecAccumulatedOutputLayer. But I'm not sure ..
Any thoughts? But that's just a name, I can start working on it nonetheless.

Right, ConcatenateInTimeLayer is not so clear.

There is a very similar/related CumsumLayer, which does tf.cumsum.
Analog to that, maybe CumConcatLayer?

  • If it is in a recurrent loop, it adds a recurrent time axis which concatenates it's input at this time step and all previous time steps.

Yes. So let's be more specific. In the loop, it assumes an input of shape [B,D], and the output in first frame would be [B,1,D], in second frame would be [B,2,D], and so on, i.e. in general [B,i,D]. This assumes an initial state of [B,0,D], but that could be configured.

It also means when you know accumulate this frame-by-frame (for i=1...T), i.e. get the accumulated values from outside of the loop, you would get an output of shape [T,B,T,D], which has the usual mask on the second T axis.

  • If the layer is not in a recurrent loop, it just copies its input (and maybe also asserts that an recurrent time axis already exists).

This is not so clear. To be analog to the within-loop case, it should expect an input [T,B,D] (or whatever axis order), and then just produce the same output [T,B,T,D], with the usual mask on the second T axis.

@albertz
Copy link
Member

albertz commented Nov 18, 2020

To add on the outside-loop case:

However, this is unnecessarily inefficient. You are right, if outside the loop, it could also just return [T,B,D]. But this has two problems:

  • This would be inconsistent to the within-loop operation, where the accumulated shape would be [T,B,T,D]. This is kind of the requirement: Whether automatic optimization is enabled or disabled should result in the same values. (Or at least the behavior in all possible cases should be the same.)

  • Follow-up operations would need to be adjusted, esp. the energy mask need to be applied later on. But how should the follow-up layer know about this, if it just gets some data [T,B,D], that some mask needs to be applied?

CumConcatLayer or AccumulatePreviousOutputLayer are also then bad names, if it has this special behavior. Or maybe the behavior should be a flag?

And there need to be a custom layer which applies the mask only if outside the loop. But I wonder if this is really intuitive. How would this be called? HistoryMaskLayer? Maybe the newly introduced dimension, the second T in the case within-loop, its dimension tag can have a special flag accumulated_over_time or historic or so, and the HistoryMaskLayer only applies the mask if there is no dimension with such a flag. But I fear this becomes kind of non-intuitive.

We should draft maybe a net dict, how that would look like.

Analog to the SelfAttentionLayer, let's assume there is a layer x with the input:

{class: rec, unit: {
...,
x: ...,  # [B,D] inside, [T,B,D] outside
qkv: {class: linear, from: x, activation: None},  # [B,2*K+V] inside, [T,B,2*K+V] outside
qkv_split: {class: split, from: qkv, size_splits: [K,K,V]},
q: {class: copy, from: qkv_split/1},  # [B,K] inside, [T,B,K] outside
k: {class: copy, from: qkv_split/2},  # [B,K] inside, [T,B,K] outside
v: {class: copy, from: qkv_split/3},  # [B,V] inside, [T,B,V] outside
k_accum: {class: cum_concat, from: k, as_is_outside: True},  # [B,T,K] in both cases
v_accum: {class: cum_concat, from: v, as_is_outside: True},  # [B,T,V] in both cases
energy: {class: dot, from: [q, k_accum], red1:"static:-1", red2:"static:-1", var1:"T?", var2:"T"},  # [B,T] inside, [T,B,T] outside
energy_masked: {class: history_mask, from: energy, value:-inf},  # [B,T] inside, [T,B,T] outside (with mask applied)
att_weights: {class: softmax_over_spatial, from: energy_masked},  # [B,T] inside, [T,B,T] outside
att: {class: dot, from: [v_accum, att_weights], red1:"T", red2: "stag:history?", var1:"static:-1", var2:[]},  # [B,V] inside, [T,B,V] outside
}}

This is only a draft. There are a couple of details which probably need extra care. It misses:

  • Scale on query (simple).
  • Multiple heads (relatively simple).

Incomplete also: In the last dot-layer for att, I use stag:history?. This would be analog to T?, i.e. it means it optionally selects this axis if it exists, and otherwise not.

We cannot use T? there because this would probably select the wrong spatial/time axis! Optional stag support is missing, but should be simple to add. history is just a made-up name here. It depends how the CumConcatLayer calls the new dimension tag.

Or maybe instead of using stag, we could also have a special axis identifier history, which would also be used by HistoryMaskLayer?

@albertz
Copy link
Member

albertz commented Nov 18, 2020

I think there is sth wrong in the att at the end. It should not be red2: "stag:history?". red2 should always match to red1, so it cannot be optional. So it would just be red2: "stag:history". To make that work, the HistoryMaskLayer actually could also introduce such history-dim-tag on the masked time axis.

Another idea: CumConcatLayer, in the outside-loop-case, could return shape [T,B,T',D], where T' is a special dimension of length 1, but which has the special meaning that it would expand to the other time-dim with masking. Such special meaning could be preserved in its dimension tag. Then it becomes somewhat more consistent to the inside-loop-case, and also HistoryMaskLayer becomes more straight-forward.

The shapes would become:

v_accum: [T',B,D] inside, [T,B,T',D] outside

However, the DotLayers become less straight-forward. I think for energy, it just stays:

energy: {class: dot, from: [q, k_accum], red1:"static:-1", red2:"static:-1", var1:"T?", var2:"T"},  # [B,T'] inside, [T,B,T] outside

This would squeeze away the T' axis in case of outside-the-loop, as it would be part of the remaining axes and it assumes them to match. Not sure if it does it all correctly, or complains. Also not sure if it should do this implicitly like that, or we should make it more explicit.

Then for the HistoryMaskLayer, we would get:

energy_masked: [B,T'] inside, [T,B,T''] outside

T'' would be like T, but a separate dim-tag.

Also the final DotLayer for att becomes less straight-forward. Specifically, what should you put for red2? You want to reduce the axis if it can be matched to T, i.e. inside the loop, but not reduce it otherwise. You just want to squeeze it outside the loop, but not sure if that should be done in dot layer or separately. Maybe just like:

att: {class: dot, from: [v_accum, att_weights], red1:"stag:history", red2: "stag:history", var1:"static:-1", var2:[]},  # [B,V] inside, [T,B,V] outside

That does not work. The axes to be reduced must match, but they don't.
Inside the loop, red1 should be T', outside the loop, it should be T.

Maybe there is some way to overcome this. Or maybe this idea was just not good, and the first approach is better.

@Zettelkasten
Copy link
Member Author

Zettelkasten commented Nov 19, 2020

Right now, I think the first proposal (just return [T,B,D] in the outside loop case) is more intuitive and more straight forward to implement and use.
We can also just implement this behavior as a flag, and then have the benefits of both solutions.

As you said, it makes everything simpler if the CumConcatLayer marked the outputted time-axis in some special way, e.g. using stag:history as you mentioned.
Then we can use stag everywhere instead of just T. This is a bit verbose, but I think necessary to keep apart the query-axis (i.e. stag:extern_data_classes and the key/value-axis (i.e. stag:history or sth like that).

I am unsure whether a HistoryMaskLayer is really necessary.
We could also add an extra parameter mask_future_of_axis to the SoftmaxOverSpatialLayer.
Then we could simply say

att_weights: {class: softmax_over_spatial, from: energy,
              axis: stag:history, mask_future_of_axis: stag:extern_data:classes?}

and let the softmax_over_spatial layer apply masking exactly if it receives a mask_future_of_axis, which is exactly the case when we are outside a loop.
But then again this is less configurable ... so maybe this is a bad idea falling back into the loop-hole of having too many arguments to RETURNN layers.

Or instead, HistoryMaskLayer could receive two time-axes as argument: e.g. axis and mask_future_of_axis, where again mask_future_of_axis can be optional.

@albertz
Copy link
Member

albertz commented Nov 19, 2020

Right now, I think the first proposal (just return [T,B,D] in the outside loop case) is more intuitive and more straight forward to implement and use.

Yes, I think I agree.
If outside the loop, CumConcatLayer would return [T_hist,B,D], for input [T,B,D], where T_hist=T, but with a different special dimension tag.

The new tag would be sth with history or so... maybe come up with a better name?

We can also just implement this behavior as a flag, and then have the benefits of both solutions.

Let's not make it too complicated. Let's first think/draft one possibility, and only implement that.

As you said, it makes everything simpler if the CumConcatLayer marked the outputted time-axis in some special way, e.g. using stag:history as you mentioned.
Then we can use stag everywhere instead of just T. This is a bit verbose, but I think necessary to keep apart the query-axis (i.e. stag:extern_data_classes and the key/value-axis (i.e. stag:history or sth like that).

Yes, this is really necessary. Also that a layer like HistoryMaskLayer (or SoftmaxOverSpatialLayer) can operate correctly.

I am unsure whether a HistoryMaskLayer is really necessary.
We could also add an extra parameter mask_future_of_axis to the SoftmaxOverSpatialLayer.

Yes. But not sure what would be better. Remember, I don't like that layers become too complex, and more and more options are added. That is our problem with SelfAttentionLayer in the first place. It's usually better to separate things into building blocks, which makes it easy for the user to try out variations.

Then we could simply say

att_weights: {class: softmax_over_spatial, from: energy,
              axis: stag:history, mask_future_of_axis: stag:extern_data:classes?}

This is the wrong way around, right? The axis for the softmax is T or stag:classes.

Or instead, HistoryMaskLayer could receive two time-axes as argument: e.g. axis and mask_future_of_axis, where again mask_future_of_axis can be optional.

Why two? It only needs to check for such a "history" dim tag, which you could maybe explicitly specify by axis: "stag:history?" or so. And if it doesn't find it, it does nothing.

I'm a bit afraid that this could potentially hide bugs, e.g. if you have a typo there for the history axis, you would not get any error, but it would silently just ignore this. This is bad. Any idea about this?

Maybe if outside a loop, the history axis must exist. And if inside the loop, it will ignore it. (This is not perfectly generic, though. E.g. what if you have a loop in a loop? Etc...)

@Zettelkasten
Copy link
Member Author

Zettelkasten commented Nov 19, 2020

The new tag would be sth with history or so... maybe come up with a better name?
If outside the loop, CumConcatLayer would return [T_hist,B,D], for input [T,B,D], where T_hist=T, but with a different special dimension tag.

Yes sounds good.

Let's not make it too complicated. Let's first think/draft one possibility, and only implement that.

Okay sure. Would we still want the layer to have an attribute as_is_outside (or another name), and then for now assert that it is set to True for clarification (implying that as_is_outside=False might become the default argument in the future)? Or not?
Maybe we should also think of a better name if we want to include it.

Yes. But not sure what would be better. Remember, I don't like that layers become too complex, and more and more options are added. That is our problem with SelfAttentionLayer in the first place. It's usually better to separate things into building blocks, which makes it easy for the user to try out variations.

Yes okay, you do have a point. Then I would go for the additional masking layer I think.

Then we could simply say

att_weights: {class: softmax_over_spatial, from: energy,
              axis: stag:history, mask_future_of_axis: stag:extern_data:classes?}

This is the wrong way around, right? The axis for the softmax is T or stag:classes.

Actually I think it is correct: stag:history would mark the history of all previous keys and values (including the current one), which we want to apply the softmax along. stag:classes (the axis of the queries) on the contrary does not exist if inside a loop.

Or instead, HistoryMaskLayer could receive two time-axes as argument: e.g. axis and mask_future_of_axis, where again mask_future_of_axis can be optional.

Why two? It only needs to check for such a "history" dim tag, which you could maybe explicitly specify by axis: "stag:history?" or so. And if it doesn't find it, it does nothing.

I wanted it to be less specific to the stag:history-tag, in general my idea was that the layer took two time axes, T1 and T2, and masks all the input to e.g. -infinity if t1 > t2.

I'm a bit afraid that this could potentially hide bugs, e.g. if you have a typo there for the history axis, you would not get any error, but it would silently just ignore this. This is bad. Any idea about this?
Maybe if outside a loop, the history axis must exist. And if inside the loop, it will ignore it. (This is not perfectly generic, though. E.g. what if you have a loop in a loop? Etc...)

Good point.
Maybe we could make stag:history_mask? the default argument? That's not a real fix of course, but should make typos less likely. Or add additional logic somewhere that ensures that one must not reference an stag that was never defined anywhere (I don't know though if there might be a valid reason to do that though)..

@albertz
Copy link
Member

albertz commented Nov 19, 2020

Okay sure. Would we still want the layer [CumConcatLayer] to have an attribute as_is_outside (or another name), and then for now assert that it is set to True for clarification (implying that as_is_outside=False might become the default argument in the future)? Or not?

Hm, maybe. I'm not sure. The name is bad. Also, what would be the default? Or would this be a required argument? (If it has a default, we can also just leave it away.)

Also, we speak about these two cases inside loop / outside loop, but these are not really the only cases. It could be inside a loop, inside another loop. It would be good to abstract away a bit from that. Also, in an optimal world, the user should not really need to think about this, and just implement it in a way as it would be without automatic optimization, i.e. always inside the loop.

Actually, what are the other arguments of CumConcatLayer? It would be good to be able to explicitly specify the axis to be accumulated over. If inside a rec layer, we might need a special placeholder for the rec layer frames. If outside the loop, this axis would exist, but if inside the loop, this axis would not exist. Maybe sth like ":i" (that is also the name for the special layer which reflects the current frame index). This could be the default (axis=":i"). Or maybe the default axis="T", and T can be treated somewhat special, i.e. namely if it does not exists, it is treated like ":i". And ":i" always implies that this is used inside a rec layer (no matter if inside the loop or outside).

I cannot really come up with a good name or other conceptual way to configure something like as_is_outside.
Maybe more like historic_existing_axis: bool? I.e. stressing the difference whether the specified axis exists or not -- this is already more generic than the concept whether within loop or not.
Or maybe we can leave that away for now.

Yes okay, you do have a point. Then I would go for the additional masking layer [HistoryMaskLayer or so] I think.

Ok.

Btw, the naming for the current rec layer frame axis (":i" suggested above), the naming of HistoryMaskLayer, and the naming of the dimension tag (and/or its special flag) of the newly introduced axis by CumConcatLayer, these should be maybe more related, to make it more clear. Maybe:

  • Current rec layer frame axis special name: "rec-frame" (would be same as "extern_data/classes" in training, or newly created axis in decoding)
  • New dim-tag by CumConcatLayer when axis does not exist (e.g. when in rec layer, and inside loop): "rec-history" (or maybe just "rec-frame" as well? would that work?)
  • New dim-tag by CumConcatLayer when axis exists (e.g. when in rec layer, and outside loop, or just outside rec layer): "left-masked" (or so?) (or also "rec-history"? but wouldn't that be confusing? maybe actually not...)
  • "HistoryMaskLayer": LeftMaskLayer (this would operate on "left-masked", but not on "rec-history")

Btw, in the SoftmaxOverSpatialLayer, the axis you do softmax over,

att_weights: {class: softmax_over_spatial, from: energy,
              axis: stag:history, mask_future_of_axis: stag:extern_data:classes?}

This is the wrong way around, right? The axis for the softmax is T or stag:classes.

Actually I think it is correct: stag:history would mark the history of all previous keys and values (including the current one), which we want to apply the softmax along. stag:classes (the axis of the queries) on the contrary does not exist if inside a loop.

I think I/we get confused now about the naming of "history" here. But you are right, depending on what the "history" axis refers to. If this refers to the dim tag by CumConcatLayer for both cases (cum-concat input axis exists or not -- e.g. inside loop vs outside loop), then this is correct.

Or instead, HistoryMaskLayer could receive two time-axes as argument: e.g. axis and mask_future_of_axis, where again mask_future_of_axis can be optional.

Why two? It only needs to check for such a "history" dim tag, which you could maybe explicitly specify by axis: "stag:history?" or so. And if it doesn't find it, it does nothing.

I wanted it to be less specific to the stag:history-tag, in general my idea was that the layer took two time axes, T1 and T2, and masks all the input to e.g. -infinity if t1 > t2.

The HistoryMaskLayer need to have the mode to be a no-op. Specifically when it is inside the loop. But in more generic terms, when you specify the axis to the "history" dim tag. Then it needs to distinguish the "history" dim tag, whether this is from within-loop, or without (the two cases of CumConcatLayer I described above).

But we can solve that, by adding a special flag/marker on the DimensionTag object, which explicitly tells that, sth like is_history_of_axis: Optional[DimensionTag], and maybe additionally also is_history_of_rec_loop: Optional[RecLayer]. It can have the same name "rec-history", but depending on the case, either the first or the second attrib would be set.

However, then the HistoryMaskLayer doesn't need to have another axis argument, right? You specify axis: "stag:rec-history" or so, and it finds that axis, and its dim tag. If is_history_of_axis is set on the dim tag, it expects that this axis is also present in the input, and it does the masking. If is_history_of_rec_loop is set, it does nothing. Otherwise it will error. So it really expects such a special axis / dim tag. Then there is no easy way (e.g. by some typo) to get it wrong and silently use no-op.

If you instead have two axes, and do the no-op mode when one of the axis does not exists, this can potentially be problematic. E.g. just a typo could lead to the case that the axis never exists, and it is always a no-op.

@Zettelkasten
Copy link
Member Author

Okay sure. Would we still want the layer [CumConcatLayer] to have an attribute as_is_outside (or another name), and then for now assert that it is set to True for clarification (implying that as_is_outside=False might become the default argument in the future)? Or not?

Hm, maybe. I'm not sure. The name is bad. Also, what would be the default? Or would this be a required argument? (If it has a default, we can also just leave it away.)

I think that either as_is_outside=False (or whatever name) should be the default or it should be required, as this is the behavior that works most consistently no matter if it is in a loop or not.
I also can't think of a better name..

Actually, what are the other arguments of CumConcatLayer? It would be good to be able to explicitly specify the axis to be accumulated over. If inside a rec layer, we might need a special placeholder for the rec layer frames. If outside the loop, this axis would exist, but if inside the loop, this axis would not exist. Maybe sth like ":i" (that is also the name for the special layer which reflects the current frame index). This could be the default (axis=":i"). Or maybe the default axis="T", and T can be treated somewhat special, i.e. namely if it does not exists, it is treated like ":i". And ":i" always implies that this is used inside a rec layer (no matter if inside the loop or outside).

I agree that the axis should be configurable.. But puh, I really can't tell what would be more intuitive :D
At least we agree on the default value, and then maybe we can just skip this for now and always use the most inner rec-time dim.

Btw, the naming for the current rec layer frame axis (":i" suggested above), the naming of HistoryMaskLayer, and the naming of the dimension tag (and/or its special flag) of the newly introduced axis by CumConcatLayer, these should be maybe more related, to make it more clear. Maybe:

  • Current rec layer frame axis special name: "rec-frame" (would be same as "extern_data/classes" in training, or newly created axis in decoding)
  • New dim-tag by CumConcatLayer when axis does not exist (e.g. when in rec layer, and inside loop): "rec-history" (or maybe just "rec-frame" as well? would that work?)
  • New dim-tag by CumConcatLayer when axis exists (e.g. when in rec layer, and outside loop, or just outside rec layer): "left-masked" (or so?) (or also "rec-history"? but wouldn't that be confusing? maybe actually not...)
  • "HistoryMaskLayer": LeftMaskLayer (this would operate on "left-masked", but not on "rec-history")

Don't we run into issues if the CumConcatLayer creates an axis with a different name depending whether it is in a loop or not?
I mean we would e.g. want to pass this identifier to the SoftmaxOverSpatial layer for the attention weights (but also other layers), but if it is called differently depending on whether we are in a loop or not we cannot do that .. right?

So if understood you correctly, you want to call the axis generated by the CumConcatLayer in either case (parallel or in-loop) rec-history, but add an is_history_of_axis and is_history_of_rec_loop attribute to the tag as you explained below (which is then used by e.g. HistoryMaskLayer to figure out if masking is necessary)?
That sounds like a good idea to me.

Maybe we could also postpone these DimensionTag-attributes and initially implement it using the simpler (but much more error prone) check if the query-time axis exists or not. For me it looks somewhat complicated to add these attributes to DimensionTag, as they are so widely used across the entire code base.

@albertz
Copy link
Member

albertz commented Nov 19, 2020

I think that either as_is_outside=False (or whatever name) should be the default or it should be required, as this is the behavior that works most consistently no matter if it is in a loop or not.
I also can't think of a better name..

So you think we should have this option on CumConcatLayer?
I tend to think not having it is better for now. Esp if it anyway would otherwise be the default, and the only implemented case. Esp also if we cannot find a better name. (The name should be intuitive, i.e. you should understand what it does, without needing to look at the code/documentation.)

I agree that the axis should be configurable.. But puh, I really can't tell what would be more intuitive :D
At least we agree on the default value, and then maybe we can just skip this for now and always use the most inner rec-time dim.

The question about the CumContatLayer axis option is quite clear I think. If it is ":i" / "rec-frame" (or "T", which would be treated special here, just like "rec-frame"), it would use the most inner rec-time dim. And this would also be the default.

  • Current rec layer frame axis special name: "rec-frame" (would be same as "extern_data/classes" in training, or newly created axis in decoding)
  • New dim-tag by CumConcatLayer when axis does not exist (e.g. when in rec layer, and inside loop): "rec-history" (or maybe just "rec-frame" as well? would that work?)
  • New dim-tag by CumConcatLayer when axis exists (e.g. when in rec layer, and outside loop, or just outside rec layer): "left-masked" (or so?) (or also "rec-history"? but wouldn't that be confusing? maybe actually not...)
  • "HistoryMaskLayer": LeftMaskLayer (this would operate on "left-masked", but not on "rec-history")

Don't we run into issues if the CumConcatLayer creates an axis with a different name depending whether it is in a loop or not?
I mean we would e.g. want to pass this identifier to the SoftmaxOverSpatial layer for the attention weights (but also other layers), but if it is called differently depending on whether we are in a loop or not we cannot do that .. right?

Yes. So concluding from that, it should have the same name in both cases. So "rec-history" (or so) then in both cases.

This is about the name of the new dim tag. It still needs to be distinguished for the other logic (e.g. in HistoryMaskLayer). But this could easily be done via is_history_of_axis/is_history_of_rec_loop flags in the dim tag, as I outlined.

So if understood you correctly, you want to call the axis generated by the CumConcatLayer in either case (parallel or in-loop) rec-history, but add an is_history_of_axis and is_history_of_rec_loop attribute to the tag as you explained below (which is then used by e.g. HistoryMaskLayer to figure out if masking is necessary)?
That sounds like a good idea to me.

Yes exactly.

Maybe we could also postpone these DimensionTag-attributes and initially implement it using the simpler (but much more error prone) check if the query-time axis exists or not.

I don't like that. We always should try to avoid error-prone code. And why not do it directly in a good way? We already have a possible solution, as outlined here.

For me it looks somewhat complicated to add these attributes to DimensionTag, as they are so widely used across the entire code base.

They are widely used, but adding new attributes will not have any effect on existing code, so that shouldn't be a problem. And they would be optional, i.e. None by default.

The only problem I see with this is whether these two attribs/flags are maybe very specific for this specific case here, and will not be used in any other case. This is maybe a bit ugly. But maybe this is not too much a problem.

@Zettelkasten
Copy link
Member Author

So you think we should have this option on CumConcatLayer?
I tend to think not having it is better for now. Esp if it anyway would otherwise be the default, and the only implemented case. Esp also if we cannot find a better name. (The name should be intuitive, i.e. you should understand what it does, without needing to look at the code/documentation.)

Okay, then we simply only support the "as_is_outside=True" case and omit the argument entirely, right?

The question about the CumContatLayer axis option is quite clear I think. If it is ":i" / "rec-frame" (or "T", which would be treated special here, just like "rec-frame"), it would use the most inner rec-time dim. And this would also be the default.

Yeah okay, then let's do it this way.

Maybe we could also postpone these DimensionTag-attributes and initially implement it using the simpler (but much more error prone) check if the query-time axis exists or not.

I don't like that. We always should try to avoid error-prone code. And why not do it directly in a good way? We already have a possible solution, as outlined here.

Okay okay, I guess so, fine.

Then everything is more or less clear now, right?
I will then start work on an initial PR, and ask here for now if any further questions come up.

I would start by using the names we chose here (i.e. CumConcatLayer, HistoryMaskLayer and rec-history with is_history_of_axis/is_history_of_rec_loop). If we find better names we can of course easily change that before we merge this into the main branch.

@albertz
Copy link
Member

albertz commented Nov 19, 2020

Okay, then we simply only support the "as_is_outside=True" case [for CumConcatLayer] and omit the argument entirely, right?

Yes.

Then everything is more or less clear now, right?

I think so. At least the draft looks fine and like this could work.

I will then start work on an initial PR, and ask here for now if any further questions come up.

Yea, I assume there will be subtle problems. But maybe not. We will see.
Create a new branch here such that I can maybe collaborate easily on some problems.

I would start by using the names we chose here (i.e. CumConcatLayer, HistoryMaskLayer and rec-history with is_history_of_axis/is_history_of_rec_loop).

Use RecHistoryMaskLayer. But otherwise, I think it's good.

@Zettelkasten

This comment has been minimized.

@albertz

This comment has been minimized.

@Zettelkasten Zettelkasten self-assigned this Nov 19, 2020
@Zettelkasten
Copy link
Member Author

Zettelkasten commented Nov 22, 2020

I pushed an initial draft of the CumConcatLayer in bb04e4c.
I didn't check whether the values calculated are actually correct, but all Data templates look right to me (with optimize_move_layers_out=True and False).
This works for a network dict like this:

'key': {'class': 'linear', 'from': ['state'], 'n_out': key_dim, 'activation': None, 'with_bias': None},
'query': {'class': 'linear', 'from': ['state'], 'n_out': key_dim, 'activation': None, 'with_bias': None},
'value': {'class': 'linear', 'from': ['state'], 'n_out': value_dim, 'activation': None, 'with_bias': None},
'accum_key': {'class': 'cum_concat', 'from': ['key'], 'axis': 'stag:extern_data:classes'},
'accum_value': {'class': 'cum_concat', 'from': ['value'], 'axis': 'stag:extern_data:classes'},
'energy': {'class': 'dot', 'from': ['accum_key', 'query'], 'red1': 'F', 'red2': 'F',
           'var1': 'stag:rec-history', 'var2': 'stag:extern_data:classes?', 'add_var2_if_empty': False},
'weights': {'class': 'softmax_over_spatial', 'from': ['energy'], 'axis': 'stag:rec-history'},
'att': {'class': 'dot', 'from': ['accum_value', 'weights'], 'red1': 'stag:rec-history', 'red2': 'stag:rec-history',
        'var1': 'F', 'var2': 'stag:extern_data:classes', 'add_var2_if_empty': False},

yielding the templates (without optimizations)

layer root/output:rec-subnet/'value' output: Data(name='value_output', shape=(5,), time_dim_axis=None, batch_shape_meta=[B,F|5])
layer root/output:rec-subnet/'accum_value' output: Data(name='accum_value_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'rec-history:output/accum_value',B,F|5])
layer root/output:rec-subnet/'key' output: Data(name='key_output', shape=(5,), time_dim_axis=None, batch_shape_meta=[B,F|5])
layer root/output:rec-subnet/'accum_key' output: Data(name='accum_key_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'rec-history:output/accum_key',B,F|5])
layer root/output:rec-subnet/'query' output: Data(name='query_output', shape=(5,), time_dim_axis=None, batch_shape_meta=[B,F|5])
layer root/output:rec-subnet/'energy' output: Data(name='energy_output', shape=(None,), batch_shape_meta=[B,T|F|'rec-history:output/accum_key'])
layer root/output:rec-subnet/'weights' output: Data(name='weights_output', shape=(None,), batch_shape_meta=[B,T|F|'rec-history:output/accum_key'])
layer root/output:rec-subnet/'att' output: Data(name='att_output', shape=(5,), time_dim_axis=None, batch_shape_meta=[B,F|5])

and with optimizations

layer root/output:rec-subnet-output/'value' output: Data(name='value_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:classes',B,F|5])
layer root/output:rec-subnet-output/'accum_value' output: Data(name='accum_value_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'rec-history:output/accum_value',B,F|5])
layer root/output:rec-subnet-output/'key' output: Data(name='key_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:classes',B,F|5])
layer root/output:rec-subnet-output/'accum_key' output: Data(name='accum_key_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'rec-history:output/accum_key',B,F|5])
layer root/output:rec-subnet-output/'query' output: Data(name='query_output', shape=(None, 5), batch_dim_axis=1, batch_shape_meta=[T|'time:var:extern_data:classes',B,F|5])
layer root/output:rec-subnet-output/'energy' output: Data(name='energy_output', shape=(None, None), batch_shape_meta=[B,T|'rec-history:output/accum_key',F|'time:var:extern_data:classes'])
layer root/output:rec-subnet-output/'weights' output: Data(name='weights_output', shape=(None, None), time_dim_axis=2, feature_dim_axis=1, batch_shape_meta=[B,F|'time:var:extern_data:classes',T|'rec-history:output/accum_key'])
layer root/output:rec-subnet-output/'att' output: Data(name='att_output', shape=(5, None), time_dim_axis=2, batch_shape_meta=[B,F|5,T|'time:var:extern_data:classes'])

Obviously, the masking is still missing.

So far I have come across some issues:

  • I needed to set the size_placeholder of the CumConcatLayer both in get_out_data_from_opts and __init__: Only in __init__, I have all the information to actually construct a meaningful size_placeholder tensor. However, other layers need to know that an axis called rec-history exists already during template construction, e.g. the energy layer above which gets stag:rec-history as input (It works fine if I use just T, but that's not ideal).
    My solution feels like a hack, is there a better option?
  • Also, what is the proper way to determine whether we are in a RecLayer in get_out_data_from_opts? I noticed that network.is_inside_rec_layer always gives True (probably because automatic optimization is called afterwards)
  • And how do I determine the time-axis which corresponds to the RecLayer? Right now I use data.time_dim_axis, but that is probably not always the most-inner rec-layer time axis.

It would be good if somebody could look over what I did :D

@albertz
Copy link
Member

albertz commented Nov 22, 2020

Some comments:

  • Unfortunately there is some hickup in the GitHub CI. This seems to be fixed now with my latest commit in master. Can you rebase to that, so that the tests run correctly again?

  • It is inefficient to calculate k,q,v in three independent linear layers. It is more efficient to do it like I suggested above:

qkv: {class: linear, from: x, activation: None},  # [B,2*K+V] inside, [T,B,2*K+V] outside
qkv_split: {class: split, from: qkv, size_splits: [K,K,V]},
q: {class: copy, from: qkv_split/1},  # [B,K] inside, [T,B,K] outside
k: {class: copy, from: qkv_split/2},  # [B,K] inside, [T,B,K] outside
v: {class: copy, from: qkv_split/3},  # [B,V] inside, [T,B,V] outside
  • In energy, you use 'var2': 'stag:extern_data:classes?', i.e. such an optional dim tag. We should avoid this if possible. This has the same problem as we discussed, that this can potentially lead to strange effects, e.g. if there is a typo. In this case, it is less problematic than in the other case, but still not so nice.
    One possible solution: Maybe you could also use 'var2': ':i'? Here :i has again a special meaning, as before, i.e. it refers to the rec layer frame axis. If inside the rec layer loop, it would not exist.

  • If we can avoid stag:..?, we also should not add this support at all. Not only is it problematic that it can hide bugs by typos. But also consider that we maybe slightly want to change dimension names at some point (some of these names are a bit inconsistent). Depending on how exactly we change it, and how it is used in the config, this could break configs. Then it is better to explicitly raise an exception, and not suddenly ignore the axis (which will likely lead to other errors, but this would be very annoying to debug and understand). This is also because our check for matching axes is currently just via name in tag.description. Maybe we should somehow make this all more explicit, i.e. not really by string matching, but by using explicit DimensionTag objects. Then it is clear that there cannot be typos.

  • In get_out_data_from_opts, no tensors should be constructed. The TF graph should not be modified/extended at all. I understand this can be tricky sometimes, or problematic, for cases when you get new dynamic sizes, and also new dim tags, like here. We maybe should think about some better way to solve this at some later point (but not here now).
    In your specific case, I think we can solve this, by always creating this directly in the RecLayer. Maybe we actually can treat it in a similar special way like ":i".
    This should probably be discussed further. We should think about how to do that in a clean way.

  • get_out_data_from_opts will get called multiple times, with different context. E.g. first for the dependency graph, where no actual computation will be done (we call this "template construction"). This is always treated as inside the loop. Then later it will be called when the layer is actually constructed. In general, you should make no assumptions about how often, when, under what conditions, with what input this is called.

  • The comment above probably clarifies your question about is_inside_rec_layer. Note that it also has an argument inside_loop.

  • how do I determine the time-axis which corresponds to the RecLayer

    That's a good question. I think this is currently not possible in general. In principle, you can use get_rec_parent_layer, and in many cases, rec_layer.output already has the time seq set (and thus the dim tag). However, not always. This needs to be set in advance somehow. But only the DimensionTag because the seq length tensor is not known yet.
    This should be discussed further.

@Zettelkasten
Copy link
Member Author

Thanks for your comments!

Can you rebase to that, so that the tests run correctly again?

Seems to have worked, thanks

  • It is inefficient to calculate k,q,v in three independent linear layers. It is more efficient to do it like I suggested above:

Ah, okay of course we can do that then.
I ran into an issue when referring to a layer as qkv_split/0 within a RecLayer and the loop is optimized away. I think I fixed it in 793d46c by handling the case '/' in name exactly as in the non-RecLayer get_layer. Is that the correct way?

  • In energy, you use 'var2': 'stag:extern_data:classes?', i.e. such an optional dim tag. We should avoid this if possible. This has the same problem as we discussed, that this can potentially lead to strange effects, e.g. if there is a typo. In this case, it is less problematic than in the other case, but still not so nice.
    One possible solution: Maybe you could also use 'var2': ':i'? Here :i has again a special meaning, as before, i.e. it refers to the rec layer frame axis. If inside the rec layer loop, it would not exist.

Ahh oh, I didn't think about that.
Then we should implement :i (and maybe the alias rec-frame) in a place decoupled from the specific layer logic, e.g. in get_axis_from_description.
But then we need to pass the (parent-)layer as well, probably not so nice?

  • If we can avoid stag:..?, we also should not add this support at all.

I agree.

  • In get_out_data_from_opts, no tensors should be constructed. [...]
    In your specific case, I think we can solve this, by always creating this directly in the RecLayer. Maybe we actually can treat it in a similar special way like ":i".

I am not sure whether I understand you correctly. What exactly should we create in RecLayer? The rec-history DimensionTag? And then make it an attribute of RecLayer?

In general, you should make no assumptions about how often, when, under what conditions, with what input this is called.

  • The comment above probably clarifies your question about is_inside_rec_layer. Note that it also has an argument inside_loop.

Thanks okay, that does clear up some things! :)

That's a good question. I think this is currently not possible in general. In principle, you can use get_rec_parent_layer, and in many cases, rec_layer.output already has the time seq set (and thus the dim tag). However, not always. This needs to be set in advance somehow. But only the DimensionTag because the seq length tensor is not known yet.
This should be discussed further.

Maybe we can also add an attribute with the rec time axis to RecLayer (if I understood you correctly above that that is what you want to do with rec-history). This way we could check. Or vice versa, add the RecLayer as attribute to the DimensionTag?
I don't know.

Somewhat related to this, in general I think we need some additional way of accessing time axes:
In the case of encoder (unmasked) self attention, we do not need a CumConcatLayer. Instead, the energy dot layer will just receive two time axes with the exact same dimension tag as input (usually stag:extern_data:classes). One of them will be marked as time axes, the other one not.
But as far as I know it is impossible to access the time axes not marked as T. We could fix this by adding something like stag:...:0, stag:...:1 and so on.
But that still makes it very error prone to mix up the axes.
Instead, it would be better if we rename (at least one of) the axes. Is there a way to do that currently? Or does somebody have a better solution in general?

@albertz
Copy link
Member

albertz commented Nov 24, 2020

  • It is inefficient to calculate k,q,v in three independent linear layers. It is more efficient to do it like I suggested above:

Ah, okay of course we can do that then.
I ran into an issue when referring to a layer as qkv_split/0 within a RecLayer and the loop is optimized away. I think I fixed it in 793d46c by handling the case '/' in name exactly as in the non-RecLayer get_layer. Is that the correct way?

From a first glance, it seems correct. Handling this can be tricky sometimes.

Then we should implement :i (and maybe the alias rec-frame) in a place decoupled from the specific layer logic, e.g. in get_axis_from_description.
But then we need to pass the (parent-)layer as well, probably not so nice?

Yes, this would not be nice. This should be decoupled. I.e. the Data should never need to know anything about layers.
This can be solved, however. The Data can and maybe should know about the current context control flow frame (this is the TF logic / handling for tf.while_loop and tf.cond), or some abstracted concept of that. So we could add a new object ControlFlowContext or so, which can be another attribute of Data. A similar extension like DimensionTag or SearchBeam.

  • In get_out_data_from_opts, no tensors should be constructed. [...]
    In your specific case, I think we can solve this, by always creating this directly in the RecLayer. Maybe we actually can treat it in a similar special way like ":i".

I am not sure whether I understand you correctly. What exactly should we create in RecLayer? The rec-history DimensionTag? And then make it an attribute of RecLayer?

Yes exactly.
Or, if we follow my suggestion from before, this could also be an attribute of ControlFlowContext, owned by the rec_layer.output Data, which would be created by RecLayer.

Maybe we can also add an attribute with the rec time axis to RecLayer (if I understood you correctly above that that is what you want to do with rec-history). This way we could check. Or vice versa, add the RecLayer as attribute to the DimensionTag?
I don't know.

I would rather make this handled directly via Data. We need to change the logic for DimensionTag handling a bit. Currently it is attached to the size_placeholder tensors, i.e. not available unless they are constructed. But we can store them separately directly in Data. Then we can create it before it (the size) is actually known.
This would however need lots of other code changes to deal correctly with all cases when Data is modified. But maybe this needs to be done.
Edit: I will take a look at this.

Somewhat related to this, in general I think we need some additional way of accessing time axes:
In the case of encoder (unmasked) self attention, we do not need a CumConcatLayer. Instead, the energy dot layer will just receive two time axes with the exact same dimension tag as input (usually stag:extern_data:classes). One of them will be marked as time axes, the other one not.
But as far as I know it is impossible to access the time axes not marked as T. We could fix this by adding something like stag:...:0, stag:...:1 and so on.
But that still makes it very error prone to mix up the axes.
Instead, it would be better if we rename (at least one of) the axes. Is there a way to do that currently? Or does somebody have a better solution in general?

You can create your own DimensionTags in your config. Then you can use an EvalLayer, and make a tf.identity on one of the size_placeholder, and attach your custom DimensionTag. However, this is arguably not so convenient or straight-forward, and also you need to understand the internals to get why the tf.identity is needed here (because the DimensionTag is attached to the size tensor).

But to the specific question/case: Couldn't you just use CumConcatLayer as well? It will do exactly that, i.e. create a new DimensionTag. And then just leave out the masking in the following.

Arguably, it's maybe a bit non-intuitive to use CumConcatLayer in that case. We could maybe really create a new layer, sth like NewDimTagLayer, which gets an axis, and the output will be just the same, but will have a new DimensionTag for that axis, based on the original DimensionTag.

Btw, we might need another attribute for DimensionTag, sth like same_as_with_new_id or so. (There is already same_as, but same_as really means that there is no difference between them, so this is wrong here.)

@Zettelkasten
Copy link
Member Author

Then we should implement :i (and maybe the alias rec-frame) in a place decoupled from the specific layer logic, e.g. in get_axis_from_description.
But then we need to pass the (parent-)layer as well, probably not so nice?

Yes, this would not be nice. This should be decoupled. I.e. the Data should never need to know anything about layers.
This can be solved, however. The Data can and maybe should know about the current context control flow frame (this is the TF logic / handling for tf.while_loop and tf.cond), or some abstracted concept of that. So we could add a new object ControlFlowContext or so, which can be another attribute of Data. A similar extension like DimensionTag or SearchBeam.

Okay, so we essentially add the rec-time axes (for now maybe only the most inner rec time axis, but we can of course extend this) encapsulated in some ControlFlowContext object as property to Data, right?

Maybe we run into issues when other code copies and modifies Data objects without considering this, though. I can't really estimate if that will be a problem for now (but it seems related to the issue you mentioned that we would get when adding the DimensionTags also as property of the Data).
For me though, this solution kind of sounds like a bit of a complicated solution that is somewhere between what we have right now (DimensionTags solely bound to tensors), and the one we probably actually want (DimensionTaga as part of the Data itself).
I don't know if that's so smart - wouldn't it be easier to handle this when we think of moving DimensionTags as attribute of Data?

Alternatively, we could also stick to a simpler solution instead, like e.g. storing both the rec-frame and rec-history tensors as attribute to RecLayer.

I would rather make this handled directly via Data. We need to change the logic for DimensionTag handling a bit. Currently it is attached to the size_placeholder tensors, i.e. not available unless they are constructed. But we can store them separately directly in Data. Then we can create it before it (the size) is actually known.
This would however need lots of other code changes to deal correctly with all cases when Data is modified. But maybe this needs to be done.
Edit: I will take a look at this.

Yes, I think this is kind of out of the scope of this issue related to self attention :D


You can create your own DimensionTags in your config. Then you can use an EvalLayer, and make a tf.identity on one of the size_placeholder, and attach your custom DimensionTag. However, this is arguably not so convenient or straight-forward, and also you need to understand the internals to get why the tf.identity is needed here (because the DimensionTag is attached to the size tensor).

Ah true, one could do that. I wouldn't use that in a reference self attention implementation though.

But to the specific question/case: Couldn't you just use CumConcatLayer as well? It will do exactly that, i.e. create a new DimensionTag. And then just leave out the masking in the following.

True! It would also make it (apart from the masking layer) exactly analogue to the decoder self attention, which I think is good.

However, the CumConcatLayer somewhat depends on being in a (possibly optimized away) recurrent layer: E.g., would we allow the axis=':i' argument when outside any recurrent layer? Also, axis='i' is currently the default parameter.

But there are also other issues in the implementation if we do that, e.g. our plan was to copy the rec-history DimensionTag from the RecLayer - we can't do that of course when there is none.

Arguably, it's maybe a bit non-intuitive to use CumConcatLayer in that case. We could maybe really create a new layer, sth like NewDimTagLayer, which gets an axis, and the output will be just the same, but will have a new DimensionTag for that axis, based on the original DimensionTag.

Maybe yes. But if I understood correctly, that is somewhat impossible to 100% correctly implement right now, because we should not (but must) create a new tensor for this DimensionTag in get_out_data_from_opts to rename it..

albertz added a commit that referenced this issue Sep 10, 2021
This is for generalized self attention (#391).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 11, 2021
This is for generalized self attention (#391).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 11, 2021
This is for generalized self attention (#391).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 11, 2021
This is for generalized self attention (#391).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 12, 2021
This is for generalized self attention (#391).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 12, 2021
This is for generalized self attention (#391).
Fixes #391.

Co-authored-by: Frithjof <[email protected]>
@albertz
Copy link
Member

albertz commented Sep 12, 2021

With CumConcatLayer (#589) and the corresponding test case (test_reclayer_optimize_out_cum_concat_gen_self_att, via 3ab8667), we have now a first version working.

The test case is this code:

# This is very much the vanilla self attention,
# implemented via the new generic way.
# See https://github.com/rwth-i6/returnn/issues/391 for a long discussion.
# Commented shapes are always for the layers inside the loop (not optimized).
"qkv": {"class": "linear", "from": "data:source", "activation": None, "n_out": n_key * 2 + n_value},  # [B,2*K+V]
"qkv_split": {"class": "split", "from": "qkv", "size_splits": [n_key, n_key, n_value]},
"q": {"class": "copy", "from": "qkv_split/0"},  # inside [B,K]. optimized out [T,B,K]
"k": {"class": "copy", "from": "qkv_split/1"},  # inside [B,K]. optimized out [T,B,K]
"v": {"class": "copy", "from": "qkv_split/2"},  # inside [B,V]. optimized out [T,B,V]
# cum_concat here. Note that the optimized-out shape is not as you might expect [T,max(t),B,K],
# but instead using the optimized format, with extended dyn size on the special dim tag,
# i.e. [t*,B,K], representing [T,t*,B,K].
"k_accum": {"class": "cum_concat", "new_dim": new_dim, "from": "k"},  # inside [t,B,K]. opt out [t*,B,K]
"v_accum": {"class": "cum_concat", "new_dim": new_dim, "from": "v"},  # inside [t,B,V]. opt out [t*,B,K]
"energy": {
  "class": "dot", "from": ["q", "k_accum"],
  "red1": "static:-1", "red2": "static:-1",
  "var1": None, "var2": new_dim},  # inside [B,t]. optimized out [T,B,t*]
"att_weights": {
  "class": "softmax_over_spatial", "from": "energy", "axis": new_dim},  # inside [B,t]. opt out [T,B,t*]
"att": {
  "class": "dot", "from": ["att_weights", "v_accum"],
  "red1": new_dim, "red2": new_dim,
  "var1": None, "var2": "static:-1"},  # inside [B,V]. opt out [T,B,V]

@albertz
Copy link
Member

albertz commented Sep 12, 2021

I would suggest to leave this closed now. Any further issues, or missing functionality, should be discussed in new separate issues.

@albertz
Copy link
Member

albertz commented Sep 12, 2021

@Zettelkasten Can you check whether you have everything you need to implement the basic self attention (well, basically that is already the test case, but maybe extend), also positional encoding, also relative positional encoding, also LSH, and other things you need?

@Zettelkasten
Copy link
Member Author

Cool awesome, a lot of thanks for all the work and thoughts you put into this!
I'll definitely play with this a lot, and also finish #570 and so.

This was referenced Sep 14, 2021
albertz added a commit that referenced this issue Sep 15, 2021
E.g. for generalized self attention (#391, #545).

Co-authored-by: Frithjof <[email protected]>
albertz added a commit that referenced this issue Sep 15, 2021
For generalized non-rec self attention (#391).
@albertz
Copy link
Member

albertz commented Sep 15, 2021

For reference, an example for generalized self attention (non recursive) (via #656):

new_dim = DimensionTag(kind=DimensionTag.Types.Spatial, description="new_self_att_dim")
net_dict_new = {
  "qkv": {
    "class": "linear", "from": "data", "with_bias": False,
    "n_out": n_key_dim_total * 2 + n_value_dim_total},  # [B,T,2*K'+V']
  "qkv_": {
    "class": "split_dims", "from": "qkv",
    "axis": "F", "dims": (n_heads, n_key_dim_per_head * 2 + n_value_dim_per_head)},
  "qkv_split": {
    "class": "split", "from": "qkv_",
    "size_splits": [n_key_dim_per_head, n_key_dim_per_head, n_value_dim_per_head]},
  "q": {"class": "copy", "from": "qkv_split/0"},  # [B,T,H,K]
  "k": {"class": "copy", "from": "qkv_split/1"},  # [B,T,H,K]
  "v": {"class": "copy", "from": "qkv_split/2"},  # [B,T,H,V]
  "q_": {"class": "eval", "from": "q", "eval": "source(0) * %f" % ((n_key_dim_total // n_heads) ** -0.5)},
  "k_": {"class": "reinterpret_data", "from": "k", "set_dim_tags": {"T": new_dim}},  # [B,T_new,H,K]
  "v_": {"class": "reinterpret_data", "from": "v", "set_dim_tags": {"T": new_dim}},  # [B,T_new,H,V]
  "energy": {
    "class": "dot", "from": ["q_", "k_"],
    "red1": "static:-1", "red2": "static:-1",
    "var1": time_dim, "var2": new_dim},  # [B,H,T_new,T]
  "att_weights": {
    "class": "softmax_over_spatial", "from": "energy", "axis": new_dim},  # [B,H,T,T_new]
  "att": {
    "class": "dot", "from": ["att_weights", "v_"],
    "red1": new_dim, "red2": new_dim,
    "var1": time_dim, "var2": "static:-1"},  # [B,H,T,V]
  "output": {
    "class": "merge_dims", "from": "att", "axes": "static"},  # [B,T,V']
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants